Skip to main content

Proxy Config.yaml

Set model list, api_base, api_key, temperature & proxy server settings (master-key) on the config.yaml.

Param NameDescription
model_listList of supported models on the server, with model-specific configs
router_settingslitellm Router settings, example routing_strategy="least-busy" see all
litellm_settingslitellm Module settings, example litellm.drop_params=True, litellm.set_verbose=True, litellm.api_base, litellm.cache see all
general_settingsServer settings, example setting master_key: sk-my_special_key
environment_variablesEnvironment Variables example, REDIS_HOST, REDIS_PORT

Complete List: Check the Swagger UI docs on <your-proxy-url>/#/config.yaml (e.g. http://0.0.0.0:4000/#/config.yaml), for everything you can pass in the config.yaml.

Quick Start

Set a model alias for your deployments.

In the config.yaml the model_name parameter is the user-facing name to use for your deployment.

In the config below:

  • model_name: the name to pass TO litellm from the external client
  • litellm_params.model: the model string passed to the litellm.completion() function

E.g.:

  • model=vllm-models will route to openai/facebook/opt-125m.
  • model=gpt-3.5-turbo will load balance between azure/gpt-turbo-small-eu and azure/gpt-turbo-small-ca
model_list:
  - model_name: gpt-3.5-turbo ### RECEIVED MODEL NAME ###
    litellm_params: # all params accepted by litellm.completion() - https://docs.litellm.ai/docs/completion/input
      model: azure/gpt-turbo-small-eu ### MODEL NAME sent to `litellm.completion()` ###
      api_base: https://my-endpoint-europe-berri-992.openai.azure.com/
      api_key: "os.environ/AZURE_API_KEY_EU" # does os.getenv("AZURE_API_KEY_EU")
      rpm: 6      # [OPTIONAL] Rate limit for this deployment: in requests per minute (rpm)
  - model_name: bedrock-claude-v1 
    litellm_params:
      model: bedrock/anthropic.claude-instant-v1
  - model_name: gpt-3.5-turbo
    litellm_params:
      model: azure/gpt-turbo-small-ca
      api_base: https://my-endpoint-canada-berri992.openai.azure.com/
      api_key: "os.environ/AZURE_API_KEY_CA"
      rpm: 6
  - model_name: anthropic-claude
    litellm_params: 
      model: bedrock/anthropic.claude-instant-v1
      ### [OPTIONAL] SET AWS REGION ###
      aws_region_name: us-east-1
  - model_name: vllm-models
    litellm_params:
      model: openai/facebook/opt-125m # the `openai/` prefix tells litellm it's openai compatible
      api_base: http://0.0.0.0:4000
      rpm: 1440
    model_info: 
      version: 2

litellm_settings: # module level litellm settings - https://github.com/BerriAI/litellm/blob/main/litellm/__init__.py
  drop_params: True
  success_callback: ["langfuse"] # OPTIONAL - if you want to start sending LLM Logs to Langfuse. Make sure to set `LANGFUSE_PUBLIC_KEY` and `LANGFUSE_SECRET_KEY` in your env

general_settings: 
  master_key: sk-1234 # [OPTIONAL] Only use this if you to require all calls to contain this key (Authorization: Bearer sk-1234)
  alerting: ["slack"] # [OPTIONAL] If you want Slack Alerts for Hanging LLM requests, Slow llm responses, Budget Alerts. Make sure to set `SLACK_WEBHOOK_URL` in your env

For more provider-specific info, go here

Step 2: Start Proxy with config

$ litellm --config /path/to/config.yaml

Run with --detailed_debug if you need detailed debug logs

$ litellm --config /path/to/config.yaml --detailed_debug

Using Proxy - Curl Request, OpenAI Package, Langchain, Langchain JS

Calling a model group

Sends request to model where model_name=gpt-3.5-turbo on config.yaml.

If multiple with model_name=gpt-3.5-turbo does Load Balancing

curl --location 'http://0.0.0.0:4000/chat/completions' \
--header 'Content-Type: application/json' \
--data ' {
      "model": "gpt-3.5-turbo",
      "messages": [
        {
          "role": "user",
          "content": "what llm are you"
        }
      ],
    }
'

Save Model-specific params (API Base, Keys, Temperature, Max Tokens, Organization, Headers etc.)

You can use the config to save model-specific information like api_base, api_key, temperature, max_tokens, etc.

All input params

Step 1: Create a config.yaml file

model_list:
  - model_name: gpt-4-team1
    litellm_params: # params for litellm.completion() - https://docs.litellm.ai/docs/completion/input#input---request-body
      model: azure/chatgpt-v-2
      api_base: https://openai-gpt-4-test-v-1.openai.azure.com/
      api_version: "2023-05-15"
      azure_ad_token: eyJ0eXAiOiJ
      seed: 12
      max_tokens: 20
  - model_name: gpt-4-team2
    litellm_params:
      model: azure/gpt-4
      api_key: sk-123
      api_base: https://openai-gpt-4-test-v-2.openai.azure.com/
      temperature: 0.2
  - model_name: openai-gpt-3.5
    litellm_params:
      model: openai/gpt-3.5-turbo
      extra_headers: {"AI-Resource Group": "ishaan-resource"}
      api_key: sk-123
      organization: org-ikDc4ex8NB
      temperature: 0.2
  - model_name: mistral-7b
    litellm_params:
      model: ollama/mistral
      api_base: your_ollama_api_base

Step 2: Start server with config

$ litellm --config /path/to/config.yaml

Load Balancing

For more on this, go to this page

Use this to call multiple instances of the same model and configure things like routing strategy.

For optimal performance:

  • Set tpm/rpm per model deployment. Weighted picks are then based on the established tpm/rpm.
  • Select your optimal routing strategy in router_settings:routing_strategy.

LiteLLM supports

["simple-shuffle", "least-busy", "usage-based-routing","latency-based-routing"], default="simple-shuffle"`

When tpm/rpm is set + routing_strategy==simple-shuffle litellm will use a weighted pick based on set tpm/rpm. In our load tests setting tpm/rpm for all deployments + routing_strategy==simple-shuffle maximized throughput

  • When using multiple LiteLLM Servers / Kubernetes set redis settings router_settings:redis_host etc
model_list:
  - model_name: zephyr-beta
    litellm_params:
        model: huggingface/HuggingFaceH4/zephyr-7b-beta
        api_base: http://0.0.0.0:8001
        rpm: 60      # Optional[int]: When rpm/tpm set - litellm uses weighted pick for load balancing. rpm = Rate limit for this deployment: in requests per minute (rpm).
        tpm: 1000   # Optional[int]: tpm = Tokens Per Minute 
  - model_name: zephyr-beta
    litellm_params:
        model: huggingface/HuggingFaceH4/zephyr-7b-beta
        api_base: http://0.0.0.0:8002
        rpm: 600      
  - model_name: zephyr-beta
    litellm_params:
        model: huggingface/HuggingFaceH4/zephyr-7b-beta
        api_base: http://0.0.0.0:8003
        rpm: 60000      
  - model_name: gpt-3.5-turbo
    litellm_params:
        model: gpt-3.5-turbo
        api_key: <my-openai-key>
        rpm: 200      
  - model_name: gpt-3.5-turbo-16k
    litellm_params:
        model: gpt-3.5-turbo-16k
        api_key: <my-openai-key>
        rpm: 100      

litellm_settings:
  num_retries: 3 # retry call 3 times on each model_name (e.g. zephyr-beta)
  request_timeout: 10 # raise Timeout error if call takes longer than 10s. Sets litellm.request_timeout 
  fallbacks: [{"zephyr-beta": ["gpt-3.5-turbo"]}] # fallback to gpt-3.5-turbo if call fails num_retries 
  context_window_fallbacks: [{"zephyr-beta": ["gpt-3.5-turbo-16k"]}, {"gpt-3.5-turbo": ["gpt-3.5-turbo-16k"]}] # fallback to gpt-3.5-turbo-16k if context window error
  allowed_fails: 3 # cooldown model if it fails > 1 call in a minute. 

router_settings: # router_settings are optional
  routing_strategy: simple-shuffle # Literal["simple-shuffle", "least-busy", "usage-based-routing","latency-based-routing"], default="simple-shuffle"
  model_group_alias: {"gpt-4": "gpt-3.5-turbo"} # all requests with `gpt-4` will be routed to models with `gpt-3.5-turbo`
  num_retries: 2
  timeout: 30                                  # 30 seconds
  redis_host: <your redis host>                # set this when using multiple litellm proxy deployments, load balancing state stored in redis
  redis_password: <your redis password>
  redis_port: 1992

You can view your cost once you set up Virtual keys or custom_callbacks

Load API Keys

Load API Keys from Environment

If you have secrets saved in your environment, and don't want to expose them in the config.yaml, here's how to load model-specific keys from the environment.

os.environ["AZURE_NORTH_AMERICA_API_KEY"] = "your-azure-api-key"
model_list:
  - model_name: gpt-4-team1
    litellm_params: # params for litellm.completion() - https://docs.litellm.ai/docs/completion/input#input---request-body
      model: azure/chatgpt-v-2
      api_base: https://openai-gpt-4-test-v-1.openai.azure.com/
      api_version: "2023-05-15"
      api_key: os.environ/AZURE_NORTH_AMERICA_API_KEY

See Code

s/o to @David Manouchehri for helping with this.

Load API Keys from Azure Vault

  1. Install Proxy dependencies
$ pip install 'litellm[proxy]' 'litellm[extra_proxy]'
  1. Save Azure details in your environment
export["AZURE_CLIENT_ID"]="your-azure-app-client-id"
export["AZURE_CLIENT_SECRET"]="your-azure-app-client-secret"
export["AZURE_TENANT_ID"]="your-azure-tenant-id"
export["AZURE_KEY_VAULT_URI"]="your-azure-key-vault-uri"
  1. Add to proxy config.yaml
model_list: 
    - model_name: "my-azure-models" # model alias 
        litellm_params:
            model: "azure/<your-deployment-name>"
            api_key: "os.environ/AZURE-API-KEY" # reads from key vault - get_secret("AZURE_API_KEY")
            api_base: "os.environ/AZURE-API-BASE" # reads from key vault - get_secret("AZURE_API_BASE")

general_settings:
  use_azure_key_vault: True

You can now test this by starting your proxy:

litellm --config /path/to/config.yaml

Set Custom Prompt Templates

LiteLLM by default checks if a model has a prompt template and applies it (e.g. if a huggingface model has a saved chat template in it's tokenizer_config.json). However, you can also set a custom prompt template on your proxy in the config.yaml:

Step 1: Save your prompt template in a config.yaml

# Model-specific parameters
model_list:
  - model_name: mistral-7b # model alias
    litellm_params: # actual params for litellm.completion()
      model: "huggingface/mistralai/Mistral-7B-Instruct-v0.1" 
      api_base: "<your-api-base>"
      api_key: "<your-api-key>" # [OPTIONAL] for hf inference endpoints
      initial_prompt_value: "\n"
      roles: {"system":{"pre_message":"<|im_start|>system\n", "post_message":"<|im_end|>"}, "assistant":{"pre_message":"<|im_start|>assistant\n","post_message":"<|im_end|>"}, "user":{"pre_message":"<|im_start|>user\n","post_message":"<|im_end|>"}}
      final_prompt_value: "\n"
      bos_token: "<s>"
      eos_token: "</s>"
      max_tokens: 4096

Step 2: Start server with config

$ litellm --config /path/to/config.yaml

Setting Embedding Models

See supported Embedding Providers & Models here

Use Sagemaker, Bedrock, Azure, OpenAI, XInference

Create Config.yaml

model_list:
  - model_name: bedrock-cohere
    litellm_params:
      model: "bedrock/cohere.command-text-v14"
      aws_region_name: "us-west-2"
  - model_name: bedrock-cohere
    litellm_params:
      model: "bedrock/cohere.command-text-v14"
      aws_region_name: "us-east-2"
  - model_name: bedrock-cohere
    litellm_params:
      model: "bedrock/cohere.command-text-v14"
      aws_region_name: "us-east-1"

Start Proxy

litellm --config config.yaml

Make Request

Sends Request to bedrock-cohere

curl --location 'http://0.0.0.0:4000/chat/completions' \
  --header 'Content-Type: application/json' \
  --data ' {
  "model": "bedrock-cohere",
  "messages": [
      {
      "role": "user",
      "content": "gm"
      }
  ]
}'

Disable Swagger UI

To disable the Swagger docs from the base url, set

NO_DOCS="True"

in your environment, and restart the proxy.

Configure DB Pool Limits + Connection Timeouts

general_settings: 
  database_connection_pool_limit: 100 # sets connection pool for prisma client to postgres db at 100
  database_connection_timeout: 60 # sets a 60s timeout for any connection call to the db 

All settings

{
  "environment_variables": {},
  "model_list": [
    {
      "model_name": "string",
      "litellm_params": {},
      "model_info": {
        "id": "string",
        "mode": "embedding",
        "input_cost_per_token": 0,
        "output_cost_per_token": 0,
        "max_tokens": 2048,
        "base_model": "gpt-4-1106-preview",
        "additionalProp1": {}
      }
    }
  ],
  "litellm_settings": {}, # ALL (https://github.com/BerriAI/litellm/blob/main/litellm/__init__.py)
  "general_settings": {
    "completion_model": "string",
    "disable_spend_logs": "boolean", # turn off writing each transaction to the db
    "disable_master_key_return": "boolean", # turn off returning master key on UI (checked on '/user/info' endpoint)
    "disable_reset_budget": "boolean", # turn off reset budget scheduled task
    "enable_jwt_auth": "boolean", # allow proxy admin to auth in via jwt tokens with 'litellm_proxy_admin' in claims
    "enforce_user_param": "boolean", # requires all openai endpoint requests to have a 'user' param
    "allowed_routes": "list", # list of allowed proxy API routes - a user can access. (currently JWT-Auth only)
    "key_management_system": "google_kms", # either google_kms or azure_kms
    "master_key": "string",
    "database_url": "string",
    "database_connection_pool_limit": 0, # default 100
    "database_connection_timeout": 0, # default 60s
    "database_type": "dynamo_db",
    "database_args": {
      "billing_mode": "PROVISIONED_THROUGHPUT",
      "read_capacity_units": 0,
      "write_capacity_units": 0,
      "ssl_verify": true,
      "region_name": "string",
      "user_table_name": "LiteLLM_UserTable",
      "key_table_name": "LiteLLM_VerificationToken",
      "config_table_name": "LiteLLM_Config",
      "spend_table_name": "LiteLLM_SpendLogs"
    },
    "otel": true,
    "custom_auth": "string",
    "max_parallel_requests": 0,
    "infer_model_from_keys": true,
    "background_health_checks": true,
    "health_check_interval": 300,
    "alerting": [
      "string"
    ],
    "alerting_threshold": 0
  }
}