We want to use custom OpenAI compatible API LLMs with GitHub Copilot Chat in VS Code without API keys. We will use LiteLLM as a proxy for authentication and use the Azure AI model support in Chat as a hack.
Problem Statement
GitHub Copilot Chat in VS Code (moving forward, called Chat
1) allows custom
LLM deployments, but only supports API keys and not AAD/Entra ID. API keys are
icky and not cool anymore. Using this method I can:
- Use my own deployment in Azure AI.
- Use
AADEntra ID (placeholder for the eventual name change in 5 years). - As a bonus, decouple my code from LLM authentication.
Summary
If you just want the solution:
- Create your own deployment in Azure AI.
- Setup LiteLLM with a config like this:
general_settings: # master_key: sk-local-proxy # optional, fake API key for LiteLLM telemetry: false model_list: - model_name: "gpt-5-parsia" # custom name that GitHub Copilot chat sees litellm_params: model: "azure/gpt-5-parsia" # keep "azure/", the rest is the name of the deployment api_base: "https://{base-api}.cognitiveservices.azure.com/" # replace api_version: "2024-12-01-preview" # replace if needed litellm_settings: enable_azure_ad_token_refresh: true # use AAD tokens drop_params: true # keep this, otherwise you will get errors.
- Add this section to the VS Code config:
{
// rest of the settings
"github.copilot.chat.azureModels": {
"gpt-5-parsia": { // This is just an identifier
"name": "gpt-5-parsia", // The model name you see in Chat
"url": "http://localhost:4000/v1/chat/completions", // Point to LiteLLM
"maxInputTokens": 128000, // Set based on your model
"maxOutputTokens": 16000, // Ditto
"toolCalling": true, // Enable this
"vision": false,
"thinking": false
}
}
}
- Create this env variable for LiteLLM:
AZURE_CREDENTIAL
with valueDefaultAzureCredential
- Run LiteLLM
litellm --config .\config.yaml --host localhost
- Click
Manage Model
in Chat and selectAzure
. - Choose
gpt-5-parsia
.
Drawback: There's a very noticeable lag in responses compared to the built-in models. Not that bad if you're batch processing, but not a great experience for real-time use.
Details
While I like the models in Chat, using your own model in your own subscription opens up a lot of opportunities. I was talking to a friend at work (you've been promoted to friend for external propaganda purposes, A) and they mentioned not using the built-in models in Chat because of dealing with sensitive stuff (OMG, who cares, donate it all to the magic oracle in return for visions).
Chat has support for Azure models (and other providers), but only supports API keys. We cannot use API keys. With apologies to Hafez:
دردم از کار است و درمان نیز هم 2 My pain and remedy are both from work
So I am logged into the machine which is Entra joined, and my model is deployed in Azure, so I should be able to get a token to talk to the model, but Chat doesn't support this natively.
Enter LiteLLM
At DEF CON, I visited AIxCC and briefly talked with people. Looking through the code for Trail of Bits 2nd place system, ButterCup, I saw a directory named litellm.
LiteLLM is a local LLM proxy. It does a lot more like budgeting,
but I was only interested in Azure AD Token Refresh support. It
uses something called DefaultAzureCredential
to obtain a token.
Think of DefaultAzureCredential
as a magical way of getting an AAD token. On
an Entra-joined machine, it will try a few ways to passively obtain a valid token
and if not, will show you one of those familiar "choose account" dialogs and if
all else fails, opens a browser window to let you login 3.
So we create a LiteLLM config like this:
# Basic proxy settings
general_settings:
# master_key: sk-local-proxy # optional, fake API key for LiteLLM
telemetry: false
model_list:
- model_name: "gpt-5-parsia" # custom name that GitHub Copilot chat sees
litellm_params:
model: "azure/gpt-5-parsia" # keep "azure/", the rest is the name of the deployment
api_base: "https://{base-api}.cognitiveservices.azure.com/" # replace
api_version: "2024-12-01-preview" # replace if needed
model_info: # optional section but helps LiteLLM understand things
base_model: "gpt-5"
mode: "completion" # not needed but good to have
# more options
# input_cost_per_token
# output_cost_per_token
# max_tokens
# metadata # apparently freeform!
# Optional router/proxy tweaks
router_settings:
num_retries: 2
timeout: 120
litellm_settings:
enable_azure_ad_token_refresh: true # this is where the AAD token is magically acquired
drop_params: true # keep this, otherwise you will get errors.
Most of the config is self-explanatory. You can have multiple models in LiteLLM.
Our example only has one. You only need to replace a maximum of
three items under litellm_params
with data from your deployment.
model
: This should start withazure/
to tell LiteLLM where it's hosted.api_base
: Your API endpoint. This ishttps://{base-api}.cognitiveservices.azure.com/
where{base-api}
is also the name of your Azure AI Foundry resource.api_version
: Comes from your deployment.
The config is very extensive. For example, LiteLLM can create fake API keys with specific budgets. These are used to talk to LiteLLM. You can also have a set API key for Chat to talk to LiteLLM (top of the config).
Now, we can run LiteLLM and it will expose an OpenAI compatible API (e.g.,
/chat/completions/
) which is the de facto standard these days. But we have
more work to do.
Chat's Ollama Support
Chat supports local models, but via Ollama. By default,
it tries to talk to http://localhost:11434
; you can also change it with
this key in the VS Code config.
"github.copilot.chat.byok.ollamaEndpoint": "http://localhost:11434",
So we can tell Chat to talk to a custom endpoint. However, Chat is expecting an Ollama API which is different from OpenAI compatible API exposed by LiteLLM and used by Azure.
You Can Now Connect Your Own Model for GitHub Copilot from March 2025
suggests running LiteLLM on 11434
and claims it can emulate an Ollama API. I
couldn't get it to work, and I could not find any switches or configurations to
tell LiteLLM to emulate the Ollama API.
So we need something to translate one to the other. Originally, I used a second Python package named oai2ollama that does it. So the setup looked like this:
.-------. .----------. .-------. .--------.
| VS Code +--->| oai2ollama +--->| LiteLLM +--->| Azure AI |
'-------' '----------' '-------' '--------'
Chat's Azure AI Support
There's experimental support for custom Azure AI models in VS Code's Chat. If
you open settings with ctrl+,
, you can search for Azure custom
and see it.
You have to edit it in JSON mode and add this info.
{
// rest of the settings
"github.copilot.chat.azureModels": {
"gpt-5-parsia": { // This is just an identifier
"name": "gpt-5-parsia", // The model name you see in Chat
"url": "http://localhost:4000/v1/chat/completions", // Point to LiteLLM
"maxInputTokens": 128000, // Set these based on your model
"maxOutputTokens": 16000, // Ditto
"toolCalling": true, // Enable this
"vision": false,
"thinking": false
}
}
}
Workflow
- Run LiteLLM
litellm --config config.yaml --host localhost
- In VS Code's Chat, click the model and select
Manage Models
. - Select Azure and you should see your model.
- If Chat asks for an API key, enter any random text unless you had set up one in LiteLLM's config. This will only be sent to LiteLLM.
Benefits
Now you have your own private hallucinating oracle that grants wishes in a secure manner.
For me, the main benefit is separating my models, authentication, and tokens from code. In the code, I just need a model name and endpoint and to quote Billy Connolly's HBO skit "buggered if I know what happens after that4."
Drawbacks
It's SLOW with a very high latency of 30-40 seconds (GPT-5). And I'm not talking about just the first request that needs to acquire a token. Subsequent requests also have this high lag and it makes GPT-5 unusable for real time use, but it works for batch processing as long as you are not sending more than dozens of requests per second.
IMO, part of the latency is because GPT-5 is only available in East US2 which is a bit away from me in PNW. See Global Standard model availability for more info.
Experiment #1: GPT-4.1 in WestUS3 has a ~3 second delay which is on par with built-in models in Chat and quite usable.
Another issue is that LiteLLM loads tons of features that we do not use.
Future work #1, I need to find a similar product. I've looked into Portkey, but I haven't experimented with it yet. Theoretically, you could code something that refreshes the token and redirects the requests, but I'd rather use an existing product in case I want to use other endpoints.
Q&A
- Does it work in Visual Studio?
- I don't know. I use VS Code. I am a normie, not a corpo.
- Why doesn't GitHub Copilot Chat have this functionality?
- Good question. I think it's a good use case. Apparently, there's a new extension API in vscode-insiders that can be used to implement this functionality.
As usual if you have any better solutions, you know where to find me.
This also opens up fun possibilities like
Chat, is this true?
↩︎The original verse is "دردم از یار است و درمان نیز هم" (My pain and remedy are both from the beloved). Replacing یار (beloved) with کار (work). ↩︎
It really doesn't matter what it does behind the scenes. It's a magical token-granting wishing well. ↩︎
Hafez to Billy Connoly is quite the transition. Enjoying this "diversity of thought?" ↩︎