Azure OpenAI
Configure Azure Chat Completions and Responses API models
Stratus includes two built-in Azure OpenAI model implementations. Both implement the Model interface and work with all Stratus APIs (agents, tools, sessions, streaming, etc.).
| Model | API | Best for |
|---|---|---|
AzureResponsesModel | Responses API | Recommended. Latest API format with full feature support |
AzureChatCompletionsModel | Chat Completions | Legacy support, widest compatibility |
Quick Start with createModel()
The fastest way to get started. Reads AZURE_OPENAI_ENDPOINT, AZURE_OPENAI_API_KEY, and AZURE_OPENAI_DEPLOYMENT from environment variables:
import { createModel } from "@usestratus/sdk/azure";
const model = createModel(); Defaults to the Responses API. Pass "chat-completions" for the legacy API:
const model = createModel("chat-completions");Override any env var with explicit options:
const model = createModel({
endpoint: "https://my-resource.openai.azure.com",
deployment: "gpt-5.2",
apiKey: process.env.MY_KEY!,
store: true,
});| Env Variable | Fallback | Description |
|---|---|---|
AZURE_OPENAI_ENDPOINT | options.endpoint | Azure OpenAI endpoint URL |
AZURE_OPENAI_API_KEY | options.apiKey | API key (or use options.azureAdTokenProvider) |
AZURE_OPENAI_DEPLOYMENT | options.deployment | Model deployment name |
AZURE_OPENAI_API_VERSION | options.apiVersion | API version (optional) |
If a required value is missing, createModel() throws a StratusError with a message telling you exactly which env var to set.
AzureResponsesModel
The recommended model for new projects. Uses the Azure Responses API.
import { AzureResponsesModel } from "@usestratus/sdk/azure";
const model = new AzureResponsesModel({
endpoint: "https://your-resource.openai.azure.com",
apiKey: "your-api-key",
deployment: "gpt-5.2",
apiVersion: "2025-04-01-preview", // optional, this is the default
});Config Options
| Property | Type | Description |
|---|---|---|
endpoint | string | Required. Any supported endpoint format |
apiKey | string | API key for authentication. Required unless azureAdTokenProvider is set. |
azureAdTokenProvider | () => Promise<string> | Entra ID token provider function. Required unless apiKey is set. See Authentication. |
deployment | string | Required. Sent as model in request body |
apiVersion | string | API version (default: "2025-04-01-preview") |
store | boolean | Whether to persist responses server-side (default: false). Enable for previous_response_id optimization. |
maxRetries | number | Maximum number of retries on 429 rate limits and network errors (default: 3). See Retry behavior. |
AzureChatCompletionsModel
Uses the Azure Chat Completions API. Use this if your deployment doesn't support the Responses API.
import { AzureChatCompletionsModel } from "@usestratus/sdk/azure";
const model = new AzureChatCompletionsModel({
endpoint: "https://your-resource.openai.azure.com",
apiKey: "your-api-key",
deployment: "gpt-5.2",
apiVersion: "2025-03-01-preview", // optional, this is the default
});Config Options
| Property | Type | Description |
|---|---|---|
endpoint | string | Required. Any supported endpoint format |
apiKey | string | API key for authentication. Required unless azureAdTokenProvider is set. |
azureAdTokenProvider | () => Promise<string> | Entra ID token provider function. Required unless apiKey is set. See Authentication. |
deployment | string | Required. Model deployment name |
apiVersion | string | API version (default: "2025-03-01-preview") |
maxRetries | number | Maximum number of retries on 429 rate limits and network errors (default: 3). See Retry behavior. |
Both models are interchangeable for function tools. Swap one for the other without changing any agent, tool, or session code. Built-in tools (web search, code interpreter, MCP, image generation) are only supported by AzureResponsesModel.
Endpoint Formats
Pass any Azure endpoint URL as endpoint — the SDK auto-detects the type and builds the correct request URL.
// Azure OpenAI
endpoint: "https://your-resource.openai.azure.com"
// Cognitive Services
endpoint: "https://your-resource.cognitiveservices.azure.com"
// AI Foundry project
endpoint: "https://your-project.services.ai.azure.com/api/projects/my-project"
// Full URL (used as-is, deployment and apiVersion are ignored)
endpoint: "https://your-resource.openai.azure.com/openai/deployments/gpt-5.2/chat/completions?api-version=2025-03-01-preview"Trailing slashes are normalized automatically.
Non-OpenAI Models (Model Inference API)
AzureChatCompletionsModel works with any model deployed through the Azure AI Model Inference API, not just OpenAI models. Pass the full Model Inference URL as the endpoint and the model name as the deployment:
import { AzureChatCompletionsModel } from "@usestratus/sdk/azure";
const model = new AzureChatCompletionsModel({
endpoint: "https://your-resource.services.ai.azure.com/models/chat/completions?api-version=2024-05-01-preview",
apiKey: "your-api-key",
deployment: "Kimi-K2.5", // model name sent in request body
});The deployment value is sent as the model field in the request body, which the Model Inference API uses to route to the correct model. All Stratus features (tools, streaming, handoffs, sessions, etc.) work with any model that supports the Chat Completions format.
Not all models support every feature. For example, some models don't support tool calling or structured output. The SDK will surface the API error if an unsupported feature is used.
Tested Models
The following non-OpenAI models have been verified with AzureChatCompletionsModel:
| Model | Tools | Structured Output | Streaming | Handoffs |
|---|---|---|---|---|
| Kimi-K2.5 | Yes | Yes | Yes | Yes |
| Kimi-K2-Thinking | Yes | Yes | Yes | Yes |
Usage
Both models implement the Model interface and work identically with all Stratus APIs:
// With run()
const result = await run(agent, "Hello", { model });
// With createSession()
const session = createSession({ model, instructions: "..." });
// With prompt()
const result = await prompt("Hello", { model });Model Interface
Any model provider can be used with Stratus by implementing the Model interface:
interface Model {
getResponse(request: ModelRequest, options?: ModelRequestOptions): Promise<ModelResponse>;
getStreamedResponse(request: ModelRequest, options?: ModelRequestOptions): AsyncIterable<StreamEvent>;
}
interface ModelRequestOptions {
signal?: AbortSignal;
}The options parameter is optional and backward compatible. When provided, signal is used for request cancellation.
ModelRequest
interface ModelRequest {
messages: ChatMessage[];
tools?: (ToolDefinition | Record<string, unknown>)[];
modelSettings?: ModelSettings;
responseFormat?: ResponseFormat;
previousResponseId?: string;
rawInputItems?: Record<string, unknown>[];
}The tools array accepts both ToolDefinition (function tools) and Record<string, unknown> (hosted tool definitions). previousResponseId is forwarded by the run loop for Responses API optimization when store is enabled.
rawInputItems appends raw items to the Responses API input array. Use this to pass back opaque items from the API — compaction items, encrypted reasoning, MCP approval responses — that the SDK doesn't serialize from ChatMessage.
ModelResponse
interface ModelResponse {
content: string | null;
toolCalls: ToolCall[];
usage?: UsageInfo;
finishReason?: FinishReason;
responseId?: string;
incompleteDetails?: { reason?: string };
outputItems?: Record<string, unknown>[];
}responseId is populated by AzureResponsesModel and tracked across turns by the run loop. It's also available on RunResult.responseId.
incompleteDetails is populated when a response is truncated (e.g. due to max_output_tokens). The reason field describes why.
outputItems is an escape hatch for Responses API output item types the SDK doesn't have first-class support for — such as mcp_approval_request, image_generation_call results, and code_interpreter_call results. These items are passed through as raw objects so you can inspect them directly.
UsageInfo
interface UsageInfo {
promptTokens: number;
completionTokens: number;
totalTokens: number;
cacheReadTokens?: number;
cacheCreationTokens?: number;
reasoningTokens?: number;
}Cache token fields are populated when the Azure API returns prompt caching details. reasoningTokens is populated for reasoning models (o1, o3, etc.) from completion_tokens_details.reasoning_tokens (Chat Completions) or output_tokens_details.reasoning_tokens (Responses API). All optional fields are undefined when not active.
Prompt Caching
Both models support Azure's automatic prompt caching. Cache hits appear as cacheReadTokens in UsageInfo and are billed at a discount. Use promptCacheKey in ModelSettings to improve hit rates:
const agent = new Agent({
name: "assistant",
model,
modelSettings: {
promptCacheKey: "my-app-v1",
},
});Both AzureChatCompletionsModel and AzureResponsesModel parse cached token counts from their respective response formats.
Authentication
Both models support two authentication methods. Exactly one of apiKey or azureAdTokenProvider must be provided — the constructor throws if neither or both are set.
API Key
The simplest option. The key is sent as an api-key header with every request.
const model = new AzureResponsesModel({
endpoint: "https://your-resource.openai.azure.com",
apiKey: process.env.AZURE_API_KEY!,
deployment: "gpt-5.2",
});Microsoft Entra ID
For enterprise environments, pass a token provider function instead of an API key. Stratus calls it before each request and sends the token as a Bearer header. This works with managed identities, service principals, and any @azure/identity credential.
Install @azure/identity in your project (Stratus has no hard dependency on it):
bun add @azure/identityThen pass a token provider:
import { AzureResponsesModel } from "@usestratus/sdk/azure";
import { DefaultAzureCredential, getBearerTokenProvider } from "@azure/identity";
const credential = new DefaultAzureCredential();
const tokenProvider = getBearerTokenProvider(
credential,
"https://cognitiveservices.azure.com/.default",
);
const model = new AzureResponsesModel({
endpoint: "https://your-resource.openai.azure.com",
azureAdTokenProvider: tokenProvider,
deployment: "gpt-5.2",
});The token provider is called fresh on each API request — token caching and refresh are handled by @azure/identity.
DefaultAzureCredential automatically picks the right credential for your environment: managed identity in Azure, Azure CLI locally, and environment variables in CI. See the @azure/identity docs for the full chain.
Streaming
Both models use Server-Sent Events (SSE) with a shared zero-dependency parser. Events are yielded as StreamEvent objects as they arrive from the Azure API.
Error Handling
Both models throw the same errors for failure modes:
ModelError- General API errors (4xx/5xx responses)ContentFilterError- Azure content filter blocked the request or response
import { ModelError, ContentFilterError } from "@usestratus/sdk/core";
try {
const result = await run(agent, input);
} catch (error) {
if (error instanceof ContentFilterError) {
// Handle content filter
} else if (error instanceof ModelError) {
console.error(`API error ${error.status}: ${error.message}`);
}
}Retry Behavior
Both models automatically retry on transient errors and network errors (timeouts, connection resets, DNS failures). Retries are transparent to the caller — the AbortSignal from RunOptions.signal still propagates through, so timeouts work across retries.
Retryable status codes
| Code | Meaning |
|---|---|
429 | Rate limited — too many requests |
500 | Internal server error — transient capacity issue |
502 | Bad gateway — upstream infrastructure issue |
503 | Service unavailable — server temporarily down |
The default is 3 retries. Configure it per model:
const model = new AzureResponsesModel({
endpoint: "https://your-resource.openai.azure.com",
apiKey: process.env.AZURE_API_KEY!,
deployment: "gpt-5.2",
maxRetries: 5,
});Backoff strategy
retry-after-msheader — Azure returns this with millisecond precision on 429s. Used when present.retry-afterheader — Standard header in seconds. Used as fallback.- Exponential backoff with jitter —
1s × 2^attempt + random(0–1s). Used when no headers are present.
All delays are capped at 30 seconds — including server-provided values — to prevent a misbehaving server from stalling requests indefinitely.
Backoff sleeps are abort-aware: if you cancel via AbortSignal, the retry exits immediately rather than waiting out the full delay.
Retries are logged via console.warn with the wait duration and attempt count.
Proxy error detection
Azure proxy errors sometimes return HTTP 200 with an HTML body instead of JSON/SSE. Both models detect this by checking the content-type header — if it's present but doesn't contain json or event-stream, the response is treated as a transient proxy error and retried with the same backoff logic. If retries are exhausted, a ModelError is thrown with the first 200 characters of the body.
As a safety net, getResponse() also catches SyntaxError from response.json() and wraps it in a ModelError with the raw body snippet for debugging.
AzureResponsesModel also retries SSE-level rate limits — when the HTTP response is 200 but the stream contains a too_many_requests error event before any content has been yielded. SSE retries use a fixed budget of 3, independent of maxRetries, to avoid quadratic retry multiplication.
Responses API Methods
AzureResponsesModel exposes additional methods beyond the Model interface for Responses API features that don't fit the standard getResponse / getStreamedResponse pattern.
Compact endpoint
Shrink a conversation's context window while preserving essential information. Useful for long-running sessions before continuing.
// Compact by passing conversation items
const compacted = await model.compact({
input: [
{ role: "user", content: "Explain quantum computing in detail." },
{
type: "message",
role: "assistant",
content: [{ type: "output_text", text: longResponse }],
},
],
});
// Use compacted output as context for the next request
const followUp = await model.getResponse({
messages: [{ role: "user", content: "What are the practical applications?" }],
rawInputItems: compacted.output,
});You can also compact by referencing a stored response:
const compacted = await model.compact({
previousResponseId: "resp_abc123",
});CompactOptions:
| Property | Type | Description |
|---|---|---|
model | string | Model override. Defaults to the deployment configured on the model instance. |
input | Record<string, unknown>[] | Conversation items to compact. |
previousResponseId | string | ID of a stored response to compact. Alternative to input. |
signal | AbortSignal | Abort signal for cancellation. |
Background tasks
Run long-running requests asynchronously. Best for reasoning models (o3, o4-mini) that can take minutes to complete.
// Start a background task
const bg = await model.createBackgroundResponse({
messages: [
{ role: "user", content: "Write a detailed analysis of this codebase." },
],
});
console.log(bg.id); // "resp_abc123"
console.log(bg.status); // "queued" | "in_progress"
// Poll until done
let response = bg;
while (response.status !== "completed" && response.status !== "failed") {
await new Promise((r) => setTimeout(r, 2000));
response = await model.retrieveResponse(response.id);
}
console.log(response.output); // completed responseCancel a running background task:
const cancelled = await model.cancelResponse("resp_abc123");Resume streaming from a specific point (useful for dropped connections):
let cursor: number | undefined;
for await (const event of model.streamBackgroundResponse("resp_abc123", {
startingAfter: cursor, // resume from last known position
})) {
// process events
}Background mode requires store: true. Not all deployments support it — it's designed for reasoning models like o3 and o4-mini.
Retrieve, delete, and list
Manage stored responses directly.
// Retrieve a stored response
const response = await model.retrieveResponse("resp_abc123");
// List the input items that were sent
const items = await model.listInputItems("resp_abc123");
console.log(items.data); // input item objects
console.log(items.hasMore); // pagination
// Delete a stored response
await model.deleteResponse("resp_abc123");retrieveResponse(id) — Returns the full RawResponse including id, status, output, usage, and error.
listInputItems(id) — Returns { data, hasMore, firstId, lastId } with the input items from the original request.
deleteResponse(id) — Deletes the stored response. Subsequent retrieval returns 404.
Stored responses are retained for 30 days by default. Use deleteResponse() to clean up earlier.
MCP approval flow
When using the MCP built-in tool with requireApproval, the API returns an mcp_approval_request in outputItems instead of executing the tool. You approve or deny it by passing an mcp_approval_response back.
const result = await model.getResponse({
messages: [{ role: "user", content: "Search the docs" }],
});
// Check for pending approvals
const approval = result.outputItems?.find(
(item) => item.type === "mcp_approval_request"
);
if (approval) {
// Approve and continue
const continued = await model.getResponse({
messages: [{ role: "user", content: "Search the docs" }],
previousResponseId: result.responseId,
modelSettings: { store: true },
rawInputItems: [{
type: "mcp_approval_response",
approve: true,
approval_request_id: approval.id as string,
}],
});
}Last updated on