Model Settings
Configure temperature, tool choice, and other model parameters
Model settings control how the model generates responses. You set them on the agent at construction time.
Setting on an agent
Pass a modelSettings object when creating an agent:
import { Agent } from "@usestratus/sdk/core";
const agent = new Agent({
name: "assistant",
model,
modelSettings: {
temperature: 0.7,
maxTokens: 1000,
},
});Settings are sent to the model on every call the agent makes. To change settings between runs, clone the agent with new values:
const creativeAgent = agent.clone({
modelSettings: { temperature: 1.2, topP: 0.95 },
});ModelSettings reference
| Setting | Type | Default | Description |
|---|---|---|---|
temperature | number | Model default | Sampling temperature. Higher values (closer to 2) produce more random output. Lower values (closer to 0) produce more deterministic output. Range: 0--2. |
topP | number | Model default | Nucleus sampling. The model considers tokens whose cumulative probability exceeds this threshold. Range: 0--1. |
maxTokens | number | Model default | Maximum number of tokens to generate in the response. |
stop | string[] | undefined | Stop sequences. The model stops generating when it produces any of these strings. |
presencePenalty | number | 0 | Penalizes tokens that have already appeared, encouraging the model to talk about new topics. Range: -2 to 2. |
frequencyPenalty | number | 0 | Penalizes tokens proportional to how often they've appeared, reducing repetition. Range: -2 to 2. |
toolChoice | ToolChoice | "auto" | Controls which tools the model can call. See Tool choice. |
parallelToolCalls | boolean | true | Whether the model can call multiple tools in a single turn. |
seed | number | undefined | Seed for deterministic sampling. Repeated requests with the same seed and parameters should return the same result. |
reasoningEffort | ReasoningEffort | undefined | Controls how much reasoning effort the model spends. See Reasoning models. |
maxCompletionTokens | number | undefined | Max tokens for the model's completion, including reasoning tokens. Use instead of maxTokens for reasoning models. |
reasoningSummary | ReasoningSummary | undefined | Controls reasoning summary output: "auto", "concise", or "detailed". |
promptCacheKey | string | undefined | Influences prompt cache routing. Requests with the same key and prefix are more likely to hit cache. See Prompt caching. |
truncation | Truncation | undefined | Input truncation strategy: "auto" (truncate oldest messages) or "disabled" (fail on overflow). |
store | boolean | undefined | Whether to store the request/response server-side. Required for previousResponseId chaining. |
metadata | Record<string, string> | undefined | Arbitrary key-value metadata attached to the API request. |
user | string | undefined | End-user identifier for abuse monitoring. |
logprobs | boolean | undefined | Whether to return log probabilities of output tokens. |
topLogprobs | number | undefined | Number of most likely tokens to return per position (0--20). Requires logprobs: true. |
prediction | PredictedOutput | undefined | Predicted output for faster completions. Chat Completions only. See Predicted output. |
modalities | Modality[] | ["text"] | Output modalities. Set to ["text", "audio"] for audio output. Chat Completions only. See Audio output. |
audio | AudioConfig | undefined | Audio voice and format config. Requires modalities: ["text", "audio"]. Chat Completions only. |
dataSources | DataSource[] | undefined | Azure On Your Data sources for RAG. Chat Completions only. See Data sources. |
contextManagement | ContextManagement | undefined | Server-side context compaction rules. Responses API only. See Context compaction. |
include | string[] | undefined | Fields to include in the response. Responses API only. See Encrypted reasoning. |
background | boolean | undefined | Run as a background task for long-running requests. Responses API only. See Background tasks. |
Reasoning models
For reasoning models (o1, o3, etc.), use reasoningEffort and maxCompletionTokens instead of temperature and maxTokens.
reasoningEffort controls how much internal reasoning the model does before responding. Higher effort produces more thorough answers but uses more tokens and takes longer.
import { Agent } from "@usestratus/sdk/core";
const agent = new Agent({
name: "analyst",
model,
modelSettings: {
reasoningEffort: "high",
maxCompletionTokens: 16384,
},
});Valid values for reasoningEffort:
| Value | Description |
|---|---|
"none" | No reasoning |
"minimal" | Minimal reasoning |
"low" | Low effort |
"medium" | Medium effort (default for reasoning models) |
"high" | High effort |
"xhigh" | Maximum effort |
maxCompletionTokens includes both reasoning tokens and output tokens. If the model uses 1000 tokens for reasoning and 500 for the response, that's 1500 total against the limit. Reasoning tokens are tracked in UsageInfo.reasoningTokens.
Prompt caching
Azure automatically caches prompt prefixes for requests over 1,024 tokens. Use promptCacheKey to improve cache hit rates when many requests share long common prefixes.
const agent = new Agent({
name: "assistant",
model,
modelSettings: {
promptCacheKey: "support-agent-v2",
},
});Cache hits appear as cacheReadTokens in UsageInfo and are billed at a discount. No opt-in is needed for basic caching — promptCacheKey is only for improving hit rates across requests with shared prefixes.
Tool choice
The toolChoice setting controls whether and how the model calls tools. Set it inside modelSettings.
The default. The model decides whether to call a tool or respond with text.
const agent = new Agent({
name: "assistant",
model,
tools: [getWeather],
modelSettings: {
toolChoice: "auto",
},
});Forces the model to call at least one tool. It will not respond with text alone.
const agent = new Agent({
name: "assistant",
model,
tools: [getWeather, searchDocs],
modelSettings: {
toolChoice: "required",
},
});Prevents the model from calling any tools, even if tools are defined on the agent. The model responds with text only.
const agent = new Agent({
name: "assistant",
model,
tools: [getWeather],
modelSettings: {
toolChoice: "none",
},
});Forces the model to call one specific tool by name. Useful when you know exactly which tool should run.
const agent = new Agent({
name: "assistant",
model,
tools: [getWeather, searchDocs],
modelSettings: {
toolChoice: {
type: "function",
function: { name: "get_weather" },
},
},
});Tool use behavior
toolUseBehavior is separate from modelSettings. It is set directly on the agent and controls what happens after a tool executes -- not what the model generates.
The default. After a tool executes, the result is sent back to the model so it can generate a follow-up response or call more tools.
const agent = new Agent({
name: "assistant",
model,
tools: [getWeather],
toolUseBehavior: "run_llm_again",
});Stops the run immediately after the first tool call completes. The tool's return value becomes the run output. The model is not called again.
const agent = new Agent({
name: "data-fetcher",
model,
tools: [fetchData],
toolUseBehavior: "stop_on_first_tool",
});
const result = await run(agent, "Get the latest sales data");
// result.output is the return value of fetchDataThis is useful when the agent's only job is to pick and invoke the right tool.
Stops only when a specific tool is called. Other tools feed their results back to the model as usual.
const agent = new Agent({
name: "researcher",
model,
tools: [searchDocs, summarize, finalAnswer],
toolUseBehavior: {
stopAtToolNames: ["final_answer"],
},
});The agent can call searchDocs and summarize as many times as it needs. The run stops only when it calls final_answer.
Pass a function to decide dynamically whether to stop after tool calls. The function receives all tool results from the current turn and returns true to stop or false to continue.
const agent = new Agent({
name: "researcher",
model,
tools: [searchDocs, summarize, finalAnswer],
toolUseBehavior: (toolResults) => {
return toolResults.some((r) => r.toolName === "final_answer");
},
});The callback can also be async:
toolUseBehavior: async (toolResults) => {
const shouldStop = await checkCompletion(toolResults);
return shouldStop;
},toolUseBehavior is set on the Agent, not in modelSettings. It controls what happens after a tool executes, not what the model generates.
Response format
Structured output is configured via outputType on the agent, not through modelSettings directly. When you set outputType to a Zod schema, Stratus sends the appropriate response_format to Azure automatically.
const agent = new Agent({
name: "extractor",
model,
outputType: z.object({
name: z.string(),
age: z.number(),
}),
});See Structured Output for details.
Predicted output
Predicted output speeds up completions when you know roughly what the model will return (e.g. code edits). The model diffs against your prediction instead of generating from scratch.
const agent = new Agent({
name: "editor",
model,
modelSettings: {
prediction: {
type: "content",
content: existingCode,
},
},
});Predicted output is only supported with the Chat Completions API (AzureChatCompletionsModel) on API version 2025-01-01-preview or later.
Audio output
For gpt-4o-audio deployments, you can request audio output alongside text.
const agent = new Agent({
name: "narrator",
model,
modelSettings: {
modalities: ["text", "audio"],
audio: { voice: "alloy", format: "mp3" },
},
});Available voices: "alloy", "echo", "fable", "onyx", "nova", "shimmer".
Available formats: "wav", "mp3", "flac", "opus", "pcm16".
Audio output is only supported with the Chat Completions API (AzureChatCompletionsModel).
Data sources
Azure On Your Data lets you ground model responses in your own data via Azure Search, Cosmos DB, and other Azure data sources. The model queries the data source and includes relevant results in its context.
const agent = new Agent({
name: "rag-agent",
model,
modelSettings: {
dataSources: [{
type: "azure_search",
parameters: {
endpoint: "https://search.example.com",
index_name: "knowledge-base",
authentication: { type: "api_key", key: process.env.SEARCH_KEY! },
},
}],
},
});Data sources are only supported with the Chat Completions API (AzureChatCompletionsModel). See the Azure On Your Data documentation for the full schema.
Encrypted reasoning
When using reasoning models (o3, o4-mini) in stateless mode (store: false), you need to preserve reasoning context across conversation turns. Set include to receive encrypted reasoning items that can be passed back in the next request.
const model = new AzureResponsesModel({
endpoint: process.env.AZURE_OPENAI_ENDPOINT!,
apiKey: process.env.AZURE_OPENAI_API_KEY!,
deployment: "o4-mini",
store: false,
});
// First turn: request encrypted reasoning
const response = await model.getResponse({
messages: [{ role: "user", content: "Solve this step by step: 15 * 23 + 7" }],
modelSettings: {
reasoningEffort: "medium",
include: ["reasoning.encrypted_content"],
},
});
// Extract reasoning items from output
const reasoningItems = response.outputItems?.filter(
(item) => item.type === "reasoning"
) ?? [];
// Second turn: pass reasoning back to preserve context
const followUp = await model.getResponse({
messages: [{ role: "user", content: "Now divide that result by 4" }],
rawInputItems: reasoningItems,
modelSettings: {
reasoningEffort: "medium",
include: ["reasoning.encrypted_content"],
},
});The encrypted reasoning items are opaque — they can't be read or modified, only passed back to the API. This allows the model to maintain its reasoning chain across turns without server-side storage.
Encrypted reasoning is only supported with the Responses API (AzureResponsesModel) and reasoning models. The include parameter is ignored by non-reasoning models.
Context compaction
For long-running sessions on the Responses API, server-side context compaction shrinks the context window while preserving essential information.
const agent = new Agent({
name: "long-session",
model: responsesModel,
modelSettings: {
contextManagement: [{
type: "compaction",
compact_threshold: 200000,
}],
},
});When the output token count crosses compact_threshold, the API automatically compacts the context and emits a compaction item in outputItems. On subsequent turns the compaction item carries forward essential context using fewer tokens.
You can pass compaction items back in follow-up requests via rawInputItems:
const result = await model.getResponse({
messages: [{ role: "user", content: "Continue the conversation" }],
modelSettings: {
contextManagement: [{ type: "compaction", compact_threshold: 200000 }],
},
});
// If compaction occurred, pass the item back in the next request
const compactionItems = result.outputItems?.filter(
(item) => item.type === "compaction"
) ?? [];
const followUp = await model.getResponse({
messages: [{ role: "user", content: "What were we talking about?" }],
rawInputItems: compactionItems,
});For explicit compaction (outside of automatic context_management), use the compact() method on AzureResponsesModel.
Context compaction is only supported with the Responses API (AzureResponsesModel).
Next steps
Last updated on