stratus

Model Settings

Configure temperature, tool choice, and other model parameters

Model settings control how the model generates responses. You set them on the agent at construction time.

Setting on an agent

Pass a modelSettings object when creating an agent:

agent-settings.ts
import { Agent } from "@usestratus/sdk/core";

const agent = new Agent({
  name: "assistant",
  model,
  modelSettings: {
    temperature: 0.7,
    maxTokens: 1000,
  },
});

Settings are sent to the model on every call the agent makes. To change settings between runs, clone the agent with new values:

clone-settings.ts
const creativeAgent = agent.clone({
  modelSettings: { temperature: 1.2, topP: 0.95 },
});

ModelSettings reference

SettingTypeDefaultDescription
temperaturenumberModel defaultSampling temperature. Higher values (closer to 2) produce more random output. Lower values (closer to 0) produce more deterministic output. Range: 0--2.
topPnumberModel defaultNucleus sampling. The model considers tokens whose cumulative probability exceeds this threshold. Range: 0--1.
maxTokensnumberModel defaultMaximum number of tokens to generate in the response.
stopstring[]undefinedStop sequences. The model stops generating when it produces any of these strings.
presencePenaltynumber0Penalizes tokens that have already appeared, encouraging the model to talk about new topics. Range: -2 to 2.
frequencyPenaltynumber0Penalizes tokens proportional to how often they've appeared, reducing repetition. Range: -2 to 2.
toolChoiceToolChoice"auto"Controls which tools the model can call. See Tool choice.
parallelToolCallsbooleantrueWhether the model can call multiple tools in a single turn.
seednumberundefinedSeed for deterministic sampling. Repeated requests with the same seed and parameters should return the same result.
reasoningEffortReasoningEffortundefinedControls how much reasoning effort the model spends. See Reasoning models.
maxCompletionTokensnumberundefinedMax tokens for the model's completion, including reasoning tokens. Use instead of maxTokens for reasoning models.
reasoningSummaryReasoningSummaryundefinedControls reasoning summary output: "auto", "concise", or "detailed".
promptCacheKeystringundefinedInfluences prompt cache routing. Requests with the same key and prefix are more likely to hit cache. See Prompt caching.
truncationTruncationundefinedInput truncation strategy: "auto" (truncate oldest messages) or "disabled" (fail on overflow).
storebooleanundefinedWhether to store the request/response server-side. Required for previousResponseId chaining.
metadataRecord<string, string>undefinedArbitrary key-value metadata attached to the API request.
userstringundefinedEnd-user identifier for abuse monitoring.
logprobsbooleanundefinedWhether to return log probabilities of output tokens.
topLogprobsnumberundefinedNumber of most likely tokens to return per position (0--20). Requires logprobs: true.
predictionPredictedOutputundefinedPredicted output for faster completions. Chat Completions only. See Predicted output.
modalitiesModality[]["text"]Output modalities. Set to ["text", "audio"] for audio output. Chat Completions only. See Audio output.
audioAudioConfigundefinedAudio voice and format config. Requires modalities: ["text", "audio"]. Chat Completions only.
dataSourcesDataSource[]undefinedAzure On Your Data sources for RAG. Chat Completions only. See Data sources.
contextManagementContextManagementundefinedServer-side context compaction rules. Responses API only. See Context compaction.
includestring[]undefinedFields to include in the response. Responses API only. See Encrypted reasoning.
backgroundbooleanundefinedRun as a background task for long-running requests. Responses API only. See Background tasks.

Reasoning models

For reasoning models (o1, o3, etc.), use reasoningEffort and maxCompletionTokens instead of temperature and maxTokens.

reasoningEffort controls how much internal reasoning the model does before responding. Higher effort produces more thorough answers but uses more tokens and takes longer.

reasoning-settings.ts
import { Agent } from "@usestratus/sdk/core";

const agent = new Agent({
  name: "analyst",
  model,
  modelSettings: {
    reasoningEffort: "high", 
    maxCompletionTokens: 16384, 
  },
});

Valid values for reasoningEffort:

ValueDescription
"none"No reasoning
"minimal"Minimal reasoning
"low"Low effort
"medium"Medium effort (default for reasoning models)
"high"High effort
"xhigh"Maximum effort

maxCompletionTokens includes both reasoning tokens and output tokens. If the model uses 1000 tokens for reasoning and 500 for the response, that's 1500 total against the limit. Reasoning tokens are tracked in UsageInfo.reasoningTokens.

Prompt caching

Azure automatically caches prompt prefixes for requests over 1,024 tokens. Use promptCacheKey to improve cache hit rates when many requests share long common prefixes.

cache-key.ts
const agent = new Agent({
  name: "assistant",
  model,
  modelSettings: {
    promptCacheKey: "support-agent-v2", 
  },
});

Cache hits appear as cacheReadTokens in UsageInfo and are billed at a discount. No opt-in is needed for basic caching — promptCacheKey is only for improving hit rates across requests with shared prefixes.

Tool choice

The toolChoice setting controls whether and how the model calls tools. Set it inside modelSettings.

The default. The model decides whether to call a tool or respond with text.

tool-choice-auto.ts
const agent = new Agent({
  name: "assistant",
  model,
  tools: [getWeather],
  modelSettings: {
    toolChoice: "auto", 
  },
});

Forces the model to call at least one tool. It will not respond with text alone.

tool-choice-required.ts
const agent = new Agent({
  name: "assistant",
  model,
  tools: [getWeather, searchDocs],
  modelSettings: {
    toolChoice: "required", 
  },
});

Prevents the model from calling any tools, even if tools are defined on the agent. The model responds with text only.

tool-choice-none.ts
const agent = new Agent({
  name: "assistant",
  model,
  tools: [getWeather],
  modelSettings: {
    toolChoice: "none", 
  },
});

Forces the model to call one specific tool by name. Useful when you know exactly which tool should run.

tool-choice-function.ts
const agent = new Agent({
  name: "assistant",
  model,
  tools: [getWeather, searchDocs],
  modelSettings: {
    toolChoice: { 
      type: "function", 
      function: { name: "get_weather" }, 
    }, 
  },
});

Tool use behavior

toolUseBehavior is separate from modelSettings. It is set directly on the agent and controls what happens after a tool executes -- not what the model generates.

The default. After a tool executes, the result is sent back to the model so it can generate a follow-up response or call more tools.

behavior-run-again.ts
const agent = new Agent({
  name: "assistant",
  model,
  tools: [getWeather],
  toolUseBehavior: "run_llm_again", 
});

Stops the run immediately after the first tool call completes. The tool's return value becomes the run output. The model is not called again.

behavior-stop-first.ts
const agent = new Agent({
  name: "data-fetcher",
  model,
  tools: [fetchData],
  toolUseBehavior: "stop_on_first_tool", 
});

const result = await run(agent, "Get the latest sales data");
// result.output is the return value of fetchData

This is useful when the agent's only job is to pick and invoke the right tool.

Stops only when a specific tool is called. Other tools feed their results back to the model as usual.

behavior-stop-at.ts
const agent = new Agent({
  name: "researcher",
  model,
  tools: [searchDocs, summarize, finalAnswer],
  toolUseBehavior: { 
    stopAtToolNames: ["final_answer"], 
  }, 
});

The agent can call searchDocs and summarize as many times as it needs. The run stops only when it calls final_answer.

Pass a function to decide dynamically whether to stop after tool calls. The function receives all tool results from the current turn and returns true to stop or false to continue.

behavior-function.ts
const agent = new Agent({
  name: "researcher",
  model,
  tools: [searchDocs, summarize, finalAnswer],
  toolUseBehavior: (toolResults) => { 
    return toolResults.some((r) => r.toolName === "final_answer"); 
  }, 
});

The callback can also be async:

toolUseBehavior: async (toolResults) => {
  const shouldStop = await checkCompletion(toolResults);
  return shouldStop;
},

toolUseBehavior is set on the Agent, not in modelSettings. It controls what happens after a tool executes, not what the model generates.

Response format

Structured output is configured via outputType on the agent, not through modelSettings directly. When you set outputType to a Zod schema, Stratus sends the appropriate response_format to Azure automatically.

const agent = new Agent({
  name: "extractor",
  model,
  outputType: z.object({
    name: z.string(),
    age: z.number(),
  }),
});

See Structured Output for details.

Predicted output

Predicted output speeds up completions when you know roughly what the model will return (e.g. code edits). The model diffs against your prediction instead of generating from scratch.

predicted-output.ts
const agent = new Agent({
  name: "editor",
  model,
  modelSettings: {
    prediction: { 
      type: "content", 
      content: existingCode, 
    }, 
  },
});

Predicted output is only supported with the Chat Completions API (AzureChatCompletionsModel) on API version 2025-01-01-preview or later.

Audio output

For gpt-4o-audio deployments, you can request audio output alongside text.

audio-output.ts
const agent = new Agent({
  name: "narrator",
  model,
  modelSettings: {
    modalities: ["text", "audio"], 
    audio: { voice: "alloy", format: "mp3" }, 
  },
});

Available voices: "alloy", "echo", "fable", "onyx", "nova", "shimmer".

Available formats: "wav", "mp3", "flac", "opus", "pcm16".

Audio output is only supported with the Chat Completions API (AzureChatCompletionsModel).

Data sources

Azure On Your Data lets you ground model responses in your own data via Azure Search, Cosmos DB, and other Azure data sources. The model queries the data source and includes relevant results in its context.

data-sources.ts
const agent = new Agent({
  name: "rag-agent",
  model,
  modelSettings: {
    dataSources: [{ 
      type: "azure_search", 
      parameters: { 
        endpoint: "https://search.example.com", 
        index_name: "knowledge-base", 
        authentication: { type: "api_key", key: process.env.SEARCH_KEY! }, 
      }, 
    }], 
  },
});

Data sources are only supported with the Chat Completions API (AzureChatCompletionsModel). See the Azure On Your Data documentation for the full schema.

Encrypted reasoning

When using reasoning models (o3, o4-mini) in stateless mode (store: false), you need to preserve reasoning context across conversation turns. Set include to receive encrypted reasoning items that can be passed back in the next request.

encrypted-reasoning.ts
const model = new AzureResponsesModel({
  endpoint: process.env.AZURE_OPENAI_ENDPOINT!,
  apiKey: process.env.AZURE_OPENAI_API_KEY!,
  deployment: "o4-mini",
  store: false,
});

// First turn: request encrypted reasoning
const response = await model.getResponse({
  messages: [{ role: "user", content: "Solve this step by step: 15 * 23 + 7" }],
  modelSettings: {
    reasoningEffort: "medium",
    include: ["reasoning.encrypted_content"], 
  },
});

// Extract reasoning items from output
const reasoningItems = response.outputItems?.filter(
  (item) => item.type === "reasoning"
) ?? [];

// Second turn: pass reasoning back to preserve context
const followUp = await model.getResponse({
  messages: [{ role: "user", content: "Now divide that result by 4" }],
  rawInputItems: reasoningItems, 
  modelSettings: {
    reasoningEffort: "medium",
    include: ["reasoning.encrypted_content"],
  },
});

The encrypted reasoning items are opaque — they can't be read or modified, only passed back to the API. This allows the model to maintain its reasoning chain across turns without server-side storage.

Encrypted reasoning is only supported with the Responses API (AzureResponsesModel) and reasoning models. The include parameter is ignored by non-reasoning models.

Context compaction

For long-running sessions on the Responses API, server-side context compaction shrinks the context window while preserving essential information.

context-compaction.ts
const agent = new Agent({
  name: "long-session",
  model: responsesModel,
  modelSettings: {
    contextManagement: [{ 
      type: "compaction", 
      compact_threshold: 200000, 
    }], 
  },
});

When the output token count crosses compact_threshold, the API automatically compacts the context and emits a compaction item in outputItems. On subsequent turns the compaction item carries forward essential context using fewer tokens.

You can pass compaction items back in follow-up requests via rawInputItems:

compaction-round-trip.ts
const result = await model.getResponse({
  messages: [{ role: "user", content: "Continue the conversation" }],
  modelSettings: {
    contextManagement: [{ type: "compaction", compact_threshold: 200000 }],
  },
});

// If compaction occurred, pass the item back in the next request
const compactionItems = result.outputItems?.filter(
  (item) => item.type === "compaction"
) ?? [];

const followUp = await model.getResponse({
  messages: [{ role: "user", content: "What were we talking about?" }],
  rawInputItems: compactionItems, 
});

For explicit compaction (outside of automatic context_management), use the compact() method on AzureResponsesModel.

Context compaction is only supported with the Responses API (AzureResponsesModel).

Next steps

  • Tools -- define functions the model can call
  • Agents -- agent configuration reference
  • Streaming -- stream responses in real time
  • Hooks -- intercept tool calls and handoffs before they execute
Edit on GitHub

Last updated on

On this page