stratus
Guides

Deployment & Hosting

Deploy Stratus agents in production with containers, sessions, and monitoring

Stratus agents are not stateless request handlers. The run loop maintains conversation history, executes tools, tracks token usage, and manages handoffs across multiple model calls within a single request. This changes how you think about deployment.

How agent runs differ from REST endpoints

A single run() may call the model several times, execute tools between calls, and accumulate state as the conversation evolves. A simple question needs one model call; a research task with four tool calls needs five. Your deployment needs to handle long-lived requests, streaming responses, and graceful cancellation.

Requirements

RequirementDetails
RuntimeBun 1.0+ or Node.js 20+ (ESM support required)
NetworkOutbound HTTPS to your Azure OpenAI endpoint
Memory256 MB minimum. 512 MB+ recommended for agents with large tool outputs or long conversation histories
CPU1 vCPU minimum. Most time is spent waiting on Azure API calls, so CPU is rarely the bottleneck
Environment variablesAZURE_ENDPOINT, AZURE_API_KEY, and your deployment name

Stratus spends most of its time waiting on network I/O (model API calls, tool HTTP requests). A single process can handle many concurrent agent runs without high CPU usage.

Deployment patterns

Choose a pattern based on how your agents interact with users.

Ephemeral -- new run per request

Each HTTP request creates a fresh run() with no prior history. Best for one-off tasks like classification, extraction, or single-turn Q&A.

ephemeral.ts
import { AzureResponsesModel } from "stratus-sdk";
import { Agent, run } from "stratus-sdk/core";

const model = new AzureResponsesModel({
  endpoint: process.env.AZURE_ENDPOINT!,
  apiKey: process.env.AZURE_API_KEY!,
  deployment: "gpt-5.2",
});

const agent = new Agent({
  name: "classifier",
  model,
  instructions: "Classify the user's intent as billing, technical, or general.",
});

// Each request gets a clean run - no shared state
async function handleRequest(message: string) {
  const result = await run(agent, message, { maxTurns: 3 }); 
  return { output: result.output, tokens: result.usage.totalTokens };
}

Pros: Simple, horizontally scalable, no state management.

Cons: No conversation memory between requests.

Persistent sessions -- long-lived process

Use createSession() for multi-turn conversations where the process stays alive. Best for chat applications, interactive assistants, and WebSocket servers.

persistent.ts
import { AzureResponsesModel } from "stratus-sdk";
import { createSession } from "stratus-sdk/core";

const model = new AzureResponsesModel({
  endpoint: process.env.AZURE_ENDPOINT!,
  apiKey: process.env.AZURE_API_KEY!,
  deployment: "gpt-5.2",
});

// One session per user connection
const sessions = new Map<string, ReturnType<typeof createSession>>();

function getOrCreateSession(userId: string) {
  if (!sessions.has(userId)) {
    sessions.set(userId, createSession({ 
      model,
      instructions: "You are a helpful assistant.",
      maxTurns: 10,
    }));
  }
  return sessions.get(userId)!;
}

async function handleMessage(userId: string, message: string) {
  const session = getOrCreateSession(userId);
  session.send(message);

  const chunks: string[] = [];
  for await (const event of session.stream()) {
    if (event.type === "content_delta") {
      chunks.push(event.content);
    }
  }

  const result = await session.result;
  return { output: chunks.join(""), tokens: result.usage.totalTokens };
}

Pros: Full conversation history, natural multi-turn flow.

Cons: Sessions are lost on process restart. Memory grows with conversation length.

Hybrid -- save and resume with database persistence

Use save() and resumeSession() to persist conversations across process restarts, deployments, or server instances. Best for workflows that span multiple sessions or need durability.

hybrid.ts
import { AzureResponsesModel } from "stratus-sdk";
import { createSession, resumeSession } from "stratus-sdk/core";
import type { SessionSnapshot } from "stratus-sdk/core";

const model = new AzureResponsesModel({
  endpoint: process.env.AZURE_ENDPOINT!,
  apiKey: process.env.AZURE_API_KEY!,
  deployment: "gpt-5.2",
});

const sessionConfig = {
  model,
  instructions: "You are a helpful assistant.",
  maxTurns: 10,
};

async function handleMessage(sessionId: string | null, message: string, db: Database) {
  let session;

  if (sessionId) {
    // Resume from database
    const saved = await db.get<SessionSnapshot>(`session:${sessionId}`); 
    session = saved
      ? resumeSession(saved, sessionConfig) 
      : createSession(sessionConfig);
  } else {
    session = createSession(sessionConfig);
  }

  session.send(message);

  const chunks: string[] = [];
  for await (const event of session.stream()) {
    if (event.type === "content_delta") {
      chunks.push(event.content);
    }
  }

  const result = await session.result;

  // Persist after each turn
  const snapshot = session.save(); 
  await db.set(`session:${snapshot.id}`, snapshot); 

  return {
    sessionId: snapshot.id,
    output: chunks.join(""),
    tokens: result.usage.totalTokens,
  };
}

Pros: Survives restarts, works across multiple servers, supports long-running workflows.

Cons: Serialization overhead, database dependency. Trim old messages for very long conversations.

HTTP API example

Wrap a Stratus agent in an HTTP endpoint that streams responses as Server-Sent Events. This pattern works for any frontend that consumes SSE.

server.ts
import { Hono } from "hono";
import { streamSSE } from "hono/streaming";
import { AzureResponsesModel } from "stratus-sdk";
import { Agent, stream, RunAbortedError } from "stratus-sdk/core";

const model = new AzureResponsesModel({
  endpoint: process.env.AZURE_ENDPOINT!,
  apiKey: process.env.AZURE_API_KEY!,
  deployment: "gpt-5.2",
});

const agent = new Agent({
  name: "assistant",
  model,
  instructions: "You are a helpful assistant.",
  tools: [/* your tools */],
});

const app = new Hono();

app.post("/chat", async (c) => {
  const { message } = await c.req.json<{ message: string }>();
  const ac = new AbortController();

  // Cancel on client disconnect
  c.req.raw.signal.addEventListener("abort", () => ac.abort()); 

  const { stream: s, result } = stream(agent, message, {
    maxTurns: 10,
    signal: ac.signal, 
  });

  return streamSSE(c, async (sse) => {
    try {
      for await (const event of s) {
        switch (event.type) {
          case "content_delta":
            await sse.writeSSE({
              event: "content",
              data: JSON.stringify({ text: event.content }),
            });
            break;
          case "tool_call_start":
            await sse.writeSSE({
              event: "tool_start",
              data: JSON.stringify({ name: event.toolCall.name }),
            });
            break;
          case "tool_call_done":
            await sse.writeSSE({
              event: "tool_done",
              data: JSON.stringify({ id: event.toolCallId }),
            });
            break;
        }
      }

      const finalResult = await result;
      await sse.writeSSE({
        event: "complete",
        data: JSON.stringify({
          tokens: finalResult.usage.totalTokens,
          finishReason: finalResult.finishReason,
        }),
      });
    } catch (error) {
      if (!(error instanceof RunAbortedError)) {
        await sse.writeSSE({
          event: "error",
          data: JSON.stringify({ message: "Internal error" }),
        });
      }
    }
  });
});

export default app;
server.ts
import express from "express";
import { AzureResponsesModel } from "stratus-sdk";
import { Agent, stream, RunAbortedError } from "stratus-sdk/core";

const model = new AzureResponsesModel({
  endpoint: process.env.AZURE_ENDPOINT!,
  apiKey: process.env.AZURE_API_KEY!,
  deployment: "gpt-5.2",
});

const agent = new Agent({
  name: "assistant",
  model,
  instructions: "You are a helpful assistant.",
  tools: [/* your tools */],
});

const app = express();
app.use(express.json());

app.post("/chat", async (req, res) => {
  res.setHeader("Content-Type", "text/event-stream");
  res.setHeader("Cache-Control", "no-cache");
  res.setHeader("Connection", "keep-alive");

  const ac = new AbortController();
  req.on("close", () => ac.abort()); 

  const { message } = req.body;
  const { stream: s, result } = stream(agent, message, {
    maxTurns: 10,
    signal: ac.signal, 
  });

  try {
    for await (const event of s) {
      if (event.type === "content_delta") {
        res.write(`event: content\ndata: ${JSON.stringify({ text: event.content })}\n\n`);
      }
    }

    const finalResult = await result;
    res.write(`event: complete\ndata: ${JSON.stringify({
      tokens: finalResult.usage.totalTokens,
      finishReason: finalResult.finishReason,
    })}\n\n`);
  } catch (error) {
    if (!(error instanceof RunAbortedError)) {
      res.write(`event: error\ndata: ${JSON.stringify({ message: "Internal error" })}\n\n`);
    }
  }

  res.end();
});

app.listen(3000);

Both examples abort the agent run when the client disconnects. This prevents wasted compute on abandoned requests.

Docker containerization

Package a Stratus agent service as a container. This Dockerfile uses Bun for a lightweight image:

Dockerfile
FROM oven/bun:1 AS base
WORKDIR /app

# Install dependencies
COPY package.json bun.lockb ./
RUN bun install --frozen-lockfile --production

# Copy application code
COPY src/ ./src/
COPY tsconfig.json ./

# Runtime
EXPOSE 3000
ENV NODE_ENV=production
CMD ["bun", "run", "src/server.ts"]

Build and run:

Terminal
docker build -t stratus-agent .
docker run -p 3000:3000 \
  -e AZURE_ENDPOINT="https://your-resource.openai.azure.com" \
  -e AZURE_API_KEY="your-key" \
  stratus-agent

Never bake API keys into the image. Pass them as environment variables at runtime, or use a secrets manager.

For Node.js, swap the base image and entrypoint:

Dockerfile.node
FROM node:20-slim AS base
WORKDIR /app

COPY package.json package-lock.json ./
RUN npm ci --omit=dev

COPY src/ ./src/
COPY tsconfig.json ./

EXPOSE 3000
ENV NODE_ENV=production
CMD ["node", "--loader", "tsx", "src/server.ts"]

Preventing infinite loops

An agent with tools can loop indefinitely if the model keeps calling tools without producing a final answer. Three mechanisms protect against this.

maxTurns

Set maxTurns to cap the number of model calls in a single run. When exceeded, Stratus throws MaxTurnsExceededError.

max-turns.ts
import { Agent, run, MaxTurnsExceededError } from "stratus-sdk/core";

const agent = new Agent({
  name: "researcher",
  model,
  tools: [searchWeb, readPage, summarize],
});

try {
  const result = await run(agent, "Research quantum computing breakthroughs", {
    maxTurns: 8, 
  });
  console.log(result.output);
} catch (error) {
  if (error instanceof MaxTurnsExceededError) {
    console.error("Agent exceeded 8 model calls - returning partial result");
  }
}

The default maxTurns is 10. For production, set it explicitly based on your agent's expected behavior. Simple Q&A agents need 2-3 turns. Research agents with multiple tools may need 8-15.

Abort signal with timeout

Use AbortSignal.timeout() to enforce a wall-clock deadline. This catches cases where individual model calls are slow, not just where the agent loops too many times.

timeout.ts
import { Agent, run, RunAbortedError } from "stratus-sdk/core";

try {
  const result = await run(agent, "Summarize this dataset", {
    maxTurns: 10,
    signal: AbortSignal.timeout(30_000), 
  });
  console.log(result.output);
} catch (error) {
  if (error instanceof RunAbortedError) {
    console.error("Agent timed out after 30 seconds");
  }
}

Combined pattern

Use both together for defense in depth:

combined-safety.ts
import { Agent, run, MaxTurnsExceededError, RunAbortedError } from "stratus-sdk/core";

async function safeRun(agent: Agent, input: string) {
  try {
    return await run(agent, input, {
      maxTurns: 10,                        
      signal: AbortSignal.timeout(30_000), 
    });
  } catch (error) {
    if (error instanceof MaxTurnsExceededError) {
      return { error: "too_many_turns", message: "Agent exceeded turn limit" };
    }
    if (error instanceof RunAbortedError) {
      return { error: "timeout", message: "Agent timed out" };
    }
    throw error;
  }
}

Monitoring

Tracing

Wrap agent runs with withTrace() to capture span-level timing for every model call, tool execution, handoff, and guardrail check:

traced-endpoint.ts
import { withTrace, Agent, run } from "stratus-sdk/core";

app.post("/chat", async (req, res) => {
  const { result, trace } = await withTrace("chat_request", async () => { 
    return run(agent, req.body.message, { maxTurns: 10 });
  });

  // Log trace to your observability platform
  for (const span of trace.spans) {
    console.log(`[${span.type}] ${span.name}: ${span.duration}ms`); 
    if (span.type === "model_call" && span.metadata?.usage) {
      console.log(`  tokens: ${JSON.stringify(span.metadata.usage)}`);
    }
  }

  res.json({
    output: result.output,
    traceId: trace.id,
    duration: trace.duration,
  });
});

Each trace includes spans for:

Span typeWhat it captures
model_callLLM API call with agent name, turn number, usage, and tool call count
tool_executionTool execute function with tool name and duration
handoffAgent-to-agent transfer with from/to names
guardrailInput or output guardrail execution
subagentSub-agent execution with child agent name

Usage tracking

Every RunResult includes accumulated token usage. Log it to track costs per request:

usage-logging.ts
import type { UsageInfo } from "stratus-sdk/core";

function logUsage(requestId: string, usage: UsageInfo) {
  console.log(JSON.stringify({
    requestId,
    promptTokens: usage.promptTokens,
    completionTokens: usage.completionTokens,
    totalTokens: usage.totalTokens,
    cacheReadTokens: usage.cacheReadTokens ?? 0,
    cacheCreationTokens: usage.cacheCreationTokens ?? 0,
    timestamp: new Date().toISOString(),
  }));
}

// After every run
const result = await run(agent, input);
logUsage(requestId, result.usage); 

Cost management

Built-in cost tracking

Use createCostEstimator() and pass it to run() or createSession() for automatic per-run cost tracking:

cost-tracking.ts
import { Agent, run, createCostEstimator } from "stratus-sdk/core";

const estimator = createCostEstimator({ 
  inputTokenCostPer1k: 0.005,
  outputTokenCostPer1k: 0.015,
  cachedInputTokenCostPer1k: 0.0025,
});

const result = await run(agent, input, {
  costEstimator: estimator, 
});

console.log(`Cost: $${result.totalCostUsd.toFixed(4)}`); 
console.log(`Turns: ${result.numTurns}`);

Budget enforcement

Set maxBudgetUsd to automatically stop runs that exceed a dollar threshold. The onStop hook fires with reason: "max_budget" before MaxBudgetExceededError is thrown.

budget-limits.ts
import { Agent, run, createCostEstimator, MaxBudgetExceededError } from "stratus-sdk/core";

const estimator = createCostEstimator({
  inputTokenCostPer1k: 0.005,
  outputTokenCostPer1k: 0.015,
});

const agent = new Agent({
  name: "researcher",
  model,
  tools: [searchWeb, readPage, summarize],
  hooks: {
    onStop: async ({ reason }) => { 
      if (reason === "max_budget") {
        await logToAnalytics("budget_exceeded");
      }
    },
  },
});

try {
  const result = await run(agent, "Research quantum computing", {
    costEstimator: estimator,
    maxBudgetUsd: 0.50, 
    maxTurns: 15,
  });
  console.log(result.output);
} catch (error) {
  if (error instanceof MaxBudgetExceededError) {
    console.error(`Budget exceeded: spent $${error.spentUsd.toFixed(4)} of $${error.budgetUsd.toFixed(4)}`);
  }
}

Sessions support the same options:

session-budget.ts
const session = createSession({
  model,
  costEstimator: estimator,
  maxBudgetUsd: 1.00, 
});

The budget is checked after each model call. A single model call may push spending over the limit. Set budgets with headroom.

Security

Input guardrails

Block harmful or invalid input before it reaches the model. Guardrails run in parallel with the first model call, so they add minimal latency:

production-guardrails.ts
import { Agent } from "stratus-sdk/core";
import type { InputGuardrail } from "stratus-sdk/core";

const piiGuardrail: InputGuardrail = {
  name: "block_pii",
  execute: async (input) => {
    const hasSSN = /\b\d{3}-\d{2}-\d{4}\b/.test(input);
    const hasCreditCard = /\b\d{4}[- ]?\d{4}[- ]?\d{4}[- ]?\d{4}\b/.test(input);
    return {
      tripwireTriggered: hasSSN || hasCreditCard,
      outputInfo: { reason: "PII detected in input" },
    };
  },
};

const injectionGuardrail: InputGuardrail = {
  name: "block_injection",
  execute: async (input) => {
    const patterns = [
      /ignore (?:all )?(?:previous |prior )?instructions/i,
      /you are now/i,
      /system:\s/i,
    ];
    const triggered = patterns.some((p) => p.test(input));
    return {
      tripwireTriggered: triggered,
      outputInfo: { reason: "Potential prompt injection detected" },
    };
  },
};

const agent = new Agent({
  name: "assistant",
  model,
  inputGuardrails: [piiGuardrail, injectionGuardrail], 
});

Catch guardrail errors in your request handler:

guardrail-handling.ts
import { run, InputGuardrailTripwireTriggered } from "stratus-sdk/core";

try {
  const result = await run(agent, userInput);
  res.json({ output: result.output });
} catch (error) {
  if (error instanceof InputGuardrailTripwireTriggered) { 
    res.status(400).json({
      error: "blocked",
      guardrail: error.guardrailName,
    });
  }
}

Tool permission control with hooks

Use beforeToolCall to enforce authorization rules. The model sees denials as tool results and adapts its response:

permission-hooks.ts
import { Agent } from "stratus-sdk/core";

interface AppContext {
  userId: string;
  role: "user" | "admin";
}

const agent = new Agent<AppContext>({
  name: "admin_assistant",
  model,
  tools: [readData, writeData, deleteData],
  hooks: {
    beforeToolCall: async ({ toolCall, context }) => {
      // Block destructive operations for non-admins
      const destructiveTools = ["write_data", "delete_data"];
      if (
        destructiveTools.includes(toolCall.function.name) &&
        context.role !== "admin"
      ) {
        return { 
          decision: "deny", 
          reason: "This action requires admin privileges.", 
        }; 
      }
    },
    beforeHandoff: async ({ toAgent, context }) => {
      // Prevent handoff to admin agent for non-admin users
      if (toAgent.name === "admin_agent" && context.role !== "admin") {
        return {
          decision: "deny",
          reason: "Access to admin agent denied.",
        };
      }
    },
  },
});

Hook decisions support three modes: "allow" (default), "deny" (block with reason), and "modify" (rewrite tool call arguments). See the Hooks reference for the full ToolCallDecision and HandoffDecision types.

Output guardrails

Validate model output before returning it to users. Output guardrails run after the model responds and can block sensitive data from leaking:

output-guardrails.ts
import type { OutputGuardrail } from "stratus-sdk/core";

const noInternalData: OutputGuardrail = {
  name: "no_internal_data",
  execute: async (output) => {
    const hasInternalUrl = /https?:\/\/internal\./i.test(output);
    const hasApiKey = /(?:api[_-]?key|secret|token)\s*[:=]\s*\S+/i.test(output);
    return {
      tripwireTriggered: hasInternalUrl || hasApiKey,
      outputInfo: { reason: "Output contains internal data" },
    };
  },
};

const agent = new Agent({
  name: "assistant",
  model,
  outputGuardrails: [noInternalData], 
});

Next steps

Edit on GitHub

Last updated on

On this page