stratus

Multimodal Input

Send images, files, audio, and mixed content to agents

Send text, images, files (PDFs), audio, or any combination to agents using ContentPart arrays.

Sending an image

Pass a ChatMessage[] array to run() with a UserMessage whose content is a ContentPart[]:

image-input.ts
import { Agent, run } from "@usestratus/sdk/core";
import type { ChatMessage } from "@usestratus/sdk/core";

const agent = new Agent({
  name: "vision",
  model,
  instructions: "Describe what you see in the image.",
});

const messages: ChatMessage[] = [
  {
    role: "user",
    content: [
      {
        type: "image_url",
        image_url: { url: "https://example.com/photo.png" },
      },
    ],
  },
];

const result = await run(agent, messages);
console.log(result.output);

Base64 data URLs work the same way:

base64-image.ts
import { readFile } from "node:fs/promises";

const buffer = await readFile("./chart.png");
const dataUrl = `data:image/png;base64,${buffer.toString("base64")}`;

const messages: ChatMessage[] = [
  {
    role: "user",
    content: [
      {
        type: "image_url",
        image_url: { url: dataUrl },
      },
    ],
  },
];

const result = await run(agent, messages);

Mixed text and images

Combine text and image parts in a single message:

mixed-content.ts
import type { ContentPart } from "@usestratus/sdk/core";

const parts: ContentPart[] = [
  { type: "text", text: "Compare these two charts and summarize the differences." },
  { type: "image_url", image_url: { url: "https://example.com/chart-q1.png" } },
  { type: "image_url", image_url: { url: "https://example.com/chart-q2.png" } },
];

const messages: ChatMessage[] = [{ role: "user", content: parts }];

const result = await run(agent, messages);
console.log(result.output);

Image detail levels

The detail parameter controls how the model processes the image:

LevelDescription
"auto"The model decides based on image size (default)
"low"Fixed low-resolution processing. Faster and uses fewer tokens
"high"High-resolution processing with tiled analysis. More accurate for detailed images

Set the detail level on the image_url object:

detail-level.ts
const parts: ContentPart[] = [
  { type: "text", text: "Read the fine print in this contract." },
  {
    type: "image_url",
    image_url: {
      url: "https://example.com/contract.png",
      detail: "high", 
    },
  },
];

Use "low" when you only need a general understanding of the image. It processes faster and consumes fewer tokens. Use "high" when fine details matter, such as reading text in screenshots or analyzing charts.

Sending a file (PDF)

Pass PDF files as base64 data URLs or file IDs. Only supported by AzureResponsesModel.

pdf-input.ts
import { readFile } from "node:fs/promises";
import type { ChatMessage } from "@usestratus/sdk/core";

const buffer = await readFile("./report.pdf");
const dataUrl = `data:application/pdf;base64,${buffer.toString("base64")}`;

const messages: ChatMessage[] = [
  {
    role: "user",
    content: [
      { type: "file", file: { url: dataUrl }, filename: "report.pdf" },
      { type: "text", text: "Summarize this PDF" },
    ],
  },
];

const result = await run(agent, messages);

If you've uploaded the file via the Azure Files API, use a file ID instead:

file-id-input.ts
const messages: ChatMessage[] = [
  {
    role: "user",
    content: [
      { type: "file", file: { file_id: "assistant-KaVLJQ..." } },
      { type: "text", text: "What does this document say?" },
    ],
  },
];

Sending audio

Pass audio as a URL or inline base64 data. Only supported by AzureResponsesModel.

audio-input.ts
const messages: ChatMessage[] = [
  {
    role: "user",
    content: [
      { type: "audio", audio: { data: base64AudioData, format: "wav" } },
      { type: "text", text: "Transcribe this audio" },
    ],
  },
];

With sessions

session.send() accepts a ContentPart[] directly:

session-multimodal.ts
import { createSession } from "@usestratus/sdk/core";
import type { ContentPart } from "@usestratus/sdk/core";

const session = createSession({
  model,
  instructions: "You are a helpful vision assistant.",
});

const parts: ContentPart[] = [
  { type: "text", text: "What is in this image?" },
  { type: "image_url", image_url: { url: "https://example.com/photo.png" } },
];

session.send(parts); 
for await (const event of session.stream()) {
  if (event.type === "content_delta") process.stdout.write(event.content);
}

Follow-up messages can reference the image from the previous turn:

session.send("What colors are most prominent in that image?");
for await (const event of session.stream()) {
  if (event.type === "content_delta") process.stdout.write(event.content);
}

With prompt()

prompt() also accepts ContentPart[] as input:

prompt-multimodal.ts
import { prompt } from "@usestratus/sdk/core";
import type { ContentPart } from "@usestratus/sdk/core";

const parts: ContentPart[] = [
  { type: "text", text: "Describe this image in one sentence." },
  { type: "image_url", image_url: { url: "https://example.com/sunset.png" } },
];

const result = await prompt(parts, { model });
console.log(result.output);

Image support depends on the model deployment. Most gpt-5.x deployments support vision.

ContentPart types

interface TextContentPart {
  type: "text";
  text: string;
}

interface ImageContentPart {
  type: "image_url";
  image_url: {
    url: string;
    detail?: "auto" | "low" | "high";
  };
}

interface FileContentPart {
  type: "file";
  file: { url: string } | { file_id: string };
  filename?: string;
}

interface AudioContentPart {
  type: "audio";
  audio: { url: string } | { data: string; format: "wav" | "mp3" };
}

type ContentPart = TextContentPart | ImageContentPart | FileContentPart | AudioContentPart;

UserMessage.content accepts either a plain string or a ContentPart[] array. When you pass a string, it behaves as a single text part.

FileContentPart and AudioContentPart are only supported by AzureResponsesModel. They are converted to the Responses API's input_file and input_audio types respectively.

Next steps

  • Sessions - Multi-turn conversations with persistent history
  • Streaming - Stream responses token by token
  • Structured Output - Parse model output into typed objects
  • Tools - Give agents the ability to call functions
Edit on GitHub

Last updated on

On this page