Test complex agents

When your task can’t be expressed as a prompt (agents, multi-step workflows, custom tooling, or heavy dependencies), connect your code to a playground. The iteration workflow stays the same: run evaluations, compare results side-by-side, and share with teammates. Your code handles task execution. The playground handles the rest. Two approaches differ in where your code runs:

Remote evals — Run evals on your own infrastructure, controlled from Braintrust. Your evaluation code runs on your machine or server. The Braintrust playground triggers execution, sends parameters, and displays results.
Sandboxes — Run evals in an isolated cloud sandbox, controlled from Braintrust. You push an execution artifact (a code bundle or container snapshot) and Braintrust invokes it on demand from the playground. No server to keep running.
Sandboxes are in beta and require a Pro or Enterprise plan. Self-hosted deployments require data plane version v2.0 (upcoming).

Common use cases

Remote evals
Sandboxes

Private internal resources

Your eval needs to call internal APIs, query private databases, or access services inside your VPN. Because remote evals execute on your infrastructure, that access is already available.

OS-specific or platform-locked tooling

Your eval requires software that only runs on a specific OS or machine — for example, a Windows-only simulation or a Unity project on a dedicated workstation. Remote evals let Braintrust trigger execution on whichever machine has the right environment set up.

Heavy or complex dev setup

Some tools are too painful to install on every teammate’s machine — game engines, large models, specialized SDKs. Set up the environment once on a shared server and let everyone else run the eval from the playground.

Data security and compliance

Sensitive data stays on your infrastructure. Only results are sent to Braintrust.

No server to maintain

Push your eval once and it’s always available from the playground — without keeping a process alive or worrying about uptime. This works well for stable eval versions the whole team can run on demand.

Team sharing without dev setup

Custom Python or TypeScript environments

Include pip packages with --requirements (Lambda) or bring your own container image (Modal) for full control over the runtime environment.

Reproducible, isolated runs

Each run executes against the same packaged artifact — same bundle or container snapshot — so results are consistent across teammates and over time.

Run a remote eval

Run evals on your own infrastructure, controlled from Braintrust. Your evaluation code runs on your machine or server. The Braintrust playground triggers execution, sends parameters, and displays results.

1. Write your eval

A remote eval looks like a standard eval call with a parameters field that defines configurable options. These parameters become UI controls in the playground. Install the SDK and dependencies:

# pnpm
pnpm add braintrust openai autoevals
# npm
npm install braintrust openai autoevals

Create the eval code:

import { Eval, wrapOpenAI } from "braintrust";
import OpenAI from "openai";
import { z } from "zod";

const client = wrapOpenAI(new OpenAI());

Eval("my-project", {
  data: [{ input: "hello", expected: "HELLO" }],
  task: async (input, { parameters }) => {
    const completion = await client.chat.completions.create(
      parameters.main.build({ input }),
    );
    return completion.choices[0].message.content ?? "";
  },
  scores: [],
  parameters: {
    main: {
      type: "prompt",
      name: "Main prompt",
      description: "The prompt used to process input",
      default: {
        messages: [{ role: "user", content: "{{input}}" }],
        model: "gpt-5-mini",
      },
    },
    prefix: z.string().describe("Optional prefix to prepend to input").default(""),
  },
});

The parameter system uses different syntax across languages:

Feature	TypeScript	Python	Java	Ruby
Prompt parameters	`type: "prompt"` with `messages` array in `default`	`type: "prompt"` with nested `prompt.messages` and `options`	`type: "prompt"` with nested `prompt.messages` and `options`	Not supported
Scalar types	Zod schemas: `z.string()`, `z.boolean()`, `z.number()` with `.describe()`	Pydantic models with `Field(description=...)`	`Map` with `type`, `description`, `default`	Hash with `type:`, `description:`, `default:`
Parameter access	`parameters.prefix`	`parameters.get("prefix")`	`parameters.get("prefix")`	`parameters["prefix"]` (via keyword arg)
Prompt usage	`parameters.main.build({ input: value })`	`**parameters["main"].build(input=value)`	`parameters.get("main").build(Map.of("input", value))`	Not applicable
Async	`async`/`await`	`async`/`await`	Synchronous or `CompletableFuture`	Synchronous

To reference saved parameter configurations instead of defining them inline, use loadParameters() (TypeScript) or load_parameters() (Python). See Parameters for details.

2. Expose the eval server

Run your eval with the --dev flag to start a local server:

TypeScript
Python
Java
Ruby

npx braintrust eval path/to/eval.ts --dev

Dev server starts at http://localhost:8300. Configure the host and port:

--dev-host DEV_HOST: The host to bind to. Defaults to localhost. Set to 0.0.0.0 to bind to all interfaces (be cautious about security when exposing beyond localhost).
--dev-port DEV_PORT: The port to bind to. Defaults to 8300.

braintrust eval path/to/eval.py --dev

Dev server starts at http://localhost:8300. Configure the host and port:

--dev-host DEV_HOST: The host to bind to. Defaults to localhost. Set to 0.0.0.0 to bind to all interfaces (be cautious about security when exposing beyond localhost).
--dev-port DEV_PORT: The port to bind to. Defaults to 8300.

The Java SDK does not have a CLI command. Start the dev server programmatically using Devserver.builder()...build() followed by devserver.start(), as shown in the code example above.

Run as a Rack app

The dev server requires a Rack-compatible web server that supports streaming:

Server	Version
Puma (recommended)	6.x
Falcon	0.x
Passenger	6.x
WEBrick	Not supported — does not support streaming

Create your eval server file:

eval_server.ru

# Requires Braintrust Ruby SDK v0.2.1+
require "braintrust"
require "braintrust/server"
require "openai"

Braintrust.init(blocking_login: true)
Braintrust.instrument!(:openai)

client = OpenAI::Client.new(api_key: ENV.fetch("OPENAI_API_KEY"))

simple_eval = Braintrust::Eval::Evaluator.new(
  task: ->(input:) {
    response = client.chat.completions.create(
      model: "gpt-5-mini",
      messages: [{role: "user", content: input}]
    )
    response.choices.first.message.content
  },
  scorers: [
    Braintrust::Scorer.new("exact_match") { |expected:, output:| output == expected ? 1.0 : 0.0 }
  ]
)

run Braintrust::Server::Rack.app(
  evaluators: {"simple-eval" => simple_eval}
)

Add dependencies and start the server:

# Gemfile
gem "rack"
gem "puma"

bundle install
bundle exec rackup eval_server.ru -p 8300 -o 0.0.0.0

Run as a Rails engine

If you have an existing Rails application, you can mount the Braintrust eval server as a Rails engine instead of running a separate Rack process.

Requires Rails 8.x. Add to your Gemfile:

gem "actionpack", "~> 8.0"
gem "railties", "~> 8.0"
gem "activesupport", "~> 8.0"

Place evaluator classes under app/evaluators/ as subclasses of Braintrust::Eval::Evaluator:

# app/evaluators/food_classifier.rb
class FoodClassifier < Braintrust::Eval::Evaluator
  def task
    ->(input:) { classify(input) }
  end

  def scorers
    [Braintrust::Scorer.new("exact_match") { |expected:, output:| output == expected ? 1.0 : 0.0 }]
  end
end

Generate the initializer:
```
bin/rails generate braintrust:server
```
This creates config/initializers/braintrust_server.rb with a slug-to-evaluator mapping auto-discovered from app/evaluators/.

Mount the engine:

# config/routes.rb
Rails.application.routes.draw do
  mount Braintrust::Contrib::Rails::Server::Engine, at: "/braintrust"
end

Auth configurationThe engine defaults to :clerk_token authentication. For local development, set auth to :none in the generated initializer:

# config/initializers/braintrust_server.rb
Braintrust::Contrib::Rails::Server::Engine.configure do |config|
  config.auth = :none
end

auth: :none disables authentication on incoming requests. Only use this for local development. BRAINTRUST_API_KEY must still be set on the server — it’s required to fetch resources from your project.

3. Configure in your project

To make your eval accessible beyond localhost, add the endpoint to your project:

In your project, go to Settings.
Under Project, select Remote evals.
Select Create remote eval source.
Enter the name and URL of your remote eval server.
Select Create remote eval source.

All team members with access to the project can now use this remote eval in their playgrounds. Keep the process running while using the remote eval.

4. Run from a playground

Open a playground in your project.
Select + Task.
Choose Remote eval from the task type list.
Select your eval and configure parameters using the UI controls.
Provide data inline or select a dataset, optionally add scorers, and click Run.

Results stream back as the eval executes. You can run multiple instances side-by-side with different parameters to compare results.

Demo

Run a sandbox eval

Sandboxes are in beta and require a Pro or Enterprise plan. Self-hosted deployments require data plane version v2.0 (upcoming).

Run evals in an isolated cloud sandbox, controlled from Braintrust. Push an execution artifact once and Braintrust invokes it on demand from the playground — no server to keep running. Braintrust supports two sandbox providers:

Lambda — AWS Lambda-based. The default for braintrust push. Supports both Python and TypeScript. No extra configuration needed.
Modal — Container-based via Modal. Requires a snapshotted Modal container image. Executes TypeScript evals only.

1. Write your eval

A sandbox eval looks like a standard eval call with a parameters field that defines configurable options. These parameters become UI controls in the playground. Install the SDK and dependencies:

# pnpm
pnpm add braintrust openai autoevals
# npm
npm install braintrust openai autoevals

Sandboxes require TypeScript SDK v3.7.1+ or Python SDK v0.12.1+.

Create the eval code:

my_eval.eval.ts

import { Eval, wrapOpenAI } from "braintrust";
import OpenAI from "openai";
import { z } from "zod";

const client = wrapOpenAI(new OpenAI());

Eval("my-project", {
  data: [{ input: "hello", expected: "HELLO" }],
  task: async (input, { parameters }) => {
    const completion = await client.chat.completions.create(
      parameters.main.build({ input }),
    );
    return completion.choices[0].message.content ?? "";
  },
  scores: [],
  parameters: {
    main: {
      type: "prompt",
      name: "Main prompt",
      description: "The prompt used to process input",
      default: {
        messages: [{ role: "user", content: "{{input}}" }],
        model: "gpt-5-mini",
      },
    },
    prefix: z.string().describe("Optional prefix to prepend to input").default(""),
  },
});

The parameter system uses different syntax across languages:

Feature	TypeScript	Python
Prompt parameters	`type: "prompt"` with `messages` array in `default`	`type: "prompt"` with nested `prompt.messages` and `options`
Scalar types	Zod schemas: `z.string()`, `z.boolean()`, `z.number()` with `.describe()`	Pydantic models with `Field(description=...)`
Parameter access	`parameters.prefix`	`parameters.get("prefix")`
Prompt usage	`parameters.main.build({ input: value })`	`**parameters["main"].build(input=value)`
Async	`async`/`await`	`async`/`await`

To reference saved parameter configurations instead of defining them inline, use loadParameters() (TypeScript) or load_parameters() (Python). See Parameters for details.

2. Register your sandbox

Lambda
Modal

braintrust push my_eval.py           # Python
npx braintrust push my_eval.eval.ts  # TypeScript

To include pip dependencies:

braintrust push my_eval.py --requirements requirements.txt

To run locally and register the sandbox in one step (TypeScript):

npx braintrust eval my_eval.eval.ts --push

To update an existing sandbox:

braintrust push my_eval.py --if-exists replace           # Python
npx braintrust push my_eval.eval.ts --if-exists replace  # TypeScript

Modal sandboxes run your eval in a custom container image. The container must include Node.js and your eval code.

Add your Modal credentials under Settings > Organization > Sandbox providers.

Build and snapshot the container:

import modal

app = modal.App.lookup("my-braintrust-sandbox", create_if_missing=True)
image = modal.Image.from_dockerfile("./Dockerfile")

sb = modal.Sandbox.create(app=app, image=image, workdir="/app", timeout=60 * 5)
snapshot_image = sb.snapshot_filesystem()
snapshot_ref = snapshot_image.object_id  # e.g. "im-icRxmsk1Sz9XPP2f8OblVU"
sb.terminate()

The object_id returned by snapshot_filesystem() is your snapshot_ref.

import { registerSandbox } from "braintrust";

const result = await registerSandbox({
  name: "My Eval Sandbox",
  project: "my-project",
  sandbox: { provider: "modal", snapshotRef: "im-icRxmsk1Sz9XPP2f8OblVU" },
  entrypoints: ["./my_eval.eval.ts"],
});

entrypoints lists the eval files available in the snapshot. Re-registering with a new snapshot_ref updates the sandbox in place.

3. Run from a playground

Open a playground in your project.
Select + Task.
Open the Remote eval submenu and select your sandbox.
Select your eval and configure parameters using the UI controls.
Provide data inline or select a dataset, optionally add scorers, and click Run.

Results stream back as the eval executes. You can run multiple instances side-by-side with different parameters to compare results.

Limitations

The dataset defined in your eval is ignored when running from the playground. Datasets are managed through the playground.
Scorers defined in your eval are concatenated with scorers added in the playground.
For sandboxes, each eval run triggered from the playground is capped at 15 minutes end-to-end.

Next steps

Test prompts and models without custom code
Create parameters to manage configurable settings
Interpret results from your experiments

Start

Instrument

Observe

Annotate

Evaluate

Deploy

Admin

Best practices

Common use cases

Run a remote eval

1. Write your eval

2. Expose the eval server

Run as a Rack app

Run as a Rails engine

3. Configure in your project

4. Run from a playground

Demo

Run a sandbox eval

1. Write your eval

2. Register your sandbox

3. Run from a playground

Limitations

Next steps

Start

Instrument

Observe

Annotate

Evaluate

Deploy

Admin

Best practices

​Common use cases

​Run a remote eval

​1. Write your eval

​2. Expose the eval server

​Run as a Rack app

​Run as a Rails engine

​3. Configure in your project

​4. Run from a playground

​Demo

​Run a sandbox eval

​1. Write your eval

​2. Register your sandbox

​3. Run from a playground

​Limitations

​Next steps

Common use cases

Run a remote eval

1. Write your eval

2. Expose the eval server

Run as a Rack app

Run as a Rails engine

3. Configure in your project

4. Run from a playground

Demo

Run a sandbox eval

1. Write your eval

2. Register your sandbox

3. Run from a playground

Limitations

Next steps