Skip to main content
When your task can’t be expressed as a prompt (agents, multi-step workflows, custom tooling, or heavy dependencies), connect your code to a playground. The iteration workflow stays the same: run evaluations, compare results side-by-side, and share with teammates. Your code handles task execution. The playground handles the rest. Two approaches differ in where your code runs:
  • Remote evals — Run evals on your own infrastructure, controlled from Braintrust. Your evaluation code runs on your machine or server. The Braintrust playground triggers execution, sends parameters, and displays results.
  • Sandboxes — Run evals in an isolated cloud sandbox, controlled from Braintrust. You push an execution artifact (a code bundle or container snapshot) and Braintrust invokes it on demand from the playground. No server to keep running.
    Sandboxes are in beta and require a Pro or Enterprise plan. Self-hosted deployments require data plane version v2.0 (upcoming).

Common use cases

Your eval needs to call internal APIs, query private databases, or access services inside your VPN. Because remote evals execute on your infrastructure, that access is already available.
Your eval requires software that only runs on a specific OS or machine — for example, a Windows-only simulation or a Unity project on a dedicated workstation. Remote evals let Braintrust trigger execution on whichever machine has the right environment set up.
Some tools are too painful to install on every teammate’s machine — game engines, large models, specialized SDKs. Set up the environment once on a shared server and let everyone else run the eval from the playground.
Sensitive data stays on your infrastructure. Only results are sent to Braintrust.

Run a remote eval

Run evals on your own infrastructure, controlled from Braintrust. Your evaluation code runs on your machine or server. The Braintrust playground triggers execution, sends parameters, and displays results.

1. Write your eval

A remote eval looks like a standard eval call with a parameters field that defines configurable options. These parameters become UI controls in the playground. Install the SDK and dependencies:
# pnpm
pnpm add braintrust openai autoevals
# npm
npm install braintrust openai autoevals
Create the eval code:
import { Eval, wrapOpenAI } from "braintrust";
import OpenAI from "openai";
import { z } from "zod";

const client = wrapOpenAI(new OpenAI());

Eval("my-project", {
  data: [{ input: "hello", expected: "HELLO" }],
  task: async (input, { parameters }) => {
    const completion = await client.chat.completions.create(
      parameters.main.build({ input }),
    );
    return completion.choices[0].message.content ?? "";
  },
  scores: [],
  parameters: {
    main: {
      type: "prompt",
      name: "Main prompt",
      description: "The prompt used to process input",
      default: {
        messages: [{ role: "user", content: "{{input}}" }],
        model: "gpt-5-mini",
      },
    },
    prefix: z.string().describe("Optional prefix to prepend to input").default(""),
  },
});
The parameter system uses different syntax across languages:
FeatureTypeScriptPythonJavaRuby
Prompt parameterstype: "prompt" with messages array in defaulttype: "prompt" with nested prompt.messages and optionstype: "prompt" with nested prompt.messages and optionsNot supported
Scalar typesZod schemas: z.string(), z.boolean(), z.number() with .describe()Pydantic models with Field(description=...)Map with type, description, defaultHash with type:, description:, default:
Parameter accessparameters.prefixparameters.get("prefix")parameters.get("prefix")parameters["prefix"] (via keyword arg)
Prompt usageparameters.main.build({ input: value })**parameters["main"].build(input=value)parameters.get("main").build(Map.of("input", value))Not applicable
Asyncasync/awaitasync/awaitSynchronous or CompletableFutureSynchronous
To reference saved parameter configurations instead of defining them inline, use loadParameters() (TypeScript) or load_parameters() (Python). See Parameters for details.

2. Expose the eval server

Run your eval with the --dev flag to start a local server:
npx braintrust eval path/to/eval.ts --dev
Dev server starts at http://localhost:8300. Configure the host and port:
  • --dev-host DEV_HOST: The host to bind to. Defaults to localhost. Set to 0.0.0.0 to bind to all interfaces (be cautious about security when exposing beyond localhost).
  • --dev-port DEV_PORT: The port to bind to. Defaults to 8300.

3. Configure in your project

To make your eval accessible beyond localhost, add the endpoint to your project:
  1. In your project, go to Settings.
  2. Under Project, select Remote evals.
  3. Select Create remote eval source.
  4. Enter the name and URL of your remote eval server.
  5. Select Create remote eval source.
All team members with access to the project can now use this remote eval in their playgrounds. Keep the process running while using the remote eval.

4. Run from a playground

  1. Open a playground in your project.
  2. Select + Task.
  3. Choose Remote eval from the task type list.
  4. Select your eval and configure parameters using the UI controls.
  5. Provide data inline or select a dataset, optionally add scorers, and click Run.
Results stream back as the eval executes. You can run multiple instances side-by-side with different parameters to compare results.

Demo

Run a sandbox eval

Sandboxes are in beta and require a Pro or Enterprise plan. Self-hosted deployments require data plane version v2.0 (upcoming).
Run evals in an isolated cloud sandbox, controlled from Braintrust. Push an execution artifact once and Braintrust invokes it on demand from the playground — no server to keep running. Braintrust supports two sandbox providers:
  • Lambda — AWS Lambda-based. The default for braintrust push. Supports both Python and TypeScript. No extra configuration needed.
  • Modal — Container-based via Modal. Requires a snapshotted Modal container image. Executes TypeScript evals only.

1. Write your eval

A sandbox eval looks like a standard eval call with a parameters field that defines configurable options. These parameters become UI controls in the playground. Install the SDK and dependencies:
# pnpm
pnpm add braintrust openai autoevals
# npm
npm install braintrust openai autoevals
Sandboxes require TypeScript SDK v3.7.1+ or Python SDK v0.12.1+.
Create the eval code:
my_eval.eval.ts
import { Eval, wrapOpenAI } from "braintrust";
import OpenAI from "openai";
import { z } from "zod";

const client = wrapOpenAI(new OpenAI());

Eval("my-project", {
  data: [{ input: "hello", expected: "HELLO" }],
  task: async (input, { parameters }) => {
    const completion = await client.chat.completions.create(
      parameters.main.build({ input }),
    );
    return completion.choices[0].message.content ?? "";
  },
  scores: [],
  parameters: {
    main: {
      type: "prompt",
      name: "Main prompt",
      description: "The prompt used to process input",
      default: {
        messages: [{ role: "user", content: "{{input}}" }],
        model: "gpt-5-mini",
      },
    },
    prefix: z.string().describe("Optional prefix to prepend to input").default(""),
  },
});
The parameter system uses different syntax across languages:
FeatureTypeScriptPython
Prompt parameterstype: "prompt" with messages array in defaulttype: "prompt" with nested prompt.messages and options
Scalar typesZod schemas: z.string(), z.boolean(), z.number() with .describe()Pydantic models with Field(description=...)
Parameter accessparameters.prefixparameters.get("prefix")
Prompt usageparameters.main.build({ input: value })**parameters["main"].build(input=value)
Asyncasync/awaitasync/await
To reference saved parameter configurations instead of defining them inline, use loadParameters() (TypeScript) or load_parameters() (Python). See Parameters for details.

2. Register your sandbox

braintrust push my_eval.py           # Python
npx braintrust push my_eval.eval.ts  # TypeScript
To include pip dependencies:
braintrust push my_eval.py --requirements requirements.txt
To run locally and register the sandbox in one step (TypeScript):
npx braintrust eval my_eval.eval.ts --push
To update an existing sandbox:
braintrust push my_eval.py --if-exists replace           # Python
npx braintrust push my_eval.eval.ts --if-exists replace  # TypeScript

3. Run from a playground

  1. Open a playground in your project.
  2. Select + Task.
  3. Open the Remote eval submenu and select your sandbox.
  4. Select your eval and configure parameters using the UI controls.
  5. Provide data inline or select a dataset, optionally add scorers, and click Run.
Results stream back as the eval executes. You can run multiple instances side-by-side with different parameters to compare results.

Limitations

  • The dataset defined in your eval is ignored when running from the playground. Datasets are managed through the playground.
  • Scorers defined in your eval are concatenated with scorers added in the playground.
  • For sandboxes, each eval run triggered from the playground is capped at 15 minutes end-to-end.

Next steps