Code Execution

A tutorial — grow one program across four steps: run a model-written snippet in a sandbox, grant it one capability, swap the backend, then gate it behind approval.

The shell tool runs a command. code_execution runs a program — a snippet in a named language that the model writes and your machine runs for real, handing back the captured output. Reach for it when the model should compute an answer (parse data, do math, transform a file) instead of guessing one: the result is measured, not invented in the prompt.

Like every dangerous built-in, it follows the SDK's split: the SDK owns the model-facing contract — codeExecutionTool fixes the code_execution name, its { language, code } schema, and how the result is formatted — and you own the backend that actually runs the code. That boundary is the whole point. Running model-written code is the single most dangerous thing an agent does, so the core ships no executor; you choose one.

Why a separate tool from `shell`

shell and code_execution both run model-written instructions, but they differ on the two things that matter:

	`shell`	`code_execution`
Input	one command string	a `{ language, code }` snippet
Isolation (shipped backend)	none — `bunShellBackend` runs on the host	a deny-by-default sandbox — `denoCodeExecutionBackend`

The isolation row is why this isn't just "shell with node -e". Code execution is exactly where you expect a sandbox, so the reference backend is sandboxed by default.

The tutorial

Starting from the multi-turn chat loop, this grows one program across four steps:

Run a snippet the model writes, in a sandbox.
Grant one capability — open a single hole in the deny-by-default sandbox.
Swap the backend — the same tool over a runtime you control.
Gate it — put a human in front once the sandbox is gone.

Each step below shows the whole program so far — the lines it adds are highlighted. Steps 1–2 need the deno binary on PATH (deno.com); steps 3–4 swap to a Bun backend, so they don't.

Step 1 — Run a Snippet

codeExecutionTool is the model-facing contract; denoCodeExecutionBackend() is the host glue that runs the code. The Deno backend is deny-by-default — the snippet gets no file, network, or env access — so you can run untrusted, model-written code with no permission gate. Add it to the chat loop's tools array and steer the model with a system prompt:

examples/code-execution-tutorial/step1.ts

import {
  AgentEventType,
  codeExecutionTool,
  runAgent,
  SessionMemoryStore,
} from "@open-agent-loops/core";
import type { AgentEvent } from "@open-agent-loops/core";
import { OpenAICompatibleModel } from "@open-agent-loops/core/providers/openai";
import { denoCodeExecutionBackend } from "./deno-backends";
import { createInterface } from "node:readline/promises";
import { stdin as input, stdout as output } from "node:process";

const apiKey = process.env.LLM_API_KEY;
if (!apiKey) {
  console.error("Set LLM_API_KEY (see .env.example).");
  process.exit(1);
}

// DeepSeek V4 tool-calls cleanly — it keeps the `code` string intact.
const model = new OpenAICompatibleModel({
  apiKey,
  baseURL: process.env.LLM_BASE_URL ?? "https://api.featherless.ai/v1",
  model: process.env.LLM_MODEL ?? "deepseek-ai/DeepSeek-V4-Flash",
  thinking: "on",
});

// `codeExecutionTool` is the stable, model-facing contract; the backend is the
// swappable host glue that actually runs the code. `denoCodeExecutionBackend()` is
// deny-by-default — the snippet gets NO file, network, or env access, so we can run
// untrusted, model-written code with no permission gate.
const codeExecution = codeExecutionTool(denoCodeExecutionBackend());

// Steer the model to the tool and the sandbox's one constraint: print the result,
// and write JavaScript or TypeScript (the Deno backend rejects other languages).
const system = [
  "You are a coding agent. When a question needs computation, write a short",
  "snippet and run it with the `code_execution` tool instead of doing the math",
  "in your head. The sandbox runs JavaScript or TypeScript — use console.log to",
  "print the result so it comes back to you.",
].join("\n");

// Render every event the loop emits. ToolStart shows the code the model wrote;
// ToolEnd shows what the sandbox sent back (stdout + the always-present verdict).
function render(e: AgentEvent) {
  switch (e.type) {
    case AgentEventType.ReasoningDelta:
      process.stdout.write(`\x1b[2m${e.text}\x1b[22m`);
      break;
    case AgentEventType.TextDelta:
      process.stdout.write(e.text);
      break;
    case AgentEventType.ToolStart:
      console.log(`→ ${e.toolName}(${JSON.stringify(e.args)})`);
      break;
    case AgentEventType.ToolEnd:
      console.log(`← ${e.toolName} [${e.isError ? "error" : "ok"}]:\n${e.result}`);
      break;
  }
}

// Multi-turn: one memory + one sessionId, reused every turn.
const memory = new SessionMemoryStore();
const sessionId = "code-execution-tutorial";
const rl = createInterface({ input, output });

while (true) {
  const prompt = (await rl.question("\nyou › ")).trim();
  if (prompt === "" || prompt === "exit") break;

  process.stdout.write("bot › ");
  await runAgent({
    model,
    memory,
    sessionId,
    prompt,
    system,
    tools: [codeExecution],
    onEvent: render,
  });
}
rl.close();

bun run examples/code-execution-tutorial/step1.ts
you › what is the 20th Fibonacci number? compute it by running code.

The model emits code_execution({ language, code }) and runs nothing itself; the loop validates the args and hands them to your backend, which launches deno as a child, pipes the code in, and captures the real { stdout, stderr, exitCode }. formatCodeExecutionResult folds that into one string the next model turn reads — always ending in a verdict (6765 then [exit 0: ok]), so a run is never a contentless result. The round trip, end to end:

ModelClient · asks
modelthe model only ASKS

SDK contract
codeExecutionTool · validate{ language, code } vs. Zod schema

CodeExecutionBackend · you own
deno · sandboxed childdeny-by-default · runs the snippet for REAL

SDK contract
formatCodeExecutionResult()stdout/stderr/exitCode → one string

Memory
append to message historytool result lands in Memory

who owns what:the model — asks, never computesSDK contract — fixedthe backend — you own (swap point)Memory

The result is measured inside the sandbox — the model only asks; it never produces the output. Swap the CodeExecutionBackend (the ringed node) for a container or cloud runner and nothing else changes. Drag the nodes to rearrange.

The system prompt earns its two lines. "use console.log to print the result" — a snippet runs as a script, not a REPL, so a bare final expression (6 * 7) is computed and thrown away; the model has to print what it wants back. "runs JavaScript or TypeScript" — the Deno battery is JS/TS only, so steering the model there keeps it from reaching for Python, which the backend rejects.

Step 2 — Grant a Capability

Deny-by-default means a snippet can compute but can't touch your disk, network, or env — so the moment the model needs to read a real file, the read fails closed and comes back as [exit 1: error]. You open exactly one hole: pass allow to the backend. The highlighted lines grant read access to just the tutorial folder (and nothing else), then point the model at a data file:

examples/code-execution-tutorial/step2.ts

import {
  AgentEventType,
  codeExecutionTool,
  runAgent,
  SessionMemoryStore,
} from "@open-agent-loops/core";
import type { AgentEvent } from "@open-agent-loops/core";
import { OpenAICompatibleModel } from "@open-agent-loops/core/providers/openai";
import { denoCodeExecutionBackend } from "./deno-backends";
import { createInterface } from "node:readline/promises";
import { stdin as input, stdout as output } from "node:process";

const apiKey = process.env.LLM_API_KEY;
if (!apiKey) {
  console.error("Set LLM_API_KEY (see .env.example).");
  process.exit(1);
}

// DeepSeek V4 tool-calls cleanly — it keeps the `code` string intact.
const model = new OpenAICompatibleModel({
  apiKey,
  baseURL: process.env.LLM_BASE_URL ?? "https://api.featherless.ai/v1",
  model: process.env.LLM_MODEL ?? "deepseek-ai/DeepSeek-V4-Flash",
  thinking: "on",
});

// Still deny-by-default — now with ONE capability granted: the snippet may READ
// files under the tutorial folder, and nothing else (no write, network, or env).
// A path-scoped grant; pass `read: true` to allow all reads, or grant `net`/`write`/`env`.
const codeExecution = codeExecutionTool(
  denoCodeExecutionBackend({ allow: { read: ["examples/code-execution-tutorial"] } }),
);

// Steer the model to the tool and the sandbox's one constraint: print the result,
// and write JavaScript or TypeScript (the Deno backend rejects other languages).
const system = [
  "You are a coding agent. When a question needs computation, write a short",
  "snippet and run it with the `code_execution` tool instead of doing the math",
  "in your head. The sandbox runs JavaScript or TypeScript — use console.log to",
  "print the result so it comes back to you.",
  "You may read files under examples/code-execution-tutorial/, such as data.txt.", 
].join("\n");

// Render every event the loop emits. ToolStart shows the code the model wrote;
// ToolEnd shows what the sandbox sent back (stdout + the always-present verdict).
function render(e: AgentEvent) {
  switch (e.type) {
    case AgentEventType.ReasoningDelta:
      process.stdout.write(`\x1b[2m${e.text}\x1b[22m`);
      break;
    case AgentEventType.TextDelta:
      process.stdout.write(e.text);
      break;
    case AgentEventType.ToolStart:
      console.log(`→ ${e.toolName}(${JSON.stringify(e.args)})`);
      break;
    case AgentEventType.ToolEnd:
      console.log(`← ${e.toolName} [${e.isError ? "error" : "ok"}]:\n${e.result}`);
      break;
  }
}

// Multi-turn: one memory + one sessionId, reused every turn.
const memory = new SessionMemoryStore();
const sessionId = "code-execution-tutorial";
const rl = createInterface({ input, output });

while (true) {
  const prompt = (await rl.question("\nyou › ")).trim();
  if (prompt === "" || prompt === "exit") break;

  process.stdout.write("bot › ");
  await runAgent({
    model,
    memory,
    sessionId,
    prompt,
    system,
    tools: [codeExecution],
    onEvent: render,
  });
}
rl.close();

bun run examples/code-execution-tutorial/step2.ts
you › add up the numbers in examples/code-execution-tutorial/data.txt

The snippet now reads data.txt and prints the sum (153). allow is the whole deny-by-default story in one object: read, write, net, and env, each either true (grant broadly) or an array that scopes it to specific paths, hosts, or variable names. There is deliberately no run grant — spawning a subprocess is the escape hatch out of the sandbox, and this backend keeps it shut. Grant the least you can: a path, not the disk; a host, not the network.

Step 3 — Swap the Backend

The backend is the swap point. Anything implementing CodeExecutionBackend — one exec({ language, code }, ctx) method returning { stdout, stderr, exitCode } — drops in behind the same codeExecutionTool. The highlighted lines replace the Deno battery with a hand-written Bun backend that runs the snippet on Bun instead: write it to a temp file, bun run it, capture the output.

examples/code-execution-tutorial/step3.ts

import {
  AgentEventType,
  codeExecutionTool,
  runAgent,
  SessionMemoryStore,
} from "@open-agent-loops/core";
import type { AgentEvent, CodeExecutionBackend } from "@open-agent-loops/core"; 
import { OpenAICompatibleModel } from "@open-agent-loops/core/providers/openai";
import { createInterface } from "node:readline/promises";
import { stdin as input, stdout as output } from "node:process";
import { tmpdir } from "node:os"; 
import { join } from "node:path";
import { unlink } from "node:fs/promises";

const apiKey = process.env.LLM_API_KEY;
if (!apiKey) {
  console.error("Set LLM_API_KEY (see .env.example).");
  process.exit(1);
}

// DeepSeek V4 tool-calls cleanly — it keeps the `code` string intact.
const model = new OpenAICompatibleModel({
  apiKey,
  baseURL: process.env.LLM_BASE_URL ?? "https://api.featherless.ai/v1",
  model: process.env.LLM_MODEL ?? "deepseek-ai/DeepSeek-V4-Flash",
  thinking: "on",
});

// Your own backend: implement `CodeExecutionBackend` and it drops in behind the
// SAME `codeExecutionTool`. This one runs the snippet on Bun instead of Deno —
// write it to a temp file, `bun run` it, capture stdout/stderr/exit. A container
// or hosted-cloud backend slots in the same way; only the body of `exec` changes.
// WARNING: unlike the Deno sandbox, Bun runs with full host access — no isolation.
const bunBackend: CodeExecutionBackend = {
  async exec(request, ctx) {
    const lang = request.language.toLowerCase();
    if (!["javascript", "js", "typescript", "ts"].includes(lang)) {
      throw new Error(`This backend runs JavaScript/TypeScript only; got "${request.language}".`);
    }
    const ext = lang === "ts" || lang === "typescript" ? "ts" : "js";
    const file = join(tmpdir(), `snippet-${crypto.randomUUID()}.${ext}`);
    await Bun.write(file, request.code);
    try {
      const proc = Bun.spawn(["bun", "run", file], { stdout: "pipe", stderr: "pipe", signal: ctx.signal });
      const [stdout, stderr, exitCode] = await Promise.all([
        new Response(proc.stdout).text(),
        new Response(proc.stderr).text(),
        proc.exited,
      ]);
      return { stdout, stderr, exitCode };
    } finally {
      await unlink(file).catch(() => {});
    }
  },
};
const codeExecution = codeExecutionTool(bunBackend); 

// Steer the model to the tool and to print the result. (No language guard needed
// for the prompt — the backend itself rejects anything but JavaScript/TypeScript.)
const system = [
  "You are a coding agent. When a question needs computation, write a short",
  "snippet and run it with the `code_execution` tool instead of doing the math",
  "in your head. This backend runs JavaScript or TypeScript — use console.log to",
  "print the result so it comes back to you.",
].join("\n");

// Render every event the loop emits. ToolStart shows the code the model wrote;
// ToolEnd shows what the backend sent back (stdout + the always-present verdict).
function render(e: AgentEvent) {
  switch (e.type) {
    case AgentEventType.ReasoningDelta:
      process.stdout.write(`\x1b[2m${e.text}\x1b[22m`);
      break;
    case AgentEventType.TextDelta:
      process.stdout.write(e.text);
      break;
    case AgentEventType.ToolStart:
      console.log(`→ ${e.toolName}(${JSON.stringify(e.args)})`);
      break;
    case AgentEventType.ToolEnd:
      console.log(`← ${e.toolName} [${e.isError ? "error" : "ok"}]:\n${e.result}`);
      break;
  }
}

// Multi-turn: one memory + one sessionId, reused every turn.
const memory = new SessionMemoryStore();
const sessionId = "code-execution-tutorial";
const rl = createInterface({ input, output });

while (true) {
  const prompt = (await rl.question("\nyou › ")).trim();
  if (prompt === "" || prompt === "exit") break;

  process.stdout.write("bot › ");
  await runAgent({
    model,
    memory,
    sessionId,
    prompt,
    system,
    tools: [codeExecution],
    onEvent: render,
  });
}
rl.close();

bun run examples/code-execution-tutorial/step3.ts
you › what is 25 factorial? compute it by running code.

The model, the loop, and the result format never change — only where the code runs. That's the seam: swap the Deno battery for a container or microVM (multi-language, stronger isolation), or a cloud variant that hands the code to a hosted execution service, all behind the same tool.

This backend traded the sandbox away

The Bun backend runs the snippet with full host access — no isolation at all, unlike Deno's deny-by-default sandbox. That's fine for code you trust, but it's exactly the situation Step 4 addresses: once the sandbox is gone, put a human in front of model-written code.

Step 4 — Gate It

With the deny-by-default Deno backend the sandbox is already the guardrail. The moment you run model-written code with real access — an unsandboxed backend, or a powerful cloud one — a human should sign off first. Admission is a separate seam: the gateToolCalls hook sees the turn's calls before any of them run and decides allow / deny / ask. The highlighted lines pair the shipped permissionGate with an InMemoryPermissionStore (the policy) and an ApprovalPrompter (how you ask), so every code_execution prompts first:

examples/code-execution-tutorial/step4.ts

import {
  AgentEventType,
  ApprovalChoice, 
  codeExecutionTool,
  InMemoryPermissionStore, 
  permissionGate, 
  PermissionPolicy, 
  runAgent,
  SessionMemoryStore,
} from "@open-agent-loops/core";
import type { AgentEvent, ApprovalPrompter, CodeExecutionBackend } from "@open-agent-loops/core"; 
import { OpenAICompatibleModel } from "@open-agent-loops/core/providers/openai";
import { createInterface } from "node:readline/promises";
import { stdin as input, stdout as output } from "node:process";
import { tmpdir } from "node:os";
import { join } from "node:path";
import { unlink } from "node:fs/promises";

const apiKey = process.env.LLM_API_KEY;
if (!apiKey) {
  console.error("Set LLM_API_KEY (see .env.example).");
  process.exit(1);
}

// DeepSeek V4 tool-calls cleanly — it keeps the `code` string intact.
const model = new OpenAICompatibleModel({
  apiKey,
  baseURL: process.env.LLM_BASE_URL ?? "https://api.featherless.ai/v1",
  model: process.env.LLM_MODEL ?? "deepseek-ai/DeepSeek-V4-Flash",
  thinking: "on",
});

// An unsandboxed Bun backend (from Step 3): runs the snippet with full host access.
const bunBackend: CodeExecutionBackend = {
  async exec(request, ctx) {
    const lang = request.language.toLowerCase();
    if (!["javascript", "js", "typescript", "ts"].includes(lang)) {
      throw new Error(`This backend runs JavaScript/TypeScript only; got "${request.language}".`);
    }
    const ext = lang === "ts" || lang === "typescript" ? "ts" : "js";
    const file = join(tmpdir(), `snippet-${crypto.randomUUID()}.${ext}`);
    await Bun.write(file, request.code);
    try {
      const proc = Bun.spawn(["bun", "run", file], { stdout: "pipe", stderr: "pipe", signal: ctx.signal });
      const [stdout, stderr, exitCode] = await Promise.all([
        new Response(proc.stdout).text(),
        new Response(proc.stderr).text(),
        proc.exited,
      ]);
      return { stdout, stderr, exitCode };
    } finally {
      await unlink(file).catch(() => {});
    }
  },
};
const codeExecution = codeExecutionTool(bunBackend);

const system = [
  "You are a coding agent. When a question needs computation, write a short",
  "snippet and run it with the `code_execution` tool instead of doing the math",
  "in your head. This backend runs JavaScript or TypeScript — use console.log to",
  "print the result so it comes back to you.",
].join("\n");

// Render every event the loop emits. ToolStart shows the code the model wrote;
// ToolEnd shows what the backend sent back (stdout + the always-present verdict).
function render(e: AgentEvent) {
  switch (e.type) {
    case AgentEventType.ReasoningDelta:
      process.stdout.write(`\x1b[2m${e.text}\x1b[22m`);
      break;
    case AgentEventType.TextDelta:
      process.stdout.write(e.text);
      break;
    case AgentEventType.ToolStart:
      console.log(`→ ${e.toolName}(${JSON.stringify(e.args)})`);
      break;
    case AgentEventType.ToolEnd:
      console.log(`← ${e.toolName} [${e.isError ? "error" : "ok"}]:\n${e.result}`);
      break;
  }
}

// Multi-turn: one memory + one sessionId, reused every turn.
const memory = new SessionMemoryStore();
const sessionId = "code-execution-tutorial";
const rl = createInterface({ input, output });

// Ask before any code runs: fallback is Ask, so every `code_execution` prompts.
const permissions = new InMemoryPermissionStore({ fallback: PermissionPolicy.Ask });
// A terminal prompter: show the call (name + args) and ask y/N.
const prompter: ApprovalPrompter = {
  async ask(batch) {
    const choices: ApprovalChoice[] = [];
    for (const { toolCall, args } of batch) {
      const answer = await rl.question(`\n🔐 allow ${toolCall.function.name}(${JSON.stringify(args)})? [y/N] `);
      choices.push(answer.trim().toLowerCase() === "y" ? ApprovalChoice.AllowOnce : ApprovalChoice.DenyOnce);
    }
    return choices;
  },
};
const gate = permissionGate(permissions, prompter);

while (true) {
  const prompt = (await rl.question("\nyou › ")).trim();
  if (prompt === "" || prompt === "exit") break;

  process.stdout.write("bot › ");
  await runAgent({
    model,
    memory,
    sessionId,
    prompt,
    system,
    tools: [codeExecution],
    hooks: { gateToolCalls: gate }, 
    onEvent: render,
  });
}
rl.close();

bun run examples/code-execution-tutorial/step4.ts
you › what is 25 factorial? compute it by running code.
🔐 allow code_execution(...)? [y/N]

The gate runs once per turn, ahead of the parallel execution phase, so the prompt never races a running tool. A denied call never runs — it comes back as an error tool-result the model can react to. This is the same gate the Tools tutorial, Step 5 puts in front of shell; Permissions & Credentials goes deeper — persisting "always" choices and feeding a tool a secret the model never sees.

How inputs and outputs behave

Two fields in, one string out.

The model sends exactly two fields — language and code — validated against the Zod schema before anything runs (a missing code comes straight back as a retryable error). The backend also receives a ToolContext carrying an abort signal, so a cancelled run kills the child process.

The backend returns { stdout, stderr, exitCode }, which formatCodeExecutionResult folds into the single string the model reads — always ending in a verdict, so a run is never a contentless result:

Code the model ran	String the model gets back
`console.log(6 * 7)`	`42` `[exit 0: ok]`
`throw new Error("nope")`	`[stderr]` `…nope…` `[exit 1: error]`
`const x = 1 + 1` (never printed)	`[exit 0: ok]`

Errors always come back as a string the model can retry from — the run never crashes:

The code fails (throws, or exits non-zero): a normal result carrying the [stderr] … [exit N: error] text. The code ran; the model reads the error and fixes its snippet. This is a soft outcome, not flagged an error.
The backend can't run it (unsupported language, deno missing): the loop turns the thrown error into a tool result flagged isError, with the message as content — e.g. "runs JavaScript/TypeScript only…".

The model reads text, not flags

On an OpenAI-compatible wire a tool result is just { role, tool_call_id, content } — there's no error flag the model sees. It tells success from failure purely by reading the string. That's why the verdict is always appended: [exit 0: ok] vs [exit 1: error] makes the outcome legible in the content itself.

Recap

Starting from a plain chat loop, a few lines at a time you ran a model-written snippet in a sandbox, opened one capability into that sandbox, swapped the Deno battery for a backend you wrote, then gated the riskiest version behind your approval. The tool, the loop, and the result format never changed — only the backend behind the seam did.

From here:

The Tools guide — where code_execution sits among the other built-ins, and the backend-seam pattern it shares with them.
Permissions & Credentials — persisting "always" choices, writing the ApprovalPrompter, and feeding a tool a secret.
API reference: codeExecutionTool, formatCodeExecutionResult, CodeExecutionBackend.

Code Execution

On this page