Code Execution
A tutorial — grow one program across four steps: run a model-written snippet in a sandbox, grant it one capability, swap the backend, then gate it behind approval.
The shell tool runs a command. code_execution runs a
program — a snippet in a named language that the model writes and your machine
runs for real, handing back the captured output. Reach for it when the model should
compute an answer (parse data, do math, transform a file) instead of guessing one:
the result is measured, not invented in the prompt.
Like every dangerous built-in, it follows the SDK's split: the SDK owns the
model-facing contract — codeExecutionTool fixes the code_execution name, its
{ language, code } schema, and how the result is formatted — and you own the
backend that actually runs the code. That boundary is the whole point. Running
model-written code is the single most dangerous thing an agent does, so the core
ships no executor; you choose one.
Why a separate tool from shell
shell and code_execution both run model-written instructions, but they differ on
the two things that matter:
shell | code_execution | |
|---|---|---|
| Input | one command string | a { language, code } snippet |
| Isolation (shipped backend) | none — bunShellBackend runs on the host | a deny-by-default sandbox — denoCodeExecutionBackend |
The isolation row is why this isn't just "shell with node -e". Code execution is
exactly where you expect a sandbox, so the reference backend is sandboxed by
default.
The tutorial
Starting from the multi-turn chat loop, this grows one program across four steps:
- Run a snippet the model writes, in a sandbox.
- Grant one capability — open a single hole in the deny-by-default sandbox.
- Swap the backend — the same tool over a runtime you control.
- Gate it — put a human in front once the sandbox is gone.
Each step below shows the whole program so far — the lines it adds are
highlighted. Steps 1–2 need the deno binary on PATH (deno.com);
steps 3–4 swap to a Bun backend, so they don't.
Step 1 — Run a Snippet
codeExecutionTool is the model-facing contract; denoCodeExecutionBackend() is the
host glue that runs the code. The Deno backend is deny-by-default — the snippet
gets no file, network, or env access — so you can run untrusted, model-written code
with no permission gate. Add it to the chat loop's tools array and steer the model
with a system prompt:
import {
AgentEventType,
codeExecutionTool,
runAgent,
SessionMemoryStore,
} from "@open-agent-loops/core";
import type { AgentEvent } from "@open-agent-loops/core";
import { OpenAICompatibleModel } from "@open-agent-loops/core/providers/openai";
import { denoCodeExecutionBackend } from "./deno-backends";
import { createInterface } from "node:readline/promises";
import { stdin as input, stdout as output } from "node:process";
const apiKey = process.env.LLM_API_KEY;
if (!apiKey) {
console.error("Set LLM_API_KEY (see .env.example).");
process.exit(1);
}
// DeepSeek V4 tool-calls cleanly — it keeps the `code` string intact.
const model = new OpenAICompatibleModel({
apiKey,
baseURL: process.env.LLM_BASE_URL ?? "https://api.featherless.ai/v1",
model: process.env.LLM_MODEL ?? "deepseek-ai/DeepSeek-V4-Flash",
thinking: "on",
});
// `codeExecutionTool` is the stable, model-facing contract; the backend is the
// swappable host glue that actually runs the code. `denoCodeExecutionBackend()` is
// deny-by-default — the snippet gets NO file, network, or env access, so we can run
// untrusted, model-written code with no permission gate.
const codeExecution = codeExecutionTool(denoCodeExecutionBackend());
// Steer the model to the tool and the sandbox's one constraint: print the result,
// and write JavaScript or TypeScript (the Deno backend rejects other languages).
const system = [
"You are a coding agent. When a question needs computation, write a short",
"snippet and run it with the `code_execution` tool instead of doing the math",
"in your head. The sandbox runs JavaScript or TypeScript — use console.log to",
"print the result so it comes back to you.",
].join("\n");
// Render every event the loop emits. ToolStart shows the code the model wrote;
// ToolEnd shows what the sandbox sent back (stdout + the always-present verdict).
function render(e: AgentEvent) {
switch (e.type) {
case AgentEventType.ReasoningDelta:
process.stdout.write(`\x1b[2m${e.text}\x1b[22m`);
break;
case AgentEventType.TextDelta:
process.stdout.write(e.text);
break;
case AgentEventType.ToolStart:
console.log(`→ ${e.toolName}(${JSON.stringify(e.args)})`);
break;
case AgentEventType.ToolEnd:
console.log(`← ${e.toolName} [${e.isError ? "error" : "ok"}]:\n${e.result}`);
break;
}
}
// Multi-turn: one memory + one sessionId, reused every turn.
const memory = new SessionMemoryStore();
const sessionId = "code-execution-tutorial";
const rl = createInterface({ input, output });
while (true) {
const prompt = (await rl.question("\nyou › ")).trim();
if (prompt === "" || prompt === "exit") break;
process.stdout.write("bot › ");
await runAgent({
model,
memory,
sessionId,
prompt,
system,
tools: [codeExecution],
onEvent: render,
});
}
rl.close();bun run examples/code-execution-tutorial/step1.ts
you › what is the 20th Fibonacci number? compute it by running code.The model emits code_execution({ language, code }) and runs nothing itself; the
loop validates the args and hands them to your backend, which launches deno as a
child, pipes the code in, and captures the real { stdout, stderr, exitCode }.
formatCodeExecutionResult folds that into one string the next model turn reads —
always ending in a verdict (6765 then [exit 0: ok]), so a run is never a
contentless result. The round trip, end to end:
CodeExecutionBackend (the ringed node) for a container or cloud runner and nothing else changes. Drag the nodes to rearrange.The system prompt earns its two lines. "use console.log to print the result" —
a snippet runs as a script, not a REPL, so a bare final expression (6 * 7) is
computed and thrown away; the model has to print what it wants back. "runs
JavaScript or TypeScript" — the Deno battery is JS/TS only, so steering the model
there keeps it from reaching for Python, which the backend rejects.
Step 2 — Grant a Capability
Deny-by-default means a snippet can compute but can't touch your disk, network, or
env — so the moment the model needs to read a real file, the read fails closed and
comes back as [exit 1: error]. You open exactly one hole: pass allow to the
backend. The highlighted lines grant read access to just the tutorial folder
(and nothing else), then point the model at a data file:
import {
AgentEventType,
codeExecutionTool,
runAgent,
SessionMemoryStore,
} from "@open-agent-loops/core";
import type { AgentEvent } from "@open-agent-loops/core";
import { OpenAICompatibleModel } from "@open-agent-loops/core/providers/openai";
import { denoCodeExecutionBackend } from "./deno-backends";
import { createInterface } from "node:readline/promises";
import { stdin as input, stdout as output } from "node:process";
const apiKey = process.env.LLM_API_KEY;
if (!apiKey) {
console.error("Set LLM_API_KEY (see .env.example).");
process.exit(1);
}
// DeepSeek V4 tool-calls cleanly — it keeps the `code` string intact.
const model = new OpenAICompatibleModel({
apiKey,
baseURL: process.env.LLM_BASE_URL ?? "https://api.featherless.ai/v1",
model: process.env.LLM_MODEL ?? "deepseek-ai/DeepSeek-V4-Flash",
thinking: "on",
});
// Still deny-by-default — now with ONE capability granted: the snippet may READ
// files under the tutorial folder, and nothing else (no write, network, or env).
// A path-scoped grant; pass `read: true` to allow all reads, or grant `net`/`write`/`env`.
const codeExecution = codeExecutionTool(
denoCodeExecutionBackend({ allow: { read: ["examples/code-execution-tutorial"] } }),
);
// Steer the model to the tool and the sandbox's one constraint: print the result,
// and write JavaScript or TypeScript (the Deno backend rejects other languages).
const system = [
"You are a coding agent. When a question needs computation, write a short",
"snippet and run it with the `code_execution` tool instead of doing the math",
"in your head. The sandbox runs JavaScript or TypeScript — use console.log to",
"print the result so it comes back to you.",
"You may read files under examples/code-execution-tutorial/, such as data.txt.",
].join("\n");
// Render every event the loop emits. ToolStart shows the code the model wrote;
// ToolEnd shows what the sandbox sent back (stdout + the always-present verdict).
function render(e: AgentEvent) {
switch (e.type) {
case AgentEventType.ReasoningDelta:
process.stdout.write(`\x1b[2m${e.text}\x1b[22m`);
break;
case AgentEventType.TextDelta:
process.stdout.write(e.text);
break;
case AgentEventType.ToolStart:
console.log(`→ ${e.toolName}(${JSON.stringify(e.args)})`);
break;
case AgentEventType.ToolEnd:
console.log(`← ${e.toolName} [${e.isError ? "error" : "ok"}]:\n${e.result}`);
break;
}
}
// Multi-turn: one memory + one sessionId, reused every turn.
const memory = new SessionMemoryStore();
const sessionId = "code-execution-tutorial";
const rl = createInterface({ input, output });
while (true) {
const prompt = (await rl.question("\nyou › ")).trim();
if (prompt === "" || prompt === "exit") break;
process.stdout.write("bot › ");
await runAgent({
model,
memory,
sessionId,
prompt,
system,
tools: [codeExecution],
onEvent: render,
});
}
rl.close();bun run examples/code-execution-tutorial/step2.ts
you › add up the numbers in examples/code-execution-tutorial/data.txtThe snippet now reads data.txt and prints the sum (153). allow is the whole
deny-by-default story in one object: read, write, net, and env, each either
true (grant broadly) or an array that scopes it to specific paths, hosts, or
variable names. There is deliberately no run grant — spawning a subprocess is the
escape hatch out of the sandbox, and this backend keeps it shut. Grant the least you
can: a path, not the disk; a host, not the network.
Step 3 — Swap the Backend
The backend is the swap point. Anything implementing CodeExecutionBackend — one
exec({ language, code }, ctx) method returning { stdout, stderr, exitCode } —
drops in behind the same codeExecutionTool. The highlighted lines replace the Deno
battery with a hand-written Bun backend that runs the snippet on Bun instead:
write it to a temp file, bun run it, capture the output.
import {
AgentEventType,
codeExecutionTool,
runAgent,
SessionMemoryStore,
} from "@open-agent-loops/core";
import type { AgentEvent, CodeExecutionBackend } from "@open-agent-loops/core";
import { OpenAICompatibleModel } from "@open-agent-loops/core/providers/openai";
import { createInterface } from "node:readline/promises";
import { stdin as input, stdout as output } from "node:process";
import { tmpdir } from "node:os";
import { join } from "node:path";
import { unlink } from "node:fs/promises";
const apiKey = process.env.LLM_API_KEY;
if (!apiKey) {
console.error("Set LLM_API_KEY (see .env.example).");
process.exit(1);
}
// DeepSeek V4 tool-calls cleanly — it keeps the `code` string intact.
const model = new OpenAICompatibleModel({
apiKey,
baseURL: process.env.LLM_BASE_URL ?? "https://api.featherless.ai/v1",
model: process.env.LLM_MODEL ?? "deepseek-ai/DeepSeek-V4-Flash",
thinking: "on",
});
// Your own backend: implement `CodeExecutionBackend` and it drops in behind the
// SAME `codeExecutionTool`. This one runs the snippet on Bun instead of Deno —
// write it to a temp file, `bun run` it, capture stdout/stderr/exit. A container
// or hosted-cloud backend slots in the same way; only the body of `exec` changes.
// WARNING: unlike the Deno sandbox, Bun runs with full host access — no isolation.
const bunBackend: CodeExecutionBackend = {
async exec(request, ctx) {
const lang = request.language.toLowerCase();
if (!["javascript", "js", "typescript", "ts"].includes(lang)) {
throw new Error(`This backend runs JavaScript/TypeScript only; got "${request.language}".`);
}
const ext = lang === "ts" || lang === "typescript" ? "ts" : "js";
const file = join(tmpdir(), `snippet-${crypto.randomUUID()}.${ext}`);
await Bun.write(file, request.code);
try {
const proc = Bun.spawn(["bun", "run", file], { stdout: "pipe", stderr: "pipe", signal: ctx.signal });
const [stdout, stderr, exitCode] = await Promise.all([
new Response(proc.stdout).text(),
new Response(proc.stderr).text(),
proc.exited,
]);
return { stdout, stderr, exitCode };
} finally {
await unlink(file).catch(() => {});
}
},
};
const codeExecution = codeExecutionTool(bunBackend);
// Steer the model to the tool and to print the result. (No language guard needed
// for the prompt — the backend itself rejects anything but JavaScript/TypeScript.)
const system = [
"You are a coding agent. When a question needs computation, write a short",
"snippet and run it with the `code_execution` tool instead of doing the math",
"in your head. This backend runs JavaScript or TypeScript — use console.log to",
"print the result so it comes back to you.",
].join("\n");
// Render every event the loop emits. ToolStart shows the code the model wrote;
// ToolEnd shows what the backend sent back (stdout + the always-present verdict).
function render(e: AgentEvent) {
switch (e.type) {
case AgentEventType.ReasoningDelta:
process.stdout.write(`\x1b[2m${e.text}\x1b[22m`);
break;
case AgentEventType.TextDelta:
process.stdout.write(e.text);
break;
case AgentEventType.ToolStart:
console.log(`→ ${e.toolName}(${JSON.stringify(e.args)})`);
break;
case AgentEventType.ToolEnd:
console.log(`← ${e.toolName} [${e.isError ? "error" : "ok"}]:\n${e.result}`);
break;
}
}
// Multi-turn: one memory + one sessionId, reused every turn.
const memory = new SessionMemoryStore();
const sessionId = "code-execution-tutorial";
const rl = createInterface({ input, output });
while (true) {
const prompt = (await rl.question("\nyou › ")).trim();
if (prompt === "" || prompt === "exit") break;
process.stdout.write("bot › ");
await runAgent({
model,
memory,
sessionId,
prompt,
system,
tools: [codeExecution],
onEvent: render,
});
}
rl.close();bun run examples/code-execution-tutorial/step3.ts
you › what is 25 factorial? compute it by running code.The model, the loop, and the result format never change — only where the code runs. That's the seam: swap the Deno battery for a container or microVM (multi-language, stronger isolation), or a cloud variant that hands the code to a hosted execution service, all behind the same tool.
This backend traded the sandbox away
The Bun backend runs the snippet with full host access — no isolation at all, unlike Deno's deny-by-default sandbox. That's fine for code you trust, but it's exactly the situation Step 4 addresses: once the sandbox is gone, put a human in front of model-written code.
Step 4 — Gate It
With the deny-by-default Deno backend the sandbox is already the guardrail. The
moment you run model-written code with real access — an unsandboxed backend, or a
powerful cloud one — a human should sign off first. Admission is a separate seam: the
gateToolCalls hook sees the turn's calls before any of them run and decides allow
/ deny / ask. The highlighted lines pair the shipped permissionGate with an
InMemoryPermissionStore (the policy) and an ApprovalPrompter (how you ask), so
every code_execution prompts first:
import {
AgentEventType,
ApprovalChoice,
codeExecutionTool,
InMemoryPermissionStore,
permissionGate,
PermissionPolicy,
runAgent,
SessionMemoryStore,
} from "@open-agent-loops/core";
import type { AgentEvent, ApprovalPrompter, CodeExecutionBackend } from "@open-agent-loops/core";
import { OpenAICompatibleModel } from "@open-agent-loops/core/providers/openai";
import { createInterface } from "node:readline/promises";
import { stdin as input, stdout as output } from "node:process";
import { tmpdir } from "node:os";
import { join } from "node:path";
import { unlink } from "node:fs/promises";
const apiKey = process.env.LLM_API_KEY;
if (!apiKey) {
console.error("Set LLM_API_KEY (see .env.example).");
process.exit(1);
}
// DeepSeek V4 tool-calls cleanly — it keeps the `code` string intact.
const model = new OpenAICompatibleModel({
apiKey,
baseURL: process.env.LLM_BASE_URL ?? "https://api.featherless.ai/v1",
model: process.env.LLM_MODEL ?? "deepseek-ai/DeepSeek-V4-Flash",
thinking: "on",
});
// An unsandboxed Bun backend (from Step 3): runs the snippet with full host access.
const bunBackend: CodeExecutionBackend = {
async exec(request, ctx) {
const lang = request.language.toLowerCase();
if (!["javascript", "js", "typescript", "ts"].includes(lang)) {
throw new Error(`This backend runs JavaScript/TypeScript only; got "${request.language}".`);
}
const ext = lang === "ts" || lang === "typescript" ? "ts" : "js";
const file = join(tmpdir(), `snippet-${crypto.randomUUID()}.${ext}`);
await Bun.write(file, request.code);
try {
const proc = Bun.spawn(["bun", "run", file], { stdout: "pipe", stderr: "pipe", signal: ctx.signal });
const [stdout, stderr, exitCode] = await Promise.all([
new Response(proc.stdout).text(),
new Response(proc.stderr).text(),
proc.exited,
]);
return { stdout, stderr, exitCode };
} finally {
await unlink(file).catch(() => {});
}
},
};
const codeExecution = codeExecutionTool(bunBackend);
const system = [
"You are a coding agent. When a question needs computation, write a short",
"snippet and run it with the `code_execution` tool instead of doing the math",
"in your head. This backend runs JavaScript or TypeScript — use console.log to",
"print the result so it comes back to you.",
].join("\n");
// Render every event the loop emits. ToolStart shows the code the model wrote;
// ToolEnd shows what the backend sent back (stdout + the always-present verdict).
function render(e: AgentEvent) {
switch (e.type) {
case AgentEventType.ReasoningDelta:
process.stdout.write(`\x1b[2m${e.text}\x1b[22m`);
break;
case AgentEventType.TextDelta:
process.stdout.write(e.text);
break;
case AgentEventType.ToolStart:
console.log(`→ ${e.toolName}(${JSON.stringify(e.args)})`);
break;
case AgentEventType.ToolEnd:
console.log(`← ${e.toolName} [${e.isError ? "error" : "ok"}]:\n${e.result}`);
break;
}
}
// Multi-turn: one memory + one sessionId, reused every turn.
const memory = new SessionMemoryStore();
const sessionId = "code-execution-tutorial";
const rl = createInterface({ input, output });
// Ask before any code runs: fallback is Ask, so every `code_execution` prompts.
const permissions = new InMemoryPermissionStore({ fallback: PermissionPolicy.Ask });
// A terminal prompter: show the call (name + args) and ask y/N.
const prompter: ApprovalPrompter = {
async ask(batch) {
const choices: ApprovalChoice[] = [];
for (const { toolCall, args } of batch) {
const answer = await rl.question(`\n🔐 allow ${toolCall.function.name}(${JSON.stringify(args)})? [y/N] `);
choices.push(answer.trim().toLowerCase() === "y" ? ApprovalChoice.AllowOnce : ApprovalChoice.DenyOnce);
}
return choices;
},
};
const gate = permissionGate(permissions, prompter);
while (true) {
const prompt = (await rl.question("\nyou › ")).trim();
if (prompt === "" || prompt === "exit") break;
process.stdout.write("bot › ");
await runAgent({
model,
memory,
sessionId,
prompt,
system,
tools: [codeExecution],
hooks: { gateToolCalls: gate },
onEvent: render,
});
}
rl.close();bun run examples/code-execution-tutorial/step4.ts
you › what is 25 factorial? compute it by running code.
🔐 allow code_execution(...)? [y/N]The gate runs once per turn, ahead of the parallel execution phase, so the prompt
never races a running tool. A denied call never runs — it comes back as an error
tool-result the model can react to. This is the same gate the Tools tutorial,
Step 5 puts in front of shell;
Permissions & Credentials goes deeper — persisting
"always" choices and feeding a tool a secret the model never sees.
How inputs and outputs behave
Two fields in, one string out.
The model sends exactly two fields — language and code — validated against the
Zod schema before anything runs (a missing code comes straight back as a retryable
error). The backend also receives a ToolContext carrying an abort signal, so a
cancelled run kills the child process.
The backend returns { stdout, stderr, exitCode }, which formatCodeExecutionResult
folds into the single string the model reads — always ending in a verdict, so a
run is never a contentless result:
| Code the model ran | String the model gets back |
|---|---|
console.log(6 * 7) | 42 [exit 0: ok] |
throw new Error("nope") | [stderr] …nope… [exit 1: error] |
const x = 1 + 1 (never printed) | [exit 0: ok] |
Errors always come back as a string the model can retry from — the run never crashes:
- The code fails (throws, or exits non-zero): a normal result carrying the
[stderr] … [exit N: error]text. The code ran; the model reads the error and fixes its snippet. This is a soft outcome, not flagged an error. - The backend can't run it (unsupported language,
denomissing): the loop turns the thrown error into a tool result flaggedisError, with the message as content — e.g. "runs JavaScript/TypeScript only…".
The model reads text, not flags
On an OpenAI-compatible wire a tool result is just { role, tool_call_id, content }
— there's no error flag the model sees. It tells success from failure purely by
reading the string. That's why the verdict is always appended: [exit 0: ok] vs
[exit 1: error] makes the outcome legible in the content itself.
Recap
Starting from a plain chat loop, a few lines at a time you ran a model-written snippet in a sandbox, opened one capability into that sandbox, swapped the Deno battery for a backend you wrote, then gated the riskiest version behind your approval. The tool, the loop, and the result format never changed — only the backend behind the seam did.
From here:
- The Tools guide — where
code_executionsits among the other built-ins, and the backend-seam pattern it shares with them. - Permissions & Credentials — persisting "always" choices,
writing the
ApprovalPrompter, and feeding a tool a secret. - API reference:
codeExecutionTool,formatCodeExecutionResult,CodeExecutionBackend.
Tools
A short tutorial — give the model a tool with defineTool, let it recover from errors, organize many tools in a registry, and reach for the built-in tools over your own backend.
Skills
A short tutorial — bundle instructions, tools, and reference material the model loads on demand, then guard the bundle with a secret and an approval.