Working demo, today
In a Claude Code session with the Wisp MCP integration installed:
"Use wisp to compute SHA-256 of 'hello world'."
[wisp sandbox: rc=0, 1.93 ms]
b94d27b9934d3e08a52e52d7da7dabfac484efe37a5380ee9088f7ace2efcde9
That's a fresh CPython 3.14 interpreter — running inside WebAssembly, isolated from the host filesystem and network — started in under 2 ms, then thrown away. The next tool call gets a brand-new interpreter.
The integration is a single ~120-line Python MCP server plus a Rust daemon running locally; full setup is at Reproduce below. OpenCode users get a similar story via a ~100-line TypeScript custom tool. Here's why anyone would bother.
What Wisp is
A free, open-source tool-execution backend for AI agents. Same slot as E2B, Modal Sandbox, Daytona, Vercel Sandbox, Blaxel, and Cloudflare Workers — the agent framework drives the loop, Wisp runs the model-generated Python.
Three deliberate bets vs the typical sandbox provider:
- Substrate: WASI CPython under wasmtime, not Firecracker / gVisor / Docker.
- Default lifecycle: per-call fresh, not persistent VM. At sub-ms per-call startup the "fresh sandbox per tool call" cost rounds to the same as "reuse one sandbox" — so we default to fresh and get no-cross-call state leakage and byte-identical replay for free.
- Capability model: explicit allowlists for filesystem, shell, and outbound HTTP, not broad-VM trust.
The position is complement, not replacement. Below: where Wisp fits, where to keep using one of the others, the design, and the numbers.
It runs numpy too
The same sandbox in the demo above also runs numpy 1.26, including all three subpackages that need C extensions:
> Use wisp to: import numpy as np;
> A = np.array([[3.0, 1], [1, 2]]); b = np.array([9.0, 8]);
> print(np.linalg.solve(A, b))
[wisp sandbox: rc=0, 4.45 ms]
[2. 3.]
> Use wisp to: import numpy as np;
> x = np.array([1.0, 2.0, 3.0, 4.0]);
> print(np.fft.ifft(np.fft.fft(x)).round(6))
[wisp sandbox: rc=0, 3.79 ms]
[1.+0.j 2.-0.j 3.+0.j 4.-0.j]
> Use wisp to: import numpy as np;
> rng = np.random.default_rng(42);
> print(rng.standard_normal(5))
[wisp sandbox: rc=0, 5.12 ms]
[ 0.30471708 -1.03998411 0.7504512 0.94056472 -1.95103519]
Cost: a 41 MB reactor wasm instead of a 38 MB one. All three came from numpy's own source, cross-compiled to wasm32-wasip1 via the M1 pipeline — linalg against numpy's bundled f2c-translated reference BLAS+LAPACK (no system BLAS, no Fortran toolchain), fft against pocketfft, random against the 9 cythonized PRNG + distribution modules. The build scripts are in scripts/numpy/ if you want to extend the runtime with your own C extensions.
Numpy is pre-imported into the snapshot, so per-call latency is the same ~5 ms whether or not user code touches it. Per-call freshness still holds — each call gets a copy of the post-import state, then the copy is thrown away.
The agent tool-execution problem
A code agent (Claude Code, Aider, OpenCode, Cursor, OpenInterpreter, LangChain agents, OpenAI Agents SDK, …) is a loop:
LLM ─→ "execute this Python" ─→ tool backend runs it ─→ result back to LLM ─→ ...
The agent's job is the loop. The tool backend's job is running the model's generated code:
- Safely (model output is untrusted)
- Quickly (often dozens of tool calls per user turn)
- With the right capabilities (filesystem? shell? network? GPU?)
These are different concerns. The agent framework focuses on routing, memory, and LLM orchestration; the tool backend focuses on isolation and speed. Most agent frameworks today either implement the tool backend ad hoc (subprocess + chmod, or maybe Docker) or call out to a hosted sandbox service.
Wisp aims at the second slot.
The current landscape (2026-05)
Eight serious players ship "sandbox-as-a-service for AI agents" today. The April 2026 OpenAI Agents SDK release named seven of them as built-in integrations (Blaxel, Cloudflare, Daytona, E2B, Modal, Runloop, Vercel) — that's the canonical adoption layer.
Each made a different bet on substrate, lifecycle, and capability model. None of them is wrong. They're optimized for different agent shapes.
| Provider | Substrate | Per-call cold start | Lifecycle | Capability model | Sweet spot |
|---|---|---|---|---|---|
| E2B | Firecracker microVM + Jupyter inside | ~150 ms | Persistent (5-min idle timeout) | Broad: full Python + pip + shell + network inside VM | Data analysts, long notebook-style sessions |
| Modal Sandbox | Custom Rust container + gVisor | sub-second | Per-invocation | Broad: full container | ML workloads, batch jobs |
| Daytona | Docker container (+ optional Kata) | ~90 ms warm | Persistent dev environment | Broad: full container, kernel-shared | Dev environments, CI |
| Blaxel | Firecracker + perpetual env | ~25 ms warm resume | Persistent w/ aggressive idle shutdown | Broad: full VM | Intermittent agent traffic |
| Vercel Sandbox | Firecracker | sub-second | Per-invocation, 5-hour cap | Broad: full VM | One-off code execution at the edge |
| Cloudflare Workers | V8 isolate + Pyodide | sub-100 ms | Per-invocation | Narrow: V8-sandboxed JS-first; Python via Pyodide | Edge HTTP workloads |
| AWS Lambda Sandbox | Firecracker | 800–1500 ms (Python cold) | Per-invocation | Broad: full container | Cron, infrequent triggers |
| Wisp (this work) | wasmtime + WASI CPython | 0.78 ms | Per-call fresh (default) | Capability-bridge (explicit allowlist) | High-frequency tool-call agents, RL rollouts |
A few honest notes:
- Cold start numbers are not apples-to-apples. E2B and Blaxel measure VM creation over their cloud RPC; Wisp measures the executor primitive in-process. End-to-end with HTTP, Wisp is closer to ~3 ms cold and the gap to the others narrows. Still substantial — see the decomposition section below.
- "Sweet spot" isn't a knock on the others. A persistent Jupyter inside a microVM is genuinely the right shape for "open a sandbox, hand the agent pandas, let it explore for 10 minutes". A per-call fresh wasm Instance would feel awkward there.
- Capability model is a security choice, not a feature gap. E2B's "broad inside the VM" is fine because the VM boundary is strong. Wisp's "narrow + capability-bridge" is appropriate when the same daemon hosts code from many distrustful tenants.
Where Wisp fits — and where it doesn't
| Use case | Wisp | Use one of the others |
|---|---|---|
| Agent makes 50+ short tool calls per turn (file ops, JSON parse, simple compute) | ✓ default per-call freshness, no state pollution between calls | persistent-VM models pay state-leakage cost |
| Tree-search RL rollout (MCTS, GRPO) | ✓ COW fork at sub-ms per branch (Spike B2 below) | Linux fork or Firecracker snapshot is 100–1000× slower |
| Multi-tenant SaaS hosting untrusted agent code | ✓ capability-bridge gives explicit allowlists per tenant | broad VM access works but auditing is harder |
Data-analyst agent with pandas and a CSV that should persist | E2B/Daytona model is a better fit (state survives across cells) | Wisp can do session mode, but it's bolted on |
Agent needing arbitrary pip install mid-session | E2B (full apt+pip inside VM) | Wisp's WASI Python doesn't have dlopen for native wheels |
| Browser automation (Playwright, etc.) | E2B's chrome-in-sandbox | not yet in scope |
The honest position: for most agent frameworks today, Wisp is a plug-in for the tool-execution layer, sitting alongside (not replacing) whichever sandbox provider they already use. We are best when the agent loop is high-frequency and stateless-ish.
The substrate, briefly
WebAssembly's linear memory is a single contiguous byte array per Instance. That's what makes per-call fresh cheap:
- Snapshot a Python interpreter mid-execution =
memcpy(ormmap MAP_FIXED) of the byte array - Reset to a known-good state = re-mmap the snapshot file at the same address — kernel page-table operation, no actual data motion
- Spawn N children sharing the same base = mmap COW (copy-on-write on first write)
Native processes structurally can't do this — heap is scattered across mmap regions, shared libs, stack; you can't dump or restore at byte granularity without ptrace or full uVM serialization. Firecracker has a snapshot facility, but it operates at VM granularity and takes 100–500 ms to restore.
The wasm Instance abstraction collapses snapshot/restore into one syscall. That's where the substrate gap comes from.
The cold-start descent (our spikes, with numbers)
We ran four spikes, each replacing the bottleneck of the previous one. All measurements are on Apple Silicon M-series, 8 cores, Wasmtime crate 27, in-process.
Spike A — pooled in-process, fresh interpreter
| Subprocess wasmtime CLI | ~400 ms |
| Pooled in-process, init | 39 ms | ← Spike A
Pre-built engine, pre-compiled module, instantiate fresh, run wisp_init per call. The 39 ms is dominated by Py_InitializeFromConfig plus stdlib imports — most of CPython's startup work. Already 10× faster than the subprocess baseline.
Spike A2 — memory-snapshot restore
wisp_init is deterministic. Run it once, capture the linear memory, restore the snapshot via Memory::data_mut().copy_from_slice() instead of re-running init.
| memcpy snapshot/restore | 1.68 ms | ← Spike A2
23× speedup. The memcpy was the dominant cost — bandwidth-bound, doesn't parallelize.
Spike A2.1 — mmap COW snapshot
Replace the per-call memcpy with a custom Wasmtime MemoryCreator that mmaps the snapshot file MAP_PRIVATE. Per call, after instantiate, re-mmap MAP_PRIVATE | MAP_FIXED to undo Wasmtime's data-segment init writes:
unsafe impl MemoryCreator for CowMemoryCreator {
fn new_memory(&self, ...) -> Result<Box<dyn LinearMemory>> {
// 1. Reserve virtual region with PROT_NONE
// 2. mmap MAP_PRIVATE of snapshot file at offset 0
// 3. mmap MAP_PRIVATE | MAP_ANON for trailing zero pages
}
}
// per call:
let r = libc::mmap(
base, snapshot.len(),
PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_FIXED,
snap_fd, 0,
);
Result:
| mmap COW snapshot | 0.78 ms | ← Spike A2.1
The per-call snapshot reset dropped from a 1.07 ms memcpy to a 0.030 ms syscall. The remaining 0.75 ms decomposes:
| Instantiate + data init | 0.45 ms ← wasmtime overhead
| mmap MAP_FIXED reset | 0.03 ms ← was memcpy 1.07 ms
| Grow memory | 0.002 ms
| Alloc + write code | 0.01 ms
| wisp_eval | 0.20 ms ← page faults lazy
| Total | 0.78 ms
The wisp_eval cost rose slightly compared to A2 because pages are faulted in lazily during execution. Net win: ~2.4× over memcpy, and unlike memcpy this approach doesn't burn memory bandwidth — meaning it parallelizes better.
The other primitive: cheap fork from a snapshot
Spawn K children from a single post-init snapshot, each running a divergent Python expression.
| K | memcpy backend (br/s) | mmap COW backend (br/s) |
|---|---|---|
| 1 | 367 | 1163 |
| 8 | 697 | 1702 |
| 64 | 926 | 2231 |
| 256 | 1025 | 2363 |
| 1024 | 799 | 2394 |
Peak parallel throughput on 8 cores: 2394 branches/sec with the COW backend, 2.3× more than memcpy. Per-branch latency at K=64 parallel: p50 3.42 ms, p99 6.17 ms.
Comparison to native runtimes for a tree-search workload (K=100 × depth=100 = 10 000 forks per trajectory):
| Substrate | Per-trajectory branching cost |
|---|---|
| Linux process fork (Python interpreter) | 50–100 s |
| Firecracker uVM snapshot | 16–83 min |
| Wisp WASM memcpy snapshot | ~10 s |
| Wisp WASM mmap COW snapshot | ~4.2 s |
Useful for tree-search RL: every WASM child gets a truly fresh sandbox — new linear memory, no shared state with siblings or parent beyond the snapshot contents.
End-to-end through the daemon
The 0.78 ms number is the executor primitive. End-to-end through the HTTP daemon (wisp-runtime):
$ for i in {1..10}; do
curl -s -X POST http://127.0.0.1:9000/v1/eval \
-H "Content-Type: application/json" \
-d '{"code":"print(2+2)"}' | jq .elapsed_us
done
14225
1511
1356
1203
1189
1130
1284
1189
1120
1255
Steady state ~1.1–1.3 ms including HTTP framing, JSON serialization, and the channel hop from tokio to a worker thread. The first call is warm-up of the JIT cache for the first Instance.
This is the cold start an agent framework will measure when it integrates the Wisp client SDK.
What WebAssembly does NOT solve
Honest disclaimers, since these come up:
- Wasm Python is not faster Python. CPython compiled to WASM runs 10–30% slower than native CPython. We sell cheap isolation, not faster execution.
- No
pip installof native wheels. WASI doesn't havedlopen, so packages with C extensions need to be built into the runtime ahead of time. (We have a build pipeline for that — seescripts/in the repo.) - No threads, no
multiprocessing, nosubprocess. WASI Preview 1 doesn't expose any of these. Threading-using libraries fail at import. This is generally fine for a sandbox (you don't want it spawning threads), but it does block some libraries. - No real sockets in-sandbox. WASI Preview 1 has no
getsockname. Outbound HTTPS goes through a host capability bridge instead — same pattern Cloudflare Workers, Deno, and other capability runtimes use. _ssland_ctypesmodules are missing. Architectural, not fixable from our side — would need WASI Preview 2 for_ssl(and_ctypesis blocked at the WebAssembly layer regardless).
For most agent tool calls (parse this JSON, transform that data, validate this schema, do this hash), none of these matter. For some agent workloads (HTTP scraping, browser control, full notebook exploration), one of the persistent-VM substrates is the right choice.
Reproduce
Everything in this post is in github.com/wisplab/wisp, MIT-licensed.
Build the runtime + daemon:
git clone https://github.com/wisplab/wisp && cd wisp
cd runtime/cpython-wasi && ./build.sh # CPython 3.14 → wasm
./build-sqlite.sh && ./build-openssl.sh # M0.5 stdlib deps
./wisp_entry/build.sh # python-reactor.wasm
cd ../.. && cargo build --release -p wisp-runtime # the daemon
Plug into Claude Code (MCP):
cd examples/claude-code-integration
python3 -m venv .venv && ./.venv/bin/pip install -r requirements.txt
# In another terminal: start the daemon (capability config optional)
cargo run --release -p wisp-runtime
# Register the MCP server with Claude Code
claude mcp add wisp \
-- $PWD/.venv/bin/python $PWD/wisp_mcp_server.py
Then in a fresh Claude Code session: "use wisp to print 2+2".
Plug into OpenCode (custom tool):
cp -r examples/opencode-integration/.opencode /path/to/your/project/
# Daemon needs to be running on localhost:9000.
Reproduce the benchmark numbers:
cd bench/python-wasi-cow && cargo run --release # 0.78 ms cold start
cd ../python-wasi-cow-branching && cargo run --release # 2394 br/s parallel
Each bench/*/FINDINGS.md documents the methodology.
What's next
- M1 build pipeline → numpy — let the sandbox actually run real data science. The cross-build pipeline already works end-to-end on xxhash; numpy is the headline test.
- Session API — E2B-like persistent semantics for the workloads where per-call freshness is the wrong default.
- Multi-host scheduler — for the Knative-shaped deployment story. Single-process daemon today is fine for solo dev / small team.
- Streaming output — currently every call runs to completion before returning; long jobs hide useful progress.
If you operate an agent framework or platform and the pattern above fits something you're hitting, we'd love to hear from you.
— Wisp