Per-call Python sandboxes in WebAssembly: where Wisp fits

Updated 2026-05-08 · Wisp

Working demo, today

In a Claude Code session with the Wisp MCP integration installed:

"Use wisp to compute SHA-256 of 'hello world'."
[wisp sandbox: rc=0, 1.93 ms]
b94d27b9934d3e08a52e52d7da7dabfac484efe37a5380ee9088f7ace2efcde9

That's a fresh CPython 3.14 interpreter — running inside WebAssembly, isolated from the host filesystem and network — started in under 2 ms, then thrown away. The next tool call gets a brand-new interpreter.

The integration is a single ~120-line Python MCP server plus a Rust daemon running locally; full setup is at Reproduce below. OpenCode users get a similar story via a ~100-line TypeScript custom tool. Here's why anyone would bother.

What Wisp is

A free, open-source tool-execution backend for AI agents. Same slot as E2B, Modal Sandbox, Daytona, Vercel Sandbox, Blaxel, and Cloudflare Workers — the agent framework drives the loop, Wisp runs the model-generated Python.

Three deliberate bets vs the typical sandbox provider:

  1. Substrate: WASI CPython under wasmtime, not Firecracker / gVisor / Docker.
  2. Default lifecycle: per-call fresh, not persistent VM. At sub-ms per-call startup the "fresh sandbox per tool call" cost rounds to the same as "reuse one sandbox" — so we default to fresh and get no-cross-call state leakage and byte-identical replay for free.
  3. Capability model: explicit allowlists for filesystem, shell, and outbound HTTP, not broad-VM trust.

The position is complement, not replacement. Below: where Wisp fits, where to keep using one of the others, the design, and the numbers.


It runs numpy too

The same sandbox in the demo above also runs numpy 1.26, including all three subpackages that need C extensions:

> Use wisp to: import numpy as np;
> A = np.array([[3.0, 1], [1, 2]]); b = np.array([9.0, 8]);
> print(np.linalg.solve(A, b))

[wisp sandbox: rc=0, 4.45 ms]
[2. 3.]
> Use wisp to: import numpy as np;
> x = np.array([1.0, 2.0, 3.0, 4.0]);
> print(np.fft.ifft(np.fft.fft(x)).round(6))

[wisp sandbox: rc=0, 3.79 ms]
[1.+0.j 2.-0.j 3.+0.j 4.-0.j]
> Use wisp to: import numpy as np;
> rng = np.random.default_rng(42);
> print(rng.standard_normal(5))

[wisp sandbox: rc=0, 5.12 ms]
[ 0.30471708 -1.03998411  0.7504512   0.94056472 -1.95103519]

Cost: a 41 MB reactor wasm instead of a 38 MB one. All three came from numpy's own source, cross-compiled to wasm32-wasip1 via the M1 pipeline — linalg against numpy's bundled f2c-translated reference BLAS+LAPACK (no system BLAS, no Fortran toolchain), fft against pocketfft, random against the 9 cythonized PRNG + distribution modules. The build scripts are in scripts/numpy/ if you want to extend the runtime with your own C extensions.

Numpy is pre-imported into the snapshot, so per-call latency is the same ~5 ms whether or not user code touches it. Per-call freshness still holds — each call gets a copy of the post-import state, then the copy is thrown away.


The agent tool-execution problem

A code agent (Claude Code, Aider, OpenCode, Cursor, OpenInterpreter, LangChain agents, OpenAI Agents SDK, …) is a loop:

LLM ─→ "execute this Python" ─→ tool backend runs it ─→ result back to LLM ─→ ...

The agent's job is the loop. The tool backend's job is running the model's generated code:

These are different concerns. The agent framework focuses on routing, memory, and LLM orchestration; the tool backend focuses on isolation and speed. Most agent frameworks today either implement the tool backend ad hoc (subprocess + chmod, or maybe Docker) or call out to a hosted sandbox service.

Wisp aims at the second slot.

The current landscape (2026-05)

Eight serious players ship "sandbox-as-a-service for AI agents" today. The April 2026 OpenAI Agents SDK release named seven of them as built-in integrations (Blaxel, Cloudflare, Daytona, E2B, Modal, Runloop, Vercel) — that's the canonical adoption layer.

Each made a different bet on substrate, lifecycle, and capability model. None of them is wrong. They're optimized for different agent shapes.

ProviderSubstratePer-call cold startLifecycleCapability modelSweet spot
E2BFirecracker microVM + Jupyter inside~150 msPersistent (5-min idle timeout)Broad: full Python + pip + shell + network inside VMData analysts, long notebook-style sessions
Modal SandboxCustom Rust container + gVisorsub-secondPer-invocationBroad: full containerML workloads, batch jobs
DaytonaDocker container (+ optional Kata)~90 ms warmPersistent dev environmentBroad: full container, kernel-sharedDev environments, CI
BlaxelFirecracker + perpetual env~25 ms warm resumePersistent w/ aggressive idle shutdownBroad: full VMIntermittent agent traffic
Vercel SandboxFirecrackersub-secondPer-invocation, 5-hour capBroad: full VMOne-off code execution at the edge
Cloudflare WorkersV8 isolate + Pyodidesub-100 msPer-invocationNarrow: V8-sandboxed JS-first; Python via PyodideEdge HTTP workloads
AWS Lambda SandboxFirecracker800–1500 ms (Python cold)Per-invocationBroad: full containerCron, infrequent triggers
Wisp (this work)wasmtime + WASI CPython0.78 msPer-call fresh (default)Capability-bridge (explicit allowlist)High-frequency tool-call agents, RL rollouts

A few honest notes:

Where Wisp fits — and where it doesn't

Use caseWispUse one of the others
Agent makes 50+ short tool calls per turn (file ops, JSON parse, simple compute)✓ default per-call freshness, no state pollution between callspersistent-VM models pay state-leakage cost
Tree-search RL rollout (MCTS, GRPO)✓ COW fork at sub-ms per branch (Spike B2 below)Linux fork or Firecracker snapshot is 100–1000× slower
Multi-tenant SaaS hosting untrusted agent code✓ capability-bridge gives explicit allowlists per tenantbroad VM access works but auditing is harder
Data-analyst agent with pandas and a CSV that should persistE2B/Daytona model is a better fit (state survives across cells)Wisp can do session mode, but it's bolted on
Agent needing arbitrary pip install mid-sessionE2B (full apt+pip inside VM)Wisp's WASI Python doesn't have dlopen for native wheels
Browser automation (Playwright, etc.)E2B's chrome-in-sandboxnot yet in scope

The honest position: for most agent frameworks today, Wisp is a plug-in for the tool-execution layer, sitting alongside (not replacing) whichever sandbox provider they already use. We are best when the agent loop is high-frequency and stateless-ish.

The substrate, briefly

WebAssembly's linear memory is a single contiguous byte array per Instance. That's what makes per-call fresh cheap:

Native processes structurally can't do this — heap is scattered across mmap regions, shared libs, stack; you can't dump or restore at byte granularity without ptrace or full uVM serialization. Firecracker has a snapshot facility, but it operates at VM granularity and takes 100–500 ms to restore.

The wasm Instance abstraction collapses snapshot/restore into one syscall. That's where the substrate gap comes from.

The cold-start descent (our spikes, with numbers)

We ran four spikes, each replacing the bottleneck of the previous one. All measurements are on Apple Silicon M-series, 8 cores, Wasmtime crate 27, in-process.

Spike A — pooled in-process, fresh interpreter

| Subprocess wasmtime CLI    | ~400 ms |
| Pooled in-process, init    |  39 ms  |  ← Spike A

Pre-built engine, pre-compiled module, instantiate fresh, run wisp_init per call. The 39 ms is dominated by Py_InitializeFromConfig plus stdlib imports — most of CPython's startup work. Already 10× faster than the subprocess baseline.

Spike A2 — memory-snapshot restore

wisp_init is deterministic. Run it once, capture the linear memory, restore the snapshot via Memory::data_mut().copy_from_slice() instead of re-running init.

| memcpy snapshot/restore    |  1.68 ms  |  ← Spike A2

23× speedup. The memcpy was the dominant cost — bandwidth-bound, doesn't parallelize.

Spike A2.1 — mmap COW snapshot

Replace the per-call memcpy with a custom Wasmtime MemoryCreator that mmaps the snapshot file MAP_PRIVATE. Per call, after instantiate, re-mmap MAP_PRIVATE | MAP_FIXED to undo Wasmtime's data-segment init writes:

unsafe impl MemoryCreator for CowMemoryCreator {
    fn new_memory(&self, ...) -> Result<Box<dyn LinearMemory>> {
        // 1. Reserve virtual region with PROT_NONE
        // 2. mmap MAP_PRIVATE of snapshot file at offset 0
        // 3. mmap MAP_PRIVATE | MAP_ANON for trailing zero pages
    }
}

// per call:
let r = libc::mmap(
    base, snapshot.len(),
    PROT_READ | PROT_WRITE,
    MAP_PRIVATE | MAP_FIXED,
    snap_fd, 0,
);

Result:

| mmap COW snapshot          |  0.78 ms  |  ← Spike A2.1

The per-call snapshot reset dropped from a 1.07 ms memcpy to a 0.030 ms syscall. The remaining 0.75 ms decomposes:

| Instantiate + data init    |   0.45 ms  ← wasmtime overhead
| mmap MAP_FIXED reset       |   0.03 ms  ← was memcpy 1.07 ms
| Grow memory                |   0.002 ms
| Alloc + write code         |   0.01 ms
| wisp_eval                  |   0.20 ms  ← page faults lazy
| Total                      |   0.78 ms

The wisp_eval cost rose slightly compared to A2 because pages are faulted in lazily during execution. Net win: ~2.4× over memcpy, and unlike memcpy this approach doesn't burn memory bandwidth — meaning it parallelizes better.

The other primitive: cheap fork from a snapshot

Spawn K children from a single post-init snapshot, each running a divergent Python expression.

Kmemcpy backend (br/s)mmap COW backend (br/s)
13671163
86971702
649262231
25610252363
10247992394

Peak parallel throughput on 8 cores: 2394 branches/sec with the COW backend, 2.3× more than memcpy. Per-branch latency at K=64 parallel: p50 3.42 ms, p99 6.17 ms.

Comparison to native runtimes for a tree-search workload (K=100 × depth=100 = 10 000 forks per trajectory):

SubstratePer-trajectory branching cost
Linux process fork (Python interpreter)50–100 s
Firecracker uVM snapshot16–83 min
Wisp WASM memcpy snapshot~10 s
Wisp WASM mmap COW snapshot~4.2 s

Useful for tree-search RL: every WASM child gets a truly fresh sandbox — new linear memory, no shared state with siblings or parent beyond the snapshot contents.

End-to-end through the daemon

The 0.78 ms number is the executor primitive. End-to-end through the HTTP daemon (wisp-runtime):

$ for i in {1..10}; do
    curl -s -X POST http://127.0.0.1:9000/v1/eval \
      -H "Content-Type: application/json" \
      -d '{"code":"print(2+2)"}' | jq .elapsed_us
  done
14225
1511
1356
1203
1189
1130
1284
1189
1120
1255

Steady state ~1.1–1.3 ms including HTTP framing, JSON serialization, and the channel hop from tokio to a worker thread. The first call is warm-up of the JIT cache for the first Instance.

This is the cold start an agent framework will measure when it integrates the Wisp client SDK.

What WebAssembly does NOT solve

Honest disclaimers, since these come up:

For most agent tool calls (parse this JSON, transform that data, validate this schema, do this hash), none of these matter. For some agent workloads (HTTP scraping, browser control, full notebook exploration), one of the persistent-VM substrates is the right choice.

Reproduce

Everything in this post is in github.com/wisplab/wisp, MIT-licensed.

Build the runtime + daemon:

git clone https://github.com/wisplab/wisp && cd wisp
cd runtime/cpython-wasi && ./build.sh             # CPython 3.14 → wasm
./build-sqlite.sh && ./build-openssl.sh           # M0.5 stdlib deps
./wisp_entry/build.sh                              # python-reactor.wasm
cd ../.. && cargo build --release -p wisp-runtime # the daemon

Plug into Claude Code (MCP):

cd examples/claude-code-integration
python3 -m venv .venv && ./.venv/bin/pip install -r requirements.txt

# In another terminal: start the daemon (capability config optional)
cargo run --release -p wisp-runtime

# Register the MCP server with Claude Code
claude mcp add wisp \
  -- $PWD/.venv/bin/python $PWD/wisp_mcp_server.py

Then in a fresh Claude Code session: "use wisp to print 2+2".

Plug into OpenCode (custom tool):

cp -r examples/opencode-integration/.opencode /path/to/your/project/
# Daemon needs to be running on localhost:9000.

Reproduce the benchmark numbers:

cd bench/python-wasi-cow            && cargo run --release   # 0.78 ms cold start
cd ../python-wasi-cow-branching     && cargo run --release   # 2394 br/s parallel

Each bench/*/FINDINGS.md documents the methodology.

What's next

  1. M1 build pipeline → numpy — let the sandbox actually run real data science. The cross-build pipeline already works end-to-end on xxhash; numpy is the headline test.
  2. Session API — E2B-like persistent semantics for the workloads where per-call freshness is the wrong default.
  3. Multi-host scheduler — for the Knative-shaped deployment story. Single-process daemon today is fine for solo dev / small team.
  4. Streaming output — currently every call runs to completion before returning; long jobs hide useful progress.

If you operate an agent framework or platform and the pattern above fits something you're hitting, we'd love to hear from you.

— Wisp