Per-call Python sandboxes in WebAssembly: where Wisp fits

2026-05-03 · Wisp

TL;DR

Wisp is an open-source Python sandbox runtime that follows the same integration model as E2B, Modal Sandbox, Daytona, Blaxel, Vercel Sandbox, and Cloudflare Workers — agent frameworks plug it in as their "tool execution backend" and ask it to run model-generated Python safely.

The substrate is different: WASI Python under Wasmtime instead of a Firecracker microVM, gVisor container, or Docker container. That choice buys two things the others structurally cannot reach today:

  1. 0.78 ms per-call cold start at the executor (vs 25 ms warm resume / 90–150 ms cold for the others).
  2. Per-call fresh sandbox as the default, not just an option. At a sub-millisecond startup cost, "new sandbox per tool call" is economically the same as "reuse one sandbox" — so we can default to fresh and recover correctness properties (no state leakage between tool calls, byte-identical replay) for free.

This post explains where Wisp belongs in the existing landscape — it's a complement, not a replacement — and walks through how we got the numbers.


The agent tool-execution problem

A code agent (Claude Code, Aider, OpenCode, Cursor, OpenInterpreter, LangChain agents, OpenAI Agents SDK, …) is a loop:

LLM ─→ "execute this Python" ─→ tool backend runs it ─→ result back to LLM ─→ ...

The agent's job is the loop. The tool backend's job is running the model's generated code:

These are different concerns. The agent framework focuses on routing, memory, and LLM orchestration; the tool backend focuses on isolation and speed. Most agent frameworks today either implement the tool backend ad hoc (subprocess + chmod, or maybe Docker) or call out to a hosted sandbox service.

Wisp aims at the second slot.

The current landscape (2026-05)

Eight serious players ship "sandbox-as-a-service for AI agents" today. The April 2026 OpenAI Agents SDK release named seven of them as built-in integrations (Blaxel, Cloudflare, Daytona, E2B, Modal, Runloop, Vercel) — that's the canonical adoption layer.

Each made a different bet on substrate, lifecycle, and capability model. None of them is wrong. They're optimized for different agent shapes.

ProviderSubstratePer-call cold startLifecycleCapability modelSweet spot
E2BFirecracker microVM + Jupyter inside~150 msPersistent (5-min idle timeout)Broad: full Python + pip + shell + network inside VMData analysts, long notebook-style sessions
Modal SandboxCustom Rust container + gVisorsub-secondPer-invocationBroad: full containerML workloads, batch jobs
DaytonaDocker container (+ optional Kata)~90 ms warmPersistent dev environmentBroad: full container, kernel-sharedDev environments, CI
BlaxelFirecracker + perpetual env~25 ms warm resumePersistent w/ aggressive idle shutdownBroad: full VMIntermittent agent traffic
Vercel SandboxFirecrackersub-secondPer-invocation, 5-hour capBroad: full VMOne-off code execution at the edge
Cloudflare WorkersV8 isolate + Pyodidesub-100 msPer-invocationNarrow: V8-sandboxed JS-first; Python via PyodideEdge HTTP workloads
AWS Lambda SandboxFirecracker800–1500 ms (Python cold)Per-invocationBroad: full containerCron, infrequent triggers
Wisp (this work)wasmtime + WASI CPython0.78 msPer-call fresh (default)Capability-bridge (explicit allowlist)High-frequency tool-call agents, RL rollouts

A few honest notes:

Where Wisp fits — and where it doesn't

Use caseWispUse one of the others
Agent makes 50+ short tool calls per turn (file ops, JSON parse, simple compute)✓ default per-call freshness, no state pollution between callspersistent-VM models pay state-leakage cost
Tree-search RL rollout (MCTS, GRPO)✓ COW fork at sub-ms per branch (Spike B2 below)Linux fork or Firecracker snapshot is 100–1000× slower
Multi-tenant SaaS hosting untrusted agent code✓ capability-bridge gives explicit allowlists per tenantbroad VM access works but auditing is harder
Data-analyst agent with pandas and a CSV that should persistE2B/Daytona model is a better fit (state survives across cells)Wisp can do session mode, but it's bolted on
Agent needing arbitrary pip install mid-sessionE2B (full apt+pip inside VM)Wisp's WASI Python doesn't have dlopen for native wheels
Browser automation (Playwright, etc.)E2B's chrome-in-sandboxnot yet in scope

The honest position: for most agent frameworks today, Wisp is a plug-in for the tool-execution layer, sitting alongside (not replacing) whichever sandbox provider they already use. We are best when the agent loop is high-frequency and stateless-ish.

The substrate, briefly

WebAssembly's linear memory is a single contiguous byte array per Instance. That's what makes per-call fresh cheap:

Native processes structurally can't do this — heap is scattered across mmap regions, shared libs, stack; you can't dump or restore at byte granularity without ptrace or full uVM serialization. Firecracker has a snapshot facility, but it operates at VM granularity and takes 100–500 ms to restore.

The wasm Instance abstraction collapses snapshot/restore into one syscall. That's where the substrate gap comes from.

The cold-start descent (our spikes, with numbers)

We ran four spikes, each replacing the bottleneck of the previous one. All measurements are on Apple Silicon M-series, 8 cores, Wasmtime crate 27, in-process.

Spike A — pooled in-process, fresh interpreter

| Subprocess wasmtime CLI    | ~400 ms |
| Pooled in-process, init    |  39 ms  |  ← Spike A

Pre-built engine, pre-compiled module, instantiate fresh, run wisp_init per call. The 39 ms is dominated by Py_InitializeFromConfig plus stdlib imports — most of CPython's startup work. Already 10× faster than the subprocess baseline.

Spike A2 — memory-snapshot restore

wisp_init is deterministic. Run it once, capture the linear memory, restore the snapshot via Memory::data_mut().copy_from_slice() instead of re-running init.

| memcpy snapshot/restore    |  1.68 ms  |  ← Spike A2

23× speedup. The memcpy was the dominant cost — bandwidth-bound, doesn't parallelize.

Spike A2.1 — mmap COW snapshot

Replace the per-call memcpy with a custom Wasmtime MemoryCreator that mmaps the snapshot file MAP_PRIVATE. Per call, after instantiate, re-mmap MAP_PRIVATE | MAP_FIXED to undo Wasmtime's data-segment init writes:

unsafe impl MemoryCreator for CowMemoryCreator {
    fn new_memory(&self, ...) -> Result<Box<dyn LinearMemory>> {
        // 1. Reserve virtual region with PROT_NONE
        // 2. mmap MAP_PRIVATE of snapshot file at offset 0
        // 3. mmap MAP_PRIVATE | MAP_ANON for trailing zero pages
    }
}

// per call:
let r = libc::mmap(
    base, snapshot.len(),
    PROT_READ | PROT_WRITE,
    MAP_PRIVATE | MAP_FIXED,
    snap_fd, 0,
);

Result:

| mmap COW snapshot          |  0.78 ms  |  ← Spike A2.1

The per-call snapshot reset dropped from a 1.07 ms memcpy to a 0.030 ms syscall. The remaining 0.75 ms decomposes:

| Instantiate + data init    |   0.45 ms  ← wasmtime overhead
| mmap MAP_FIXED reset       |   0.03 ms  ← was memcpy 1.07 ms
| Grow memory                |   0.002 ms
| Alloc + write code         |   0.01 ms
| wisp_eval                  |   0.20 ms  ← page faults lazy
| Total                      |   0.78 ms

The wisp_eval cost rose slightly compared to A2 because pages are faulted in lazily during execution. Net win: ~2.4× over memcpy, and unlike memcpy this approach doesn't burn memory bandwidth — meaning it parallelizes better.

The other primitive: cheap fork from a snapshot

Spawn K children from a single post-init snapshot, each running a divergent Python expression.

Kmemcpy backend (br/s)mmap COW backend (br/s)
13671163
86971702
649262231
25610252363
10247992394

Peak parallel throughput on 8 cores: 2394 branches/sec with the COW backend, 2.3× more than memcpy. Per-branch latency at K=64 parallel: p50 3.42 ms, p99 6.17 ms.

Comparison to native runtimes for a tree-search workload (K=100 × depth=100 = 10 000 forks per trajectory):

SubstratePer-trajectory branching cost
Linux process fork (Python interpreter)50–100 s
Firecracker uVM snapshot16–83 min
Wisp WASM memcpy snapshot~10 s
Wisp WASM mmap COW snapshot~4.2 s

Useful for tree-search RL: every WASM child gets a truly fresh sandbox — new linear memory, no shared state with siblings or parent beyond the snapshot contents.

End-to-end through the daemon

The 0.78 ms number is the executor primitive. End-to-end through the HTTP daemon (wisp-runtime):

$ for i in {1..10}; do
    curl -s -X POST http://127.0.0.1:9000/v1/eval \
      -H "Content-Type: application/json" \
      -d '{"code":"print(2+2)"}' | jq .elapsed_us
  done
14225
1511
1356
1203
1189
1130
1284
1189
1120
1255

Steady state ~1.1–1.3 ms including HTTP framing, JSON serialization, and the channel hop from tokio to a worker thread. The first call is warm-up of the JIT cache for the first Instance.

This is the cold start an agent framework will measure when it integrates the Wisp client SDK.

What WebAssembly does NOT solve

Honest disclaimers, since these come up:

For most agent tool calls (parse this JSON, transform that data, validate this schema, do this hash), none of these matter. For some agent workloads (HTTP scraping, browser control, full notebook exploration), one of the persistent-VM substrates is the right choice.

Reproduce

Everything in this post is in github.com/wisplab/wisp, MIT-licensed.

git clone https://github.com/wisplab/wisp
cd wisp/runtime/cpython-wasi && ./build.sh        # CPython 3.14 → wasm
./build-sqlite.sh && ./build-openssl.sh           # M0.5 stdlib deps
./wisp_entry/build.sh                              # python-reactor.wasm

cd ../../bench/python-wasi-cow && cargo run --release            # Spike A2.1
cd ../python-wasi-cow-branching && cargo run --release            # Spike B2

# And the daemon + SDK:
cd ../.. && cargo run --release -p wisp-runtime &
pip install -e sdks/python
python -c "import wisp; print(wisp.Client().eval('print(2+2)').stdout)"

Each bench/*/FINDINGS.md documents the methodology.

What's next

  1. Daemon-side capabilitiesshell, file_read, file_write, web_fetch, grep. Configured via allowlist. Without these the sandbox is too narrow for real tool-call workloads.
  2. Sandbox-side wisp Python module — friendly wrapper around the capability bridge, so user code does wisp.shell("ls") instead of raw _wisp.call_host(...).
  3. Reference integration with one open-source agent framework — probably smolagents or OpenCode. Not asking them to merge a PR; just a working example so anyone can copy the pattern.
  4. M1 build pipeline → numpy — let the sandbox actually run real data science.
  5. Session API — for users who want E2B-like persistent semantics when that's the right fit.

If you operate an agent framework or platform and the pattern above fits something you're hitting, we'd love to hear from you.

— Wisp