Skip to content

MicroVMs: The Data Isolation Layer AI Agents Have Been Missing

13 min read

Here’s the question that breaks most AI agent architectures: when your agent processes Patient A’s medical records, can you guarantee it physically cannot access Patient B’s data?

Not “probably won’t.” Not “our access controls should prevent it.” Physically cannot. Hardware-enforced. The way one AWS customer’s EC2 instance cannot read another customer’s memory, even if both share the same physical server.

If your agents run in Docker containers, the honest answer is no. Containers share the host kernel. A kernel exploit (and there’s a new one every few months) means lateral movement between tenants. When the workload inside is LLM-generated code that you didn’t write, didn’t review, and can’t predict, that shared kernel is a liability.

MicroVMs solve exactly this problem. One problem. The important one: hard data isolation between concurrent agent sessions.

Why Data Isolation Is the Real Problem

Most discussions about “securing AI agents” mix up several concerns. Let me separate them:

  • Authentication/Authorization: Who can invoke which agent? Solved by IAM, RBAC, tokens.
  • Network security: What can the agent reach? Solved by egress rules, allowlists, VPCs.
  • Audit logging: What did the agent do? Solved by structured logging, SIEM integration.
  • Data isolation: Can one user’s agent session access another user’s data at the compute layer?

The first three have established solutions that work fine with containers. (I covered many of these in my piece on 15 patterns for production agents.) The fourth is where containers fundamentally fall short for AI agent workloads.

%%{init: {"layout": "dagre"}}%%
flowchart TB
    subgraph "The Problem with Containers"
        direction TB
        A[Agent Session: Patient A] --- K[Shared Linux Kernel]
        B[Agent Session: Patient B] --- K
        C[Agent Session: Patient C] --- K
        K -->|"Kernel exploit = <br/>all sessions exposed"| D[Data Breach]
    end

Why agents make this worse than traditional apps: a normal web server runs code you wrote. You control what it does. An AI agent generates and executes code dynamically. It might write a script that scans /proc, exploits a container escape CVE, or does something nobody anticipated. The attack surface isn’t static. It’s whatever the model invents. (Even Claude Code’s architecture uses path sandboxing and worktree isolation to contain agent actions — but that’s software-level containment, not hardware.)

What MicroVMs Actually Do (Simply)

A microVM is a stripped-down virtual machine. Same hardware isolation as an EC2 instance, but boots in 125ms and uses 5MB of memory overhead.

%%{init: {"layout": "dagre"}}%%
flowchart TB
    subgraph "MicroVM Architecture"
        direction TB
        A[Agent: Patient A<br/>Own kernel, own memory] ---|"Hardware boundary<br/>(cannot cross)"| B[Agent: Patient B<br/>Own kernel, own memory]
        B ---|"Hardware boundary<br/>(cannot cross)"| C[Agent: Patient C<br/>Own kernel, own memory]
    end
    H[Host + Hypervisor] --> A
    H --> B
    H --> C

Each agent session gets its own kernel, its own filesystem, its own memory space. Not separated by software namespaces (which can be escaped). Separated by CPU virtualization hardware (which cannot be escaped without a hypervisor bug, and the hypervisor is ~50K lines of Rust).

That’s the entire value proposition for regulated environments. Not faster boot times. Not lower memory usage. Those are nice, but the reason you pick microVMs over containers is: one agent session physically cannot read another session’s data.

The Healthcare Scenario

You’re building an agent platform for a health system. Multiple hospitals use it. Each doctor’s AI assistant processes patient records, writes analysis code, pulls lab results.

%%{init: {"layout": "dagre"}}%%
flowchart LR
    subgraph "Hospital A"
        D1[Dr. Smith's session] --> VM1[MicroVM<br/>Smith's patients ONLY<br/>Destroyed after session]
    end
    subgraph "Hospital B"
        D2[Dr. Patel's session] --> VM2[MicroVM<br/>Patel's patients ONLY<br/>Destroyed after session]
    end
    VM1 -.->|"Cannot access<br/>Hardware boundary"| VM2

When Dr. Smith’s session ends, the microVM is destroyed. Memory zeroed. Filesystem gone. No residual data. No chance that the next session on that hardware inherits leftover patient records from a previous session.

This is not achievable with containers without extraordinary effort. Container filesystems persist unless explicitly cleaned. Memory pages can be recycled. Process namespaces can be escaped. The amount of defensive configuration required to make containers “probably safe” for this use case is itself a source of bugs.

With microVMs, the isolation is the default. You’d have to actively try to break it.

How It Compares to Other Agent Runtimes

Every comparison ultimately comes back to one question: does this runtime guarantee data isolation between sessions?

RuntimeData Isolation GuaranteeHow It IsolatesThe Catch
Docker containerNo hardware guaranteeKernel namespaces (software)Kernel exploits cross boundaries
gVisorStronger than DockerUser-space kernel intercepts syscallsStill software-only; 10-30% I/O overhead
MicroVM (Firecracker)Yes, hardware-enforcedSeparate kernel per session via KVM125ms cold boot (1-5ms from snapshot)
Traditional VMYes, hardware-enforcedFull hypervisor30-60 second boot, 1GB+ overhead
WASM sandboxPartialLanguage runtime boundaryCan’t run arbitrary code, install packages

Where microVMs specifically win over every alternative: they’re the only option that gives you hardware-enforced data isolation AND can boot fast enough for interactive agent sessions AND can run arbitrary code (install packages, run browsers, execute generated scripts).

gVisor comes close but it’s a software boundary. WASM is fast but can’t run real-world agent workloads. Traditional VMs are too slow and expensive per session. Containers are fast but don’t isolate at the hardware level.

Can It Handle Real Compute Workloads?

“Lightweight” refers to the overhead, not the capacity. The agent inside gets whatever resources you allocate.

Agent WorkloadTypical ConfigPerformance vs Bare Metal
Pandas on multi-GB datasets4 vCPU, 16GB RAM95-98% native speed
Browser automation2 vCPU, 4GB RAM~95% native speed
Code generation + execution2 vCPU, 8GB RAMNear-native
ML model inference (CPU)8 vCPU, 32GB RAM95%+ native speed

A microVM runs a real Linux kernel. If it runs on Linux, it runs inside a microVM at near-native speed. The 5MB overhead is the monitor process on the host. The guest gets full access to allocated CPU and RAM.

For data analysis agents: mount the dataset into the microVM read-only. The agent can analyze but never modify source data. When the session ends, any intermediate results vanish with the microVM. No data leakage between sessions.

Limitation: GPU workloads (large model inference, training) need PCI passthrough, which adds complexity. The typical pattern is to keep GPU workloads on shared inference servers and have the isolated agent call them via API.

The Cost of Data Isolation

The real question isn’t “what do microVMs cost?” It’s “what does data isolation cost compared to the alternatives?”

Scenario: 10,000 agent sessions per day, 5 minutes each, 2 vCPU + 4GB RAM per session.

ApproachMonthly CostData Isolation?
Shared containers~$800-1,200No hardware guarantee
One traditional VM per session~$15,000-25,000Yes, but absurdly expensive
MicroVMs (self-managed)~$1,000-1,500Yes, hardware-enforced
AWS AgentCore (managed)~$2,000-3,500Yes, zero ops burden

MicroVMs cost 20-50% more than raw containers but 85-95% less than traditional VMs. You can pack 200+ microVMs on a single 64GB host because the overhead is 5MB per instance, not 1-2GB.

%%{init: {"layout": "dagre"}}%%
flowchart LR
    subgraph "Same 64GB Host"
        A["Traditional VMs<br/>8-16 isolated sessions"]
        B["MicroVMs<br/>200+ isolated sessions"]
    end

The cost you’re avoiding: a data isolation breach in healthcare means HIPAA violation notifications, potential fines up to $1.5M per violation category, and months of remediation. The $200-700/month premium over containers is not a cost. It’s the cheapest insurance you’ll ever buy.

How Managed Agent Platforms Handle Isolation

Not every agent platform thinks about data isolation the same way. Here’s how the major managed platforms compare on the one question that matters: what actually separates User A’s data from User B’s at the compute layer?

PlatformIsolation ModelPer-Session BoundaryData Destroyed After Session?Hardware-Enforced?
AWS Bedrock AgentCoreDedicated Firecracker microVM per sessionOwn kernel, own filesystemYes, VM destroyedYes
AWS Bedrock (standard)Shared managed infrastructureAPI-level separation onlyN/A (stateless API calls)No
Google Vertex AI AgentsShared GKE containers + IAMIAM policy, not compute boundaryNo (persists in managed storage)No
Databricks Mosaic AI AgentsShared Spark clusters + Unity CatalogRow-level ACLs on dataNo (governed by catalog policies)No
Azure AI Agent ServiceShared container poolsThread-level separation in codeConfigurable retentionNo
%%{init: {"layout": "dagre"}}%%
flowchart TB
    subgraph "AgentCore (MicroVM Model)"
        direction LR
        S1[Session A] --> VM1[Dedicated MicroVM<br/>Own kernel + filesystem<br/>Destroyed on completion]
        S2[Session B] --> VM2[Dedicated MicroVM<br/>Own kernel + filesystem<br/>Destroyed on completion]
    end
    subgraph "Most Platforms (Shared Model)"
        direction LR
        S3[Session A] --> SC[Shared Container Pool]
        S4[Session B] --> SC
        SC --> IAM{IAM / ACLs decide<br/>who sees what}
    end

The fundamental difference: most managed agent platforms rely on software-level access controls (IAM policies, row-level security, catalog permissions) to separate users’ data. The compute layer is shared. If the access control has a bug, or the agent finds a way around it, data leaks between users.

AgentCore (and microVM-based approaches in general) enforce isolation at the compute layer itself. There’s no shared process, no shared kernel, no shared filesystem between sessions. Even if your access control logic has a bug, the hardware boundary prevents cross-session access.

Where Each Platform Actually Fits

Databricks Mosaic AI Agents are excellent for data teams building analytics agents that query governed datasets. Unity Catalog handles column-level and row-level permissions well. But the isolation is policy-based, not physical. If your agent writes and executes code (not just SQL), a code execution vulnerability could bypass catalog policies.

Google Vertex AI Agents integrate tightly with Google Cloud IAM and can invoke Cloud Functions or Cloud Run. Isolation depends on how you architect the backend. You could achieve microVM-level isolation by routing agent code execution through a Firecracker layer, but Vertex doesn’t provide it natively.

AWS Bedrock (standard) is a stateless inference API. Great for calling models, but if your agent needs to execute code or maintain session state with user data, you need to bring your own compute layer. That’s what AgentCore adds.

AWS Bedrock AgentCore is the only managed platform today that gives each agent session its own Firecracker microVM. Session lasts up to 8 hours. When it ends, everything is wiped. This is the managed option for teams who need hardware-enforced data isolation without running their own infrastructure.

What Kinds of Agents Can You Build?

MicroVMs don’t limit what your agent can do. They limit what your agent can access. Inside the microVM, the agent has a full Linux environment — but you still need a well-engineered harness (proper tooling, verification, context) for the agent to be productive within that environment. Here’s what that enables across use cases:

%%{init: {"layout": "dagre"}}%%
flowchart TB
    subgraph "Agents That Benefit Most from MicroVM Isolation"
        direction TB
        A[Data Analysis Agents<br/>Process sensitive datasets<br/>per-user or per-patient] 
        B[Code Execution Agents<br/>Run LLM-generated scripts<br/>Unpredictable workloads]
        C[Document Processing Agents<br/>Parse uploaded files<br/>PDFs, spreadsheets, images]
        D[Browser Agents<br/>Navigate portals, scrape data<br/>Handle credentials per user]
        E[Clinical/Research Agents<br/>Analyze patient cohorts<br/>PHI must stay isolated]
    end

Data Analysis Agents

The agent receives a user’s dataset (CSV, Parquet, database dump), writes Python code to analyze it, and returns insights. The microVM gets the dataset mounted read-only, runs pandas/numpy/sklearn at near-native speed, and is destroyed when done.

Why isolation matters here: the dataset contains sensitive records. Without hardware isolation, a vulnerability in the analysis code could expose one customer’s data to another’s session running on the same host.

Compute capacity: microVMs can be configured with up to 32 vCPUs and 256GB RAM per session. A 10GB dataset with complex transformations runs at 95-98% bare-metal speed. The 5MB overhead is the VM monitor, not a cap on the workload.

Code Execution Agents (Coding Assistants)

The agent generates and runs code to solve user problems: debugging scripts, running test suites, building projects. (This is the AI-native pattern where agents sandbox-verify generated code before it hits production.) Each user’s workspace lives inside its own microVM.

Why isolation matters here: User A’s codebase (which may contain secrets, API keys, proprietary logic) must be invisible to User B’s session, even if both are running on the same physical server.

Document Processing Agents

The agent parses uploaded files: extracting data from PDFs, processing medical records, analyzing contracts. The uploaded file goes into the microVM, gets processed, and results come back. The file never touches shared storage.

Why isolation matters here: uploaded documents often contain the most sensitive data in the workflow. A shared filesystem or container volume means residual data could leak between sessions.

Browser Agents

The agent runs a real browser (Chromium via Playwright or Puppeteer) to navigate web portals, fill forms, or scrape information. Credentials are injected per session.

Why isolation matters here: the browser has full access to the user’s session cookies, portal credentials, and page content. A shared browser environment between users would be a catastrophic data leak.

Clinical and Research Agents

The agent analyzes patient cohorts, runs statistical models on PHI, or processes lab results across a provider’s patient panel. Each provider’s agent session can only see their own patients.

Why isolation matters here: this is the HIPAA case. PHI from Hospital A’s patients must never be accessible to Hospital B’s agent session. Hardware isolation makes this guarantee simple to demonstrate to auditors.

Your Options Today

%%{init: {"layout": "dagre"}}%%
flowchart TB
    Q{"How much infra<br/>do you want to own?"} 
    Q -->|"None"| A[AWS AgentCore<br/>Managed Firecracker microVMs<br/>Per-session isolation<br/>Up to 8-hour sessions]
    Q -->|"Some"| B[SmolVM or Kata Containers<br/>Open-source microVM orchestration<br/>Plugs into existing K8s or standalone<br/>Full control over lifecycle]
    Q -->|"All of it"| C[Raw Firecracker<br/>Build your own Lambda-style runtime<br/>Maximum density and control<br/>Requires significant ops investment]
  • AWS AgentCore: Each agent session gets a dedicated Firecracker microVM. Session ends, VM is destroyed. Inherits AWS compliance posture (SOC 2, HIPAA, PCI). No infra to manage.
  • SmolVM: Open-source Python SDK for creating/destroying agent microVMs. Supports snapshots, read-only mounts, browser sessions. Integrates with LangChain, PydanticAI, OpenAI Agents SDK.
  • Kata Containers: Drop-in replacement for your container runtime that runs each pod inside a microVM. Your Kubernetes workflow stays the same. Isolation becomes hardware-enforced.

Vendor Lock-in and Migration Paths

This is the question that matters at scale: if you start with AgentCore today and costs spike at 50,000 sessions/day, can you leave? Or are you trapped?

The good news: microVM technology is inherently portable because all implementations run standard Linux guests. Your agent code doesn’t know or care whether it’s inside a Firecracker microVM on AWS, a Kata Container on bare metal, or a Cloud Hypervisor instance on GCP. It just sees a Linux environment.

%%{init: {"layout": "dagre"}}%%
flowchart LR
    subgraph "Your Agent Code"
        A[Python/Node runtime<br/>+ agent logic<br/>+ dependencies]
    end
    A -->|"Runs unchanged on"| B[AgentCore<br/>Managed Firecracker]
    A -->|"Runs unchanged on"| C[Kata Containers<br/>Self-hosted K8s]
    A -->|"Runs unchanged on"| D[Raw Firecracker<br/>Bare metal]
    A -->|"Runs unchanged on"| E[SmolVM<br/>Any cloud]

What locks you in vs. what doesn’t:

LayerLock-in RiskWhy
Agent code (Python, Node, scripts)NoneRuns on any Linux. No platform SDK required inside the guest.
Guest OS image (rootfs)LowStandard Linux filesystem. Works across any microVM runtime.
Orchestration API (create/destroy/snapshot)MediumEach platform has its own control plane API. This is what you’d rewrite.
Snapshot formatMediumFirecracker snapshots are Firecracker-specific. You’d re-snapshot on a new platform.
Cloud integrations (IAM, KMS, VPC)HighAWS-specific. You’d swap these for the equivalent on your target cloud.

The pattern that avoids lock-in: keep a clean separation between your agent logic (runs inside the microVM) and your orchestration logic (creates/destroys microVMs). The agent code is portable by default. The orchestration layer is a thin wrapper around whichever platform you’re on.

Migration Scenarios

AgentCore to self-hosted Firecracker (when managed costs hit a ceiling):

Your agent code and rootfs image move unchanged. You build a control plane that manages microVM lifecycle (create, snapshot, restore, destroy) using Firecracker’s REST API directly. Teams typically hit this threshold at 30,000-50,000 daily sessions, where the managed premium exceeds the cost of a small platform team.

AgentCore to Kata Containers (when you want multi-cloud):

Your agent code moves unchanged. Package it as a container image (which you likely already have). Kata runs that image inside a microVM transparently. Your Kubernetes manifests stay the same. You just switch the runtime class from runc to kata.

# The only change: add runtimeClassName
spec:
  runtimeClassName: kata
  containers:
    - name: agent
      image: your-agent:latest

Any microVM platform to another microVM platform:

The agent code is always portable. The orchestration layer (how you create sessions, mount data, set resource limits) is the part you rewrite. For most teams, this is 500-2000 lines of infrastructure code, not a multi-month migration.

The Cost Escalation Playbook

If you start managed and costs grow faster than revenue, here’s the typical path:

%%{init: {"layout": "dagre"}}%%
flowchart LR
    A["Phase 1<br/>AgentCore (managed)<br/>0-10K sessions/day<br/>~$2-3.5K/month"] --> B["Phase 2<br/>Kata on K8s<br/>10-50K sessions/day<br/>~$1-1.5K/month equivalent"]
    B --> C["Phase 3<br/>Raw Firecracker<br/>50K+ sessions/day<br/>~$0.60-0.80/month per 1K sessions"]

Each transition keeps your agent code unchanged. You’re only swapping the orchestration layer. The data isolation guarantee remains hardware-enforced at every stage because all three options use the same underlying technology: KVM hardware virtualization.

The honest lock-in assessment: microVMs have less vendor lock-in than most managed services because the isolation primitive (Linux KVM) is an open kernel feature, not a proprietary technology. Firecracker, Kata, and Cloud Hypervisor all use it. Moving between them is an infrastructure concern, not an application rewrite.

The Bottom Line

The entire case for microVMs in regulated agent platforms comes down to one thing: data isolation that’s enforced by hardware, not by software configuration.

When your AI agent generates unpredictable code while processing sensitive data, you need certainty that session A cannot access session B’s information. Not “our seccomp profile should prevent it.” Not “we’ve configured namespace isolation correctly.” Certainty backed by the same virtualization hardware that separates AWS customers from each other.

That guarantee costs $200-700/month more than containers for a typical workload. It boots in 125ms (or 1-5ms from a snapshot). It runs any code the agent generates at near-native speed. And when someone asks “can you prove Patient A’s data is isolated from Patient B?”, the answer is simple: different virtual machines. Hardware boundary. Same as EC2.

Everything else, the audit logs, the encryption, the access controls, you need regardless of what runtime you choose. Data isolation at the compute layer is the one problem that only microVMs solve at agent-compatible speeds.


Building AI agents that handle sensitive data? I’d love to hear how you’re approaching isolation. Reach out on LinkedIn.