Aimable Lab · Skills · Technical deep dive

An isolated, auditable runtime for tenant-specific AI workflows

A Skill is a named, versioned bundle of prompt + tools + policy that the platform runs in a hardened sandbox. Five kernel-level isolation layers, four trust tiers, SHA-256 bundle integrity, an append-only audit ledger. This page walks through how it is built.

subprocess-v1 adapter5 kernel hardening layers4 trust tiersSHA-256 bundle integrity60 s · 512 MB · 60 CPU-sfail-closed by defaultLangfuse + event ledger
The user view · live workbench preview

From form to artifact, seen from the user's seat

A faithful mirror of the workbench Skill runner. Same form your team fills in, same state badge, same tool-call timeline, same cost line at the end. The block below shows the same execution from the sandbox side.

/Generate deal shortlist

Generate deal shortlist

Read deal materials, score against investment criteria, return an Excel summary.

xlsx-generator·v1.2.0
Deal materials (optional)
Q2-pipeline.pdf1.2 MBdue-diligence.pdf880 KBvaluations.xlsx215 KB

Files land in inputs/materials/ — the skill reads them via workspace.fs.list.

Q2 EU deals, exclude crypto

Optional natural-language filter applied during scoring.

Run skill
queuedexecution_91c8d4 · live
content delta
Waiting for output…
Live sandbox preview

Watch a Skill execute through the sandbox

Every invocation walks the same preexec ladder: bundle verify → workspace mount → namespaces + cgroup + UID drop → seccomp load → exec → tools → artifact commit → archive. Three scenarios, each with a different sandbox profile.

deal-room / Q2-shortlist/xlsx-generator
xlsx-generator
Skill · v1.2.0
Official
Caller
Aimee · agent (chat)
Space
deal-room / Q2-shortlist
Sandbox profile
code_exec=truenetwork=deny
Tools (intersected)
workspace.fs.readworkspace.fs.writeworkspace.code.pythonartifact.commit
Resource caps
60 s wall512 MB mem60 CPU-s10 MiB out
Status
running · preexec ladder
Artifact
outputs/Q2-shortlist.xlsx · artifact 7c4e91 · 87 KB
Live audit · subprocess-v1 preexec
subprocess-v1
Awaiting invocation
    First — what is a Skill, exactly?

    A workflow your team owns, packaged

    Take a workflow your team keeps repeating — review a contract, score a deal, prep a meeting brief — and turn it into a named bundle. The bundle declares what it does, which tools it may touch, what guardrails apply and what shape its output takes. People invoke it by name from the workbench. Agents invoke the same one from inside a chat loop. Same call, same audit trail.

    01

    A contract, not a prompt

    Skills are versioned and signed. The same v1.4.0 today is the same v1.4.0 next quarter — no copy-paste drift, no off-script edits, no surprises for the people calling it.

    02

    Run by the platform, not the caller

    Authorization, isolation, cost attribution, audit trail — all enforced by the runtime. Identical for a human in the workbench and an agent inside a chat loop.

    03

    Comes with its own paper trail

    Every invocation produces a traced, costed, archived run. An admin can inspect any execution end to end — today, last week, last quarter.

    And at runtime — what actually happens

    An LLM in a sandboxed loop, end to end

    01 · Setup

    Inputs staged

    Files validated against the manifest. Workspace mounted, isolated, scoped to your tenant. Sandbox profile applied before any code runs.

    02 · Loop

    LLM picks a tool, runtime runs it, LLM reads the result

    Each step is checked against the allow-list, executed inside the sandbox, and logged. The loop ends when the model produces output that fits the declared schema — typically 3–8 turns.

    03 · Wrap-up

    Output + artifacts

    Structured output goes back to the caller. Files become downloadable artifacts. Cost and trace rows are written. Workspace is archived for retention.

    The rest of this page is how the platform makes that work — layer by layer.

    Defense in depth

    Three layers guard every invocation

    Process isolation, trust classification and bundle integrity stack independently. Any one of them can fail-closed without taking the others with it.

    Layer 01

    Process isolation

    Every run gets a fresh, walled-off process. Before any code starts, the platform decides what the Skill can see on disk, whether it has a network, how much memory and CPU it gets, and which low-level operations it may call.

    subprocess-v1
    Layer 02

    Trust classification

    Every Skill version carries one of four tiers — official, verified, community or tenant. Admins see the tier in the catalog, set per-space policy on which tiers may run, and approve before tenant-built Skills go live.

    4 tiers
    Layer 03

    Tamper detection

    Every version has a fingerprint computed at publish time. Just before each run, the platform recomputes it and compares. Mismatch — even by one byte — and the run is refused before any code starts.

    SHA-256
    Inside Layer 1 · in plain English

    Five things the operating system itself enforces

    Each one answers a different question: what can the Skill see, where can it reach, how much can it use, what can it call, how loud can it be? Applied in a fixed order — if any single one fails to set up, the Skill never starts at all.

    01

    The filesystem, walled off

    The Skill only sees its own workspace folder. The host disk, other tenants' files, anything outside its tree — invisible. If it mounts something new, that mount cannot leak back to the host.

    kernel ·CLONE_NEWNS · MS_SLAVE
    02

    Network access, on or off — at the kernel

    Skills declare in their manifest whether they need the network. When the answer is no, the running Skill literally has no network interface to use — not 'we forgot to wire one up', but 'the kernel has none to hand out'. There is no way around it from inside Python.

    kernel ·CLONE_NEWNET
    03

    Memory and CPU, hard-capped

    Each run gets a ceiling — 512 MB of memory, 60 seconds of CPU. If a Skill goes beyond, the operating system kills it and we record exactly that. No other tenant's run is affected, no host pressure leaks across.

    kernel ·cgroup v2
    04

    An allow-list of low-level operations

    On top of the Python sandbox, the kernel itself only lets the Skill call a fixed list of low-level operations. Raw sockets, kernel-module tricks, exotic IPC — refused by the kernel, not by our code. Skills get what they need to do their job and nothing more.

    kernel ·seccomp BPF
    05

    Logs that can't drown the system

    Each run captures the first 10 MiB of its own logs, marks the rest truncated and moves on. This is operational, not security: it stops a chatty Skill from blocking the runtime or filling up storage.

    kernel ·10 MiB cap
    Layers are fail-closed: a Skill is either fully sandboxed or it does not run at all. There is no partial state where a Skill is, say, memory-capped but free on the network.
    Resource caps

    Hard ceilings on every execution

    Identical defaults regardless of caller, tier or space. Tenants can lower these in policy but not raise them past the platform cap.

    60 s
    wall-clock timeout · tree-killed via session
    512 MB
    memory.max · OOM kill counter recorded
    60 CPU-s
    cpu budget · 100 ms period
    10 MiB
    stdout / stderr cap · truncated past limit
    Adapter roadmap

    One interface, three execution backends

    The CodeExecAdapter protocol lets the runtime swap execution backends without changing the Skill or the calling code. Tier 1 ships today; tiers 2 and 3 sit behind the same interface and follow on the same trust model.

    Tier 01Live

    subprocess-v1

    OS-level hardening on the runtime host. The five preexec layers above. Default for every execution today; production-ready, fail-closed, instrumented.

    Tier 02Architecture ready

    gVisor / Firecracker

    User-space kernel or microVM isolation for stronger workloads. Same adapter contract; registers under a different name. Targeted at higher-risk tenant skills.

    Tier 03Architecture ready

    Remote execution

    Off-host execution in e2b or Modal for elastic capacity and stricter physical separation. Same protocol, asynchronous worker pool — no API change for callers.

    Trust tier classification

    Four tiers, one runtime

    Tiers are recorded on every Skill version and surfaced in the workbench (badges) and console (filters and policy panels). Same isolation runs for every tier; the tier governs who is allowed to author and where it can run.

    Official

    Official

    Aimable-shipped platform skills. Reviewed and signed off internally. Visible to every tenant unless an admin hides them.

    Verified

    Verified

    Third-party authored, vetted by Aimable. Curated and reviewed before publication; safe defaults for cross-tenant use.

    Community

    Community

    Public-marketplace skills. Run in the same sandbox as everything else, but admins decide per space whether community tiers are allowed at all.

    Tenant

    Tenant

    Private skills authored inside one tenant. Never visible outside that tenant. Forward-deployed engineers ship most of these today.

    RoadmapPer-space minimum-tier policy and approval flows are tracked under AIM-683. The tier itself is recorded and surfaced today; runtime enforcement for tier-based admission is the next milestone.
    Tamper-evident

    The version you approved is the version that runs

    When a Skill version is published, the platform computes a fingerprint over the whole bundle — prompt, playbook, resources, scripts — and stores it next to the version. Just before every run, the fingerprint is recomputed and compared. One byte different anywhere — a silent edit, a swap in storage, anything — and the run is refused before any code starts. The v1.4.0 your reviewer signed off on is bit-for-bit the same v1.4.0 that runs in production tomorrow.

    Published once, frozen
    xlsx-generator-v1.2.0.tar.gz
    ├── SKILL.md
    ├── resources/
    └── scripts/

    A new version lands in the catalog as one frozen bundle. The platform computes its fingerprint at the same moment. The bundle itself can never be edited in place — a change means a new version row, with a new fingerprint.

    Fingerprint stored alongside
    a8f3b2c4d1e57f9b1c0a3e8d24

    The fingerprint travels with the version like a serial number. Admins see it in the console; auditors compare against it later. Same version, same fingerprint, anywhere the Skill goes.

    Re-checked on every run
    match → ok — Skill runs
    mismatch → refused — no run

    Right before execution, the platform recomputes the fingerprint and compares. Match: the Skill runs. Mismatch: refused, audited, no code executes — no override path.

    kernel ·Fingerprint algorithm: SHA-256 over the full bundle tarball.
    Workspace lifecycle

    Ephemeral per-execution filesystem

    Workspaces are tenant-scoped, keyed on (tenant_id, execution_id) and torn down on completion. The on-disk tree never outlives the execution; only the archived tar.gz does, and only for a bounded retention.

    State machine
    01creatingfetch bundle · verify SHA-256 · prepare paths
    02readybundle extracted · adapter not yet acquired
    03runningadapter applied · child executing
    04archivingtar.gz written to archive storage
    05archivedtree removed from disk · row pinned
    06failedany error before archived · audit trail kept
    07cancelleduser or scheduler stop · same archive path

    Archive default retention is 7 days. A cleanup job removes the tar.gz and stamps deleted_at; the row remains for audit.

    On-disk layout
    /workspace/{tenant_id}/{execution_id}/
    ├── inputs/                  # bundle, ro
    │   └── bundle/              #   skill payload
    ├── scratch/                 # rw, ephemeral
    ├── outputs/                 # rw, promoted to artifacts
    └── metadata.json            # execution metadata
    • Each execution gets a UID drawn from a pre-created pool (aimable-skill-0…63) so cross-execution UID collisions cannot occur within a pod.
    • Every workspace path is prefixed with tenant_id; cross-tenant requests return cross_tenant_access_error.
    • Artifact commits are append-only — re-committing the same path produces a new row, never an in-place overwrite.
    Tool authorization

    Three paths converge at one intersection gate

    A Skill cannot invoke a tool just because it asked nicely. Three independent declarations are intersected at runtime — the result is the only set the LLM ever sees.

    Path 01manifest.allowed_tools

    Skill manifest declaration

    The Skill author lists tools the Skill needs in its manifest. This is intent, not authorization — by itself it grants nothing.

    Path 02space.enabled_tools

    Space-level policy

    A space admin enables tools per space. Gated by RBAC (spaces.write). This is the operator's view.

    Path 03compose_execution_tools()

    Runtime intersection gate

    compose_execution_tools() takes the intersection, strips network-requiring tools when sandbox.network=deny, and adds platform meta-tools.

    compose_execution_tools()runtime
    selected = (manifest.allowed_tools  ∩  space.enabled_tools)
                  − tools_requiring_network  if  sandbox.network = "deny"
                  + meta_tools(skill.describe, skill.invoke, artifact.commit)

    Unknown tool names from the manifest are silently dropped at composition. If the LLM ever invokes one anyway it sees an unknown_tool_reference result — the runtime tells the model, not the user.

    Skill bundle anatomy

    SKILL.md is the contract

    The same manifest the workbench renders, the runtime parses, and the console diffs across versions. Below: a real review-contract Skill, with its manifest on the left and a live invocation on the right.

    review-contract
    Tenant Skill
    v1.4.0
    Owner
    Legal team
    Intent

    Read a SaaS contract, flag clauses against the house playbook, return a review memo with risks and recommended redlines.

    Inputs (typed)
    • file
      contract PDF or DOCX, ≤ 25 MB
    • collection
      playbook House playbook, scoped to space
    allowed_tools
    clause.extractplaybook.searchpolicy.checkartifact.commit
    Sandbox + guardrails
    • sandbox.network = deny (no external retrieval)
    • sandbox.code_exec = false (no Python tool)
    • PII redacted in outputs (presidio)
    • EU data residency enforced upstream
    Output schema
    { memo: markdown, clauses[], risks[], score: number }
    Invocation
    Running
    Caller
    Aimee · agent
    waiting for invocation…
    Python runtime today

    One shared venv, deliberately so

    Skills run against the same /app/.venv as the Aimable backend — about fifty curated packages including numpy, pandas, torch, transformers, spacy, docling, openpyxl, httpx, sqlalchemy, langfuse, cryptography and litellm.

    There is no pip install at execution time. That is a deliberate trade-off: it eliminates supply-chain hijack during a run, at the cost of skills being unable to pin their own versions.

    While the platform matures, forward-deployed engineers author Skills with us. Per-skill venvs with SHA-pinned wheels are the next step (AIM-685) — until then, the kernel layers are what limit the blast radius if a pre-installed library misbehaves.

    /app/.venv
    shared venv

    Selected pre-installed packages

    numpypandasscipytorchtransformersspacydoclingopenpyxlpypdfpython-docxpython-pptxhttpxsqlalchemylangfusecryptographylitellmguardrails-aipresidio-analyzerpresidio-anonymizertrafilatura+ 30 more
    Honest caveat

    Anything in the venv is reachable from a Skill's process. UID drop, seccomp and netns limit what that reachability can do — but a vulnerable library still increases blast radius. The wheels-pattern in the roadmap closes that gap.

    Threat model

    What a Skill can and cannot do

    Stated explicitly. A useful threat model is the one you can reason about — vague claims of 'enterprise-grade isolation' are not what regulated customers buy.

    A skill can
    • Read files staged into inputs/ (bundle resources and caller-provided materials)
    • Write to scratch/ and outputs/ (read-only inputs/ enforced by mount)
    • Execute Python via workspace.code.python when sandbox.code_exec=true
    • Call libraries already present in /app/.venv (numpy, pandas, openpyxl, litellm, …)
    • Reach external URLs only when sandbox.network=allow and a network-requiring tool was approved
    A skill cannot
    • Touch the host filesystem outside its workspace tree (mount-ns + path validator)
    • See another tenant or another space — every path and service call is tenant-scoped
    • Spawn a privileged process (setuid drops to a low-privilege UID before execve)
    • Issue syscalls outside the seccomp allowlist (kernel returns EPERM)
    • Exhaust memory or CPU (cgroup v2 hard caps; OOM-kill recorded)
    • Run past the wall-clock budget (tree-killed via session leader on timeout)
    • Exfiltrate over the network when sandbox.network=deny (no interfaces in the netns)
    Multi-tenant isolation

    Tenant boundaries are filesystem-deep

    Skills, workspaces, artifacts and audit rows are all keyed on tenant_id. The workspace tree, the UID assignment and every service call enforce it independently.

    tenant_a
    space:legal
    space:finance
    aimable-skill-7
    /workspace/tenant_a/
    tenant_b
    space:research
    space:ops
    aimable-skill-23
    /workspace/tenant_b/
    Every workspace path is prefixed with tenant_id; the path validator rejects any traversal back up.
    Each execution gets a UID from a low-privilege pool (aimable-skill-0…63) — no shared UID across executions.
    Service methods take an explicit tenant_id arg; mismatches return cross_tenant_access_error at the API.
    Auditability

    Every execution is observable and attributable

    Four parallel signals land for every Skill run. Together they answer 'what happened, who paid for it, and what did it touch?'.

    ledger

    Append-only event ledger

    skill_execution_event captures execution_started, state_changed, content_delta, tool_call_start/end, artifact_committed, execution_completed. Resumable; pruned past retention.

    tracing

    Langfuse trace per execution

    Root span carries tenant_id, space_id, principal_id, skill_slug, skill_version, execution_id, parent_execution_id, source. Tool calls and sub-skills nest as child spans.

    metrics

    Prometheus metrics

    skills_execution_active_count, skills_execution_duration_ms, skills_execution_result_total and skills_tool_invocation_total — alerts wire into existing infra.

    billing

    Cost attribution rows

    Every llm.complete call writes a USAGE_EVENT with model, input/output tokens and cost_usd. Rolled up per skill, space, principal in the console audit page.

    Open improvements

    Tickets we are honest about

    Four MC5-tagged items still on the work list. We surface them so you know what is and is not in production today.

    AIM-682MC5

    pivot_root for full filesystem confinement

    Mount-ns + MS_SLAVE prevents host mount propagation today, but /proc, /sys and /dev are still visible. pivot_root into the workspace tree, plus tmpfs for /proc and a minimal /dev, closes the gap.

    AIM-683MC5

    Trust-tier enforcement at runtime

    Tiers are recorded and surfaced today. Per-space admin policy, approval flows and a per-space tier filter still need to land. The console UI exists; the backend endpoint does not yet.

    AIM-684MC5

    Spec / code reconciliation on FR-005

    Lock down the intersection-gate decision: today it is implemented in tool_composition.py and matches the spec; this ticket formalises the test coverage so spec and code stay in sync.

    AIM-685MC5

    Wheels pattern for per-skill dependencies

    Replace the single shared venv with per-skill dependency declarations + SHA-pinned wheels packaged at bundle time. Closes the supply-chain reach a Skill currently has into Aimable's internal dependencies.

    Lab project

    Early access for design partners

    We co-author the first set of Skills with technical leads who own a regulated workflow. If you have one in mind — and you want the kernel-level isolation under it — let's talk.

    Run AI in a sandbox you can reason about.

    Book a demo or tell us which workflow your team keeps repeating. We'll package the first Skill with you and walk through the full audit trail it produces.