Skip to content

sandlock-oci: single-sandbox OCI exec via sandlock-init#110

Merged
congwang-mk merged 14 commits into
mainfrom
oci-exec2
Jun 24, 2026
Merged

sandlock-oci: single-sandbox OCI exec via sandlock-init#110
congwang-mk merged 14 commits into
mainfrom
oci-exec2

Conversation

@congwang-mk

Copy link
Copy Markdown
Contributor

Summary

Re-architects OCI exec so the container workload and every exec'd process run in one sandbox: one seccomp filter, one notify listener, one supervisor with shared runtime state. This replaces the clone-per-exec model (PR #109), where each exec'd process was an independent clone of the policy with its own supervisor (identical rules, but a separate runtime world: independent port remapping, no shared process view).

Supersedes #109.

Why a clone is not enough

A cloned sandbox enforces identical rules but does not share the supervisor's per-sandbox runtime state. Two independent supervisors mean port remapping maps the main process's bind and an exec'd process's connect independently (so the exec'd process cannot reach a service the main process runs), and there is no single supervisor that knows the container's full process set. "Same sandbox" requires all processes to share one seccomp listener, and a listener is shared only by fork-inheritance from the installing task.

Architecture

The container's confined PID-1 is a small static sandlock-init, launched from a memfd (via a new core execveat(AT_EMPTY_PATH) primitive, so it never needs to exist in the rootfs). sandlock-init reads a control channel and fork-execs the workload (RunMain) and each exec'd command (RunExec, whose stdin/stdout/stderr arrive over SCM_RIGHTS). Because the workload and exec'd processes are fork-children of one sandlock-init, they inherit its one seccomp filter and Landlock ruleset and are serviced by the one supervisor. There is exactly one confined-process launch per container, so "same sandbox" is structural, not asserted.

The daemon launches sandlock-init, hosts the single supervisor, and relays OCI verbs (start -> RunMain, exec -> RunExec, kill --all/delete --force -> group teardown via the daemon since sandlock-init is the group leader) over an InitLink demux. state.pid remains the workload pid, so kill/state/liveness are unchanged for callers.

What changed

  • sandlock-core: Sandbox.exec_fd -> execveat(AT_EMPTY_PATH) to launch the confined process from an fd; an AT_EMPTY_PATH-only-when-pathname-empty guard in the chroot exec handler (so a confined process cannot exec a host binary outside the virtual root); checkpoint_pid() to capture a specific fork-descendant.
  • sandlock-init (new crate): a static, synchronous PID-1 agent, embedded into sandlock-oci via include_bytes! and run from a memfd.
  • sandlock-oci: the daemon launches sandlock-init and relays OCI verbs; the CLI exec surface; group teardown routed through the daemon; checkpoint re-targeted to the workload.

Non-TTY only: -t/--console-socket are accepted for runc compatibility but ignored (no PTY yet).

Testing

  • New Landlock-gated e2e oci_exec_same_sandbox: create+start a container, exec a command into it, assert it ran confined inside the container; passed 11/11 stress runs.
  • Full suites green in isolation: sandlock-core 409 lib + 262 integration; sandlock-oci 47 unit + 13 integration; sandlock-init 4. oci_checkpoint_of_running_container passes with the workload (a fork-child of init) captured.
  • Each change passed a spec + quality review; a security review tightened the AT_EMPTY_PATH passthrough to empty-pathname-only.

Known follow-ups (not blocking exec)

  • Checkpoint capture now targets the workload, but restoring a workload checkpoint resurrects it as a standalone direct child (the existing model), losing the init layer (so no exec on the restored container). Full checkpoint -> restore round-trip under the sandlock-init model is a separate, larger effort.
  • Minor hardening: surface daemon-level errors in kill --all/delete (currently fire-and-forget), bounds-check argv[0] vs PATH_MAX in the exec buffers, EAGAIN loop in send_with_fds.

A hypothetical kernel SECCOMP_FILTER_FLAG_USE_LISTENER (bind a new filter to an existing listener) would let the daemon spawn exec'd processes directly and retire sandlock-init; it does not exist in mainline, so sandlock-init is the portable mechanism today.

🤖 Generated with Claude Code

Signed-off-by: Cong Wang <cwang@multikernel.io>
Signed-off-by: Cong Wang <cwang@multikernel.io>
Signed-off-by: Cong Wang <cwang@multikernel.io>
Signed-off-by: Cong Wang <cwang@multikernel.io>
Signed-off-by: Cong Wang <cwang@multikernel.io>
Signed-off-by: Cong Wang <cwang@multikernel.io>
…through

Signed-off-by: Cong Wang <cwang@multikernel.io>
Signed-off-by: Cong Wang <cwang@multikernel.io>
Signed-off-by: Cong Wang <cwang@multikernel.io>
Signed-off-by: Cong Wang <cwang@multikernel.io>
Signed-off-by: Cong Wang <cwang@multikernel.io>
@congwang-mk congwang-mk force-pushed the oci-exec2 branch 5 times, most recently from 19500ea to efaaa76 Compare June 23, 2026 02:57
Signed-off-by: Cong Wang <cwang@multikernel.io>
… memfd init

Signed-off-by: Cong Wang <cwang@multikernel.io>
Signed-off-by: Cong Wang <cwang@multikernel.io>
@congwang-mk congwang-mk merged commit d2bdca0 into main Jun 24, 2026
12 checks passed
@congwang-mk congwang-mk deleted the oci-exec2 branch June 24, 2026 01:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant