Add corruption-detection test for probabilistic mitigations#848
Open
mjp41 wants to merge 4 commits intomicrosoft:mainfrom
Open
Add corruption-detection test for probabilistic mitigations#848mjp41 wants to merge 4 commits intomicrosoft:mainfrom
mjp41 wants to merge 4 commits intomicrosoft:mainfrom
Conversation
Adds a new functional test, func/corruption_detection, that
validates snmalloc's probabilistic memory-safety mitigations
actually fire on the corruption patterns they are designed to
catch. Without it, regressions that silently weakened a mitigation
would be invisible to the existing suite, since every other test
exercises only the non-failing arm of the integrity checks.
Each scenario runs in a forked child so the expected abort does not
kill the harness. Detection is reported as the child being killed
by SIGABRT/SIGSEGV/SIGBUS/SIGILL; a clean exit means the corruption
went undetected and the test fails.
Six scenarios are covered, spanning the local-thread, remote-thread
and large-allocation paths:
* double_free - small alloc, two local frees of the same
slot. Detected by freelist_backward_edge
when the resulting cycle is later
traversed.
* uaf_freelist - small alloc, free, then write garbage
into the freed slot's first two words
(the obfuscated next/prev). Detected by
check_prev on the next freelist
consumption.
* oob_into_neighbor - tiny allocs, free even slots, overrun
from an odd live slot into freed
neighbours. Detected by check_prev when
the neighbour is later allocated.
* remote_double_free - small alloc, free locally, then free
again from a different thread (the
second free travels via the remote
message queue). Detected as
!meta->is_unused() in the dealloc path.
* remote_uaf - small alloc, free via a different
thread, then write garbage through the
dangling pointer while the slot sits on
the owning allocator's pending-remote
queue. Detected by check_prev during
handle_message_queue_slow's drain - a
code path no other test exercises.
* large_double_free - allocation larger than any small
sizeclass (handled by the chunk
allocator and per-chunk metadata rather
than the slab freelist), freed twice.
Detected as !meta->is_unused() in the
large-dealloc path.
The test is Linux-only (uses fork()/waitpid()) and is a no-op when
SNMALLOC_CHECK_CLIENT is not defined, since the mitigations it
relies on are then compiled out.
The test is also instrumented to cooperate with clang source-based
coverage: the forked child re-resolves LLVM_PROFILE_FILE with its
own pid (the parent's %p expansion is otherwise inherited and all
children would write to the same file) and a signal handler flushes
.profraw before re-raising the fatal signal. The runtime entry
points are declared as weak symbols so the test still links in
non-coverage builds.
Picked up automatically by make_tests so it runs as both
func-corruption_detection-fast and func-corruption_detection-check;
the fast variant immediately exits with the "skip" message because
the mitigations are off.
Three issues surfaced in CI for the new corruption-detection test: 1. Linux: `large_double_free` did not detect any corruption. The subtest used `LARGE_SIZE = MIN_CHUNK_SIZE * 4 = 64 KiB`, which on the default Linux config is `MAX_SMALL_SIZECLASS_SIZE` — i.e. the largest *small* sizeclass — so the allocations went through the slab free-list path and never reached the chunk-allocator double-free check at all. Use `MAX_SMALL_SIZECLASS_SIZE * 2` so the size unambiguously falls into the large range. Once the test actually exercises the right path, the existing `is_backend_owned()` check in `dealloc_remote` (gated on the `sanity_checks` mitigation, which is part of `full_checks` in a default `SNMALLOC_CHECK_CLIENT` build) flags the double-free. 2. Mac: `-Wunused-function` errors for every `try_*` helper. The helpers are referenced only from `run_in_child`, which is already gated on `__linux__`. Move the helpers and the LLVM profile externs inside the same `#if defined(__linux__)` block so non-Linux builds compile cleanly. The non-Linux `main` already prints a "skipping" message and returns 0. 3. Windows: `__attribute__((weak))` is not portable to MSVC and there is no `SNMALLOC_WEAK` macro in `defines.h`. The weak symbols are only used by the Linux-only fork harness for coverage-flush, so gating them on `__linux__` is the natural fix. Also use `static_cast<uintptr_t>(0xDEADBEEFu)`-style literals for the UAF freelist-corruption writes so MSVC does not warn about narrowing on 32-bit Windows (C4305/C4309). The exact bit pattern does not matter: any non-zero garbage in the freelist node header will fail domestication or the doubly-linked invariant check. Verified locally: all 6 subtests now detect corruption (including large_double_free, which detects via signal 4 / SIGILL from the sanity_checks mitigation).
Two more failure modes from CI:
1. aarch64 (qemu cross-build, native arm64): every subtest reported
"child died with unexpected signal 5". `__builtin_trap()` (which
`SNMALLOC_FAST_FAIL` expands to on non-MSVC) emits `ud2` on x86
and is delivered as SIGILL (4), but on aarch64 it emits `brk
#1000` and is delivered as SIGTRAP (5). The mitigation is firing
correctly; the parent just wasn't recognising the signal. Add
SIGTRAP to the accepted-signal set and to the child's
coverage-flush handler list.
2. UBSan / TSan / GWP-ASan builds: corruption "not detected".
Under those builds either the sanitizer intercepts allocation
(replacing snmalloc's mitigated path entirely), or the fast-fail
trap is intercepted by the sanitizer runtime before the parent
sees a fatal signal, or both. Either way the test's premise
("snmalloc raises a fatal signal on corrupt frees") doesn't
hold. Skip the test in those builds, detected via:
* `__has_feature(address_sanitizer)` etc. (clang),
* `__SANITIZE_ADDRESS__` / `__SANITIZE_THREAD__` (gcc),
* `SNMALLOC_ENABLE_GWP_ASAN_INTEGRATION` (snmalloc-defined).
The non-skipped behaviour is unchanged; verified locally that
all 6 subtests still detect (mix of SIGILL and SIGABRT on x86).
When CORRUPTION_TEST_SKIP_SANITIZER was defined, `main()` returned early via the preprocessor, leaving every `try_*` helper and `run_in_child` lexically present but unreferenced — same `-Wunused-function` failure that originally hit the Mac build. Introduce CORRUPTION_TEST_ACTIVE = `__linux__ && !sanitizer` and gate the entire harness (LLVM-profile externs, namespace with all helpers, `main`'s test-driving block) on it. The two skip-paths (non-Linux, sanitizer/GWP-ASan) are now distinct top-level branches in `main` and neither references any helper, so nothing in the preprocessed translation unit is unused. Verified: with -DCORRUPTION_TEST_SKIP_SANITIZER the helpers are preprocessed out entirely (0 occurrences vs 8 without it). Local non-sanitizer run still detects all 6 subtests.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds a new functional test, func/corruption_detection, that validates snmalloc's probabilistic memory-safety mitigations actually fire on the corruption patterns they are designed to catch. Without it, regressions that silently weakened a mitigation would be invisible to the existing suite, since every other test exercises only the non-failing arm of the integrity checks.
Each scenario runs in a forked child so the expected abort does not kill the harness. Detection is reported as the child being killed by SIGABRT/SIGSEGV/SIGBUS/SIGILL; a clean exit means the corruption went undetected and the test fails.
Six scenarios are covered, spanning the local-thread, remote-thread and large-allocation paths:
slot. Detected by freelist_backward_edge
when the resulting cycle is later
traversed.
into the freed slot's first two words
(the obfuscated next/prev). Detected by
check_prev on the next freelist
consumption.
from an odd live slot into freed
neighbours. Detected by check_prev when
the neighbour is later allocated.
again from a different thread (the
second free travels via the remote
message queue). Detected as
!meta->is_unused() in the dealloc path.
thread, then write garbage through the
dangling pointer while the slot sits on
the owning allocator's pending-remote
queue. Detected by check_prev during
handle_message_queue_slow's drain - a
code path no other test exercises.
sizeclass (handled by the chunk
allocator and per-chunk metadata rather
than the slab freelist), freed twice.
Detected as !meta->is_unused() in the
large-dealloc path.
The test is Linux-only (uses fork()/waitpid()) and is a no-op when SNMALLOC_CHECK_CLIENT is not defined, since the mitigations it relies on are then compiled out.
The test is also instrumented to cooperate with clang source-based coverage: the forked child re-resolves LLVM_PROFILE_FILE with its own pid (the parent's %p expansion is otherwise inherited and all children would write to the same file) and a signal handler flushes .profraw before re-raising the fatal signal. The runtime entry points are declared as weak symbols so the test still links in non-coverage builds.
Picked up automatically by make_tests so it runs as both func-corruption_detection-fast and func-corruption_detection-check; the fast variant immediately exits with the "skip" message because the mitigations are off.