Skip to content

rtld: share exactly the initial thread's static TLS pages for IA2#5

Draft
oinoom wants to merge 1 commit intomainfrom
slice/20260417-rtld-initial-tls-sharing
Draft

rtld: share exactly the initial thread's static TLS pages for IA2#5
oinoom wants to merge 1 commit intomainfrom
slice/20260417-rtld-initial-tls-sharing

Conversation

@oinoom
Copy link
Copy Markdown
Collaborator

@oinoom oinoom commented Apr 20, 2026

IA2's current dav1d single-thread branch no longer wants to retag the initial thread's TLS neighborhood from ia2_start() after startup. That runtime-side retag solved the decode path, but it also caused immediate tracer-mode failures in standard builds because the tracer saw a second pkey_mprotect of loader-owned TLS state during process init.

The previous loader-side fix moved that policy into init_tls(), where rtld allocates the initial thread's static TLS block and DTV in the first place. That change kept dav1d working and restored the tracer sweep, but it still used a conservative fixed window: retag the TCB page plus up to eight pages below it, and retag the DTV page separately if it fell outside that range.

That fixed-size window is broader than necessary. A narrower 1-page probe regressed strict single-thread decode, and GDB showed exactly why: during ivf_read() -> dav1d_data_wrap() -> malloc(), PartitionAlloc hit its TLS bookkeeping at fs_base-0x4018, which is five pages below the TCB page on this x86_64 layout. The loader already knows the exact size of the initial thread's static TLS block at this point, so it does not need a magic page count at all.

Replace the fixed "eight pages below the TCB" rule with the precise static-TLS lower bound derived from dl_tls_static_size and TLS_TCB_SIZE. Round that lower bound down to a page boundary and retag exactly that page range plus the TCB page itself. Keep the page-by-page walk: the initial TLS block can span multiple minimal-malloc VMAs, and a single fixed-size pkey_mprotect over the whole interval can still run into a hole and fail with ENOMEM. Retag the DTV page separately only when it falls outside the computed static-TLS range.

This keeps the policy in the loader that allocated the memory, removes the remaining magic page-count heuristic, and narrows the shared loader-heap surface without reintroducing the tracer startup regression.

Validated with the paired IA2 and dav1d branches:

  • fresh ./rewrite.py --llvm-config /usr/bin/llvm-config-18 build
  • dav1d --version returns 0
  • strict single-thread decode of test.ivf returns 0
  • Debug/Tracer/No-libc IA2 sweep: ctest --test-dir build/tracer_debug_standard_computed_20260417 \ --output-on-failure -j1 -E terminating_threads => 35/35 passed
  • Debug/No-tracer/No-libc IA2 sweep: ctest --test-dir build/standard_debug_notracer_computed_20260417 \ --output-on-failure -j1 -E terminating_threads => 35/35 passed
  • Release/Libc-compartment IA2 sweep: ctest --test-dir build/libc_release_computed_20260417 \ --output-on-failure -j1 => 13/13 passed

IA2's current dav1d single-thread branch no longer wants to retag the
initial thread's TLS neighborhood from ia2_start() after startup. That
runtime-side retag solved the decode path, but it also caused immediate
tracer-mode failures in standard builds because the tracer saw a second
pkey_mprotect of loader-owned TLS state during process init.

The previous loader-side fix moved that policy into init_tls(), where
rtld allocates the initial thread's static TLS block and DTV in the
first place. That change kept dav1d working and restored the tracer
sweep, but it still used a conservative fixed window: retag the TCB
page plus up to eight pages below it, and retag the DTV page separately
if it fell outside that range.

That fixed-size window is broader than necessary. A narrower 1-page
probe regressed strict single-thread decode, and GDB showed exactly why:
during ivf_read() -> dav1d_data_wrap() -> malloc(), PartitionAlloc hit
its TLS bookkeeping at fs_base-0x4018, which is five pages below the
TCB page on this x86_64 layout. The loader already knows the exact size
of the initial thread's static TLS block at this point, so it does not
need a magic page count at all.

Replace the fixed "eight pages below the TCB" rule with the precise
static-TLS lower bound derived from dl_tls_static_size and TLS_TCB_SIZE.
Round that lower bound down to a page boundary and retag exactly that
page range plus the TCB page itself. Keep the page-by-page walk: the
initial TLS block can span multiple minimal-malloc VMAs, and a single
fixed-size pkey_mprotect over the whole interval can still run into a
hole and fail with ENOMEM. Retag the DTV page separately only when it
falls outside the computed static-TLS range.

This keeps the policy in the loader that allocated the memory, removes
the remaining magic page-count heuristic, and narrows the shared
loader-heap surface without reintroducing the tracer startup regression.

Validated with the paired IA2 and dav1d branches:
- fresh ./rewrite.py --llvm-config /usr/bin/llvm-config-18 build
- dav1d --version returns 0
- strict single-thread decode of test.ivf returns 0
- Debug/Tracer/No-libc IA2 sweep:
  ctest --test-dir build/tracer_debug_standard_computed_20260417 \
        --output-on-failure -j1 -E terminating_threads
  => 35/35 passed
- Debug/No-tracer/No-libc IA2 sweep:
  ctest --test-dir build/standard_debug_notracer_computed_20260417 \
        --output-on-failure -j1 -E terminating_threads
  => 35/35 passed
- Release/Libc-compartment IA2 sweep:
  ctest --test-dir build/libc_release_computed_20260417 \
        --output-on-failure -j1
  => 13/13 passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant