Skip to content

Pre-insert layouts of basic integers for consteval perf#156718

Open
197g wants to merge 5 commits into
rust-lang:mainfrom
197g:consteval-layout-perf
Open

Pre-insert layouts of basic integers for consteval perf#156718
197g wants to merge 5 commits into
rust-lang:mainfrom
197g:consteval-layout-perf

Conversation

@197g
Copy link
Copy Markdown
Contributor

@197g 197g commented May 18, 2026

While investigating build performance for image, we noted through perf samples that a surprising amount of time was spent in consteval and more specifically calls to layout_of. That was curious since we use a few const tables but definitely not of extreme size nor complicated computation and really only with primitive types. Through instrumenting (printf debugging the types being queried) a debug build, this pattern emerged the tail end of the summary:

RDTSC    count  type
-------
2968652 48376   u8
2975486 75647   usize
3372234 2024    std::num::NonZero<usize>
3494159 65639   FnDef(DefId(2:705 ~ core[93c5]::f64::{impl#0}::to_bits), [])
6016422 107343  ()
7589954 145971  FnDef(DefId(21:947 ~ pxfm[d8cc]::common::fmla), [])
10072816        175205  i64
15194244        290049  i32
22053787        392617  u32
22284234        322397  bool
40392452        615785  u64
58157921        2110    std::alloc::Layout
179631693       2996946 f64

Several question arise:

  • Is it necessary to query f64 and other primitive types so often? This fits to a query per evaluation of the lines in MIR instead of a query per line itself. There is a cache of layouts for locals of a block; but it is only local and that cache is constructed within the query to a const evaluation of a block.
  • What's happening while querying the layout of std::alloc::Layout?

In an attempt to mitigate the first and considering that usize is necessarily used for indexing into an array at the moment, maybe one of the large uses, I've attempted to pre-intern the layouts of common integer types and special case them out.

This is obviously fraught with peril; it must not disagree with the other layout computations and a second source of truth is risky in that regard. So the main question is whether to continue down this path or find a way to reduce the count by doing more clever type assignment in expression evaluation maybe.

Performance results have been okay. I've had problems reliably producing numbers in debug builds in the first place, sometimes the instrumentation did not seem to trigger, so there was definitely something I don't understand about the eval. In a --release profile build I seem to observe a moderate effect of ~1-2% overall.

@RalfJung we chatted about related topics, hopefully this isn't taking too much of your time from other matters.

197g added 4 commits May 18, 2026 16:35
Contrary to locals this did not have a query cache but the fixed usize type is used constantly, while projecting places into (array) indices.
@rustbot
Copy link
Copy Markdown
Collaborator

rustbot commented May 18, 2026

Some changes occurred to the CTFE machinery

cc @RalfJung, @oli-obk, @lcnr

Some changes occurred to the CTFE / Miri interpreter

cc @rust-lang/miri

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels May 18, 2026
@rustbot
Copy link
Copy Markdown
Collaborator

rustbot commented May 18, 2026

r? @Kivooeo

rustbot has assigned @Kivooeo.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

Why was this reviewer chosen?

The reviewer was selected based on:

  • Owners of files modified in this PR: compiler, types
  • compiler, types expanded to 73 candidates
  • Random selection from 19 candidates

@lqd
Copy link
Copy Markdown
Member

lqd commented May 18, 2026

@bors try @rust-timer queue

@rust-timer

This comment has been minimized.

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label May 18, 2026
@rust-bors

This comment has been minimized.

rust-bors Bot pushed a commit that referenced this pull request May 18, 2026
@rust-bors
Copy link
Copy Markdown
Contributor

rust-bors Bot commented May 18, 2026

☀️ Try build successful (CI)
Build commit: 960cc11 (960cc11f53c88c2a75400df4d7224293ba40d345, parent: 5ea817c65e4896167300b7d2550781b98da9901a)

@rust-timer

This comment has been minimized.

@rust-timer
Copy link
Copy Markdown
Collaborator

Finished benchmarking commit (960cc11): comparison URL.

Overall result: ❌✅ regressions and improvements - please read:

Benchmarking means the PR may be perf-sensitive. It's automatically marked not fit for rolling up. Overriding is possible but disadvised: it risks changing compiler perf.

Next, please: If you can, justify the regressions found in this try perf run in writing along with @rustbot label: +perf-regression-triaged. If not, fix the regressions and do another perf run. Neutral or positive results will clear the label automatically.

@bors rollup=never
@rustbot label: -S-waiting-on-perf +perf-regression

Instruction count

Our most reliable metric. Used to determine the overall result above. However, even this metric can be noisy.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
0.0% [0.0%, 0.0%] 2
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
-1.0% [-2.5%, -0.2%] 14
All ❌✅ (primary) - - 0

Max RSS (memory usage)

Results (primary -1.6%, secondary 1.5%)

A less reliable metric. May be of interest, but not used to determine the overall result above.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
4.1% [0.8%, 7.4%] 2
Improvements ✅
(primary)
-1.6% [-2.7%, -0.5%] 2
Improvements ✅
(secondary)
-1.1% [-1.5%, -0.6%] 2
All ❌✅ (primary) -1.6% [-2.7%, -0.5%] 2

Cycles

Results (primary 1.6%, secondary 2.0%)

A less reliable metric. May be of interest, but not used to determine the overall result above.

mean range count
Regressions ❌
(primary)
1.6% [1.6%, 1.6%] 1
Regressions ❌
(secondary)
4.7% [3.3%, 6.6%] 5
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
-4.8% [-5.8%, -3.8%] 2
All ❌✅ (primary) 1.6% [1.6%, 1.6%] 1

Binary size

This perf run didn't have relevant results for this metric.

Bootstrap: 511.688s -> 511.616s (-0.01%)
Artifact size: 400.54 MiB -> 402.60 MiB (0.52%)

@rustbot rustbot added perf-regression Performance regression. and removed S-waiting-on-perf Status: Waiting on a perf run to be completed. labels May 18, 2026
@197g 197g changed the title Consteval layout perf Pre-insert layouts of basic integers for consteval perf May 18, 2026
@Kivooeo
Copy link
Copy Markdown
Member

Kivooeo commented May 18, 2026

is perf good? i still feel like i'm unable to read it, can someone translate it please

-1,0% sounds good

@Mark-Simulacrum
Copy link
Copy Markdown
Member

This is obviously fraught with peril; it must not disagree with the other layout computations and a second source of truth is risky in that regard. So the main question is whether to continue down this path or find a way to reduce the count by doing more clever type assignment in expression evaluation maybe.

Is there a chance we could call the 'real' layout computation during that pre-interning step? I think we do this in a number of other places, though the re-entrancy can be a bit tricky.

IIUC, though, there's not really two sources of truth here -- the real layout computation should never be hit with the pre-interned types, right? If so, can we add unreachable!() or similar into the real layout computation?

What's happening while querying the layout of std::alloc::Layout?

This seems like a good idea to dig into, not sure why that would be specifically so slow...

@197g
Copy link
Copy Markdown
Contributor Author

197g commented May 20, 2026

Is there a chance we could call the 'real' layout computation during that pre-interning step?

The problem of this direction is that a LayoutCx<'_> for this call does not exist; we're only interning from the special branches where the layout does not depend on the precise environment (apart from TargetDataLayout). But using the pre-interned layouts for the converse seems roughly feasible? In consteval we would can make an educated heuristic approximate of most common types for which it should speculatively short circuit through those then.

The interesting part to me is that this would also avoid a bit of odd double work. When miri hits an expression of common primitive types it builds up a complete Ty. The code for computing its layout then matches on that runtime value and after distinguishing Int from Uint, from floats, builds up a different runtime value of type abi::Primitive in every branch that then itself gets passed on, and will get matched yet another time to fill its range information. If we instead dispatch to different layout code paths earlier based on the Ty we avoid that buildup-teardown repetition. Most of this is not too consequential in terms of performance due to query caching except if we use the first disambiguation to avoid calling into the general layout query system in the first place (as in what this PR does). Although if you're calling lots of functions they all get their own separate empty-layout FnDef interned despite definitely sharing layout.

This is a pattern: building up the layout of primitives mostly constructs one value by filling in fields with constants. That value is then passed off to a more generic function that again dispatches or computes on those fields. (E.g. also happens for FnPtr(..), sized pointers, …). That's a lot of branches taken in the nested function (LayoutData::scalar) that are redundant if they were lifted instead by computing the LayoutData independently.

Ideally we would also communicate to the call to layout_of information on what we know about the parameter: if it is likely to be primitive or not and speculation is worth it; or conversely if we know that it is definitely not primitive and you'll just want to query the layout via the query system. That's a micro-optimization though, much bigger fish to fry.

@197g
Copy link
Copy Markdown
Contributor Author

197g commented May 20, 2026

It was quite a lot easier to have the real layout layout computation hit the pre-interned cache than I thought the implementation might be; just extending the list to all the signed and unsigned variants is sufficient. That works out as a single-source of truth here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

perf-regression Performance regression. S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants