Skip to content

[GC] Precompute: Fix ref.eq comparisons of structs with nested effects#7857

Merged
kripken merged 4 commits intoWebAssembly:mainfrom
kripken:pc.2
Aug 22, 2025
Merged

[GC] Precompute: Fix ref.eq comparisons of structs with nested effects#7857
kripken merged 4 commits intoWebAssembly:mainfrom
kripken:pc.2

Conversation

@kripken
Copy link
Copy Markdown
Member

@kripken kripken commented Aug 22, 2025

In the testcase here, the nested tee make the outer struct not precomputable
when we care about effects (tee is an effect). But we could precompute it
otherwise, and that ended up causing confusion, with no caching being done
of the canonical allocation for it, meaning each time we scanned it we saw
another struct, different in identity.

To fix this, implement the TODO to cache in both modes. I thought that would
be useful only for speed, but turns out it is necessary for correctness. Luckily
doing so is quite simple.

@kripken kripken requested a review from tlively August 22, 2025 20:53
@aheejin
Copy link
Copy Markdown
Member

aheejin commented Aug 22, 2025

Some basic questions:

  • Why is this case a problem for correctness and not only for speed?
  • When do we preserve side effects, and when do we not? Scanning the code, it looks we preserve side effects when we actually try to replace a value (for correctness) and don't preserve it when just evaluating whether something is a constant. But if we have to preserve side effects when replacing anyway, why do we need two separate modes? For speed?

@kripken
Copy link
Copy Markdown
Member Author

kripken commented Aug 22, 2025

The correctness issue is because we need the caching (as I forgot). The caching ensures that each struct.new is represented by a single allocated object in the interpreter. If we don't cache those values, the same struct.new will look like a different allocation if we compute it twice (and do not cache). And when it looks different, ref.eq will be wrong, as the testcase shows.

Yes, the two modes are for when we replace and when we don't. When we want to replace an expression, we can't ignore side effects like local.tee (if we replaced it with a precomputed constant, the tee's effect would vanish). But, when we are just propagating values around, we want to see the value of an expression (so if it evaluates to a precomputed constant, we can propagate that; and it's fine if there are side effects, we are leaving them alone).

@aheejin
Copy link
Copy Markdown
Member

aheejin commented Aug 22, 2025

I can understand that if we ignore side effect (like tee) and replace some expression with a constant, the tee would vanish. But how do the withEffects and withoutEffects maps convey that? It's value is just a GCData, which I presume a constant that an expression boils down to, and does not seem to contain info on whether a tee is set or not.

@kripken
Copy link
Copy Markdown
Member Author

kripken commented Aug 22, 2025

It is just that we need two separate maps so we don't "mix" things. Imagine we compute

(struct.new (local.tee ))

If we compute it when we don't care about effects, we can store PTR in the cache (the pointer to the allocation that represents it, in the interpreter). Then, later, if we do care about effects, if we read from that cache, we'd get PTR, but that is wrong: if we ignored the cache, we'd get the right result (when effects matter), which is to stop at the local.tee because of its effects.

Basically, the two modes can get different results for computing the same expression. To preserve the property that the cache returns the value we would have computed anyhow, we need one cache for each.

@aheejin
Copy link
Copy Markdown
Member

aheejin commented Aug 22, 2025

It is just that we need two separate maps so we don't "mix" things. Imagine we compute

(struct.new (local.tee ))

Were you gonna put something in (local.tee)?

If we compute it when we don't care about effects, we can store PTR in the cache (the pointer to the allocation that represents it, in the interpreter). Then, later, if we do care about effects, if we read from that cache, we'd get PTR, but that is wrong: if we ignored the cache, we'd get the right result (when effects matter), which is to stop at the local.tee because of its effects.

Basically, the two modes can get different results for computing the same expression. To preserve the property that the cache returns the value we would have computed anyhow, we need one cache for each.

So in this case, if there is a tee, withoutEffects will give you a constant value, but withEffects wouldn't, because it cannot be replaced by that single constant, right?

Copy link
Copy Markdown
Member

@aheejin aheejin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If #7857 (comment) is a correct interpretation of your intention, LGTM. (Please let me know if not 😅)

@kripken
Copy link
Copy Markdown
Member Author

kripken commented Aug 22, 2025

Yes, and yes, exactly.

@kripken kripken merged commit 969bf76 into WebAssembly:main Aug 22, 2025
16 checks passed
@kripken kripken deleted the pc.2 branch August 22, 2025 23:50
Copy link
Copy Markdown
Member

@tlively tlively left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any guesses as to why the fuzzer did not catch this? Do we need to emit more ref.eq?

@kripken
Copy link
Copy Markdown
Member Author

kripken commented Aug 25, 2025

I'm really not sure how this was missed by the fuzzer. Still thinking about it / concerned...

kripken added a commit that referenced this pull request Aug 27, 2025
…hich to keep (#7863)

The main changes here are:

* Rather than precompute sometimes while ignoring effects of tees etc.
  and sometimes not, do so in a single manner: while considering the effects
  carefully and deciding which children to keep.
* This lets us remove the dual cache from #7857, as now there is a single
  mode.

But really, this is a rewrite of that core logic from scratch in a cleaner and
less hackish way, while fixing issues with the dual cache and even the
earlier single cache that it fixed:

* We must have a single cached object for each expression. Having a dual
  cache opened us up to bugs, because it turns out we might actually
  cache an object in the propagate phase, and use it in the main phase, and
  each used a different cache. Now both phases do the same thing, so there
  is no risk.
* We must compute effects when there are effects, because they are
  state in the PrecomputingExpressionRunner, a source of bugs with all
  previous caches. I realized that the solution here is simple: note when there
  are effects, and if so, just compute them. This is fine, because the quadratic
  case happens in global objects, which have no effects anyhow (and even inside
  functions it is rare to have such effects). And, after computing the effects, we
  use the single cached heap location, keeping identity stable (a key fix here).

Then, the main visitExpression is straightforward: compute in the most
general manner: NOT trying to replace the entire expression, which requires
no side effects, but allowing them, and looking at the children afterwards to
see which are actually needed. This is necessary to avoid a regression in this
PR, but it actually ends up as a progression, since we can handle more cases,
like (ref.eq (tee) (get)). Before the tee would stop us, and propagate doesn't
handle this if it isn't written to a local, but now we can just compute it, and
keep the tee around.

Also, I figured out how to avoid the monotonically increasing code size
problem, which e.g. GUFA has, where you see expression A, figure out it
evaluates to constant C, but has effects you must keep, so you emit (A, C).
That lets the constant get optimized, but if you run twice you can add C
twice, unless you carefully look at the parent, which is annoying. Here,
that is avoided because while we may add such a constant, regressing
size, we still make progress because we remove the main expression itself -
we may keep some children, but never the parent, so the increase is
bounded.

This improves not just GC code but is an improvement in some Emscripten size
tests.

There are also some minor theoretical regressions, as a few tests
show, but those are things other passes handle better (like
(return (return ..))), so they only happen when running the pass
by itself (production code using the full pipeline should only get
better).

This is also a slight improvement in compile times.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants