fix(query): NOT/predicate/or-join binding regressions + adopt-vector >32 corruption#816
Merged
fix(query): NOT/predicate/or-join binding regressions + adopt-vector >32 corruption#816
Conversation
…e resolver
The legacy Relation engine's `-resolve-clause*` cases for predicates,
NOT, NOT-JOIN, and OR-JOIN-with-required-vars all used `check-all-bound`
/ `check-some-bound` to RAISE eagerly when their input vars weren't yet
bound. But the iterative resolver in `datahike.tools/resolve-clauses`
already supports clause deferral via nil-return — `bind-by-fn` uses it,
and the resolver re-queues failed clauses for the next pass until a
fixed point is reached.
The asymmetry meant that a perfectly resolvable query could be rejected
when a NOT/predicate sat after a binder chain that itself needed the
resolver's retry pass. Concretely:
[(get-else $ ?e :v :__null__) ?vv] ; deferred until ?e bound
[(get-else $ ?e :w :__null__) ?ww]
[(?fn ?vv ?ww) ?v1] ; deferred until ?vv ?ww bound
(not [(contains? #{…} ?v1)]) ; eagerly raised on ?v1 unbound
[?e :marker true] ; would have bound ?e
Fixed by switching the four eager call sites to `(when (X-bound? …) …)`
defer patterns, mirroring `bind-by-fn`. When a query is genuinely
unsolvable, the resolver still raises — now with the more accurate
"Cannot resolve any more clauses" listing every unresolvable clause,
instead of a single misleading "Insufficient bindings" pointing at
whichever clause happened to be checked first.
Side fix: `update-ctx-with-stats` now propagates `nil` from `update-fn`
(silently kept a half-built map otherwise — only triggered for stats-on
queries that go through a deferring clause).
Tests:
- New `test-deferred-clause-binding` regression test in query_not_test
covering predicate, NOT, fn-call→NOT chain, and or-join.
- Existing `test-insufficient-bindings` (and the attribute-refs mirror)
used to assert the buggy eager-raise; updated to the new behavior
that matches the compiled engine.
- Existing `test-clause-order` legacy-engine branch was dead weight
(codified the bug); replaced with a single both-engines assertion.
All 1444 unit tests + 6 integration tests pass.
…:binding
The plan-builder's post-ordering NOT-binding check (lower.cljc Step 7
and the equivalent loop in plan.cljc create-plan) walked ordered ops to
build a vars-so-far set, but used `(:bind-vars op)` for `:function`
ops — a key plan-function-op never sets. The op contributed nothing,
so any subsequent NOT/predicate whose only required var came from a
function chain was falsely rejected with "Insufficient bindings".
Concretely, the Odoo `_check_removed_columns` query goes through
pgwire-datahike with `*force-legacy* false` (planner enabled). Its
SQL `format_type(a.atttypid, a.atttypmod) NOT IN (...)` becomes:
[(get-else $ ?a_eid :pg_attribute/atttypid :__null__) ?atttypid]
[(get-else $ ?a_eid :pg_attribute/atttypmod :__null__) ?atttypmod]
[(?fmt ?atttypid ?atttypmod) ?v1]
(not [(contains? #{...} ?v1)])
After ordering, the planner walked the ops correctly but the
function-op contribution was dropped, leaving ?v1 invisible to the
NOT validation. Same op-cost code at plan.cljc:790 uses :args for
inputs and is fine — only the vars-so-far accumulator was broken.
Fixed by reading `(:binding op)` (the canonical key from
plan-function-op) and passing through `analyze/extract-vars` so
scalar / tuple / list / map binding forms all work uniformly.
Pgwire-datahike's failing Odoo query now succeeds end-to-end on the
planner path. Verified directly: parsed SQL → q-result with both
*force-legacy* true (legacy engine, was already fixed in prior
commit) and false (planner path, this commit's fix).
All 1444 tests pass on both engines; 6983 assertions on legacy,
7010 on planner.
`clojure.lang.PersistentVector/adopt` is a zero-copy wrapper that builds
the vector with `cnt = arr.length`, `shift = 5`, `root = EMPTY_NODE`,
and the data in `tail`. That layout is only valid for vectors of length
≤ 32 (the entire body fits in tail; tailoff = 0). For longer arrays
adopt silently produces a corrupt vector: cnt > 32 implies tailoff =
cnt-32 > 0, but root remains EMPTY_NODE, so any indexed access at
i < tailoff walks `EMPTY_NODE.array` and dies with
NPE: Cannot read field "array" because "node" is null
The corruption isn't surfaced until a `seq`/`nth`/`take`/`subvec` runs
— at which point the call site looks completely innocent (just `(seq
row)` on a `clojure.lang.PersistentVector`). Hard to spot in code
review.
Surfaced in production by pgwire-datahike running Odoo's `_auto_init`
field-reflection: `SELECT 34 cols FROM res_partner WHERE id IN (1)`
materialised one row whose backing Object[] had length 35 (34 user
columns + ?eid for :with). pgwire's `format-query-result`
hidden-strip step `(vec (take visible row))` then NPE'd on the first
chunkedSeq access.
The runtime check is essentially free: `LazilyPersistentVector
/createOwning` does the right dispatch — `adopt` for arr.length ≤ 32
(unchanged hot path), and `PersistentVector/create` (transient-build,
valid tree) for longer arrays.
Note: the bug has been latent since `adopt-vector` was added — every
query whose materialised tuple width exceeded 32 was at risk. Most
queries don't hit that arity, which is why no existing test caught it.
Tests:
- New `test-find-arity-greater-than-32` regression in query_test:
35-element :find with 35-attr schema, asserts seq/nth/take all
succeed and round-trip through (vec (take k row)) cleanly.
- All 1444 unit tests still pass with DATAHIKE_QUERY_PLANNER=true.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three independent fixes to the query engine, all surfaced through pgwire-datahike running Odoo's
--init=basebootstrap end-to-end.1.
4a4c53c2Legacy engine: defer NOT/predicate/or-join (query.cljc,query_stats.cljc)The legacy
Relationengine's-resolve-clause*cases for predicates, NOT, NOT-JOIN, and OR-JOIN-with-required-vars usedcheck-all-bound/check-some-boundto raise eagerly when their input vars weren't yet bound. But the iterative resolver indatahike.tools/resolve-clausesalready supports clause deferral via nil-return —bind-by-fnuses it, the resolver re-queues failed clauses for the next pass.The asymmetry rejected perfectly resolvable queries when a NOT/predicate sat after a binder chain that itself needed a retry pass:
Fix: switch the four eager call sites to
(when (X-bound? …) …)defer patterns, mirroringbind-by-fn. When a query is genuinely unsolvable, the resolver still raises — now with the more accurate "Cannot resolve any more clauses" listing every unresolvable clause, instead of "Insufficient bindings" pointing at whichever clause happened to be checked first.Side fix:
update-ctx-with-statsnow propagatesnilfromupdate-fn(silently kept a half-built map otherwise — only triggered for stats-on queries that go through a deferring clause).2.
717b8b58Planner: NOT-validator reads:binding, not:bind-vars(lower.cljc,plan.cljc)The plan-builder's post-ordering NOT-binding check walked ordered ops to build a
vars-so-farset, but used(:bind-vars op)for:functionops — a key plan-function-op never sets. The op contributed nothing, so any subsequent NOT/predicate whose only required var came from a function chain was falsely rejected with "Insufficient bindings".Same Odoo
_check_removed_columnsquery as #1, but on the planner side. Same op-cost code atplan.cljc:790already uses:argsfor inputs and is fine — only the vars-so-far accumulator was broken. Fixed by reading(:binding op)(the canonical key) and passing throughanalyze/extract-varsso scalar/tuple/list/map binding forms all work uniformly.3.
81ac1261adopt-vector>32-element corruption (query/execute.cljc)clojure.lang.PersistentVector/adoptis zero-copy but only correct whenarr.length <= 32— it builds the vector withcnt = arr.length, shift = 5, root = EMPTY_NODE, tail = arr. For longer arrays the result is silently corrupt:cnt > 32impliestailoff() = cnt-32 > 0, butrootremainsEMPTY_NODE, so anyarrayFor(i)fori < tailoffwalksEMPTY_NODE.arrayand dies withNPE: Cannot read field "array" because "node" is null.Surfaced by Odoo's
_auto_initfield-reflection:SELECT 34 cols FROM res_partner WHERE id IN (1)materialised one row whose backingObject[]had length 35 (34 user columns +?eidfor:with). Latent sinceadopt-vectorwas added — every query whose materialised tuple width exceeded 32 was at risk; most queries don't hit that arity, which is why no existing test caught it.Fix: swap to
clojure.lang.LazilyPersistentVector/createOwning, which does the right dispatch — same zero-copy ctor forlength <= 32(unchanged hot path),PersistentVector/create(transient build, valid tree) for longer arrays.Test plan
*force-legacy* true) and planner (DATAHIKE_QUERY_PLANNER=true) engines. 6983 assertions on legacy, 7010 on planner. New regression tests:test-deferred-clause-binding(predicate / NOT / fn-chain → NOT / or-join)test-find-arity-greater-than-32(35-element :find with 35-attr schema; asserts seq/nth/take +(vec (take k row))round-trip)test-insufficient-bindings(and the attribute-refs mirror)test-clause-order(legacy-only branch was dead weight)--init=baseharness: 11/11