Skip to content

fix(query): NOT/predicate/or-join binding regressions + adopt-vector >32 corruption#816

Merged
whilo merged 4 commits intomainfrom
fix/not-clause-defer-binding
Apr 26, 2026
Merged

fix(query): NOT/predicate/or-join binding regressions + adopt-vector >32 corruption#816
whilo merged 4 commits intomainfrom
fix/not-clause-defer-binding

Conversation

@whilo
Copy link
Copy Markdown
Member

@whilo whilo commented Apr 26, 2026

Summary

Three independent fixes to the query engine, all surfaced through pgwire-datahike running Odoo's --init=base bootstrap end-to-end.

1. 4a4c53c2 Legacy engine: defer NOT/predicate/or-join (query.cljc, query_stats.cljc)

The legacy Relation engine's -resolve-clause* cases for predicates, NOT, NOT-JOIN, and OR-JOIN-with-required-vars used check-all-bound/check-some-bound to raise eagerly when their input vars weren't yet bound. But the iterative resolver in datahike.tools/resolve-clauses already supports clause deferral via nil-return — bind-by-fn uses it, the resolver re-queues failed clauses for the next pass.

The asymmetry rejected perfectly resolvable queries when a NOT/predicate sat after a binder chain that itself needed a retry pass:

[(get-else $ ?e :v :__null__) ?vv]   ; deferred until ?e bound
[(get-else $ ?e :w :__null__) ?ww]
[(?fn ?vv ?ww) ?v1]                  ; deferred until ?vv ?ww bound
(not [(contains? #{…} ?v1)])         ; eagerly raised on ?v1 unbound
[?e :marker true]                    ; would have bound ?e

Fix: switch the four eager call sites to (when (X-bound? …) …) defer patterns, mirroring bind-by-fn. When a query is genuinely unsolvable, the resolver still raises — now with the more accurate "Cannot resolve any more clauses" listing every unresolvable clause, instead of "Insufficient bindings" pointing at whichever clause happened to be checked first.

Side fix: update-ctx-with-stats now propagates nil from update-fn (silently kept a half-built map otherwise — only triggered for stats-on queries that go through a deferring clause).

2. 717b8b58 Planner: NOT-validator reads :binding, not :bind-vars (lower.cljc, plan.cljc)

The plan-builder's post-ordering NOT-binding check walked ordered ops to build a vars-so-far set, but used (:bind-vars op) for :function ops — a key plan-function-op never sets. The op contributed nothing, so any subsequent NOT/predicate whose only required var came from a function chain was falsely rejected with "Insufficient bindings".

Same Odoo _check_removed_columns query as #1, but on the planner side. Same op-cost code at plan.cljc:790 already uses :args for inputs and is fine — only the vars-so-far accumulator was broken. Fixed by reading (:binding op) (the canonical key) and passing through analyze/extract-vars so scalar/tuple/list/map binding forms all work uniformly.

3. 81ac1261 adopt-vector >32-element corruption (query/execute.cljc)

clojure.lang.PersistentVector/adopt is zero-copy but only correct when arr.length <= 32 — it builds the vector with cnt = arr.length, shift = 5, root = EMPTY_NODE, tail = arr. For longer arrays the result is silently corrupt: cnt > 32 implies tailoff() = cnt-32 > 0, but root remains EMPTY_NODE, so any arrayFor(i) for i < tailoff walks EMPTY_NODE.array and dies with NPE: Cannot read field "array" because "node" is null.

Surfaced by Odoo's _auto_init field-reflection: SELECT 34 cols FROM res_partner WHERE id IN (1) materialised one row whose backing Object[] had length 35 (34 user columns + ?eid for :with). Latent since adopt-vector was added — every query whose materialised tuple width exceeded 32 was at risk; most queries don't hit that arity, which is why no existing test caught it.

Fix: swap to clojure.lang.LazilyPersistentVector/createOwning, which does the right dispatch — same zero-copy ctor for length <= 32 (unchanged hot path), PersistentVector/create (transient build, valid tree) for longer arrays.

Test plan

  • Unit suite: 1444/1444 unit tests pass on both legacy (*force-legacy* true) and planner (DATAHIKE_QUERY_PLANNER=true) engines. 6983 assertions on legacy, 7010 on planner. New regression tests:
    • test-deferred-clause-binding (predicate / NOT / fn-chain → NOT / or-join)
    • test-find-arity-greater-than-32 (35-element :find with 35-attr schema; asserts seq/nth/take + (vec (take k row)) round-trip)
  • Updated tests that previously codified the buggy behavior:
    • test-insufficient-bindings (and the attribute-refs mirror)
    • test-clause-order (legacy-only branch was dead weight)
  • End-to-end via pgwire-datahike:
    • pgjdbc ResultSetTest: 80/80
    • Odoo TestORM --init=base harness: 11/11

whilo added 4 commits April 25, 2026 18:05
…e resolver

The legacy Relation engine's `-resolve-clause*` cases for predicates,
NOT, NOT-JOIN, and OR-JOIN-with-required-vars all used `check-all-bound`
/ `check-some-bound` to RAISE eagerly when their input vars weren't yet
bound. But the iterative resolver in `datahike.tools/resolve-clauses`
already supports clause deferral via nil-return — `bind-by-fn` uses it,
and the resolver re-queues failed clauses for the next pass until a
fixed point is reached.

The asymmetry meant that a perfectly resolvable query could be rejected
when a NOT/predicate sat after a binder chain that itself needed the
resolver's retry pass. Concretely:

    [(get-else $ ?e :v :__null__) ?vv]   ; deferred until ?e bound
    [(get-else $ ?e :w :__null__) ?ww]
    [(?fn ?vv ?ww) ?v1]                  ; deferred until ?vv ?ww bound
    (not [(contains? #{…} ?v1)])         ; eagerly raised on ?v1 unbound
    [?e :marker true]                    ; would have bound ?e

Fixed by switching the four eager call sites to `(when (X-bound? …) …)`
defer patterns, mirroring `bind-by-fn`. When a query is genuinely
unsolvable, the resolver still raises — now with the more accurate
"Cannot resolve any more clauses" listing every unresolvable clause,
instead of a single misleading "Insufficient bindings" pointing at
whichever clause happened to be checked first.

Side fix: `update-ctx-with-stats` now propagates `nil` from `update-fn`
(silently kept a half-built map otherwise — only triggered for stats-on
queries that go through a deferring clause).

Tests:
- New `test-deferred-clause-binding` regression test in query_not_test
  covering predicate, NOT, fn-call→NOT chain, and or-join.
- Existing `test-insufficient-bindings` (and the attribute-refs mirror)
  used to assert the buggy eager-raise; updated to the new behavior
  that matches the compiled engine.
- Existing `test-clause-order` legacy-engine branch was dead weight
  (codified the bug); replaced with a single both-engines assertion.

All 1444 unit tests + 6 integration tests pass.
…:binding

The plan-builder's post-ordering NOT-binding check (lower.cljc Step 7
and the equivalent loop in plan.cljc create-plan) walked ordered ops to
build a vars-so-far set, but used `(:bind-vars op)` for `:function`
ops — a key plan-function-op never sets. The op contributed nothing,
so any subsequent NOT/predicate whose only required var came from a
function chain was falsely rejected with "Insufficient bindings".

Concretely, the Odoo `_check_removed_columns` query goes through
pgwire-datahike with `*force-legacy* false` (planner enabled). Its
SQL `format_type(a.atttypid, a.atttypmod) NOT IN (...)` becomes:

    [(get-else $ ?a_eid :pg_attribute/atttypid :__null__) ?atttypid]
    [(get-else $ ?a_eid :pg_attribute/atttypmod :__null__) ?atttypmod]
    [(?fmt ?atttypid ?atttypmod) ?v1]
    (not [(contains? #{...} ?v1)])

After ordering, the planner walked the ops correctly but the
function-op contribution was dropped, leaving ?v1 invisible to the
NOT validation. Same op-cost code at plan.cljc:790 uses :args for
inputs and is fine — only the vars-so-far accumulator was broken.

Fixed by reading `(:binding op)` (the canonical key from
plan-function-op) and passing through `analyze/extract-vars` so
scalar / tuple / list / map binding forms all work uniformly.

Pgwire-datahike's failing Odoo query now succeeds end-to-end on the
planner path. Verified directly: parsed SQL → q-result with both
*force-legacy* true (legacy engine, was already fixed in prior
commit) and false (planner path, this commit's fix).

All 1444 tests pass on both engines; 6983 assertions on legacy,
7010 on planner.
`clojure.lang.PersistentVector/adopt` is a zero-copy wrapper that builds
the vector with `cnt = arr.length`, `shift = 5`, `root = EMPTY_NODE`,
and the data in `tail`. That layout is only valid for vectors of length
≤ 32 (the entire body fits in tail; tailoff = 0). For longer arrays
adopt silently produces a corrupt vector: cnt > 32 implies tailoff =
cnt-32 > 0, but root remains EMPTY_NODE, so any indexed access at
i < tailoff walks `EMPTY_NODE.array` and dies with

    NPE: Cannot read field "array" because "node" is null

The corruption isn't surfaced until a `seq`/`nth`/`take`/`subvec` runs
— at which point the call site looks completely innocent (just `(seq
row)` on a `clojure.lang.PersistentVector`). Hard to spot in code
review.

Surfaced in production by pgwire-datahike running Odoo's `_auto_init`
field-reflection: `SELECT 34 cols FROM res_partner WHERE id IN (1)`
materialised one row whose backing Object[] had length 35 (34 user
columns + ?eid for :with). pgwire's `format-query-result`
hidden-strip step `(vec (take visible row))` then NPE'd on the first
chunkedSeq access.

The runtime check is essentially free: `LazilyPersistentVector
/createOwning` does the right dispatch — `adopt` for arr.length ≤ 32
(unchanged hot path), and `PersistentVector/create` (transient-build,
valid tree) for longer arrays.

Note: the bug has been latent since `adopt-vector` was added — every
query whose materialised tuple width exceeded 32 was at risk. Most
queries don't hit that arity, which is why no existing test caught it.

Tests:
- New `test-find-arity-greater-than-32` regression in query_test:
  35-element :find with 35-attr schema, asserts seq/nth/take all
  succeed and round-trip through (vec (take k row)) cleanly.
- All 1444 unit tests still pass with DATAHIKE_QUERY_PLANNER=true.
@whilo whilo merged commit 1131ae6 into main Apr 26, 2026
12 checks passed
@whilo whilo deleted the fix/not-clause-defer-binding branch April 26, 2026 22:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant