Add SIMD-accelerated bulk range evaluation for dense numeric doc values by sgup432 · Pull Request #16050 · apache/lucene

sgup432 · 2026-05-12T06:29:57Z

Description

Numeric range queries on dense fields use DocValuesRangeIterator, which is a TwoPhaseIterator that uses SkipBlockRangeIterator as an approximation. This works well, but for MAYBE blocks (where values partially overlap the query range), it still falls back to per-doc evaluation: each doc is checked individually via values.advance(doc) + values.longValue() + range comparison.

Since DocValuesRangeIterator is a TwoPhaseIterator, DenseConjunctionBulkScorer routes it through the leap-frog path(see here) and intoBitSet() is never called. This means SIMD is never used for MAYBE block evaluation, even though the underlying storage for dense fields is a packed long[] that's ideal for vectorized comparison.

PR changes

For dense singleton numeric fields with a skip index, replace DocValuesRangeIterator with a new BatchDocValuesRangeIterator which is a plain DocIdSetIterator (not TwoPhaseIterator). This was added so that we force DenseConjunctionBulkScorer to call intoBitSet() on it directly, enabling the bitset intersection path. I am open to suggestion if this is a right approach

This PR also adds support to do SIMD-accelerated bulk range evaluation for MAYBE (partial overlap) blocks, which seem to be the most expensive case when running range queries through doc values.

For this we added below changes:

Add NumericDocValues.rangeIntoBitSet(fromDoc, toDoc, minValue, maxValue, bitSet, offset): a new bulk API with a per-doc fallback default. Lucene90DocValuesProducer overrides this for dense fields to dispatch to the vectorization layer.
Add a DocValuesRangeSupport interface with two implementations:
- PanamaDocValuesRangeSupport — SIMD implementation using the Panama Vector API (LongVector.SPECIES_PREFERRED). Evaluates multiple values per CPU instruction using vectorized range comparisons.
- DefaultDocValuesRangeSupport — scalar tight loop fallback.
VectorizationProvider.getDocValuesRangeSupport() returns the appropriate implementation at startup.

Benchmarks

MultiFieldDocValuesRangeBenchmark (c5.2xlarge, AVX-512)

Mode: Throughput (ops/s, higher is better)
JVM args: --add-modules=jdk.incubator.vector
Warmup: 3 x 3s, Measurement: 5 x 5s, Fork: 1

Data Pattern	docCount	Fields	Baseline (ops/s)	Optimized (ops/s)	Change
random	1M	1	59.99	208.27	+247%
random	1M	3	34.83	69.30	+99%
random	1M	5	29.40	65.10	+121%
random	10M	1	6.12	25.16	+311%
random	10M	3	3.41	8.38	+146%
random	10M	5	2.82	7.45	+164%
clustered	1M	1	6231.86	8584.63	+38%
clustered	1M	3	9142.82	35488.66	+288%
clustered	1M	5	7072.30	32583.89	+361%
clustered	10M	1	685.27	1253.04	+83%
clustered	10M	3	8314.53	23913.65	+188%
clustered	10M	5	8855.14	12703.13	+43%

Mac (Apple M-series, 128-bit NEON)

Data Pattern	docCount	Fields	Baseline (ops/s)	Optimized (ops/s)	Change
random	1M	1	123.5	219.5	+78%
random	1M	3	65.3	86.9	+33%
random	1M	5	47.7	80.2	+68%
random	10M	1	12.1	27.1	+124%
random	10M	3	5.3	9.0	+69%
random	10M	5	4.6	7.7	+69%
clustered	1M	1	21,486	27,137	+26%
clustered	1M	3	17,815	56,333	+216%
clustered	1M	5	13,496	77,159	+472%
clustered	10M	1	2,195	3,502	+60%
clustered	10M	3	17,495	57,900	+231%
clustered	10M	5	18,689	31,728	+70%

The numbers look great across the board!

sgup432 · 2026-05-12T11:03:15Z

@romseygeek Do you mind taking a look at this?

romseygeek

Thanks @sgup432, this looks great! I think we need some more comprehensive testing, and I left some notes on the API itself. I think I'd like @benwtrent or @uschindler's opinions on the vectorization code as that's not something I'm very familiar with.

romseygeek · 2026-05-13T08:49:29Z

+    IndexWriterConfig iwc = new IndexWriterConfig();
+    iwc.setCodec(new Lucene104Codec());
+    IndexWriter w = new IndexWriter(dir, iwc);
+    Random r = new Random(42);


This should use random() to get the test seed

I have changed the logic a bit and removed the earlier logic. Please take a look again.

romseygeek · 2026-05-13T08:50:53Z

+  public void testSingleFieldRangeCorrectness() throws Exception {
+    Query q = SortedNumericDocValuesField.newSlowRangeQuery("age", 20, 40);
+    int count = searcher.count(q);
+    assertTrue("Should find some docs in range [20,40]", count > 0);


I don't think we can assert this with the randomly generated values? We could conceivably get all docs with value 1 on some (admittedly unlikely) seed.

I think we can generate some controlled random values and still verify? I have the changed the logic to generate values like 0, 1, 2, ..., 99, 0, 1, 2, ..., 99, .. and verify this. That way it helps to write the other tests(for intoBitSet() and advance()) and verify.
Let me know what you think.

romseygeek · 2026-05-13T08:51:11Z

+ */
+public class TestSkipBlockRangeIteratorIntoBitSet extends LuceneTestCase {
+
+  private static final int DOC_COUNT = 50_000;


This seems like a lot of docs?

Yeah that is true for a unit test! I think this class needs some fixing, I was using to do some adhoc testing in my local. Let me fix it and add more cases.

@sgup432 I do suggest having a "@nightly" version of the doc count if we really need to have many to test this thing.

We do need it but I was not sure if a unit test is a right place for that. I can keep this doc count if you folks think is a good way to go.

Also on a side note, I think we really need to have some benchmark for docs values(with skip list enabled) in luceneutil? I see we don't have that yet?
Would be very useful for this change and otherwise. Maybe I can check and plan to add it :)

For now I have kept ~16k doc count in this test, so that we have 4 blocks(4096 each) to test correctness with. Let me know if its okay.

Ok, I have added a nightly version of test as well with larger number of docs. I didn't know that we can explicitly annotate such tests with "nightly", I thought @benwtrent used that in context of using a large number of docs. 😄

You can change the number of docs given that nightly tag, or you can flip an entire test class or case on/off :). We just need to be aware of individual costs. Many folks still run ./gradlew check locally and we want to keep that useful and cheap.

romseygeek · 2026-05-13T08:51:59Z

+ * <p>Key behavioral notes:
+ *
+ * <ul>
+ *   <li>Single-field range with a second clause (e.g., MatchAllDocsQuery): goes through {@code


I don't think we're testing for this case? In addition, it needs to be a restrictive filter of some kind, as MatchAllDocsQuery will get rewritten away by BQ.

Removed the section. Added more tests.

romseygeek · 2026-05-13T08:53:07Z

+ *       rangeIntoBitSet()}.
+ * </ul>
+ */
+public class TestSkipBlockRangeIteratorIntoBitSet extends LuceneTestCase {


I think we should be doing some lower-level testing here, specifically of the intoBitSet call - you can look at TestSkipBlockRangeIterator to get an idea of what to check.

Yeah agree. I think this unit class is half baked and still needs more testing. I raised this PR to get initial feedback, still working in the background to ensure more correctness.

Added more tests specifically for intoBitSet(). Please take a look and let me know.

neoremind · 2026-05-13T10:09:31Z

+        scratch[i] = values.get(d + i);
+      }
+      LongVector v = LongVector.fromArray(LONG_SPECIES, scratch, 0);
+      VectorMask<Long> inRange =


Wondering if loop unrolling for SIMD can speed up further (sample)? I suspect if we were to profile this, the bottleneck might be serial values.get(d + i) gather from packed values, if we could read more compact values with fewer loop iterations, and parallelize the range check with more CPU level pipelines, that would be a win, but need to do performance test to vet.

i think the compiler can do this itself, when you aren't dealing with floating point. with floating point, optimization changes the result so its "unsafe".

neoremind · 2026-05-13T10:11:12Z

+        int base = d - offset;
+        while (maskBits != 0) {
+          int bit = Long.numberOfTrailingZeros(maskBits);
+          bitSet.set(base + bit);


The vectorized comparison is great, but here we do per-bit loop for the bitset update. Since docs are consecutive, maskBits already stores the exact bit we want, and its max value is 0xFF on AVX-512 (8 lanes). We could OR the mask directly into the bitset word(s) in constant time like O(2) + fewer branches, sample method in FixedBitSet would be

public void orMask(int startBit, long mask, int maskLen) { int wordIndex = startBit >> 6; int bitOffset = startBit & 63; if (bitOffset + maskLen <= 64) { bits[wordIndex] |= mask << bitOffset; } else { bits[wordIndex] |= mask << bitOffset; bits[wordIndex + 1] |= mask >>> (64 - bitOffset); } }

Yeah! Seems like @benwtrent also pointed this out indirectly below.
The current approach is indeed in-efficient and the while loop approach can instead be replaced by a single OR.
Thanks for pointing out.

benwtrent

I am surprised to see such good numbers with so much more perf opportunities still left to try!

Good idea :)

benwtrent · 2026-05-13T16:40:06Z

+              .getDocValuesRangeSupport();
+
+  // Static helper so anonymous inner classes can call DocValuesRangeSupport from the outer class
+  static void rangeIntoBitSetVectorized(


nit, the assumption is that it is vectorized, but it might be the "default" implementation. can we just name this rangeIntoBitSet? Or something other than vectorized.

benwtrent · 2026-05-13T16:54:55Z

+      int offset) {
+    // Scalar tight loop — JIT may auto-vectorize this on modern JVMs.
+    for (int d = fromDoc; d < toDoc; d++) {
+      long v = values.get(d);


this tells me we eventually might actually want a int count = values.get(int[] docIds, long[] dest);

That is a larger change, but I suspect there is perf to be gained lower level just decoding the long values.

@benwtrent
Hmm doing in a batched manner like you mentioned would certainly help. Seems like another topic worthy of a separate issue or discussion.

benwtrent · 2026-05-13T16:56:47Z

+
+    // Only use SIMD if vector length >= 4 (AVX-256 or better).
+    // On 128-bit SIMD (2 longs), the scratch buffer overhead outweighs the benefit.
+    if (vectorLen < 4) {


Have you benchmarked this to indicate no improvement here?

I think we can remove this now. I see that PanamaVectorizationProvider anyway throws exception for < 128 bit, and we default to scalar approach.
With 128 bits(or vectorLen==2), I added this case when I was seeing some regression on my Mac(128 bit), but with some older changes. Let me re-run, confirm and remove this.

@sgup432

Let me re-run, confirm and remove this.

Cool, I would also suggest please use the names and formatting similar to the larger panama vector code files, clearly indicating 128 vs 256 vs 512, etc.

benwtrent · 2026-05-13T16:57:56Z

+      LongVector v = LongVector.fromArray(LONG_SPECIES, scratch, 0);
+      VectorMask<Long> inRange =
+          v.compare(VectorOperators.GE, minValue).and(v.compare(VectorOperators.LE, maxValue));
+      long maskBits = inRange.toLong();


Its a huge shame to throw away the maskBits which is already encoded as a long, especially when we know the bit set is a FixedBitSet and we have access to FixedBitSet.getBits ;)

Hmm yeah, I see what you mean here. I am currently using maskBits and then in the below while loop, for each bit set in maskBits, I set the desired bit in bitSet. This is inefficient indeed!

I guess instead of doing this in a loop, bit by bit, we can instead replace it with a single OR!

we can instead replace it with a single OR!

Exactly! But, I am not 100% sure the correct way to do that, or how this is adjusted due to the "offset" thing. But I think there is a better way here.

Let me check that.
@neoremind also pointed this out above with an approach.

@benwtrent I am planning to address this in a separate PR and benchmark. Let me know what you think. As this PR is already pretty loaded.

benwtrent · 2026-05-13T17:00:25Z

+    for (int d = loopBound; d < toDoc; d++) {
+      long v = values.get(d);
+      if (v >= minValue && v <= maxValue) {
+        bitSet.set(d - offset);
+      }
+    }


I wonder if we will hit windows of density, where v passes our predicate for multiple docs in a row. In that case, we could take advantage of FixedBitSet.set(int startIndex, int endIndex) which would provide a substantial speed up in those dense regions.

This same idea goes for the default, etc. versions.

Hmm good idea.
I think it won't help much in this vectorized approach where we are processing at most 7 docs for the tail?

But for scalar approach or default, where we are processing upto 4096 docs in a block, so we can probably use FixedBitSet.set(int startIndex, int endIndex) for a batch of matching docs.

Doing something like below might work? Let me what you think.

int runStart = -1; for (int d = fromDoc; d < toDoc; d++) { long v = values.get(d); boolean matches = v >= minValue && v <= maxValue; if (matches) { if (runStart == -1) runStart = d; // a new run } else { if (runStart != -1) { bitSet.set(runStart - offset, d - offset); // set in a batch runStart = -1; } } } // for remaining matching docs if (runStart != -1) { bitSet.set(runStart - offset, toDoc - offset); }

@sgup432 something like that for sure. It would need to be benchmarked, I am not sure how common this density is and branch prediction can be a pain :/

Yeah agree. I think this is something worthy of a separate issue/PR as needs more thought.

@sgup432 yeah, a follow up discussion I think for sure.

benwtrent · 2026-05-13T17:00:54Z

+ *
+ * @lucene.internal
+ */
+public interface DocValuesRangeSupport {


I think this support path, etc. all matches our existing patterns. Seems OK to me.

romseygeek

Thanks @sgup432, I took another look and left some suggestions.

romseygeek · 2026-05-15T08:18:45Z

+      int toDoc,
+      long minValue,
+      long maxValue,
+      org.apache.lucene.util.FixedBitSet bitSet,


Add an import for FixedBitSet?

romseygeek · 2026-05-15T08:24:35Z

+
+    BatchDocValuesRangeIterator iter = new BatchDocValuesRangeIterator(dv, skipper, 20, 40);
+    List<Integer> actual = new ArrayList<>();
+    for (int d = iter.nextDoc(); d != DocIdSetIterator.NO_MORE_DOCS; d = iter.nextDoc()) {


Should this be advance()? It looks like you're only exercising nextDoc() on the range iterator.

Hmm yeah. Fixed it. Renamed the test and verifying using both nextDoc() and advance()

romseygeek · 2026-05-15T08:27:47Z

+        Document doc = new Document();
+        if (i % 2 == 0) {
+          long val = i % 100;
+          doc.add(NumericDocValuesField.indexedField("sparse", val));


This is going to be a MAYBE block, not a YES_IF_PRESENT one - there will be docs within the blocks that fall outside the range of the BatchDocValuesRangeIterator

I think it's worth trying to use BaseDocValuesSkipperTest for these, as it constructs a fake numeric values / skipper combination that exercises all the different possible block combinations.

Good catch. I have now added the logic so that we verify YES_IF_PRESENT correctly.
Also added unit test using BaseDocValuesSkipperTest to verify different block combinations.

Also noticed javadoc comment on BaseDocValuesSkipperTest was a bit incorrect, also fixed it.
Please take another look.

uschindler · 2026-05-15T14:00:23Z

Hi,
In general, the setup of code looks fine. It is in line with other implementations.

I don't know if the vectorized code works well on all CPU types. This is better known by @rmuir, maybe he can give some comments.

If some CPUs or avx versions are not supported well, the provider class needs to check and fall back to the default impl. This can be done by an if statement in the code that returns the implementation.

Otherwise I see no issues with the code.

…ndant code

romseygeek

Nice, thanks @sgup432. Can you resolve the merge conflicts, and then I'll commit.

Co-authored-by: Benjamin Trent <ben.w.trent@gmail.com>

romseygeek · 2026-05-20T10:05:27Z

Can this be backported to 10x or are we relying on bits from JDK25?

romseygeek · 2026-05-20T10:41:05Z

Can this be backported to 10x or are we relying on bits from JDK25?

Answering myself: I think it can be backported but it is relying on some code that's only in main so it's not trivial. @sgup432 would you be able to open a backport PR?

sgup432 · 2026-05-20T15:53:34Z

@sgup432 would you be able to open a backport PR?

Sure, let me do that.

Add SIMD-accelerated bulk range evaluation for dense numeric doc values

77ec451

github-actions Bot added module:core/index module:core/search module:core/codecs labels May 12, 2026

Add changelog

5b2e742

github-actions Bot added this to the 10.5.0 milestone May 12, 2026

Minor refactor

ecaba41

romseygeek requested changes May 13, 2026

View reviewed changes

neoremind reviewed May 13, 2026

View reviewed changes

Comment thread ...ne/core/src/java25/org/apache/lucene/internal/vectorization/PanamaDocValuesRangeSupport.java Outdated

neoremind reviewed May 13, 2026

View reviewed changes

Comment thread ...ne/core/src/java25/org/apache/lucene/internal/vectorization/PanamaDocValuesRangeSupport.java Outdated

neoremind reviewed May 13, 2026

View reviewed changes

benwtrent reviewed May 13, 2026

View reviewed changes

sgup432 and others added 4 commits May 14, 2026 13:55

Handle YES_IF_PRESENT case as well in BatchDocValuesRangeIterator

be25c03

Add more unit tests for BlockRangeIterator

2154dc7

Fix java docs/comments

dab75fc

Merge branch 'main' into simd_doc_values_range

d75ce13

sgup432 requested review from benwtrent and romseygeek May 14, 2026 23:17

sgup432 added 2 commits May 14, 2026 17:07

Add nightly version of tests as well

ce3cde7

Fix build issue

e053cd6

romseygeek requested changes May 15, 2026

View reviewed changes

uschindler approved these changes May 15, 2026

View reviewed changes

sgup432 and others added 2 commits May 18, 2026 15:36

Addressing comments - fixed UT logic, added more and refactoring redu…

afa00a6

…ndant code

Merge branch 'main' into simd_doc_values_range

d9cfe96

sgup432 requested a review from romseygeek May 18, 2026 22:47

romseygeek reviewed May 19, 2026

View reviewed changes

Comment thread lucene/core/src/test/org/apache/lucene/search/TestSkipBlockRangeIteratorIntoBitSet.java Outdated

Adding advanceExact implementation for base doc value test class

c592193

romseygeek approved these changes May 19, 2026

View reviewed changes

Merge branch 'main' into simd_doc_values_range

35bfcba

benwtrent reviewed May 19, 2026

View reviewed changes

Comment thread lucene/CHANGES.txt Outdated

Apply suggestions from code review

b08d5a3

Co-authored-by: Benjamin Trent <ben.w.trent@gmail.com>

romseygeek merged commit 6f0821d into apache:main May 20, 2026
13 checks passed

Conversation

sgup432 commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

PR changes

Benchmarks

Uh oh!

sgup432 commented May 12, 2026

Uh oh!

romseygeek left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sgup432 May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sgup432 May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sgup432 May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

benwtrent left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sgup432 May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

sgup432 commented May 12, 2026 •

edited

Loading

sgup432 May 14, 2026 •

edited

Loading

sgup432 May 13, 2026 •

edited

Loading

sgup432 May 13, 2026 •

edited

Loading

sgup432 May 13, 2026 •

edited

Loading