Skip to content

Add SIMD-accelerated bulk range evaluation for dense numeric doc values#16050

Merged
romseygeek merged 14 commits into
apache:mainfrom
sgup432:simd_doc_values_range
May 20, 2026
Merged

Add SIMD-accelerated bulk range evaluation for dense numeric doc values#16050
romseygeek merged 14 commits into
apache:mainfrom
sgup432:simd_doc_values_range

Conversation

@sgup432
Copy link
Copy Markdown
Contributor

@sgup432 sgup432 commented May 12, 2026

Description

Numeric range queries on dense fields use DocValuesRangeIterator, which is a TwoPhaseIterator that uses SkipBlockRangeIterator as an approximation. This works well, but for MAYBE blocks (where values partially overlap the query range), it still falls back to per-doc evaluation: each doc is checked individually via values.advance(doc) + values.longValue() + range comparison.

Since DocValuesRangeIterator is a TwoPhaseIterator, DenseConjunctionBulkScorer routes it through the leap-frog path(see here) and intoBitSet() is never called. This means SIMD is never used for MAYBE block evaluation, even though the underlying storage for dense fields is a packed long[] that's ideal for vectorized comparison.

PR changes

For dense singleton numeric fields with a skip index, replace DocValuesRangeIterator with a new BatchDocValuesRangeIterator which is a plain DocIdSetIterator (not TwoPhaseIterator). This was added so that we force DenseConjunctionBulkScorer to call intoBitSet() on it directly, enabling the bitset intersection path. I am open to suggestion if this is a right approach

This PR also adds support to do SIMD-accelerated bulk range evaluation for MAYBE (partial overlap) blocks, which seem to be the most expensive case when running range queries through doc values.

For this we added below changes:

  • Add NumericDocValues.rangeIntoBitSet(fromDoc, toDoc, minValue, maxValue, bitSet, offset): a new bulk API with a per-doc fallback default. Lucene90DocValuesProducer overrides this for dense fields to dispatch to the vectorization layer.

  • Add a DocValuesRangeSupport interface with two implementations:

    • PanamaDocValuesRangeSupport — SIMD implementation using the Panama Vector API (LongVector.SPECIES_PREFERRED). Evaluates multiple values per CPU instruction using vectorized range comparisons.
    • DefaultDocValuesRangeSupport — scalar tight loop fallback.
  • VectorizationProvider.getDocValuesRangeSupport() returns the appropriate implementation at startup.

Benchmarks

MultiFieldDocValuesRangeBenchmark (c5.2xlarge, AVX-512)

Mode: Throughput (ops/s, higher is better)
JVM args: --add-modules=jdk.incubator.vector
Warmup: 3 x 3s, Measurement: 5 x 5s, Fork: 1
Data Pattern docCount Fields Baseline (ops/s) Optimized (ops/s) Change
random 1M 1 59.99 208.27 +247%
random 1M 3 34.83 69.30 +99%
random 1M 5 29.40 65.10 +121%
random 10M 1 6.12 25.16 +311%
random 10M 3 3.41 8.38 +146%
random 10M 5 2.82 7.45 +164%
clustered 1M 1 6231.86 8584.63 +38%
clustered 1M 3 9142.82 35488.66 +288%
clustered 1M 5 7072.30 32583.89 +361%
clustered 10M 1 685.27 1253.04 +83%
clustered 10M 3 8314.53 23913.65 +188%
clustered 10M 5 8855.14 12703.13 +43%

Mac (Apple M-series, 128-bit NEON)

Data Pattern docCount Fields Baseline (ops/s) Optimized (ops/s) Change
random 1M 1 123.5 219.5 +78%
random 1M 3 65.3 86.9 +33%
random 1M 5 47.7 80.2 +68%
random 10M 1 12.1 27.1 +124%
random 10M 3 5.3 9.0 +69%
random 10M 5 4.6 7.7 +69%
clustered 1M 1 21,486 27,137 +26%
clustered 1M 3 17,815 56,333 +216%
clustered 1M 5 13,496 77,159 +472%
clustered 10M 1 2,195 3,502 +60%
clustered 10M 3 17,495 57,900 +231%
clustered 10M 5 18,689 31,728 +70%

The numbers look great across the board!

@github-actions github-actions Bot added this to the 10.5.0 milestone May 12, 2026
@sgup432
Copy link
Copy Markdown
Contributor Author

sgup432 commented May 12, 2026

@romseygeek Do you mind taking a look at this?

Copy link
Copy Markdown
Contributor

@romseygeek romseygeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sgup432, this looks great! I think we need some more comprehensive testing, and I left some notes on the API itself. I think I'd like @benwtrent or @uschindler's opinions on the vectorization code as that's not something I'm very familiar with.

Comment thread lucene/core/src/java/org/apache/lucene/index/NumericDocValues.java Outdated
IndexWriterConfig iwc = new IndexWriterConfig();
iwc.setCodec(new Lucene104Codec());
IndexWriter w = new IndexWriter(dir, iwc);
Random r = new Random(42);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should use random() to get the test seed

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed the logic a bit and removed the earlier logic. Please take a look again.

public void testSingleFieldRangeCorrectness() throws Exception {
Query q = SortedNumericDocValuesField.newSlowRangeQuery("age", 20, 40);
int count = searcher.count(q);
assertTrue("Should find some docs in range [20,40]", count > 0);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can assert this with the randomly generated values? We could conceivably get all docs with value 1 on some (admittedly unlikely) seed.

Copy link
Copy Markdown
Contributor Author

@sgup432 sgup432 May 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can generate some controlled random values and still verify? I have the changed the logic to generate values like 0, 1, 2, ..., 99, 0, 1, 2, ..., 99, .. and verify this. That way it helps to write the other tests(for intoBitSet() and advance()) and verify.
Let me know what you think.

*/
public class TestSkipBlockRangeIteratorIntoBitSet extends LuceneTestCase {

private static final int DOC_COUNT = 50_000;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a lot of docs?

Copy link
Copy Markdown
Contributor Author

@sgup432 sgup432 May 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that is true for a unit test! I think this class needs some fixing, I was using to do some adhoc testing in my local. Let me fix it and add more cases.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sgup432 I do suggest having a "@nightly" version of the doc count if we really need to have many to test this thing.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do need it but I was not sure if a unit test is a right place for that. I can keep this doc count if you folks think is a good way to go.

Also on a side note, I think we really need to have some benchmark for docs values(with skip list enabled) in luceneutil? I see we don't have that yet?
Would be very useful for this change and otherwise. Maybe I can check and plan to add it :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now I have kept ~16k doc count in this test, so that we have 4 blocks(4096 each) to test correctness with. Let me know if its okay.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I have added a nightly version of test as well with larger number of docs. I didn't know that we can explicitly annotate such tests with "nightly", I thought @benwtrent used that in context of using a large number of docs. 😄

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can change the number of docs given that nightly tag, or you can flip an entire test class or case on/off :). We just need to be aware of individual costs. Many folks still run ./gradlew check locally and we want to keep that useful and cheap.

* <p>Key behavioral notes:
*
* <ul>
* <li>Single-field range with a second clause (e.g., MatchAllDocsQuery): goes through {@code
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we're testing for this case? In addition, it needs to be a restrictive filter of some kind, as MatchAllDocsQuery will get rewritten away by BQ.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the section. Added more tests.

* rangeIntoBitSet()}.
* </ul>
*/
public class TestSkipBlockRangeIteratorIntoBitSet extends LuceneTestCase {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be doing some lower-level testing here, specifically of the intoBitSet call - you can look at TestSkipBlockRangeIterator to get an idea of what to check.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah agree. I think this unit class is half baked and still needs more testing. I raised this PR to get initial feedback, still working in the background to ensure more correctness.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added more tests specifically for intoBitSet(). Please take a look and let me know.

scratch[i] = values.get(d + i);
}
LongVector v = LongVector.fromArray(LONG_SPECIES, scratch, 0);
VectorMask<Long> inRange =
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if loop unrolling for SIMD can speed up further (sample)? I suspect if we were to profile this, the bottleneck might be serial values.get(d + i) gather from packed values, if we could read more compact values with fewer loop iterations, and parallelize the range check with more CPU level pipelines, that would be a win, but need to do performance test to vet.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the compiler can do this itself, when you aren't dealing with floating point. with floating point, optimization changes the result so its "unsafe".

int base = d - offset;
while (maskBits != 0) {
int bit = Long.numberOfTrailingZeros(maskBits);
bitSet.set(base + bit);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The vectorized comparison is great, but here we do per-bit loop for the bitset update. Since docs are consecutive, maskBits already stores the exact bit we want, and its max value is 0xFF on AVX-512 (8 lanes). We could OR the mask directly into the bitset word(s) in constant time like O(2) + fewer branches, sample method in FixedBitSet would be

public void orMask(int startBit, long mask, int maskLen) {
    int wordIndex = startBit >> 6;
    int bitOffset = startBit & 63;
    if (bitOffset + maskLen <= 64) {
        bits[wordIndex] |= mask << bitOffset;
    } else {
        bits[wordIndex]     |= mask << bitOffset;
        bits[wordIndex + 1] |= mask >>> (64 - bitOffset);
    }
}

Copy link
Copy Markdown
Contributor Author

@sgup432 sgup432 May 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah! Seems like @benwtrent also pointed this out indirectly below.
The current approach is indeed in-efficient and the while loop approach can instead be replaced by a single OR.
Thanks for pointing out.

Copy link
Copy Markdown
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am surprised to see such good numbers with so much more perf opportunities still left to try!

Good idea :)

.getDocValuesRangeSupport();

// Static helper so anonymous inner classes can call DocValuesRangeSupport from the outer class
static void rangeIntoBitSetVectorized(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, the assumption is that it is vectorized, but it might be the "default" implementation. can we just name this rangeIntoBitSet? Or something other than vectorized.

Comment thread lucene/core/src/java/org/apache/lucene/index/NumericDocValues.java Outdated
int offset) {
// Scalar tight loop — JIT may auto-vectorize this on modern JVMs.
for (int d = fromDoc; d < toDoc; d++) {
long v = values.get(d);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this tells me we eventually might actually want a int count = values.get(int[] docIds, long[] dest);

That is a larger change, but I suspect there is perf to be gained lower level just decoding the long values.

Copy link
Copy Markdown
Contributor Author

@sgup432 sgup432 May 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benwtrent
Hmm doing in a batched manner like you mentioned would certainly help. Seems like another topic worthy of a separate issue or discussion.

Comment thread lucene/core/src/java/org/apache/lucene/index/NumericDocValues.java Outdated

// Only use SIMD if vector length >= 4 (AVX-256 or better).
// On 128-bit SIMD (2 longs), the scratch buffer overhead outweighs the benefit.
if (vectorLen < 4) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you benchmarked this to indicate no improvement here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove this now. I see that PanamaVectorizationProvider anyway throws exception for < 128 bit, and we default to scalar approach.
With 128 bits(or vectorLen==2), I added this case when I was seeing some regression on my Mac(128 bit), but with some older changes. Let me re-run, confirm and remove this.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sgup432

Let me re-run, confirm and remove this.

Cool, I would also suggest please use the names and formatting similar to the larger panama vector code files, clearly indicating 128 vs 256 vs 512, etc.

LongVector v = LongVector.fromArray(LONG_SPECIES, scratch, 0);
VectorMask<Long> inRange =
v.compare(VectorOperators.GE, minValue).and(v.compare(VectorOperators.LE, maxValue));
long maskBits = inRange.toLong();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its a huge shame to throw away the maskBits which is already encoded as a long, especially when we know the bit set is a FixedBitSet and we have access to FixedBitSet.getBits ;)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yeah, I see what you mean here. I am currently using maskBits and then in the below while loop, for each bit set in maskBits, I set the desired bit in bitSet. This is inefficient indeed!

I guess instead of doing this in a loop, bit by bit, we can instead replace it with a single OR!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can instead replace it with a single OR!

Exactly! But, I am not 100% sure the correct way to do that, or how this is adjusted due to the "offset" thing. But I think there is a better way here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me check that.
@neoremind also pointed this out above with an approach.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benwtrent I am planning to address this in a separate PR and benchmark. Let me know what you think. As this PR is already pretty loaded.

Comment on lines +83 to +88
for (int d = loopBound; d < toDoc; d++) {
long v = values.get(d);
if (v >= minValue && v <= maxValue) {
bitSet.set(d - offset);
}
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we will hit windows of density, where v passes our predicate for multiple docs in a row. In that case, we could take advantage of FixedBitSet.set(int startIndex, int endIndex) which would provide a substantial speed up in those dense regions.

This same idea goes for the default, etc. versions.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm good idea.
I think it won't help much in this vectorized approach where we are processing at most 7 docs for the tail?

But for scalar approach or default, where we are processing upto 4096 docs in a block, so we can probably use FixedBitSet.set(int startIndex, int endIndex) for a batch of matching docs.

Doing something like below might work? Let me what you think.

int runStart = -1;
for (int d = fromDoc; d < toDoc; d++) {
    long v = values.get(d);
    boolean matches = v >= minValue && v <= maxValue;
    if (matches) {
        if (runStart == -1) runStart = d;  // a new run
    } else {
        if (runStart != -1) {
            bitSet.set(runStart - offset, d - offset);  // set in a batch
            runStart = -1;
        }
    }
}
// for remaining matching docs
if (runStart != -1) {
    bitSet.set(runStart - offset, toDoc - offset);
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sgup432 something like that for sure. It would need to be benchmarked, I am not sure how common this density is and branch prediction can be a pain :/

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah agree. I think this is something worthy of a separate issue/PR as needs more thought.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sgup432 yeah, a follow up discussion I think for sure.

*
* @lucene.internal
*/
public interface DocValuesRangeSupport {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this support path, etc. all matches our existing patterns. Seems OK to me.

@sgup432 sgup432 requested review from benwtrent and romseygeek May 14, 2026 23:17
Copy link
Copy Markdown
Contributor

@romseygeek romseygeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sgup432, I took another look and left some suggestions.

int toDoc,
long minValue,
long maxValue,
org.apache.lucene.util.FixedBitSet bitSet,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add an import for FixedBitSet?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed


BatchDocValuesRangeIterator iter = new BatchDocValuesRangeIterator(dv, skipper, 20, 40);
List<Integer> actual = new ArrayList<>();
for (int d = iter.nextDoc(); d != DocIdSetIterator.NO_MORE_DOCS; d = iter.nextDoc()) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be advance()? It looks like you're only exercising nextDoc() on the range iterator.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yeah. Fixed it. Renamed the test and verifying using both nextDoc() and advance()

Document doc = new Document();
if (i % 2 == 0) {
long val = i % 100;
doc.add(NumericDocValuesField.indexedField("sparse", val));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to be a MAYBE block, not a YES_IF_PRESENT one - there will be docs within the blocks that fall outside the range of the BatchDocValuesRangeIterator

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth trying to use BaseDocValuesSkipperTest for these, as it constructs a fake numeric values / skipper combination that exercises all the different possible block combinations.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I have now added the logic so that we verify YES_IF_PRESENT correctly.
Also added unit test using BaseDocValuesSkipperTest to verify different block combinations.

Also noticed javadoc comment on BaseDocValuesSkipperTest was a bit incorrect, also fixed it.
Please take another look.

@uschindler
Copy link
Copy Markdown
Contributor

Hi,
In general, the setup of code looks fine. It is in line with other implementations.

I don't know if the vectorized code works well on all CPU types. This is better known by @rmuir, maybe he can give some comments.

If some CPUs or avx versions are not supported well, the provider class needs to check and fall back to the default impl. This can be done by an if statement in the code that returns the implementation.

Otherwise I see no issues with the code.

@sgup432 sgup432 requested a review from romseygeek May 18, 2026 22:47
Copy link
Copy Markdown
Contributor

@romseygeek romseygeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks @sgup432. Can you resolve the merge conflicts, and then I'll commit.

Comment thread lucene/CHANGES.txt Outdated
Co-authored-by: Benjamin Trent <ben.w.trent@gmail.com>
@romseygeek romseygeek merged commit 6f0821d into apache:main May 20, 2026
13 checks passed
@romseygeek
Copy link
Copy Markdown
Contributor

Can this be backported to 10x or are we relying on bits from JDK25?

@romseygeek
Copy link
Copy Markdown
Contributor

Can this be backported to 10x or are we relying on bits from JDK25?

Answering myself: I think it can be backported but it is relying on some code that's only in main so it's not trivial. @sgup432 would you be able to open a backport PR?

@sgup432
Copy link
Copy Markdown
Contributor Author

sgup432 commented May 20, 2026

@sgup432 would you be able to open a backport PR?

Sure, let me do that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants