Add SIMD-accelerated bulk range evaluation for dense numeric doc values#16050
Conversation
|
@romseygeek Do you mind taking a look at this? |
romseygeek
left a comment
There was a problem hiding this comment.
Thanks @sgup432, this looks great! I think we need some more comprehensive testing, and I left some notes on the API itself. I think I'd like @benwtrent or @uschindler's opinions on the vectorization code as that's not something I'm very familiar with.
| IndexWriterConfig iwc = new IndexWriterConfig(); | ||
| iwc.setCodec(new Lucene104Codec()); | ||
| IndexWriter w = new IndexWriter(dir, iwc); | ||
| Random r = new Random(42); |
There was a problem hiding this comment.
This should use random() to get the test seed
There was a problem hiding this comment.
I have changed the logic a bit and removed the earlier logic. Please take a look again.
| public void testSingleFieldRangeCorrectness() throws Exception { | ||
| Query q = SortedNumericDocValuesField.newSlowRangeQuery("age", 20, 40); | ||
| int count = searcher.count(q); | ||
| assertTrue("Should find some docs in range [20,40]", count > 0); |
There was a problem hiding this comment.
I don't think we can assert this with the randomly generated values? We could conceivably get all docs with value 1 on some (admittedly unlikely) seed.
There was a problem hiding this comment.
I think we can generate some controlled random values and still verify? I have the changed the logic to generate values like 0, 1, 2, ..., 99, 0, 1, 2, ..., 99, .. and verify this. That way it helps to write the other tests(for intoBitSet() and advance()) and verify.
Let me know what you think.
| */ | ||
| public class TestSkipBlockRangeIteratorIntoBitSet extends LuceneTestCase { | ||
|
|
||
| private static final int DOC_COUNT = 50_000; |
There was a problem hiding this comment.
This seems like a lot of docs?
There was a problem hiding this comment.
Yeah that is true for a unit test! I think this class needs some fixing, I was using to do some adhoc testing in my local. Let me fix it and add more cases.
There was a problem hiding this comment.
We do need it but I was not sure if a unit test is a right place for that. I can keep this doc count if you folks think is a good way to go.
Also on a side note, I think we really need to have some benchmark for docs values(with skip list enabled) in luceneutil? I see we don't have that yet?
Would be very useful for this change and otherwise. Maybe I can check and plan to add it :)
There was a problem hiding this comment.
For now I have kept ~16k doc count in this test, so that we have 4 blocks(4096 each) to test correctness with. Let me know if its okay.
There was a problem hiding this comment.
Ok, I have added a nightly version of test as well with larger number of docs. I didn't know that we can explicitly annotate such tests with "nightly", I thought @benwtrent used that in context of using a large number of docs. 😄
There was a problem hiding this comment.
You can change the number of docs given that nightly tag, or you can flip an entire test class or case on/off :). We just need to be aware of individual costs. Many folks still run ./gradlew check locally and we want to keep that useful and cheap.
| * <p>Key behavioral notes: | ||
| * | ||
| * <ul> | ||
| * <li>Single-field range with a second clause (e.g., MatchAllDocsQuery): goes through {@code |
There was a problem hiding this comment.
I don't think we're testing for this case? In addition, it needs to be a restrictive filter of some kind, as MatchAllDocsQuery will get rewritten away by BQ.
There was a problem hiding this comment.
Removed the section. Added more tests.
| * rangeIntoBitSet()}. | ||
| * </ul> | ||
| */ | ||
| public class TestSkipBlockRangeIteratorIntoBitSet extends LuceneTestCase { |
There was a problem hiding this comment.
I think we should be doing some lower-level testing here, specifically of the intoBitSet call - you can look at TestSkipBlockRangeIterator to get an idea of what to check.
There was a problem hiding this comment.
Yeah agree. I think this unit class is half baked and still needs more testing. I raised this PR to get initial feedback, still working in the background to ensure more correctness.
There was a problem hiding this comment.
Added more tests specifically for intoBitSet(). Please take a look and let me know.
| scratch[i] = values.get(d + i); | ||
| } | ||
| LongVector v = LongVector.fromArray(LONG_SPECIES, scratch, 0); | ||
| VectorMask<Long> inRange = |
There was a problem hiding this comment.
Wondering if loop unrolling for SIMD can speed up further (sample)? I suspect if we were to profile this, the bottleneck might be serial values.get(d + i) gather from packed values, if we could read more compact values with fewer loop iterations, and parallelize the range check with more CPU level pipelines, that would be a win, but need to do performance test to vet.
There was a problem hiding this comment.
i think the compiler can do this itself, when you aren't dealing with floating point. with floating point, optimization changes the result so its "unsafe".
| int base = d - offset; | ||
| while (maskBits != 0) { | ||
| int bit = Long.numberOfTrailingZeros(maskBits); | ||
| bitSet.set(base + bit); |
There was a problem hiding this comment.
The vectorized comparison is great, but here we do per-bit loop for the bitset update. Since docs are consecutive, maskBits already stores the exact bit we want, and its max value is 0xFF on AVX-512 (8 lanes). We could OR the mask directly into the bitset word(s) in constant time like O(2) + fewer branches, sample method in FixedBitSet would be
public void orMask(int startBit, long mask, int maskLen) {
int wordIndex = startBit >> 6;
int bitOffset = startBit & 63;
if (bitOffset + maskLen <= 64) {
bits[wordIndex] |= mask << bitOffset;
} else {
bits[wordIndex] |= mask << bitOffset;
bits[wordIndex + 1] |= mask >>> (64 - bitOffset);
}
}
There was a problem hiding this comment.
Yeah! Seems like @benwtrent also pointed this out indirectly below.
The current approach is indeed in-efficient and the while loop approach can instead be replaced by a single OR.
Thanks for pointing out.
benwtrent
left a comment
There was a problem hiding this comment.
I am surprised to see such good numbers with so much more perf opportunities still left to try!
Good idea :)
| .getDocValuesRangeSupport(); | ||
|
|
||
| // Static helper so anonymous inner classes can call DocValuesRangeSupport from the outer class | ||
| static void rangeIntoBitSetVectorized( |
There was a problem hiding this comment.
nit, the assumption is that it is vectorized, but it might be the "default" implementation. can we just name this rangeIntoBitSet? Or something other than vectorized.
| int offset) { | ||
| // Scalar tight loop — JIT may auto-vectorize this on modern JVMs. | ||
| for (int d = fromDoc; d < toDoc; d++) { | ||
| long v = values.get(d); |
There was a problem hiding this comment.
this tells me we eventually might actually want a int count = values.get(int[] docIds, long[] dest);
That is a larger change, but I suspect there is perf to be gained lower level just decoding the long values.
There was a problem hiding this comment.
@benwtrent
Hmm doing in a batched manner like you mentioned would certainly help. Seems like another topic worthy of a separate issue or discussion.
|
|
||
| // Only use SIMD if vector length >= 4 (AVX-256 or better). | ||
| // On 128-bit SIMD (2 longs), the scratch buffer overhead outweighs the benefit. | ||
| if (vectorLen < 4) { |
There was a problem hiding this comment.
Have you benchmarked this to indicate no improvement here?
There was a problem hiding this comment.
I think we can remove this now. I see that PanamaVectorizationProvider anyway throws exception for < 128 bit, and we default to scalar approach.
With 128 bits(or vectorLen==2), I added this case when I was seeing some regression on my Mac(128 bit), but with some older changes. Let me re-run, confirm and remove this.
There was a problem hiding this comment.
Let me re-run, confirm and remove this.
Cool, I would also suggest please use the names and formatting similar to the larger panama vector code files, clearly indicating 128 vs 256 vs 512, etc.
| LongVector v = LongVector.fromArray(LONG_SPECIES, scratch, 0); | ||
| VectorMask<Long> inRange = | ||
| v.compare(VectorOperators.GE, minValue).and(v.compare(VectorOperators.LE, maxValue)); | ||
| long maskBits = inRange.toLong(); |
There was a problem hiding this comment.
Its a huge shame to throw away the maskBits which is already encoded as a long, especially when we know the bit set is a FixedBitSet and we have access to FixedBitSet.getBits ;)
There was a problem hiding this comment.
Hmm yeah, I see what you mean here. I am currently using maskBits and then in the below while loop, for each bit set in maskBits, I set the desired bit in bitSet. This is inefficient indeed!
I guess instead of doing this in a loop, bit by bit, we can instead replace it with a single OR!
There was a problem hiding this comment.
we can instead replace it with a single OR!
Exactly! But, I am not 100% sure the correct way to do that, or how this is adjusted due to the "offset" thing. But I think there is a better way here.
There was a problem hiding this comment.
Let me check that.
@neoremind also pointed this out above with an approach.
There was a problem hiding this comment.
@benwtrent I am planning to address this in a separate PR and benchmark. Let me know what you think. As this PR is already pretty loaded.
| for (int d = loopBound; d < toDoc; d++) { | ||
| long v = values.get(d); | ||
| if (v >= minValue && v <= maxValue) { | ||
| bitSet.set(d - offset); | ||
| } | ||
| } |
There was a problem hiding this comment.
I wonder if we will hit windows of density, where v passes our predicate for multiple docs in a row. In that case, we could take advantage of FixedBitSet.set(int startIndex, int endIndex) which would provide a substantial speed up in those dense regions.
This same idea goes for the default, etc. versions.
There was a problem hiding this comment.
Hmm good idea.
I think it won't help much in this vectorized approach where we are processing at most 7 docs for the tail?
But for scalar approach or default, where we are processing upto 4096 docs in a block, so we can probably use FixedBitSet.set(int startIndex, int endIndex) for a batch of matching docs.
Doing something like below might work? Let me what you think.
int runStart = -1;
for (int d = fromDoc; d < toDoc; d++) {
long v = values.get(d);
boolean matches = v >= minValue && v <= maxValue;
if (matches) {
if (runStart == -1) runStart = d; // a new run
} else {
if (runStart != -1) {
bitSet.set(runStart - offset, d - offset); // set in a batch
runStart = -1;
}
}
}
// for remaining matching docs
if (runStart != -1) {
bitSet.set(runStart - offset, toDoc - offset);
}
There was a problem hiding this comment.
@sgup432 something like that for sure. It would need to be benchmarked, I am not sure how common this density is and branch prediction can be a pain :/
There was a problem hiding this comment.
Yeah agree. I think this is something worthy of a separate issue/PR as needs more thought.
There was a problem hiding this comment.
@sgup432 yeah, a follow up discussion I think for sure.
| * | ||
| * @lucene.internal | ||
| */ | ||
| public interface DocValuesRangeSupport { |
There was a problem hiding this comment.
I think this support path, etc. all matches our existing patterns. Seems OK to me.
romseygeek
left a comment
There was a problem hiding this comment.
Thanks @sgup432, I took another look and left some suggestions.
| int toDoc, | ||
| long minValue, | ||
| long maxValue, | ||
| org.apache.lucene.util.FixedBitSet bitSet, |
There was a problem hiding this comment.
Add an import for FixedBitSet?
|
|
||
| BatchDocValuesRangeIterator iter = new BatchDocValuesRangeIterator(dv, skipper, 20, 40); | ||
| List<Integer> actual = new ArrayList<>(); | ||
| for (int d = iter.nextDoc(); d != DocIdSetIterator.NO_MORE_DOCS; d = iter.nextDoc()) { |
There was a problem hiding this comment.
Should this be advance()? It looks like you're only exercising nextDoc() on the range iterator.
There was a problem hiding this comment.
Hmm yeah. Fixed it. Renamed the test and verifying using both nextDoc() and advance()
| Document doc = new Document(); | ||
| if (i % 2 == 0) { | ||
| long val = i % 100; | ||
| doc.add(NumericDocValuesField.indexedField("sparse", val)); |
There was a problem hiding this comment.
This is going to be a MAYBE block, not a YES_IF_PRESENT one - there will be docs within the blocks that fall outside the range of the BatchDocValuesRangeIterator
There was a problem hiding this comment.
I think it's worth trying to use BaseDocValuesSkipperTest for these, as it constructs a fake numeric values / skipper combination that exercises all the different possible block combinations.
There was a problem hiding this comment.
Good catch. I have now added the logic so that we verify YES_IF_PRESENT correctly.
Also added unit test using BaseDocValuesSkipperTest to verify different block combinations.
Also noticed javadoc comment on BaseDocValuesSkipperTest was a bit incorrect, also fixed it.
Please take another look.
|
Hi, I don't know if the vectorized code works well on all CPU types. This is better known by @rmuir, maybe he can give some comments. If some CPUs or avx versions are not supported well, the provider class needs to check and fall back to the default impl. This can be done by an if statement in the code that returns the implementation. Otherwise I see no issues with the code. |
romseygeek
left a comment
There was a problem hiding this comment.
Nice, thanks @sgup432. Can you resolve the merge conflicts, and then I'll commit.
Co-authored-by: Benjamin Trent <ben.w.trent@gmail.com>
|
Can this be backported to 10x or are we relying on bits from JDK25? |
Answering myself: I think it can be backported but it is relying on some code that's only in main so it's not trivial. @sgup432 would you be able to open a backport PR? |
Sure, let me do that. |
Description
Numeric range queries on dense fields use DocValuesRangeIterator, which is a TwoPhaseIterator that uses SkipBlockRangeIterator as an approximation. This works well, but for MAYBE blocks (where values partially overlap the query range), it still falls back to per-doc evaluation: each doc is checked individually via values.advance(doc) + values.longValue() + range comparison.
Since DocValuesRangeIterator is a TwoPhaseIterator,
DenseConjunctionBulkScorerroutes it through the leap-frog path(see here) andintoBitSet()is never called. This means SIMD is never used for MAYBE block evaluation, even though the underlying storage for dense fields is a packed long[] that's ideal for vectorized comparison.PR changes
For dense singleton numeric fields with a skip index, replace DocValuesRangeIterator with a new
BatchDocValuesRangeIteratorwhich is a plain DocIdSetIterator (not TwoPhaseIterator). This was added so that we force DenseConjunctionBulkScorer to call intoBitSet() on it directly, enabling the bitset intersection path. I am open to suggestion if this is a right approachThis PR also adds support to do SIMD-accelerated bulk range evaluation for MAYBE (partial overlap) blocks, which seem to be the most expensive case when running range queries through doc values.
For this we added below changes:
Add
NumericDocValues.rangeIntoBitSet(fromDoc, toDoc, minValue, maxValue, bitSet, offset): a new bulk API with a per-doc fallback default. Lucene90DocValuesProducer overrides this for dense fields to dispatch to the vectorization layer.Add a DocValuesRangeSupport interface with two implementations:
VectorizationProvider.getDocValuesRangeSupport()returns the appropriate implementation at startup.Benchmarks
MultiFieldDocValuesRangeBenchmark (c5.2xlarge, AVX-512)
Mac (Apple M-series, 128-bit NEON)
The numbers look great across the board!