Introduce intoCacheAndCount method to BulkScorer to allow cache specializations#16083
Introduce intoCacheAndCount method to BulkScorer to allow cache specializations#16083iverase wants to merge 1 commit into
Conversation
| if (scorer.cost() * 100 >= maxDoc) { | ||
| // FixedBitSet is faster for dense sets and will enable the random-access | ||
| // optimization in ConjunctionDISI | ||
| return cacheIntoBitSet(scorer, maxDoc); |
There was a problem hiding this comment.
I wonder if this comment is still true now that we have intoBitSet() and docIDRunEnd()? I don't really like the idea of pushing things from LRUQueryCache onto BulkScorer - I feel like the way to fix this is to make RoaringBitSet more performant.
There was a problem hiding this comment.
And how do you fix it if the range matches 2% of the index? having to build a full FixedBitSet for a dense range feels wrong.
There was a problem hiding this comment.
I mean that maybe we should always be using RoaringBitSet?
There was a problem hiding this comment.
I see, you mean using RoaringBitSet more aggresively.
There was a problem hiding this comment.
Yeah, exactly. I need to do some code archaeology but I think ConjunctionDISI used to do instanceof checks for its input filters but now just uses intoBitSet and docIdRunEnd, so if we can get the performance of RoaringBitSet up for those methods then we don't need to use FixedBitSet here at all. Which would save us a whole bunch of memory.
There was a problem hiding this comment.
So this would be a step in that direction: #16084
I like the idea of going away of those fixed bit set in caching as they are a source of humongous allocations which moving to RoaringBitSet would avoid. How can be test the performance here?
There was a problem hiding this comment.
I think we should be able to adjust luceneutil to use a query cache which should give us an idea of performance changes?
Currently scorers are always cache using a dense representation, using either RoaringBitSets or FixedBitSets. This feels very inefficient for scorers that can be represented in a sparse way, like dense ranges.
This PR proposes to allow for scorer specialisations by moving the current code to materialize the scorer to the BulkScorer base class under the method #intoCacheAndCount(int maxDoc). This method can be overriden by subclasses, for example RangeBulkScorer can represent itself in a sparse way saving a good bunch of heap.
closes #16071