Skip to content

Introduce intoCacheAndCount method to BulkScorer to allow cache specializations#16083

Open
iverase wants to merge 1 commit into
apache:mainfrom
iverase:RangeDocIdSet
Open

Introduce intoCacheAndCount method to BulkScorer to allow cache specializations#16083
iverase wants to merge 1 commit into
apache:mainfrom
iverase:RangeDocIdSet

Conversation

@iverase
Copy link
Copy Markdown
Contributor

@iverase iverase commented May 18, 2026

Currently scorers are always cache using a dense representation, using either RoaringBitSets or FixedBitSets. This feels very inefficient for scorers that can be represented in a sparse way, like dense ranges.

This PR proposes to allow for scorer specialisations by moving the current code to materialize the scorer to the BulkScorer base class under the method #intoCacheAndCount(int maxDoc). This method can be overriden by subclasses, for example RangeBulkScorer can represent itself in a sparse way saving a good bunch of heap.

closes #16071

@iverase iverase added this to the 10.5.0 milestone May 18, 2026
if (scorer.cost() * 100 >= maxDoc) {
// FixedBitSet is faster for dense sets and will enable the random-access
// optimization in ConjunctionDISI
return cacheIntoBitSet(scorer, maxDoc);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this comment is still true now that we have intoBitSet() and docIDRunEnd()? I don't really like the idea of pushing things from LRUQueryCache onto BulkScorer - I feel like the way to fix this is to make RoaringBitSet more performant.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And how do you fix it if the range matches 2% of the index? having to build a full FixedBitSet for a dense range feels wrong.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean that maybe we should always be using RoaringBitSet?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, you mean using RoaringBitSet more aggresively.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, exactly. I need to do some code archaeology but I think ConjunctionDISI used to do instanceof checks for its input filters but now just uses intoBitSet and docIdRunEnd, so if we can get the performance of RoaringBitSet up for those methods then we don't need to use FixedBitSet here at all. Which would save us a whole bunch of memory.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this would be a step in that direction: #16084

I like the idea of going away of those fixed bit set in caching as they are a source of humongous allocations which moving to RoaringBitSet would avoid. How can be test the performance here?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be able to adjust luceneutil to use a query cache which should give us an idea of performance changes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Can we improve caching of dense results?

2 participants