Add afterMerge() lifecycle hook to KnnVectorsWriter and PerFieldKnnVe…#15952
Add afterMerge() lifecycle hook to KnnVectorsWriter and PerFieldKnnVe…#15952MrFlap wants to merge 1 commit into
Conversation
|
@MrFlap can you add entry in change log and also unit test for the new interface |
| if (mergeState.infoStream.isEnabled("VV")) { | ||
| mergeState.infoStream.message("VV", "merge done " + mergeState.segmentInfo); | ||
| if (mergeState.infoStream.isEnabled("VV")) { | ||
| mergeState.infoStream.message("VV", "merge done " + mergeState.segmentInfo); | ||
| } | ||
| } | ||
| } | ||
| } finally { | ||
| afterMerge(); | ||
| } | ||
| finish(); |
There was a problem hiding this comment.
Why doesn't finish work? It seems like the writer should know its in a merge context already and finish should just handle that? Adding a new interface seems unnecessary?
There was a problem hiding this comment.
We would like to manage the threads in PerFieldKnnVectorsFormat, which doesn't get a callback once KnnVectorsWriter.finish() happens. Also, the suggested afterMerge() runs in a finally block, so we can be sure to close out the threads even on exception.
There was a problem hiding this comment.
- On an exception with merge, all files, etc. should be closed, do we want to move finish into the
finally? There may be edge cases, for sure... - Are we sure extending PerFieldKnnVectorsFormat is the way forward here? My point is that this type of thing is doable via experts in Lucene in different ways than adding another top-level API to the KnnVectorFormat which then impacts all formats.
I want to be very careful about adding an expert level API that Apache Lucene itself doesn't actually use.
There was a problem hiding this comment.
On an exception with merge, all files, etc. should be closed, do we want to move finish into the finally? There may be edge cases, for sure...
I'm not sure I can give a completely educated call on this, but it seems like the right thing to do. If mergeOneField throws mid-way, the finish() call will be skipped.
Are we sure extending PerFieldKnnVectorsFormat is the way forward here? My point is that this type of thing is doable via experts in Lucene in different ways than adding another top-level API to the KnnVectorFormat which then impacts all formats.
Understandable. I'm not a Lucene expert, but I don't think there is an easy way for PerFieldKnnVectorsFormat to know when a merge completes. I'll continue looking into this. cc @navneet1v
There was a problem hiding this comment.
I am saying, that any expert extending PerFieldKnnVectorsFormat can likely do this in another way and requiring a Lucene change that Lucene doesn't benefit from is a tricky situation.
I am just wondering if this is the best way for Lucene.
There was a problem hiding this comment.
I mean, the overrider for PerFieldKnnVectorsFormat could wrap a try/final directly in mergeOneField. I am not sure a new interface is required at all.
There was a problem hiding this comment.
Our desired functionality is to have a set number N threads per merge. This way we won't have resource contention between merges. So if one field needs 50 singular merge operations, we should see 50 * N threads spawning. Wrapping functionality around mergeOneField is difficult because we want to
- Allocate N threads when each singular merge operation starts.
- Release N threads when each singular merge operation completes.
- Be able to release N threads on a specific merge operation when an exception happens
We would really like to keep it to a "N threads per singular merge operation" as making it a set number of threads per field merge could lead to resource contention per thread. We don't really see an easy way to be this granular about how to detect and allocate resources per merge, since there are no hooks to see when a singular merge operation happens.
Adding these. |
…ctorsFormat Signed-off-by: Andrew Klepchick <aklepchi@amazon.com>
|
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution! |
Resolves #15935
Problem
PerFieldKnnVectorsFormatsubclasses that allocate merge-time resources (such as thread pools for concurrent HNSW graph construction) have no way to know when a merge completes.getKnnVectorsFormatForFieldis called per-field during merge, but there is no corresponding callback when the merge finishes. This makes it impossible to cleanly release per-merge resources, leading to resource leaks.Solution
Add a
protected void afterMerge() throws IOExceptionmethod with a no-op default to bothKnnVectorsWriterandPerFieldKnnVectorsFormat.A consumer can then override
afterMerge()to shrink a shared thread pool after each merge completes, ensuring the pool scales elastically with the number of concurrent merges rather than leaking threads indefinitely.Changes
KnnVectorsWriter.java: AddedafterMerge()no-op method. Wrapped the field merge loop inmerge()withtry/finally { afterMerge() }.PerFieldKnnVectorsFormat.java: AddedafterMerge()no-op method on the format.FieldsWriteroverridesafterMerge()to delegate toPerFieldKnnVectorsFormat.this.afterMerge().cc @navneet1v @shatejas