-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Add afterMerge() lifecycle hook to KnnVectorsWriter and PerFieldKnnVe… #15952
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
MrFlap
wants to merge
1
commit into
apache:main
Choose a base branch
from
MrFlap:after-merge
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
239 changes: 239 additions & 0 deletions
239
...re/src/test/org/apache/lucene/codecs/perfield/TestPerFieldKnnVectorsFormatAfterMerge.java
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,239 @@ | ||
| /* | ||
| * Licensed to the Apache Software Foundation (ASF) under one or more | ||
| * contributor license agreements. See the NOTICE file distributed with | ||
| * this work for additional information regarding copyright ownership. | ||
| * The ASF licenses this file to You under the Apache License, Version 2.0 | ||
| * (the "License"); you may not use this file except in compliance with | ||
| * the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, software | ||
| * distributed under the License is distributed on an "AS IS" BASIS, | ||
| * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| * See the License for the specific language governing permissions and | ||
| * limitations under the License. | ||
| */ | ||
| package org.apache.lucene.codecs.perfield; | ||
|
|
||
| import java.io.IOException; | ||
| import java.util.concurrent.atomic.AtomicInteger; | ||
| import org.apache.lucene.codecs.Codec; | ||
| import org.apache.lucene.codecs.FilterCodec; | ||
| import org.apache.lucene.codecs.KnnVectorsFormat; | ||
| import org.apache.lucene.document.Document; | ||
| import org.apache.lucene.document.KnnFloatVectorField; | ||
| import org.apache.lucene.index.IndexWriter; | ||
| import org.apache.lucene.index.IndexWriterConfig; | ||
| import org.apache.lucene.index.NoMergePolicy; | ||
| import org.apache.lucene.store.Directory; | ||
| import org.apache.lucene.tests.analysis.MockAnalyzer; | ||
| import org.apache.lucene.tests.util.LuceneTestCase; | ||
| import org.apache.lucene.tests.util.TestUtil; | ||
|
|
||
| /** Tests for the afterMerge() lifecycle hook on PerFieldKnnVectorsFormat. */ | ||
| public class TestPerFieldKnnVectorsFormatAfterMerge extends LuceneTestCase { | ||
|
|
||
| /** Writes numSegments single-doc segments with a vector field, using NoMergePolicy. */ | ||
| private void writeSegments(Directory dir, KnnVectorsFormat format, int numSegments) | ||
| throws IOException { | ||
| IndexWriterConfig iwc = new IndexWriterConfig(new MockAnalyzer(random())); | ||
| iwc.setCodec(codecWithFormat(format)); | ||
| iwc.setMergePolicy(NoMergePolicy.INSTANCE); | ||
| try (IndexWriter iw = new IndexWriter(dir, iwc)) { | ||
| for (int i = 0; i < numSegments; i++) { | ||
| Document doc = new Document(); | ||
| doc.add(new KnnFloatVectorField("field", new float[] {i, i + 1, i + 2})); | ||
| iw.addDocument(doc); | ||
| iw.commit(); | ||
| } | ||
| } | ||
| } | ||
|
|
||
| private static FilterCodec codecWithFormat(KnnVectorsFormat format) { | ||
| Codec defaultCodec = TestUtil.getDefaultCodec(); | ||
| return new FilterCodec(defaultCodec.getName(), defaultCodec) { | ||
| @Override | ||
| public KnnVectorsFormat knnVectorsFormat() { | ||
| return format; | ||
| } | ||
| }; | ||
| } | ||
|
|
||
| /** afterMerge() on the format must be called exactly once when a merge completes. */ | ||
| public void testAfterMergeCalledOnMerge() throws IOException { | ||
| AtomicInteger afterMergeCount = new AtomicInteger(); | ||
|
|
||
| PerFieldKnnVectorsFormat format = | ||
| new PerFieldKnnVectorsFormat() { | ||
| @Override | ||
| public KnnVectorsFormat getKnnVectorsFormatForField(String field) { | ||
| return TestUtil.getDefaultKnnVectorsFormat(); | ||
| } | ||
|
|
||
| @Override | ||
| protected void afterMerge() { | ||
| afterMergeCount.incrementAndGet(); | ||
| } | ||
| }; | ||
|
|
||
| try (Directory dir = newDirectory()) { | ||
| writeSegments(dir, format, 3); | ||
| assertEquals("afterMerge should not be called during flush", 0, afterMergeCount.get()); | ||
|
|
||
| // Force merge triggers afterMerge | ||
| IndexWriterConfig mergeConfig = new IndexWriterConfig(new MockAnalyzer(random())); | ||
| mergeConfig.setCodec(codecWithFormat(format)); | ||
| try (IndexWriter iw = new IndexWriter(dir, mergeConfig)) { | ||
| iw.forceMerge(1); | ||
| } | ||
|
|
||
| assertEquals("afterMerge should be called exactly once per merge", 1, afterMergeCount.get()); | ||
| } | ||
| } | ||
|
|
||
| /** afterMerge() must not be called during a normal flush (no merge). */ | ||
| public void testAfterMergeNotCalledOnFlush() throws IOException { | ||
| AtomicInteger afterMergeCount = new AtomicInteger(); | ||
|
|
||
| PerFieldKnnVectorsFormat format = | ||
| new PerFieldKnnVectorsFormat() { | ||
| @Override | ||
| public KnnVectorsFormat getKnnVectorsFormatForField(String field) { | ||
| return TestUtil.getDefaultKnnVectorsFormat(); | ||
| } | ||
|
|
||
| @Override | ||
| protected void afterMerge() { | ||
| afterMergeCount.incrementAndGet(); | ||
| } | ||
| }; | ||
|
|
||
| try (Directory dir = newDirectory()) { | ||
| writeSegments(dir, format, 3); | ||
| assertEquals("afterMerge must not be called on flush-only writes", 0, afterMergeCount.get()); | ||
| } | ||
| } | ||
|
|
||
| /** | ||
| * afterMerge() is called in a finally block, so it must fire even when mergeOneField throws. We | ||
| * verify this by using a format that wraps the delegate writer to throw during merge, then | ||
| * checking that afterMerge() was still invoked. | ||
| */ | ||
| public void testAfterMergeCalledEvenOnException() throws IOException { | ||
| AtomicInteger afterMergeCount = new AtomicInteger(); | ||
|
|
||
| PerFieldKnnVectorsFormat format = | ||
| new PerFieldKnnVectorsFormat() { | ||
| @Override | ||
| public KnnVectorsFormat getKnnVectorsFormatForField(String field) { | ||
| return new ThrowingOnMergeKnnVectorsFormat(TestUtil.getDefaultKnnVectorsFormat()); | ||
| } | ||
|
|
||
| @Override | ||
| protected void afterMerge() { | ||
| afterMergeCount.incrementAndGet(); | ||
| } | ||
| }; | ||
|
|
||
| try (Directory dir = newDirectory()) { | ||
| // Write segments using a normal format so data is valid on disk | ||
| PerFieldKnnVectorsFormat normalFormat = | ||
| new PerFieldKnnVectorsFormat() { | ||
| @Override | ||
| public KnnVectorsFormat getKnnVectorsFormatForField(String field) { | ||
| return TestUtil.getDefaultKnnVectorsFormat(); | ||
| } | ||
| }; | ||
| writeSegments(dir, normalFormat, 3); | ||
|
|
||
| // Use SerialMergeScheduler so the merge runs on the calling thread and the | ||
| // exception propagates directly (instead of being caught by ConcurrentMergeScheduler) | ||
| IndexWriterConfig mergeConfig = new IndexWriterConfig(new MockAnalyzer(random())); | ||
| mergeConfig.setCodec(codecWithFormat(format)); | ||
| mergeConfig.setMergeScheduler(new org.apache.lucene.index.SerialMergeScheduler()); | ||
| IndexWriter iw = new IndexWriter(dir, mergeConfig); | ||
| try { | ||
| expectThrows(IOException.class, () -> iw.forceMerge(1)); | ||
| } finally { | ||
| // IndexWriter may be in a tragic state after the merge failure, so rollback | ||
| try { | ||
| iw.rollback(); | ||
| } catch ( | ||
| @SuppressWarnings("unused") | ||
| Exception ignored) { | ||
| // expected — writer may already be closed or in a tragic state | ||
| } | ||
| } | ||
|
|
||
| assertEquals( | ||
| "afterMerge must be called even when mergeOneField throws", 1, afterMergeCount.get()); | ||
| } | ||
| } | ||
|
|
||
| /** | ||
| * A KnnVectorsFormat wrapper that delegates everything normally except mergeOneField, which always | ||
| * throws IOException. | ||
| */ | ||
| private static class ThrowingOnMergeKnnVectorsFormat extends KnnVectorsFormat { | ||
| private final KnnVectorsFormat delegate; | ||
|
|
||
| ThrowingOnMergeKnnVectorsFormat(KnnVectorsFormat delegate) { | ||
| super(delegate.getName()); | ||
| this.delegate = delegate; | ||
| } | ||
|
|
||
| @Override | ||
| public org.apache.lucene.codecs.KnnVectorsWriter fieldsWriter( | ||
| org.apache.lucene.index.SegmentWriteState state) throws IOException { | ||
| org.apache.lucene.codecs.KnnVectorsWriter delegateWriter = delegate.fieldsWriter(state); | ||
| return new org.apache.lucene.codecs.KnnVectorsWriter() { | ||
| @Override | ||
| public org.apache.lucene.codecs.KnnFieldVectorsWriter<?> addField( | ||
| org.apache.lucene.index.FieldInfo fieldInfo) throws IOException { | ||
| return delegateWriter.addField(fieldInfo); | ||
| } | ||
|
|
||
| @Override | ||
| public void flush(int maxDoc, org.apache.lucene.index.Sorter.DocMap sortMap) | ||
| throws IOException { | ||
| delegateWriter.flush(maxDoc, sortMap); | ||
| } | ||
|
|
||
| @Override | ||
| public void mergeOneField( | ||
| org.apache.lucene.index.FieldInfo fieldInfo, | ||
| org.apache.lucene.index.MergeState mergeState) | ||
| throws IOException { | ||
| throw new IOException("simulated merge failure"); | ||
| } | ||
|
|
||
| @Override | ||
| public void finish() throws IOException { | ||
| delegateWriter.finish(); | ||
| } | ||
|
|
||
| @Override | ||
| public void close() throws IOException { | ||
| delegateWriter.close(); | ||
| } | ||
|
|
||
| @Override | ||
| public long ramBytesUsed() { | ||
| return delegateWriter.ramBytesUsed(); | ||
| } | ||
| }; | ||
| } | ||
|
|
||
| @Override | ||
| public org.apache.lucene.codecs.KnnVectorsReader fieldsReader( | ||
| org.apache.lucene.index.SegmentReadState state) throws IOException { | ||
| return delegate.fieldsReader(state); | ||
| } | ||
|
|
||
| @Override | ||
| public int getMaxDimensions(String fieldName) { | ||
| return delegate.getMaxDimensions(fieldName); | ||
| } | ||
| } | ||
| } |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why doesn't
finishwork? It seems like the writer should know its in a merge context already andfinishshould just handle that? Adding a new interface seems unnecessary?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would like to manage the threads in
PerFieldKnnVectorsFormat, which doesn't get a callback onceKnnVectorsWriter.finish()happens. Also, the suggestedafterMerge()runs in a finally block, so we can be sure to close out the threads even on exception.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
finally? There may be edge cases, for sure...I want to be very careful about adding an expert level API that Apache Lucene itself doesn't actually use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I can give a completely educated call on this, but it seems like the right thing to do. If
mergeOneFieldthrows mid-way, thefinish()call will be skipped.Understandable. I'm not a Lucene expert, but I don't think there is an easy way for
PerFieldKnnVectorsFormatto know when a merge completes. I'll continue looking into this. cc @navneet1vThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am saying, that any expert extending
PerFieldKnnVectorsFormatcan likely do this in another way and requiring a Lucene change that Lucene doesn't benefit from is a tricky situation.I am just wondering if this is the best way for Lucene.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, the overrider for
PerFieldKnnVectorsFormatcould wrap a try/final directly inmergeOneField. I am not sure a new interface is required at all.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our desired functionality is to have a set number
Nthreads per merge. This way we won't have resource contention between merges. So if one field needs 50 singular merge operations, we should see50 * Nthreads spawning. Wrapping functionality around mergeOneField is difficult because we want toWe would really like to keep it to a "N threads per singular merge operation" as making it a set number of threads per field merge could lead to resource contention per thread. We don't really see an easy way to be this granular about how to detect and allocate resources per merge, since there are no hooks to see when a singular merge operation happens.