[core] Add committer-side bucket consistency check#7793
Conversation
ab6ff42 to
8384547
Compare
| try { | ||
| Set<BinaryRow> remainingPartitions = new HashSet<>(changedPartitions); | ||
| Map<BinaryRow, Integer> totalBuckets = new HashMap<>(); | ||
| FileStoreScan freshScan = scanSupplier.get().dropStats(); |
| .checkSameBucket( | ||
| bucketMode() == BucketMode.HASH_FIXED | ||
| && options.writeOnly() | ||
| && !options.bucketAppendOrdered()); |
There was a problem hiding this comment.
FileStoreCommitImpl already has options, can you just check it inside?
8384547 to
89e6214
Compare
81ebe02 to
fce7662
Compare
|
CC @JingsongLi , please take another look |
leaves12138
left a comment
There was a problem hiding this comment.
Thanks for the PR. The goal makes sense: for fixed-bucket append tables with write-only=true and unordered appends, the committer should reject new APPEND files whose totalBuckets differs from the existing layout, unless the user first performs an INSERT OVERWRITE rescale.
I found one correctness issue that I think should be fixed before merge.
sameBucketCheckedPartitions is cached across commits, but the cache is not tied to either the latest snapshot id or the bucket number that was checked. This can skip a required check after an overwrite/rescale by another committer.
A possible sequence:
- A long-lived streaming committer writes partition
pwith bucket num2. - A later APPEND from the same committer checks
pagainst the latest snapshot and recordspinsameBucketCheckedPartitions. - A batch job performs the recommended rescale flow:
INSERT OVERWRITEpartition/table with bucket num4. - The old streaming committer is still alive and appends to
pwith bucket num2again.
At step 4, collectUncheckedBucketPartitions skips p because it is already cached, so checkSameBucketFromSnapshot does not call readTotalBuckets and the APPEND can succeed with the old bucket num. checkBucketKeepSame has the same skip when conflict detection is enabled. That leaves files with mixed totalBuckets after an overwrite rescale, which is exactly what this PR is trying to prevent.
I think the cache should either be removed, or made safe by tying it to the snapshot/layout that was actually checked. For example, invalidate/recheck cached partitions when the latest snapshot advances in a way that may change layout, or store enough information to verify the cached bucket number against the current snapshot. The simplest safe implementation may be to always read the current total bucket for the changed partitions; readTotalBuckets already stops after finding one file per partition.
Could you also add a regression test for this case? Something like: use the same StreamTableCommit to append to one partition twice so the cache is populated, use another committer/table copy to withOverwrite({"f0": "1"}) the same partition with a different bucket num, then verify the old committer's next APPEND with the old bucket num fails.
I attempted to run a focused BucketedAppendFileStoreWriteTest, but this environment could not resolve the current paimon-arrow:1.5-SNAPSHOT test dependency from the configured snapshot repository, so I could not complete local test validation here.
|
Hi @leaves12138 , Thank you for your comment. It is certainly possible that the situation you described could happen.But currently, our check logic actually doesn't verify this type of situation. For example, taking our Writer-side Bucket Rescale detection as an instance: it actually stops detecting once the WriterContainer has been initialized. Therefore, the scenario you mentioned is similarly unavoidable at this stage. In a scenario where one job is performing a normal write while another job is performing an "insert overwrite rescale" the expected outcome is that the data will be overwritten. However, this is not considered a normal or intended write behavior. Its primary purpose is to prevent the Commit side from repeatedly checking the Manifest file and result in additional cost. |
|
+1 |
Purpose
Add committer-side bucket consistency validation for write-only unordered append tables.
after #6741 When
bucket-append-ordered=falseandwrite-only=true, writers skip restoring previous files, so bucket-count validation can be bypassed after bucket rescale. This change adds an internal commit-sidecheckSameBucketpath for fixed hash bucket tables to validate touched partitions before committing.The check is integrated with
ConflictDetection, reuses the existing conflict path when available, and uses a bounded partition cache to avoid repeatedly checking the same partition within one committer lifecycle.Tests
prepareCommitsucceeds when bucket count changes.commitfails for an existing partition with mismatched bucket count.INSERT INTOfails after bucket count changes.INSERT OVERWRITEsucceeds for rescaling.