Default Stats Send Mode SAFE -> ALWAYS#18367
Default Stats Send Mode SAFE -> ALWAYS#18367satwik-pachigolla wants to merge 3 commits intoapache:masterfrom
Conversation
|
cc @dang-stripe |
|
@gortiz @suvodeep-pyne please add the |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #18367 +/- ##
============================================
- Coverage 63.43% 63.42% -0.01%
Complexity 1683 1683
============================================
Files 3253 3253
Lines 198841 198842 +1
Branches 30795 30795
============================================
- Hits 126136 126124 -12
- Misses 62625 62643 +18
+ Partials 10080 10075 -5
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
xiangfu0
left a comment
There was a problem hiding this comment.
Found one high-signal compatibility issue; see inline comment.
| /// running 1.3.0 may fail, which breaks backward compatibility. | ||
| public static final String KEY_OF_SEND_STATS_MODE = "pinot.query.mse.stats.mode"; | ||
| public static final String DEFAULT_SEND_STATS_MODE = "SAFE"; | ||
| public static final String DEFAULT_SEND_STATS_MODE = "ALWAYS"; |
There was a problem hiding this comment.
Changing the default here bypasses the SAFE compatibility check for every cluster that never set pinot.query.mse.stats.mode. SendStatsPredicate still documents that 1.3.x and lower can return incorrect stats or fail when unexpected upstream stats arrive, so this turns mixed-version rollouts into a behavior-breaking default change. This needs an explicit migration boundary or rollout plan instead of flipping the default constant.
There was a problem hiding this comment.
I don't think there's a safe way do this as an explicit migration boundary. This requires coordination between nodes on old versions (which we can't change) and new versions. Otherwise we'd need to use more stable mechanisms of relying on ZK metadata that would have existed as of <=1.3, none of which I think are suitable here.
#15890 is a documented case of how using ZK watchers led to more instability and the partial fix PR comments also mention that we should go to default ALWAYS eventually.
I think the existing ZK risk >> the risk of an MSE user upgrading from <= 1.3 to >= 1.5 without seeing this PR if we label it
I updated the PR description to make this more clear.
Summary
Improvements
This will further mitigate #15890 which was only partially mitigated in #15895 (comment) where there is still the risk of ZK instability risk for big clusters where operators haven't come across this setting.
Detailed Explanation
AI powered summary here.
In short,
Compatability