Skip table size aggregation for disabled tables in SegmentStatusChecker#18343
Skip table size aggregation for disabled tables in SegmentStatusChecker#183431fanwang wants to merge 2 commits intoapache:masterfrom
Conversation
When SegmentStatusChecker.processTable runs against a disabled table,
updateSegmentMetrics correctly early-returns and tags the table in
context._disabledTables, but updateTableSizeMetrics still runs
unconditionally. This issues a /table/{name}/size scatter-gather to
every server, and because servers do not load segments for disabled
tables, every server responds 404. CompletionServiceHelper logs each
404 at ERROR level, flooding controller logs whenever the controller
UI (or any caller) refreshes table state.
Read context._disabledTables in processTable and skip the size call
when the table is disabled. The set is already populated by
updateSegmentMetrics, so no extra ZK lookup is needed.
Also extends disabledTableTest to assert TableSizeReader is never
invoked for a disabled table, locking in the fix as a regression
test.
Fixes apache#14279
3089e52 to
37c05b3
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #18343 +/- ##
============================================
+ Coverage 63.39% 63.41% +0.01%
Complexity 1679 1679
============================================
Files 3253 3253
Lines 198764 198768 +4
Branches 30791 30792 +1
============================================
+ Hits 126012 126044 +32
+ Misses 62678 62653 -25
+ Partials 10074 10071 -3
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
noob-se7en
left a comment
There was a problem hiding this comment.
0 critical / 0 major / 1 scope question — see inline.
| updateTableSizeMetrics(tableNameWithType); | ||
| // Skip table size aggregation for disabled tables. Servers do not load segments for disabled tables, so the | ||
| // /table/{name}/size scatter-gather would otherwise return 404 from every server and flood controller logs | ||
| // with ERROR-level entries. updateSegmentMetrics has already populated context._disabledTables. |
There was a problem hiding this comment.
the null-IdealState early-return at ~L221 doesn't populate _disabledTables, so this guard still lets that path through to updateTableSizeMetrics — same 404 pattern, no? worth handling here too, or calling out as out-of-scope.
There was a problem hiding this comment.
Good catch — the null-IdealState branch had the same shape but a different downstream symptom (getServerToSegmentsMap throws IllegalStateException, which lands in processTable's outer catch as an ERROR + stack trace per table per cycle). Generalized the guard in 1ca4517: introduced _tablesToSkipSizeAggregation as a superset of _disabledTables so both idealState == null and !idealState.isEnabled() branches add to it, and processTable checks the new set. Kept _disabledTables distinct so the DISABLED_TABLE_COUNT gauge isn't inflated by null-IdealState entries. New nullIdealStateTest covers it.
Address review feedback on PR apache#18343: The null-IdealState early-return at SegmentStatusChecker.java:~221 doesn't populate `_disabledTables`, so the original guard still let that path through to `updateTableSizeMetrics`. The downstream `getServerToSegmentsMap` throws `IllegalStateException` when IdealState is missing, propagating to processTable's outer catch as an ERROR + stack trace per table per cycle — same controller-log flood pattern that apache#14279 reports for disabled tables. Introduce `_tablesToSkipSizeAggregation` as a superset of `_disabledTables` so both branches can opt out of size aggregation without inflating the disabled-table count gauge. The `updateSegmentMetrics` paths for null-IdealState and disabled-table both add to it; `processTable` consults this set instead of `_disabledTables`. Adds `SegmentStatusCheckerTest.nullIdealStateTest` to lock the regression and confirm the disabled-table-count gauge is unaffected. 24/24 SegmentStatusCheckerTest pass under `./mvnw -pl pinot-controller -Dtest=SegmentStatusCheckerTest test`.
Problem
Per issue #14279, controller logs flood with ERROR-level entries from
CompletionServiceHelperwhenever the controller UI (or any caller) refreshes table state, with one entry per server per disabled table per refresh:This trips alerts wired to ERROR-level controller logs.
Root cause
SegmentStatusChecker.processTablecalls three sub-methods:updateSegmentMetricsalready detects disabled tables (!idealState.isEnabled()) and tags them incontext._disabledTablesbefore returning.updateTableSizeMetricsruns regardless and issues a/table/{name}/sizescatter-gather to every server. Because servers do not load segments for disabled tables, every server responds 404, andCompletionServiceHelperlogs each as ERROR.This matches @Jackie-Jiang's framing on the issue: "the root problem is not the log level, but Pinot not able to gracefully handle reading size for disabled tables."
Fix
Read
context._disabledTablesinprocessTableand skipupdateTableSizeMetricswhen the table is disabled. The set is already populated byupdateSegmentMetrics, so the guard adds no extra ZK lookup or controller round-trip.Test
Extends the existing
disabledTableTestwithverify(tableSizeReader, never()).getTableSizeDetails(...)to lock in the regression. Refactors therunSegmentStatusCheckerhelper to return theTableSizeReadermock so the assertion has something to verify against; existing callers ignore the return value../mvnw -pl pinot-controller -am -Dtest=SegmentStatusCheckerTest -Dsurefire.failIfNoSpecifiedTests=false test→ 23/23 pass (disabledTableTest1.4s).Pre-commit checks pass:
spotless:apply,checkstyle:check,license:check.Fixes #14279