Skip to content

Skip table size aggregation for disabled tables in SegmentStatusChecker#18343

Open
1fanwang wants to merge 2 commits intoapache:masterfrom
1fanwang:fix/14279-disabled-table-size
Open

Skip table size aggregation for disabled tables in SegmentStatusChecker#18343
1fanwang wants to merge 2 commits intoapache:masterfrom
1fanwang:fix/14279-disabled-table-size

Conversation

@1fanwang
Copy link
Copy Markdown

Problem

Per issue #14279, controller logs flood with ERROR-level entries from CompletionServiceHelper whenever the controller UI (or any caller) refreshes table state, with one entry per server per disabled table per refresh:

ERROR [CompletionServiceHelper] Server: Server_pinot-server-0... returned error: 404, reason: Not Found for uri: http://pinot-server-0...:8097/table/<table>_OFFLINE/size
ERROR [CompletionServiceHelper] Server: Server_pinot-server-1... returned error: 404, reason: Not Found for uri: http://pinot-server-1...:8097/table/<table>_OFFLINE/size
ERROR [CompletionServiceHelper] Server: Server_pinot-server-2... returned error: 404, reason: Not Found for uri: http://pinot-server-2...:8097/table/<table>_OFFLINE/size

This trips alerts wired to ERROR-level controller logs.

Root cause

SegmentStatusChecker.processTable calls three sub-methods:

updateTableConfigMetrics(...);
updateSegmentMetrics(...);   // early-returns for disabled tables
updateTableSizeMetrics(...); // runs unconditionally

updateSegmentMetrics already detects disabled tables (!idealState.isEnabled()) and tags them in context._disabledTables before returning. updateTableSizeMetrics runs regardless and issues a /table/{name}/size scatter-gather to every server. Because servers do not load segments for disabled tables, every server responds 404, and CompletionServiceHelper logs each as ERROR.

This matches @Jackie-Jiang's framing on the issue: "the root problem is not the log level, but Pinot not able to gracefully handle reading size for disabled tables."

Fix

Read context._disabledTables in processTable and skip updateTableSizeMetrics when the table is disabled. The set is already populated by updateSegmentMetrics, so the guard adds no extra ZK lookup or controller round-trip.

updateSegmentMetrics(tableNameWithType, tableConfig, context);
if (!context._disabledTables.contains(tableNameWithType)) {
  updateTableSizeMetrics(tableNameWithType);
}

Test

Extends the existing disabledTableTest with verify(tableSizeReader, never()).getTableSizeDetails(...) to lock in the regression. Refactors the runSegmentStatusChecker helper to return the TableSizeReader mock so the assertion has something to verify against; existing callers ignore the return value.

./mvnw -pl pinot-controller -am -Dtest=SegmentStatusCheckerTest -Dsurefire.failIfNoSpecifiedTests=false test23/23 pass (disabledTableTest 1.4s).

Pre-commit checks pass: spotless:apply, checkstyle:check, license:check.

Fixes #14279

When SegmentStatusChecker.processTable runs against a disabled table,
updateSegmentMetrics correctly early-returns and tags the table in
context._disabledTables, but updateTableSizeMetrics still runs
unconditionally. This issues a /table/{name}/size scatter-gather to
every server, and because servers do not load segments for disabled
tables, every server responds 404. CompletionServiceHelper logs each
404 at ERROR level, flooding controller logs whenever the controller
UI (or any caller) refreshes table state.

Read context._disabledTables in processTable and skip the size call
when the table is disabled. The set is already populated by
updateSegmentMetrics, so no extra ZK lookup is needed.

Also extends disabledTableTest to assert TableSizeReader is never
invoked for a disabled table, locking in the fix as a regression
test.

Fixes apache#14279
@1fanwang 1fanwang force-pushed the fix/14279-disabled-table-size branch from 3089e52 to 37c05b3 Compare April 27, 2026 09:01
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 27, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 63.41%. Comparing base (9581549) to head (37c05b3).
⚠️ Report is 4 commits behind head on master.

Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18343      +/-   ##
============================================
+ Coverage     63.39%   63.41%   +0.01%     
  Complexity     1679     1679              
============================================
  Files          3253     3253              
  Lines        198764   198768       +4     
  Branches      30791    30792       +1     
============================================
+ Hits         126012   126044      +32     
+ Misses        62678    62653      -25     
+ Partials      10074    10071       -3     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-21 63.41% <100.00%> (+0.01%) ⬆️
temurin 63.41% <100.00%> (+0.01%) ⬆️
unittests 63.40% <100.00%> (+0.01%) ⬆️
unittests1 55.37% <ø> (+0.02%) ⬆️
unittests2 34.94% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

@noob-se7en noob-se7en left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 critical / 0 major / 1 scope question — see inline.

updateTableSizeMetrics(tableNameWithType);
// Skip table size aggregation for disabled tables. Servers do not load segments for disabled tables, so the
// /table/{name}/size scatter-gather would otherwise return 404 from every server and flood controller logs
// with ERROR-level entries. updateSegmentMetrics has already populated context._disabledTables.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the null-IdealState early-return at ~L221 doesn't populate _disabledTables, so this guard still lets that path through to updateTableSizeMetrics — same 404 pattern, no? worth handling here too, or calling out as out-of-scope.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — the null-IdealState branch had the same shape but a different downstream symptom (getServerToSegmentsMap throws IllegalStateException, which lands in processTable's outer catch as an ERROR + stack trace per table per cycle). Generalized the guard in 1ca4517: introduced _tablesToSkipSizeAggregation as a superset of _disabledTables so both idealState == null and !idealState.isEnabled() branches add to it, and processTable checks the new set. Kept _disabledTables distinct so the DISABLED_TABLE_COUNT gauge isn't inflated by null-IdealState entries. New nullIdealStateTest covers it.

Address review feedback on PR apache#18343:

The null-IdealState early-return at SegmentStatusChecker.java:~221
doesn't populate `_disabledTables`, so the original guard still let
that path through to `updateTableSizeMetrics`. The downstream
`getServerToSegmentsMap` throws `IllegalStateException` when
IdealState is missing, propagating to processTable's outer catch as
an ERROR + stack trace per table per cycle — same controller-log
flood pattern that apache#14279 reports for disabled tables.

Introduce `_tablesToSkipSizeAggregation` as a superset of
`_disabledTables` so both branches can opt out of size aggregation
without inflating the disabled-table count gauge. The
`updateSegmentMetrics` paths for null-IdealState and disabled-table
both add to it; `processTable` consults this set instead of
`_disabledTables`.

Adds `SegmentStatusCheckerTest.nullIdealStateTest` to lock the
regression and confirm the disabled-table-count gauge is unaffected.

24/24 SegmentStatusCheckerTest pass under
`./mvnw -pl pinot-controller -Dtest=SegmentStatusCheckerTest test`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Gracefully handle reading size for disabled table

3 participants