Probe cycle discrimination + scheduler health snapshot#50
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughImplements discriminated ISO probe cycle outcomes (CycleResult with SUCCESS/EMPTY/FAILED), publishes an immutable SchedulerSnapshot under a lock, records cycle status/error and probe stats, merges scheduler snapshot + allow-listed MQ + db_pool for /health extras (scheduler wins on conflicts), and expands tests/docs for safety and concurrency. ChangesDiscriminated Cycles and Health Snapshots
Sequence DiagramsequenceDiagram
participant Scheduler
participant ISOProber
participant HealthEndpoint
participant MessageQueue
ISOProber->>Scheduler: CycleResult = run_cycle()
alt Probe Success
ISOProber->>ISOProber: perform concurrent HEAD probes, collect hits
ISOProber-->>Scheduler: CycleResult(SUCCESS, results=hits)
else Probe Empty
ISOProber->>ISOProber: probes return no hits
ISOProber-->>Scheduler: CycleResult(EMPTY)
else Probe Failed
ISOProber->>ISOProber: exception during cycle
ISOProber-->>Scheduler: CycleResult(FAILED, error=msg)
end
Scheduler->>Scheduler: _probe_hits_from_cycle() (record status/error)
Scheduler->>Scheduler: _record_probe_cycle_completion() (store probe_stats)
alt non-FAILED cycle
Scheduler->>Scheduler: _mark_poll_successful_if_probe_ok()
end
Scheduler->>Scheduler: _publish_health_snapshot()
HealthEndpoint->>Scheduler: health_snapshot()
Scheduler-->>HealthEndpoint: SchedulerSnapshot dict
HealthEndpoint->>MessageQueue: health_fields() / depth()
MessageQueue-->>HealthEndpoint: MQ metrics dict
HealthEndpoint->>HealthEndpoint: merge snapshot + MQ + db_pool (scheduler wins)
HealthEndpoint-->>Client: /health response
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/paperscout/monitor.py`:
- Around line 128-136: The health snapshot fallback uses
dict(_HEALTH_SNAPSHOT_DEFAULTS) which only performs a shallow copy and ends up
sharing the nested probe_stats dict; change the construction in the code that
creates the fallback (where dict(_HEALTH_SNAPSHOT_DEFAULTS) is used) to produce
a deep/independent copy of nested structures—either use copy.deepcopy on
_HEALTH_SNAPSHOT_DEFAULTS or rebuild the dict while explicitly creating a new
empty dict for "probe_stats" (e.g., {**_HEALTH_SNAPSHOT_DEFAULTS, "probe_stats":
{}}) so each consumer gets its own probe_stats instead of a shared mutable
object.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: f831e1c1-21d8-46f5-b3dd-da55b443e852
📒 Files selected for processing (9)
CHANGELOG.mddocs/architecture.mdsrc/paperscout/__main__.pysrc/paperscout/models.pysrc/paperscout/monitor.pysrc/paperscout/sources.pytests/test_health.pytests/test_monitor.pytests/test_sources.py
|
@coderabbitai don't skip. full review |
|
Tip For best results, initiate chat on the files or code changes.
PR
|
| File | Location | Note |
|---|---|---|
sources.py |
run_cycle() |
The PROBE-CYCLE-SUMMARY JSON log is emitted after the PROBE-SUCCESS / PROBE-EMPTY log. Reversing the order would make it easier to correlate the summary line with the outcome line when grepping logs. |
monitor.py |
health_snapshot() |
copy.deepcopy(_HEALTH_SNAPSHOT_DEFAULTS) is called on every request when _health_snapshot is None. Since defaults are constant, a shallow dict(_HEALTH_SNAPSHOT_DEFAULTS) is sufficient (no nested mutable values that need deep-copying). |
models.py |
CycleResult.hits docstring |
"otherwise empty" — consider clarifying it also returns empty for EMPTY status, not just FAILED. |
Verdict
The approach is sound. The staleness-clock fix (not advancing _last_successful_poll on FAILED) and the lock-protected snapshot are the right patterns. The main actionable items are:
- Unreachable
return Nonein_probe_one(trivial). - Missing wildcard case in
_probe_hits_from_cycle(low-risk but important for forward safety). - Python version check for
match/case(must verify). - Redundant
_record_probe_cycle_completion()calls in seed paths (cleanup). CycleResult.__post_init__validation (defensive programming).
There was a problem hiding this comment.
🧹 Nitpick comments (3)
src/paperscout/scout.py (1)
58-67: ⚡ Quick winKeep
mq_utilizationnormalized to[0.0, 1.0].
MessageQueueis unbounded today, sodepth / mq_max_sizecan exceed1.0under load. Clamping keeps dashboard semantics stable.Proposed patch
def health_fields(self) -> dict[str, Any]: """Metrics for the ``/health`` endpoint (merged by ``__main__``).""" d = self.depth() m = settings.mq_max_size + util = (d / m) if m else 0.0 return { "mq_depth": d, "mq_max_size": m, - "mq_utilization": round(d / m, 4) if m else 0.0, + "mq_utilization": round(min(max(util, 0.0), 1.0), 4), "mq_circuit_state": "closed", }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/paperscout/scout.py` around lines 58 - 67, The mq_utilization calculation in health_fields can exceed 1.0 because MessageQueue is unbounded; change the computation in scout.health_fields to compute utilization = (d / m) if m else 0.0 and then clamp it to [0.0, 1.0] (e.g., utilization = min(1.0, max(0.0, utilization))) before rounding; keep the existing keys ("mq_depth","mq_max_size","mq_utilization","mq_circuit_state") and use settings.mq_max_size and self.depth() as the sources.tests/test_main_health_merge.py (1)
59-62: ⚡ Quick winAssert all allow-listed MQ fields in the inclusion test.
You already pass
mq_utilizationandmq_circuit_stateinmq_extra; asserting them here closes a regression gap in allow-list behavior.Proposed patch
def test_merge_includes_allowlisted_mq_fields(): @@ out = _merge_extra_health_fields(scheduler, mq_extra, {}) assert out["mq_depth"] == 2 assert out["mq_max_size"] == 1000 + assert out["mq_utilization"] == 0.002 + assert out["mq_circuit_state"] == "closed"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_main_health_merge.py` around lines 59 - 62, The test currently only asserts mq_depth and mq_max_size after calling _merge_extra_health_fields(scheduler, mq_extra, {}); update the test to also assert that the allow-listed MQ fields passed in mq_extra—specifically mq_utilization and mq_circuit_state—are present and equal to the values provided in mq_extra, so modify the assertions in the test (around the call to _merge_extra_health_fields) to include checks for out["mq_utilization"] and out["mq_circuit_state"] matching the input.src/paperscout/__main__.py (1)
41-48: ⚡ Quick winGuard MQ health extraction so one MQ failure doesn’t drop all extras.
If
mq.health_fields()raises,_extra_health_fields()fails and/healthfalls back to base-only. Catching here keeps scheduler/db pool extras available.Proposed patch
def _mq_health_fields(mq: MessageQueue) -> dict: """MQ metrics for /health; from health_fields() when present, else depth only.""" if hasattr(mq, "health_fields"): - raw = mq.health_fields() + try: + raw = mq.health_fields() + except Exception: + log.exception("health: mq.health_fields() failed, using mq_depth only") + return {"mq_depth": mq.depth()} if isinstance(raw, dict): return raw log.warning("health: mq.health_fields() returned non-dict, using mq_depth only") return {"mq_depth": mq.depth()}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/paperscout/__main__.py` around lines 41 - 48, Wrap the call to mq.health_fields() in a try/except so a thrown exception from any MessageQueue implementation doesn’t cancel gathering other extras; in _mq_health_fields, call mq.health_fields() inside a try block, catch Exception, log a warning including the exception details and the MQ identity, and then fall back to returning {"mq_depth": mq.depth()} (preserving the existing non-dict check path). Ensure you reference _mq_health_fields and use mq.health_fields and mq.depth in the fix.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@src/paperscout/__main__.py`:
- Around line 41-48: Wrap the call to mq.health_fields() in a try/except so a
thrown exception from any MessageQueue implementation doesn’t cancel gathering
other extras; in _mq_health_fields, call mq.health_fields() inside a try block,
catch Exception, log a warning including the exception details and the MQ
identity, and then fall back to returning {"mq_depth": mq.depth()} (preserving
the existing non-dict check path). Ensure you reference _mq_health_fields and
use mq.health_fields and mq.depth in the fix.
In `@src/paperscout/scout.py`:
- Around line 58-67: The mq_utilization calculation in health_fields can exceed
1.0 because MessageQueue is unbounded; change the computation in
scout.health_fields to compute utilization = (d / m) if m else 0.0 and then
clamp it to [0.0, 1.0] (e.g., utilization = min(1.0, max(0.0, utilization)))
before rounding; keep the existing keys
("mq_depth","mq_max_size","mq_utilization","mq_circuit_state") and use
settings.mq_max_size and self.depth() as the sources.
In `@tests/test_main_health_merge.py`:
- Around line 59-62: The test currently only asserts mq_depth and mq_max_size
after calling _merge_extra_health_fields(scheduler, mq_extra, {}); update the
test to also assert that the allow-listed MQ fields passed in
mq_extra—specifically mq_utilization and mq_circuit_state—are present and equal
to the values provided in mq_extra, so modify the assertions in the test (around
the call to _merge_extra_health_fields) to include checks for
out["mq_utilization"] and out["mq_circuit_state"] matching the input.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 8da341de-1054-40e5-be6d-4857363b9673
📒 Files selected for processing (13)
src/paperscout/__main__.pysrc/paperscout/config.pysrc/paperscout/health.pysrc/paperscout/models.pysrc/paperscout/monitor.pysrc/paperscout/scout.pysrc/paperscout/sources.pytests/conftest.pytests/test_health.pytests/test_main_health_merge.pytests/test_message_queue.pytests/test_models_cycle_result.pytests/test_monitor.py
💤 Files with no reviewable changes (1)
- src/paperscout/sources.py
✅ Files skipped from review due to trivial changes (1)
- tests/conftest.py
There was a problem hiding this comment.
🧹 Nitpick comments (2)
tests/test_message_queue.py (1)
33-41: 💤 Low valueConsider asserting the patched
mq_max_sizevalue.The test correctly verifies clamping behavior, but adding an assertion for
fields["mq_max_size"] == 2would confirm the patch was applied successfully and make the test's setup more explicit.✨ Optional enhancement
assert fields["mq_depth"] == 5 + assert fields["mq_max_size"] == 2 assert fields["mq_utilization"] == 1.0🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_message_queue.py` around lines 33 - 41, In the test_health_fields_clamps_utilization_when_depth_exceeds_max add an assertion that the patched mq_max_size was applied by asserting fields["mq_max_size"] == 2 after calling mq.health_fields(); this confirms the patch on paperscout.scout.settings.mq_max_size took effect and makes the test setup explicit while keeping the existing assertions (fields["mq_depth"] and fields["mq_utilization"]) intact.src/paperscout/__main__.py (1)
46-54: ⚡ Quick winGuard fallback
mq.depth()reads so/healthextras remain available.If
mq.depth()fails on Line 54,extra_fields_fn()fails and the handler drops all extras (including scheduler snapshot). Keep_mq_health_fields()exception-safe in both fallback paths.Proposed defensive patch
def _mq_health_fields(mq: MessageQueue) -> dict: """MQ metrics for /health; from health_fields() when present, else depth only.""" if hasattr(mq, "health_fields"): try: raw = mq.health_fields() except Exception as exc: log.warning( "health: mq.health_fields() failed for %s id=%s: %s", type(mq).__name__, id(mq), exc, exc_info=True, ) - return {"mq_depth": mq.depth()} + try: + return {"mq_depth": mq.depth()} + except Exception: + log.warning("health: mq.depth() fallback failed; omitting MQ fields", exc_info=True) + return {} if isinstance(raw, dict): return raw log.warning("health: mq.health_fields() returned non-dict, using mq_depth only") - return {"mq_depth": mq.depth()} + try: + return {"mq_depth": mq.depth()} + except Exception: + log.warning("health: mq.depth() failed; omitting MQ fields", exc_info=True) + return {}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/paperscout/__main__.py` around lines 46 - 54, The fallback call to mq.depth() can itself raise and currently will propagate out of extra_fields_fn()/ _mq_health_fields(), dropping all health "extras"; wrap any call to mq.depth() (the one inside the exception handler and the one inside _mq_health_fields()) in its own try/except and return a safe sentinel (e.g. None or "unknown") or omit that field on error, while still logging the depth failure (include exc_info=True) so the rest of the extras (scheduler snapshot, etc.) are preserved; update both the except block shown and the _mq_health_fields() implementation to be exception-safe around mq.depth().
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@src/paperscout/__main__.py`:
- Around line 46-54: The fallback call to mq.depth() can itself raise and
currently will propagate out of extra_fields_fn()/ _mq_health_fields(), dropping
all health "extras"; wrap any call to mq.depth() (the one inside the exception
handler and the one inside _mq_health_fields()) in its own try/except and return
a safe sentinel (e.g. None or "unknown") or omit that field on error, while
still logging the depth failure (include exc_info=True) so the rest of the
extras (scheduler snapshot, etc.) are preserved; update both the except block
shown and the _mq_health_fields() implementation to be exception-safe around
mq.depth().
In `@tests/test_message_queue.py`:
- Around line 33-41: In the
test_health_fields_clamps_utilization_when_depth_exceeds_max add an assertion
that the patched mq_max_size was applied by asserting fields["mq_max_size"] == 2
after calling mq.health_fields(); this confirms the patch on
paperscout.scout.settings.mq_max_size took effect and makes the test setup
explicit while keeping the existing assertions (fields["mq_depth"] and
fields["mq_utilization"]) intact.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: a757c472-99ee-44c9-a03f-83117e29ff55
📒 Files selected for processing (4)
src/paperscout/__main__.pysrc/paperscout/scout.pytests/test_main_health_merge.pytests/test_message_queue.py
Summary
run_cycle()returnsCycleResult(success/empty/failed) instead of a list./healthreads scheduler state from a locked, immutable snapshot (last_updated,last_cycle_status, etc.).Why
Operators could not tell a real empty probe from a failed one, and the health thread read scheduler fields without synchronization.
Main changes
models.py—CycleStatus,CycleResultsources.py— discriminatedrun_cycle(), new log linesmonitor.py— handle all cycle outcomes; do not advance staleness onFAILED;SchedulerSnapshot+health_snapshot()__main__.py— health extras fromscheduler.health_snapshot()Tests
138 passed (includes empty/failed cycle, FAILED staleness, concurrent health snapshot).
Related issues
close #46
close #49
Summary by CodeRabbit
New Features
Behavior
Documentation
Tests