[cuebot] Fix Silent frame double booking#2441
Conversation
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Fixes issue AcademySoftwareFoundation#2438. Production changes - **`DispatchSupportService.java`** — Fix A (kill-before-release): `lostProc` now issues a synchronous `rqdClient.killFrame(proc, …)` **before** `unbookProc`/frame reset, in a try/catch so a genuinely dead host still releases. Skips re-kill on `EXIT_STATUS_FAILED_KILL`. Added `@Autowired Environment env`. - **`HostReportHandler.java`** — Fix B (strict `resource_id` fencing): the grace period now shields **only** the `proc == null` book→first-report window. When a reported frame's `resource_id` maps to a proc on a *different* frame, the zombie frame is killed **immediately** (no 120s grace / 300s `isOrphan` wait), and the proc/its current frame are never touched (safe under proc-reuse, since the RQD kill is frame-scoped). Removed the unsafe inline orphan-clear; extracted a `killUnverifiedRunningFrame` helper; grace period is now configurable. - **`opencue.properties`** (main + test) — three new tunables, all defaulting to the prior behavior: `kill_running_frame_before_release_enabled=true`, `frame_verification_strict_fencing_enabled=true`, `frame_verification_grace_period_seconds=120`. - **`DispatchSupportServiceLostProcTests`** (new, pure JUnit+Mockito) — **4 tests, passing locally**: kill-before-release ordering, FAILED_KILL skip, release-when-kill-throws, property-disabled. (Pure-unit because `lostProc` is `@Transactional(NOT_SUPPORTED)`, which fights the embedded-DB harness.) - **`HostReportHandlerTests`** (+3) and **`ProcDaoTests`** (+1) — fencing/grace/proc-reuse coverage.
a3583cb to
e354d60
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@cuebot/src/main/java/com/imageworks/spcue/dispatcher/DispatchSupportService.java`:
- Around line 540-542: In the catch block that handles the kill-before-release
failure (the block logging "kill-before-release failed for"), change the caught
exception type from Throwable to Exception. This ensures that fatal JVM errors
like OutOfMemoryError and StackOverflowError will not be swallowed and the
release flow will properly propagate these critical errors instead of silently
catching and logging them.
In `@cuebot/src/main/java/com/imageworks/spcue/dispatcher/HostReportHandler.java`:
- Around line 1087-1093: The jobManager.getFrameDetail call in the
killUnverifiedRunningFrame method lacks exception handling for cases where the
frame cannot be found (late or stale reports). Add a try-catch block around the
getFrameDetail lookup to explicitly handle missing frames, log an appropriate
message when a frame is not found, and return early from the method rather than
allowing the exception to propagate and interrupt host report processing.
- Around line 1035-1041: Replace the broad catch (Exception e) block with a more
specific exception handler that distinguishes between a missing proc and
transient data access failures. Catch EmptyResultDataAccessException
specifically to set the message "Virtual proc did not exist" only when the proc
truly doesn't exist, and handle or rethrow other exceptions (like transient data
access failures) separately to prevent false kill decisions based on temporary
lookup failures in the try block containing hostManager.getVirtualProc().
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 566cc23b-a2e8-421f-8dc8-41ac1c501aba
📒 Files selected for processing (7)
cuebot/src/main/java/com/imageworks/spcue/dispatcher/DispatchSupportService.javacuebot/src/main/java/com/imageworks/spcue/dispatcher/HostReportHandler.javacuebot/src/main/resources/opencue.propertiescuebot/src/test/java/com/imageworks/spcue/test/dao/postgres/ProcDaoTests.javacuebot/src/test/java/com/imageworks/spcue/test/dispatcher/DispatchSupportServiceLostProcTests.javacuebot/src/test/java/com/imageworks/spcue/test/dispatcher/HostReportHandlerTests.javacuebot/src/test/resources/opencue.properties
| } catch (Throwable t) { | ||
| logger.info("kill-before-release failed for " + proc.getName() + ", " + t); | ||
| } |
There was a problem hiding this comment.
Avoid catching Throwable in kill-before-release.
Catching Throwable here can mask fatal JVM errors and still proceed with release flow. Restrict this to Exception so fatal errors are not swallowed.
Proposed fix
- } catch (Throwable t) {
- logger.info("kill-before-release failed for " + proc.getName() + ", " + t);
+ } catch (Exception e) {
+ logger.info("kill-before-release failed for " + proc.getName() + ", " + e);
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| } catch (Throwable t) { | |
| logger.info("kill-before-release failed for " + proc.getName() + ", " + t); | |
| } | |
| } catch (Exception e) { | |
| logger.info("kill-before-release failed for " + proc.getName() + ", " + e); | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@cuebot/src/main/java/com/imageworks/spcue/dispatcher/DispatchSupportService.java`
around lines 540 - 542, In the catch block that handles the kill-before-release
failure (the block logging "kill-before-release failed for"), change the caught
exception type from Throwable to Exception. This ensures that fatal JVM errors
like OutOfMemoryError and StackOverflowError will not be swallowed and the
release flow will properly propagate these critical errors instead of silently
catching and logging them.
| try { | ||
| proc = hostManager.getVirtualProc(runningFrame.getResourceId()); | ||
| msg = "Virtual proc " + proc.getProcId() + "is assigned to " + proc.getFrameId() | ||
| msg = "Virtual proc " + proc.getProcId() + " is assigned to " + proc.getFrameId() | ||
| + " not " + runningFrame.getFrameId(); | ||
| } catch (Exception e) { | ||
| /* | ||
| * This will happen if the host goes offline and then comes back. In this case, we | ||
| * don't touch the frame since it might already be running somewhere else. We do | ||
| * however kill the proc. | ||
| */ | ||
| msg = "Virtual proc did not exist."; | ||
| } |
There was a problem hiding this comment.
Differentiate “proc missing” from transient lookup failures before kill decisions.
The broad catch (Exception) path treats datastore/access failures as “proc did not exist”, which can drive false kill decisions after grace expiry. Only EmptyResultDataAccessException should map to “missing proc”; transient data access failures should skip destructive action for that cycle.
Proposed fix
try {
proc = hostManager.getVirtualProc(runningFrame.getResourceId());
msg = "Virtual proc " + proc.getProcId() + " is assigned to " + proc.getFrameId()
+ " not " + runningFrame.getFrameId();
- } catch (Exception e) {
+ } catch (EmptyResultDataAccessException e) {
msg = "Virtual proc did not exist.";
+ } catch (DataAccessException e) {
+ logger.warn("Skipping frame verification kill due to proc lookup failure for "
+ + runningFrame.getResourceId() + ", " + e);
+ runningFrames.add(runningFrame);
+ continue;
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@cuebot/src/main/java/com/imageworks/spcue/dispatcher/HostReportHandler.java`
around lines 1035 - 1041, Replace the broad catch (Exception e) block with a
more specific exception handler that distinguishes between a missing proc and
transient data access failures. Catch EmptyResultDataAccessException
specifically to set the message "Virtual proc did not exist" only when the proc
truly doesn't exist, and handle or rethrow other exceptions (like transient data
access failures) separately to prevent false kill decisions based on temporary
lookup failures in the try block containing hostManager.getVirtualProc().
| FrameDetail frameLatestVersion = jobManager.getFrameDetail(runningFrame.getFrameId()); | ||
| if (frameLatestVersion.state != FrameState.RUNNING) { | ||
| logger.info("DelayedVerification, the proc " + runningFrame.getResourceId() | ||
| + " on host " + report.getHost().getName() + " has already Completed " | ||
| + runningFrame.getJobName() + "/" + runningFrame.getFrameName()); | ||
| } else if (killFrame(runningFrame.getFrameId(), report.getHost().getName(), | ||
| KillCause.FrameVerificationFailure)) { |
There was a problem hiding this comment.
Guard frame-detail lookup in killUnverifiedRunningFrame to prevent report-handler aborts.
If jobManager.getFrameDetail can’t find the frame (late/stale report), the exception can bubble out and interrupt host report processing. Handle missing-frame lookup explicitly and return.
Proposed fix
- FrameDetail frameLatestVersion = jobManager.getFrameDetail(runningFrame.getFrameId());
+ FrameDetail frameLatestVersion;
+ try {
+ frameLatestVersion = jobManager.getFrameDetail(runningFrame.getFrameId());
+ } catch (EmptyResultDataAccessException e) {
+ logger.info("DelayedVerification, frame " + runningFrame.getFrameId()
+ + " no longer exists in DB; skipping kill.");
+ return;
+ }
if (frameLatestVersion.state != FrameState.RUNNING) {📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| FrameDetail frameLatestVersion = jobManager.getFrameDetail(runningFrame.getFrameId()); | |
| if (frameLatestVersion.state != FrameState.RUNNING) { | |
| logger.info("DelayedVerification, the proc " + runningFrame.getResourceId() | |
| + " on host " + report.getHost().getName() + " has already Completed " | |
| + runningFrame.getJobName() + "/" + runningFrame.getFrameName()); | |
| } else if (killFrame(runningFrame.getFrameId(), report.getHost().getName(), | |
| KillCause.FrameVerificationFailure)) { | |
| FrameDetail frameLatestVersion; | |
| try { | |
| frameLatestVersion = jobManager.getFrameDetail(runningFrame.getFrameId()); | |
| } catch (EmptyResultDataAccessException e) { | |
| logger.info("DelayedVerification, frame " + runningFrame.getFrameId() | |
| " no longer exists in DB; skipping kill."); | |
| return; | |
| } | |
| if (frameLatestVersion.state != FrameState.RUNNING) { | |
| logger.info("DelayedVerification, the proc " + runningFrame.getResourceId() | |
| " on host " + report.getHost().getName() + " has already Completed " | |
| runningFrame.getJobName() + "/" + runningFrame.getFrameName()); | |
| } else if (killFrame(runningFrame.getFrameId(), report.getHost().getName(), | |
| KillCause.FrameVerificationFailure)) { |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@cuebot/src/main/java/com/imageworks/spcue/dispatcher/HostReportHandler.java`
around lines 1087 - 1093, The jobManager.getFrameDetail call in the
killUnverifiedRunningFrame method lacks exception handling for cases where the
frame cannot be found (late or stale reports). Add a try-catch block around the
getFrameDetail lookup to explicitly handle missing frames, log an appropriate
message when a frame is not found, and return early from the method rather than
allowing the exception to propagate and interrupt host report processing.
Related Issues
#2438
Production changes
DispatchSupportService.java— Fix A (kill-before-release):lostProcnow issues a synchronousrqdClient.killFrame(proc, …)beforeunbookProc/frame reset, in a try/catch so a genuinely dead host still releases. Skips re-kill onEXIT_STATUS_FAILED_KILL. Added@Autowired Environment env.HostReportHandler.java— Fix B (strictresource_idfencing): the grace period now shields only theproc == nullbook→first-report window. When a reported frame'sresource_idmaps to a proc on a different frame, the zombie frame is killed immediately (no 120s grace / 300sisOrphanwait), and the proc/its current frame are never touched (safe under proc-reuse, since the RQD kill is frame-scoped). Removed the unsafe inline orphan-clear; extracted akillUnverifiedRunningFramehelper; grace period is now configurable.opencue.properties(main + test) — three new tunables, all defaulting to the prior behavior:kill_running_frame_before_release_enabled=true,frame_verification_strict_fencing_enabled=true,frame_verification_grace_period_seconds=120.DispatchSupportServiceLostProcTests(new, pure JUnit+Mockito) — 4 tests, passing locally: kill-before-release ordering, FAILED_KILL skip, release-when-kill-throws, property-disabled. (Pure-unit becauselostProcis@Transactional(NOT_SUPPORTED), which fights the embedded-DB harness.)HostReportHandlerTests(+3) andProcDaoTests(+1) — fencing/grace/proc-reuse coverage.Summary by CodeRabbit
Release Notes
New Features
Chores
kill_running_frame_before_release_enabled,frame_verification_strict_fencing_enabled, andframe_verification_grace_period_seconds(all enabled by default with 120-second grace period)