Skip to content

[cuebot] Fix Silent frame double booking#2441

Draft
DiegoTavares wants to merge 1 commit into
AcademySoftwareFoundation:masterfrom
DiegoTavares:fix_multi_booking
Draft

[cuebot] Fix Silent frame double booking#2441
DiegoTavares wants to merge 1 commit into
AcademySoftwareFoundation:masterfrom
DiegoTavares:fix_multi_booking

Conversation

@DiegoTavares

@DiegoTavares DiegoTavares commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

Related Issues

#2438

Production changes

  • DispatchSupportService.java — Fix A (kill-before-release): lostProc now issues a synchronous rqdClient.killFrame(proc, …) before unbookProc/frame reset, in a try/catch so a genuinely dead host still releases. Skips re-kill on EXIT_STATUS_FAILED_KILL. Added @Autowired Environment env.

  • HostReportHandler.java — Fix B (strict resource_id fencing): the grace period now shields only the proc == null book→first-report window. When a reported frame's resource_id maps to a proc on a different frame, the zombie frame is killed immediately (no 120s grace / 300s isOrphan wait), and the proc/its current frame are never touched (safe under proc-reuse, since the RQD kill is frame-scoped). Removed the unsafe inline orphan-clear; extracted a killUnverifiedRunningFrame helper; grace period is now configurable.

  • opencue.properties (main + test) — three new tunables, all defaulting to the prior behavior: kill_running_frame_before_release_enabled=true, frame_verification_strict_fencing_enabled=true, frame_verification_grace_period_seconds=120.

  • DispatchSupportServiceLostProcTests (new, pure JUnit+Mockito) — 4 tests, passing locally: kill-before-release ordering, FAILED_KILL skip, release-when-kill-throws, property-disabled. (Pure-unit because lostProc is @Transactional(NOT_SUPPORTED), which fights the embedded-DB harness.)

  • HostReportHandlerTests (+3) and ProcDaoTests (+1) — fencing/grace/proc-reuse coverage.

Summary by CodeRabbit

Release Notes

  • New Features

    • Added automatic cleanup of running frames during lost process recovery
    • Implemented frame ownership verification to prevent misallocated frames from executing
  • Chores

    • Added configurable dispatcher properties: kill_running_frame_before_release_enabled, frame_verification_strict_fencing_enabled, and frame_verification_grace_period_seconds (all enabled by default with 120-second grace period)

@coderabbitai

coderabbitai Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 20b8cb25-ffb5-4f44-a32c-2ff0888849ae

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Fixes issue AcademySoftwareFoundation#2438.

Production changes
- **`DispatchSupportService.java`** — Fix A (kill-before-release): `lostProc` now issues a
  synchronous `rqdClient.killFrame(proc, …)` **before** `unbookProc`/frame reset, in a try/catch so
  a genuinely dead host still releases. Skips re-kill on `EXIT_STATUS_FAILED_KILL`. Added
  `@Autowired Environment env`.
- **`HostReportHandler.java`** — Fix B (strict `resource_id` fencing): the grace period now shields
  **only** the `proc == null` book→first-report window. When a reported frame's `resource_id` maps
  to a proc on a *different* frame, the zombie frame is killed **immediately** (no 120s grace / 300s
  `isOrphan` wait), and the proc/its current frame are never touched (safe under proc-reuse, since
  the RQD kill is frame-scoped). Removed the unsafe inline orphan-clear; extracted a
  `killUnverifiedRunningFrame` helper; grace period is now configurable.
- **`opencue.properties`** (main + test) — three new tunables, all defaulting to the prior behavior:
  `kill_running_frame_before_release_enabled=true`,
  `frame_verification_strict_fencing_enabled=true`, `frame_verification_grace_period_seconds=120`.

- **`DispatchSupportServiceLostProcTests`** (new, pure JUnit+Mockito) — **4 tests, passing
  locally**: kill-before-release ordering, FAILED_KILL skip, release-when-kill-throws,
  property-disabled. (Pure-unit because `lostProc` is `@Transactional(NOT_SUPPORTED)`, which fights
  the embedded-DB harness.)
- **`HostReportHandlerTests`** (+3) and **`ProcDaoTests`** (+1) — fencing/grace/proc-reuse coverage.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@cuebot/src/main/java/com/imageworks/spcue/dispatcher/DispatchSupportService.java`:
- Around line 540-542: In the catch block that handles the kill-before-release
failure (the block logging "kill-before-release failed for"), change the caught
exception type from Throwable to Exception. This ensures that fatal JVM errors
like OutOfMemoryError and StackOverflowError will not be swallowed and the
release flow will properly propagate these critical errors instead of silently
catching and logging them.

In `@cuebot/src/main/java/com/imageworks/spcue/dispatcher/HostReportHandler.java`:
- Around line 1087-1093: The jobManager.getFrameDetail call in the
killUnverifiedRunningFrame method lacks exception handling for cases where the
frame cannot be found (late or stale reports). Add a try-catch block around the
getFrameDetail lookup to explicitly handle missing frames, log an appropriate
message when a frame is not found, and return early from the method rather than
allowing the exception to propagate and interrupt host report processing.
- Around line 1035-1041: Replace the broad catch (Exception e) block with a more
specific exception handler that distinguishes between a missing proc and
transient data access failures. Catch EmptyResultDataAccessException
specifically to set the message "Virtual proc did not exist" only when the proc
truly doesn't exist, and handle or rethrow other exceptions (like transient data
access failures) separately to prevent false kill decisions based on temporary
lookup failures in the try block containing hostManager.getVirtualProc().
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 566cc23b-a2e8-421f-8dc8-41ac1c501aba

📥 Commits

Reviewing files that changed from the base of the PR and between ca5f70f and a3583cb.

📒 Files selected for processing (7)
  • cuebot/src/main/java/com/imageworks/spcue/dispatcher/DispatchSupportService.java
  • cuebot/src/main/java/com/imageworks/spcue/dispatcher/HostReportHandler.java
  • cuebot/src/main/resources/opencue.properties
  • cuebot/src/test/java/com/imageworks/spcue/test/dao/postgres/ProcDaoTests.java
  • cuebot/src/test/java/com/imageworks/spcue/test/dispatcher/DispatchSupportServiceLostProcTests.java
  • cuebot/src/test/java/com/imageworks/spcue/test/dispatcher/HostReportHandlerTests.java
  • cuebot/src/test/resources/opencue.properties

Comment on lines +540 to +542
} catch (Throwable t) {
logger.info("kill-before-release failed for " + proc.getName() + ", " + t);
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid catching Throwable in kill-before-release.

Catching Throwable here can mask fatal JVM errors and still proceed with release flow. Restrict this to Exception so fatal errors are not swallowed.

Proposed fix
-            } catch (Throwable t) {
-                logger.info("kill-before-release failed for " + proc.getName() + ", " + t);
+            } catch (Exception e) {
+                logger.info("kill-before-release failed for " + proc.getName() + ", " + e);
             }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
} catch (Throwable t) {
logger.info("kill-before-release failed for " + proc.getName() + ", " + t);
}
} catch (Exception e) {
logger.info("kill-before-release failed for " + proc.getName() + ", " + e);
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@cuebot/src/main/java/com/imageworks/spcue/dispatcher/DispatchSupportService.java`
around lines 540 - 542, In the catch block that handles the kill-before-release
failure (the block logging "kill-before-release failed for"), change the caught
exception type from Throwable to Exception. This ensures that fatal JVM errors
like OutOfMemoryError and StackOverflowError will not be swallowed and the
release flow will properly propagate these critical errors instead of silently
catching and logging them.

Comment on lines 1035 to 1041
try {
proc = hostManager.getVirtualProc(runningFrame.getResourceId());
msg = "Virtual proc " + proc.getProcId() + "is assigned to " + proc.getFrameId()
msg = "Virtual proc " + proc.getProcId() + " is assigned to " + proc.getFrameId()
+ " not " + runningFrame.getFrameId();
} catch (Exception e) {
/*
* This will happen if the host goes offline and then comes back. In this case, we
* don't touch the frame since it might already be running somewhere else. We do
* however kill the proc.
*/
msg = "Virtual proc did not exist.";
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Differentiate “proc missing” from transient lookup failures before kill decisions.

The broad catch (Exception) path treats datastore/access failures as “proc did not exist”, which can drive false kill decisions after grace expiry. Only EmptyResultDataAccessException should map to “missing proc”; transient data access failures should skip destructive action for that cycle.

Proposed fix
             try {
                 proc = hostManager.getVirtualProc(runningFrame.getResourceId());
                 msg = "Virtual proc " + proc.getProcId() + " is assigned to " + proc.getFrameId()
                         + " not " + runningFrame.getFrameId();
-            } catch (Exception e) {
+            } catch (EmptyResultDataAccessException e) {
                 msg = "Virtual proc did not exist.";
+            } catch (DataAccessException e) {
+                logger.warn("Skipping frame verification kill due to proc lookup failure for "
+                        + runningFrame.getResourceId() + ", " + e);
+                runningFrames.add(runningFrame);
+                continue;
             }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cuebot/src/main/java/com/imageworks/spcue/dispatcher/HostReportHandler.java`
around lines 1035 - 1041, Replace the broad catch (Exception e) block with a
more specific exception handler that distinguishes between a missing proc and
transient data access failures. Catch EmptyResultDataAccessException
specifically to set the message "Virtual proc did not exist" only when the proc
truly doesn't exist, and handle or rethrow other exceptions (like transient data
access failures) separately to prevent false kill decisions based on temporary
lookup failures in the try block containing hostManager.getVirtualProc().

Comment on lines +1087 to +1093
FrameDetail frameLatestVersion = jobManager.getFrameDetail(runningFrame.getFrameId());
if (frameLatestVersion.state != FrameState.RUNNING) {
logger.info("DelayedVerification, the proc " + runningFrame.getResourceId()
+ " on host " + report.getHost().getName() + " has already Completed "
+ runningFrame.getJobName() + "/" + runningFrame.getFrameName());
} else if (killFrame(runningFrame.getFrameId(), report.getHost().getName(),
KillCause.FrameVerificationFailure)) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard frame-detail lookup in killUnverifiedRunningFrame to prevent report-handler aborts.

If jobManager.getFrameDetail can’t find the frame (late/stale report), the exception can bubble out and interrupt host report processing. Handle missing-frame lookup explicitly and return.

Proposed fix
-        FrameDetail frameLatestVersion = jobManager.getFrameDetail(runningFrame.getFrameId());
+        FrameDetail frameLatestVersion;
+        try {
+            frameLatestVersion = jobManager.getFrameDetail(runningFrame.getFrameId());
+        } catch (EmptyResultDataAccessException e) {
+            logger.info("DelayedVerification, frame " + runningFrame.getFrameId()
+                    + " no longer exists in DB; skipping kill.");
+            return;
+        }
         if (frameLatestVersion.state != FrameState.RUNNING) {
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
FrameDetail frameLatestVersion = jobManager.getFrameDetail(runningFrame.getFrameId());
if (frameLatestVersion.state != FrameState.RUNNING) {
logger.info("DelayedVerification, the proc " + runningFrame.getResourceId()
+ " on host " + report.getHost().getName() + " has already Completed "
+ runningFrame.getJobName() + "/" + runningFrame.getFrameName());
} else if (killFrame(runningFrame.getFrameId(), report.getHost().getName(),
KillCause.FrameVerificationFailure)) {
FrameDetail frameLatestVersion;
try {
frameLatestVersion = jobManager.getFrameDetail(runningFrame.getFrameId());
} catch (EmptyResultDataAccessException e) {
logger.info("DelayedVerification, frame " + runningFrame.getFrameId()
" no longer exists in DB; skipping kill.");
return;
}
if (frameLatestVersion.state != FrameState.RUNNING) {
logger.info("DelayedVerification, the proc " + runningFrame.getResourceId()
" on host " + report.getHost().getName() + " has already Completed "
runningFrame.getJobName() + "/" + runningFrame.getFrameName());
} else if (killFrame(runningFrame.getFrameId(), report.getHost().getName(),
KillCause.FrameVerificationFailure)) {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cuebot/src/main/java/com/imageworks/spcue/dispatcher/HostReportHandler.java`
around lines 1087 - 1093, The jobManager.getFrameDetail call in the
killUnverifiedRunningFrame method lacks exception handling for cases where the
frame cannot be found (late or stale reports). Add a try-catch block around the
getFrameDetail lookup to explicitly handle missing frames, log an appropriate
message when a frame is not found, and return early from the method rather than
allowing the exception to propagate and interrupt host report processing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant