Skip to content

fix Bookkeeper can't startup cause by 'IOException: Recovery log xxx is missing'#4740

Open
void-ptr974 wants to merge 5 commits into
apache:masterfrom
void-ptr974:fix_journal_missing
Open

fix Bookkeeper can't startup cause by 'IOException: Recovery log xxx is missing'#4740
void-ptr974 wants to merge 5 commits into
apache:masterfrom
void-ptr974:fix_journal_missing

Conversation

@void-ptr974
Copy link
Copy Markdown
Contributor

@void-ptr974 void-ptr974 commented Apr 7, 2026

Fix #4105

Motivation

A bookie may fail to restart with:

IOException: Recovery log xxx is missing

This can happen when the persisted journal lastMark moves backwards and points to a journal file that has already been garbage-collected.

Root Cause

With DbLedgerStorage and a single ledger directory, one flush cycle may complete checkpoints out of order.

A possible sequence is:

  1. SyncThread.flush() captures an older checkpoint, for example journal 5.
  2. It calls ledgerStorage.flush().
  3. SingleDirectoryDbLedgerStorage.flush() captures and completes a newer checkpoint, for example journal 7.
  4. Completing checkpoint 7 persists lastMark=7.
  5. Journal GC may then delete older journal files, including journal 5.
  6. SyncThread.flush() resumes and completes the older checkpoint 5.
  7. The old code overwrites lastMark back to 5.

If the bookie crashes after this, restart reads lastMark=5, but journal 5 may no longer exist, so recovery fails.

Example

Assume the bookie currently has these journal files:

Journal file Status
5.txn old journal, still needed if lastMark=5
6.txn old journal
7.txn newer journal

A possible execution order:

Step Path Action Persisted lastMark Result
1 SyncThread.flush() captures checkpoint at journal 5 5 outer flush starts from an older checkpoint
2 ledgerStorage.flush() single-dir DB storage captures newer checkpoint 7 5 inner flush sees newer journal progress
3 checkpointComplete(7) newer checkpoint completes first 7 recovery point moves forward
4 Journal GC deletes journals older than 7 7 5.txn may be deleted
5 checkpointComplete(5) older outer checkpoint completes later 5 before this patch lastMark moves backwards
6 Bookie restart reads lastMark=5 5 restart fails because 5.txn is missing

Checkpoint 5 is valid when it is created. It becomes stale only after checkpoint 7 completes first and allows journal GC to remove older logs.

Race

The same issue can happen when two checkpoint completions are interleaved.

Time Thread A Thread B Persisted lastMark
T1 starts checkpointComplete(7) 5
T2 checks current persisted mark: 5 5
T3 starts checkpointComplete(9) 5
T4 checks current persisted mark: 5 5
T5 persists lastMark=9 9
T6 persists lastMark=7 7 before this patch

Without holding the stale check, persist, and in-memory update under the same lock, an older checkpoint can pass the stale check before a newer checkpoint is persisted, then write later and move lastMark backwards.

Changes

This patch makes journal checkpoint completion monotonic.

Journal.checkpointComplete() now:

  • tracks the latest persisted lastMark;
  • skips checkpoint completions older than the latest persisted mark;
  • keeps the stale check, persist, and in-memory mark update under the same lock;
  • initializes the tracked persisted mark from the lastMark loaded from disk during journal startup.

In other words, lastMark may move from 5 to 7 to 9, but it must never move back from 9 to 7 or from 7 to 5.

Performance Impact

The impact should be negligible.

Checkpoint completion is not on the normal journal write hot path. The added comparison is in-memory and constant-time. The lock only protects checkpoint mark persistence, and stale checkpoints may now avoid unnecessary lastMark writes.

Tests

Added coverage for:

  • stale checkpoint completion after a newer checkpoint;
  • initializing the monotonic guard from a reloaded lastMark;
  • concurrent checkpoint completion where an older checkpoint finishes after a newer one.

Verified with:

mvn -pl bookkeeper-server -am -P '!mac' -Dtest=DbLedgerStorageTest -Dsurefire.failIfNoSpecifiedTests=false test

mvn -pl bookkeeper-server -P '!mac' -DskipTests checkstyle:check

@void-ptr974 void-ptr974 changed the title fix Recovery log xxx is missing fix Bookkeeper can't startup cause by 'IOException: Recovery log xxx is missing' Apr 7, 2026
@StevenLuMT StevenLuMT requested a review from Copilot April 8, 2026 01:32
@StevenLuMT
Copy link
Copy Markdown
Member

good jobs,
Could you describe the triggering scenario of the problem more clearly?

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses BookKeeper startup failures where a referenced recovery journal file has been garbage-collected due to a race between concurrent checkpointComplete() calls (e.g., SyncThread vs SingleDirectoryDbLedgerStorage.flush()), causing lastMark to move backwards and point at a deleted journal.

Changes:

  • Adds a monotonic-advance guard to Journal.checkpointComplete() intended to prevent lastMark regression under concurrent checkpoints.
  • Adds a regression test attempting to reproduce the “recovery log is missing” scenario and validate the fix.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/Journal.java Introduces a lock + persisted-mark tracking intended to keep lastMark monotonic across concurrent checkpoints.
bookkeeper-server/src/test/java/org/apache/bookkeeper/bookie/storage/ldb/DbLedgerStorageTest.java Adds a test scenario for the journal-missing failure mode and expected monotonic behavior.
Comments suppressed due to low confidence (1)

bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/Journal.java:804

  • The monotonic guard is not sufficient to prevent lastMark regression: if an older checkpoint enters this method first (passes the check and updates lastPersistedMark) but reaches rollLog() after a newer checkpoint has already persisted a later mark, this call will still execute rollLog() and can overwrite the lastMark file backwards. Consider serializing the whole rollLog + journal GC block under the same lock, or re-checking the mark under the lock immediately before persisting/deleting so stale checkpoints cannot write.
        // Monotonic check: only advance lastMark forward, never backwards.
        synchronized (checkpointLock) {
            if (mark.getCurMark().compare(lastPersistedMark) <= 0) {
                return;
            }
            lastPersistedMark.setLogMark(
                    mark.getCurMark().getLogFileId(), mark.getCurMark().getLogFileOffset());
        }

        mark.rollLog(mark);
        if (compact) {
            // list the journals that have been marked
            List<Long> logs = listJournalIds(journalDirectory, new JournalRollingFilter(mark));
            // keep MAX_BACKUP_JOURNALS journal files before marked journal

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/Journal.java Outdated
@void-ptr974
Copy link
Copy Markdown
Contributor Author

@StevenLuMT Thanks for the feedback. I've updated the PR description with a clearer explanation of the triggering scenario.

@StevenLuMT
Copy link
Copy Markdown
Member

good jobs, I have rerun CI

Copy link
Copy Markdown
Contributor

@hangc0276 hangc0276 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@void-ptr974 Thanks for your contribution.

Great diagnosis — the race is real, the description is precise, and the monotonic guard is the right shape of fix. A few concerns and suggestions before merging.

Concern 1: rollLog is outside the synchronized block — a narrower variant of the same race still exists

The fix correctly prevents an older mark from overwriting newer on the disk file when calls are interleaved as described. But two newer marks racing each other are not protected, because rollLog runs
outside checkpointLock:

synchronized (checkpointLock) {
if (mark.getCurMark().compare(lastPersistedMark) < 0) {
return;
}
lastPersistedMark.setLogMark(...); // updated under lock
}
mark.rollLog(mark); // executed outside the lock

Scenario:

  • Thread C: checkpointComplete(mark=7) passes guard, lastPersistedMark=7, starts rollLog (slow disk write + force(true)).
  • Thread D: checkpointComplete(mark=9) passes guard (9 > 7), lastPersistedMark=9, starts rollLog.
  • Thread D's rollLog finishes first → file contains 9.
  • Thread C's rollLog finishes second → file contains 7. lastMark regresses.

rollLog opens new FileOutputStream(file) which truncates, so it really is last-writer-wins. This window is much smaller than the one the PR targets, but it's structurally the same bug and the guard
doesn't close it.

Suggested fix: either move rollLog inside the synchronized block, or have rollLog write lastPersistedMark (the source of truth) rather than the per-call mark argument. The former is simpler:

synchronized (checkpointLock) {
if (mark.getCurMark().compare(lastPersistedMark) < 0) {
return;
}
lastPersistedMark.setLogMark(
mark.getCurMark().getLogFileId(), mark.getCurMark().getLogFileOffset());
mark.rollLog(mark);
}
if (compact) {
...
}

rollLog does a 16-byte write + fsync. Holding the lock across it is fine — checkpointComplete is not on the hot path, and serializing it is exactly the invariant we want.

Concern 2: the test doesn't actually exercise the race

testConcurrentCheckpointCompleteLastMarkRegression calls checkpointComplete(newer) then checkpointComplete(stale) sequentially on the same thread. It verifies the guard logic but doesn't reproduce the
actual concurrent path that motivated the PR.

A concurrent test would:

  • Spin up two threads, one with the stale checkpoint and one with the newer.
  • Use latches/barriers to force the documented interleaving (Path B's checkpointComplete runs before Path A's, both with rollLog in flight).
  • Assert lastMark file contents after both finish.

This also surfaces Concern 1 — if you run two threads with newer marks (7 and 9) concurrently in a loop, you can demonstrate the residual race.

Concern 3: lastPersistedMark is initialized to (0, 0)

At startup, LastLogMark.readLog() populates lastLogMark.curMark from the on-disk file (could be at, say, mark 100). But lastPersistedMark stays at (0, 0). In practice this is benign because
newCheckpoint() only ever produces marks ≥ the current journal position, so the first post-startup checkpointComplete will be ≥ 100 and the guard advances correctly. But it's conceptually inconsistent
— lastPersistedMark claims to be "what's on disk" and is actually zero.

Consider seeding it from lastLogMark when Journal.readLog() runs, or at least document the invariant.

Smaller comments

  • Style: mark.rollLog(mark) (a method on LastLogMark taking a LastLogMark and using the argument's curMark) is awkward. Pre-existing, not for this PR — but a follow-up to make rollLog parameter-less
    and use this.curMark would clean this up.
  • Deeper question (not blocking): zooming out, SingleDirectoryDbLedgerStorage.flush() calls checkpointSource.checkpointComplete(cp, true) internally (line 925), and SyncThread.flush() also calls
    and compact=true, the outer one with the stale snapshot and compact=false. The monotonic guard handles this safely now, but it might be worth questioning whether SDLS should be calling
    checkpointComplete at all when invoked from SyncThread's path. That's out of scope here, but the duplicated responsibility for advancing lastMark is a smell.
  • Comment: the inline // Skip if this mark is not newer than what was already persisted. comment is good. Adding a one-liner reference to the issue/PR (// See #4105) would help future readers trace the
    rationale.

What looks good

  • Guard semantics (compare < 0 not ≤ 0) correctly tolerates an idempotent retry with the same mark.
  • The skip path doesn't suppress compact=true work that matters: if a stale call's compaction is skipped, the newer call's compaction has already deleted everything older.
  • The synchronized block is tiny (compare + set) so contention is a non-issue.

Once Concerns 1 and 2 are addressed I think this is ready.

When journal and dbledgerstorage on the same disk, concurrent checkpointComplete calls from SyncThread and SingleDirectoryDbLedgerStorage.flush() can overwrite lastMark backwards,causing journal files to be deleted while still referenced by the older mark.
When singleLedgerDirs=true, bookie can crash on restart with "Recovery log is missing" due to lastMark file being overwritten backwards.

  Root cause

  SyncThread.flush() has a nested call pattern that causes lastMark regression:

  SyncThread.flush():
    1. outerCheckpoint = newCheckpoint()        → mark = (file 5, offset 100)
    2. ledgerStorage.flush()
       → SingleDirectoryDbLedgerStorage.flush():
         a. innerCheckpoint = newCheckpoint()   → mark = (file 7, offset 200), journal has advanced
         b. flushes data to disk
         c. checkpointComplete(mark=7, compact=true)
            → persists lastMark=7, deletes journal files with id < 7 (including file 5)
    3. checkpointComplete(mark=5, compact=false)
       → persists lastMark=5, overwriting the 7 written in step 2c
    4. Bookie restarts → reads lastMark=5 → file 5 no longer exists → crash

  Step 1 captures the checkpoint before step 2 runs, so it is always older. Step 2c advances lastMark and garbage-collects old journals. Step 3 then overwrites lastMark backwards to a
  position whose journal file was already deleted.

  Conditions

  - singleLedgerDirs=true (journal and ledger on the same disk, so SingleDirectoryDbLedgerStorage.flush() calls checkpointComplete internally)
  - Journal file rotates between the two newCheckpoint() calls (requires sufficient write throughput)
  - maxBackupJournals small enough for old files to actually be deleted

  Fix

  Add a monotonic guard in checkpointComplete(): track the highest mark persisted so far, skip any call with an older mark. This prevents rollLog from overwriting lastMark backwards.
When singleLedgerDirs=true, bookie can crash on restart with "Recovery log is missing" due to lastMark file being overwritten backwards.

  Root cause

  SyncThread.flush() has a nested call pattern that causes lastMark regression:

  SyncThread.flush():
    1. outerCheckpoint = newCheckpoint()        → mark = (file 5, offset 100)
    2. ledgerStorage.flush()
       → SingleDirectoryDbLedgerStorage.flush():
         a. innerCheckpoint = newCheckpoint()   → mark = (file 7, offset 200), journal has advanced
         b. flushes data to disk
         c. checkpointComplete(mark=7, compact=true)
            → persists lastMark=7, deletes journal files with id < 7 (including file 5)
    3. checkpointComplete(mark=5, compact=false)
       → persists lastMark=5, overwriting the 7 written in step 2c
    4. Bookie restarts → reads lastMark=5 → file 5 no longer exists → crash

  Step 1 captures the checkpoint before step 2 runs, so it is always older. Step 2c advances lastMark and garbage-collects old journals. Step 3 then overwrites lastMark backwards to a
  position whose journal file was already deleted.

  Conditions

  - singleLedgerDirs=true (journal and ledger on the same disk, so SingleDirectoryDbLedgerStorage.flush() calls checkpointComplete internally)
  - Journal file rotates between the two newCheckpoint() calls (requires sufficient write throughput)
  - maxBackupJournals small enough for old files to actually be deleted

  Fix

  Add a monotonic guard in checkpointComplete(): track the highest mark persisted so far, skip any call with an older mark. This prevents rollLog from overwriting lastMark backwards.
When singleLedgerDirs=true, bookie can crash on restart with "Recovery log is missing" due to lastMark file being overwritten backwards.

  Root cause

  SyncThread.flush() has a nested call pattern that causes lastMark regression:

  SyncThread.flush():
    1. outerCheckpoint = newCheckpoint()        → mark = (file 5, offset 100)
    2. ledgerStorage.flush()
       → SingleDirectoryDbLedgerStorage.flush():
         a. innerCheckpoint = newCheckpoint()   → mark = (file 7, offset 200), journal has advanced
         b. flushes data to disk
         c. checkpointComplete(mark=7, compact=true)
            → persists lastMark=7, deletes journal files with id < 7 (including file 5)
    3. checkpointComplete(mark=5, compact=false)
       → persists lastMark=5, overwriting the 7 written in step 2c
    4. Bookie restarts → reads lastMark=5 → file 5 no longer exists → crash

  Step 1 captures the checkpoint before step 2 runs, so it is always older. Step 2c advances lastMark and garbage-collects old journals. Step 3 then overwrites lastMark backwards to a
  position whose journal file was already deleted.

  Conditions

  - singleLedgerDirs=true (journal and ledger on the same disk, so SingleDirectoryDbLedgerStorage.flush() calls checkpointComplete internally)
  - Journal file rotates between the two newCheckpoint() calls (requires sufficient write throughput)
  - maxBackupJournals small enough for old files to actually be deleted

  Fix

  Add a monotonic guard in checkpointComplete(): track the highest mark persisted so far, skip any call with an older mark. This prevents rollLog from overwriting lastMark backwards.
@void-ptr974 void-ptr974 force-pushed the fix_journal_missing branch from 2d35897 to 5498af4 Compare May 22, 2026 14:50
@void-ptr974
Copy link
Copy Markdown
Contributor Author

Thanks @hangc0276, I addressed your comments.

The patch now initializes lastPersistedMark from disk, keeps the stale check + persistence + in-memory update under the same lock, and adds tests for both the reload case and concurrent checkpoint completion. I also updated the PR description with the root cause, example sequence, race table, performance impact, and tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Recovery log is missing

4 participants