Skip to content

Optimize bounded batch reads by predicting entry count#4741

Open
dao-jun wants to merge 11 commits into
apache:masterfrom
dao-jun:dev/optimize_batch_read
Open

Optimize bounded batch reads by predicting entry count#4741
dao-jun wants to merge 11 commits into
apache:masterfrom
dao-jun:dev/optimize_batch_read

Conversation

@dao-jun
Copy link
Copy Markdown
Member

@dao-jun dao-jun commented Apr 9, 2026

Motivation

Batch reads currently keep reading entries until an entry is found to exceed the maxSize response budget. For bounded batch reads, this can read and allocate one extra entry only to release it immediately when it does not fit.

For typical BookKeeper/Pulsar workloads, entries within the same ledger are often close in size. After reading the first entry, we can use its framed size as an estimate for the remaining entries and cap the batch count before issuing reads that are likely to overflow the response budget.

This keeps the existing batch-read behavior, including returning the first entry even when a single entry alone exceeds the requested maxSize.

Changes

  • Update BatchedReadEntryProcessor to:
    • read the first entry as before
    • include the per-entry 4-byte delimiter in the response-size accounting
    • use the first entry's framed size to reduce the remaining maxCount
    • still perform the exact maxSize check for later entries and release the one over-read entry if entry sizes grow
  • Preserve existing prefix-read behavior: if a later entry fails after at least one entry has been read, return the accumulated prefix.
  • Update mock batch-read behavior to match production semantics, while respecting that MockLedgerData returns shared buffers.
  • Add regression and boundary tests covering:
    • first entry larger than maxSize
    • uniform entry-size prediction avoiding extra reads
    • growing entry sizes causing at most one over-read
    • missing or failed later entries returning the accumulated prefix
    • mock and client-facing batch-read behavior

@dao-jun dao-jun closed this Apr 9, 2026
@dao-jun dao-jun reopened this Apr 9, 2026
Copy link
Copy Markdown
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't find tests about failure scenarios.

Memory leaks can happen in case of failures (reading from storage...), client may see the wrong error code.....

@dao-jun
Copy link
Copy Markdown
Member Author

dao-jun commented Apr 13, 2026

@eolivelli I added 3 tests in BatchedReadEntryProcessorTest, PTAL

@dao-jun dao-jun requested a review from eolivelli April 13, 2026 13:04
dao-jun added 2 commits May 8, 2026 16:18
# Conflicts:
#	bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/DefaultEntryLogger.java
@merlimat
Copy link
Copy Markdown
Contributor

@dao-jun Thanks for the optimization. I went through the PR end-to-end — overall the direction is good and the framing-budget contract is clean and consistent across the storage stack. A few things I'd like to discuss before merge.

Overall

The core idea is sound: today the batch-read path reads each entry from disk and only then checks whether it fits in maxSize, throwing the result away on overflow. This PR threads the remaining response budget down through Bookie → LedgerDescriptor → LedgerStorage → EntryLogger, lets DefaultEntryLogger/DirectEntryLogger short-circuit by reading just the size header, and updates BatchedReadEntryProcessor to use the bounded API for entries 2..N (keeping the "always return the first entry" semantics).

The framing budget contract is consistent end-to-end (entry.readableBytes() + 4 > maxEntrySize everywhere, matching what BatchedReadEntryProcessor adds to frameSize). Tests are thorough — exact-fit, off-by-one budgets, oversized-first-entry, missing entries, IOException propagation, all the storage flavors. Reference counting on the new paths checks out: WriteCache.get()/ReadCache.get() return freshly-allocated buffers, readEntrySize returns a thread-local, and ByteBufList.deallocate releases members.

Substantive concerns

1. Undocumented behavior change in exception handling — BatchedReadEntryProcessor.readData

} catch (Bookie.NoEntryException e) {
    if (data == null) { throw e; }
    break;
} catch (Throwable e) {
    if (data != null) { data.release(); }
    throw e;
}

The old code's catch (Throwable) silently broke and returned partial data on any error after the first entry — including transient IOExceptions. The new code only swallows NoEntryException; any other exception now propagates as EIO to the client, even if some entries were already read successfully. This is arguably more correct (hiding I/O errors is bad), but it's a real behavior change that isn't called out in the description and may surprise clients that relied on the lenient behavior. Could you call this out explicitly in the PR description?

2. Stat pollution in BookieImpl.readEntryIfFits

When entry == null (size-rejected), entrySize is 0, but the method still calls registerSuccessfulValue(0) on readBytesStats and records latency on readEntryStats. That mixes "actually read N bytes" and "size-rejected, didn't read" into the same histograms — the bytes histogram gets zero-spikes, and the latency histogram skews artificially low because size-rejected reads are much faster than full reads. Either skip recording on the null branch, or add a separate counter for size-rejected reads.

3. Significant code duplication in SingleDirectoryDbLedgerStorage.doGetEntryIfFits

doGetEntryIfFits is ~90 lines that mostly mirror doGetEntry, with the same if (entry.readableBytes() + Integer.BYTES > maxEntrySize) { release; return null; } block repeated four times (write cache, write-cache-being-flushed, read cache, after entry-log read). Could be reduced significantly by extracting a small checkBudget(ByteBuf, long) helper, or by adding a budget parameter to doGetEntry with a sentinel for "no limit". Smaller diff, easier to keep in sync going forward.

4. MockBookies.batchReadEntries no longer releases the skipped entry

The old code did entry.release() before breaking on overflow; the new code drops that call. This is actually a fix — MockLedgerData.getEntry() returns the same shared ByteBuf it stores in its map, so the old release() could free the only ref and corrupt subsequent reads (which is exactly what testBatchReadDoesNotReleaseOversizedSkippedEntry verifies). But the production BatchedReadEntryProcessor does release in the analogous overflow path (because Bookie.readEntry returns a freshly-allocated buffer), so the mock and the real bookie now have intentionally different ref-counting contracts. A short comment on the mock explaining "we don't own the buffer here" would prevent future confusion.

Smaller things

5. Inconsistent + 4 vs + Integer.BYTESBatchedReadEntryProcessor uses literal 4, every other module uses Integer.BYTES. Pick one.

6. DefaultEntryLogger.readEntryIfFits returns null on oversized entries before calling validateEntry. If an entry's size header is corrupted to a huge value, this silently returns null ("doesn't fit") instead of catching the corruption. Probably acceptable — a later legitimate read would catch it — but worth a one-line comment so a future reader doesn't think it's an oversight.

7. When entryLogger.readEntryIfFits returns null in SingleDirectoryDbLedgerStorage.doGetEntryIfFits, the entry isn't put in the read cache. A subsequent call with a larger budget will hit disk again. Probably fine (we never actually read the data), just noting it.

Verdict

Direction is good, contract is clean, tests are solid. Before merging I'd ask for: (a) the exception-handling behavior change explicitly called out in the PR description, (b) stat recording tightened for size-rejected reads, and (c) the doGetEntryIfFits duplication reduced. The rest are nice-to-have.

@dao-jun
Copy link
Copy Markdown
Member Author

dao-jun commented May 14, 2026

/bkbot rerun-failure

@dao-jun
Copy link
Copy Markdown
Member Author

dao-jun commented May 14, 2026

@merlimat review comment addressed, PTAL

@void-ptr974
Copy link
Copy Markdown
Contributor

Thanks for working on this optimization. The direction makes sense to me.

One compatibility concern: this PR seems to change the batch-read behavior when a later entry fails after some entries have already been read. My understanding is that the previous behavior was closer to a best-effort prefix read: once the first entry was read successfully, later failures would stop the batch and return the accumulated entries.

I think it would be safer to preserve that behavior in this PR, so the change stays focused on avoiding unnecessary disk IO rather than changing error semantics. The failed entry would still be exposed when the client later starts reading from that entry.

@dao-jun
Copy link
Copy Markdown
Member Author

dao-jun commented May 18, 2026

Thanks for working on this optimization. The direction makes sense to me.

One compatibility concern: this PR seems to change the batch-read behavior when a later entry fails after some entries have already been read. My understanding is that the previous behavior was closer to a best-effort prefix read: once the first entry was read successfully, later failures would stop the batch and return the accumulated entries.

I think it would be safer to preserve that behavior in this PR, so the change stays focused on avoiding unnecessary disk IO rather than changing error semantics. The failed entry would still be exposed when the client later starts reading from that entry.

addressed

Copy link
Copy Markdown
Contributor

@hangc0276 hangc0276 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tackling this — the wasted work on batched-read overflow is real. Before this lands I'd like to push back on both the framing and the scope of the API change, and propose a simpler
alternative.

The disk-I/O savings are smaller than the title suggests

The PR title says "avoid wasted disk IO", but the size-peek doesn't actually save disk I/O on the buffered path:

  • DefaultEntryLogger uses BufferedReadChannel over a regular FileChannel. The 4-byte readEntrySize call (bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/DefaultEntryLogger.java:856) pulls
    a full read-buffer's worth of data into BufferedReadChannel's internal buffer, and the kernel's sequential-readahead heuristic pulls the following pages into the page cache. The body bytes are resident
    in memory before we even decide whether to read them. What readEntryIfFits actually saves on this path is the user-space pread syscall, the allocator.buffer(entrySize, entrySize) allocation, and the
    Netty/GC pressure from the discarded ByteBuf — real CPU/allocation wins, but not disk I/O.
  • DirectEntryLogger uses O_DIRECT via DirectReader, which bypasses the page cache and reads one aligned block at a time into nativeBuffer
    (bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/storage/directentrylogger/DirectReader.java:200). Here the peek does save real disk I/O — but only for entries that span more than one
    block (block size is governed by dbStorage_directIOEntryLoggerReadBufferSizeBytes, typically 4KB–1MB). If the block size is large enough to hold the entry, the peek has already loaded everything.

It would help to either narrow the PR title, or post a measurement that distinguishes wall-clock improvement from actual disk-I/O reduction (e.g. iostat / read_bytes deltas under representative
workloads).

Where the savings actually come from

The PR adds readEntryIfFits / getEntryIfFits across Bookie, LedgerDescriptor, LedgerStorage, EntryLogger, plus the concrete implementations (BookieImpl, LedgerDescriptorImpl, DbLedgerStorage,
SingleDirectoryDbLedgerStorage, InterleavedLedgerStorage, SortedLedgerStorage, DefaultEntryLogger, DirectEntryLogger) and a new LogReader.readEntrySizeAt. But the substantive change is just two
methods:

  • DefaultEntryLogger.readEntryIfFits — peeks the size header before reading the body.
  • DirectEntryLogger.readEntryIfFits — same, via DirectReader.readEntrySizeAt.

The rest is plumbing. The default interface implementations in LedgerStorage.getEntryIfFits and EntryLogger.readEntryIfFits do "read then check & release" — i.e. they preserve correctness but provide
none of the optimization. And the cache-layer short-circuits in SingleDirectoryDbLedgerStorage.doGetEntryIfFits against the write/read caches save zero I/O (the entry is already fully in memory) — they
only save a small amount of memory traffic and network bytes.

Proposed alternative: predict from the first entry

For typical BookKeeper/Pulsar workloads, entries within a single ledger come from the same producer and are nearly uniform in size. Using the first entry as the size estimate, we can stop the batch
before issuing the wasted read — no storage API changes required:

long frameSize = 24 + 8 + 4;
ByteBufList data = null;
for (int i = 0; i < maxCount; i++) {
try {
if (data == null) {
ByteBuf first = requestProcessor.getBookie()
.readEntry(request.getLedgerId(), request.getEntryId());
frameSize += first.readableBytes() + 4;
data = ByteBufList.get(first);

          long perEntry = first.readableBytes() + 4;
          long remaining = (maxSize - frameSize) / Math.max(perEntry, 1);
          maxCount = (int) Math.min(maxCount, 1 + remaining);
          continue;
      }
      ByteBuf entry = requestProcessor.getBookie()
              .readEntry(request.getLedgerId(), request.getEntryId() + i);
      if (frameSize + entry.readableBytes() + 4 > maxSize) {
          entry.release();
          break;
      }
      frameSize += entry.readableBytes() + 4;
      data.add(entry);
  } catch (Bookie.NoEntryException e) {
      if (data == null) {
          throw e;
      }
      break;
  } catch (Throwable e) {
      if (data != null) {
          data.release();
      }
      throw e;
  }

}

Comparison

Case This PR First-entry prediction
Uniform entry sizes (typical) Saves the tail-overflow read Same — prediction is exact, zero waste
Sizes drift up within batch Eliminates all overflow reads One over-read possible, bounded to 1 per batch
Sizes drift down within batch Fully fills budget Under-fills budget — client does extra round-trips
Highly variable sizes Optimal Pessimistic or one wasted read

Code cost: ~10 LOC in one file vs. ~1000 LOC across 25 files.

Combined with the OS-prefetch point above: for DefaultEntryLogger the PR doesn't save disk I/O anyway, so the first-entry approach captures essentially all the achievable wins (skips the body read
entirely → same allocator/syscall savings, similar disk impact). The PR's edge only shows up on DirectEntryLogger with large entries and a highly-variable size distribution.

@dao-jun
Copy link
Copy Markdown
Member Author

dao-jun commented May 22, 2026

@hangc0276 Thanks for your professional review!

When working on this PR, I definitely took this issue into consideration.

Although readEntrySize only needs to read 4 bytes of data, during actual I/O operations the disk may read more data – typically 4 KB. If the page cache is hit, no additional disk I/O is required. Compared to the original implementation, this does save disk I/O.

For Direct I/O, this change may also waste a buffer block (or not, if the subsequent request is a tail read). In that case, only memory allocation and other similar overheads are saved.

In short, at the time I believed this change would indeed reduce system overhead – at the very least, it would save on memory allocation and similar costs.

I understand your concern: introducing such a major change might only reduce some non‑I/O overhead (in the worst case, for example, when the entry size is very small).

In order not to block the release of version 4.18, I have decided to temporarily apply the solution you mentioned in your comment.

However, because Bookkeeper is widely used in Pulsar, I think the entry size tends to be uneven and relatively large. On the client side, a batch is typically 128 KB, but to improve throughput it is often set larger (though not exceeding 5 MB). I currently don't have time to provide benchmark data to quantify the performance improvement of this PR, but I may continue that work in the future.

@dao-jun dao-jun changed the title Optimize batch-read to avoid wasted disk IO Optimize bounded batch reads by predicting entry count May 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants