batch: resolve pending transactions when a flush yields no batch by Jeffail · Pull Request #446 · redpanda-data/benthos

Jeffail · 2026-06-04T13:11:54Z

Summary

The output and input batchers accumulate upstream transactions while buffering messages, then acknowledge them once the resulting batch has been written. When a flush produced no batch — because the batching processors filtered every message away, or because they returned an error — the loop continued without resolving the already-accumulated transactions. Those transactions lingered and were later acknowledged with the result of a subsequent, unrelated batch.

In the current code the practical symptoms are an acknowledgement-accounting problem rather than data loss: transactions belonging to an empty/filtered flush are resolved against a later batch's result (acked late, or spuriously nacked and redelivered if that later batch fails), and — if a pipeline goes idle immediately after a filtered batch — those transactions are left unacknowledged, holding max_in_flight slots and blocking offset commits. A genuine data-loss scenario only arises once batching-processor errors are routed into the dropped-batch path, which is a separate, upcoming change; this PR also stages the correct nack behaviour for it.

Change

policy.Flush now returns (message.Batch, error) to distinguish three outcomes:

Result	Meaning	Caller action
`(batch, nil)`	a batch is ready to send onwards	send
`(nil, nil)`	the policy was empty, or the batching processors intentionally filtered every message away	ack the buffered transactions (data consumed successfully)
`(nil, err)`	the batching processors failed and the batch was dropped	nack the buffered transactions so the source can retry

Both the output and input batchers resolve their pending transactions immediately in the latter two cases rather than deferring them to a future batch's result. The public service.Batcher.Flush wrapper (already (batch, error)) now propagates the error rather than discarding it.

Tests

Adds TestBatcherDroppedBatchMisattributesAck to both the output and input batcher suites, asserting that transactions belonging to a flush that yields no batch are resolved against that flush rather than inheriting a later batch's result. Existing batcher and batch-policy suites updated for the new signature; internal/impl/pure, internal/impl/io, and internal/stream suites pass unchanged.

🤖 Generated with Claude Code

The output and input batchers accumulate upstream transactions while buffering messages, then acknowledge them once the resulting batch has been written. When a flush produced no batch -- because the batching processors filtered every message away, or because they returned an error -- the loop continued without resolving the already accumulated transactions. Those transactions lingered and were later acknowledged with the result of a subsequent, unrelated batch, which could acknowledge data that was never delivered. policy.Flush now returns (batch, error) to distinguish three outcomes: a batch ready to send onwards, an intentionally empty result (the buffered transactions are acked as successfully consumed), and a processing failure (the buffered transactions are nacked so the source can retry them). Both batchers resolve their pending transactions immediately in the latter two cases rather than deferring them to a future batch's result. Adds a regression test covering the misattributed acknowledgement.

claude · 2026-06-04T13:16:46Z

 	flushBatchFn := func() {
-		sendMsg := m.batcher.Flush(closeNowCtx)
+		sendMsg, err := m.batcher.Flush(closeNowCtx)
+		if err != nil {
+			// The batching processors failed and the batch has been dropped.
+			// Nack the accumulated transactions so the source can retry them.
+			resolvePending(err)
+			return
+		}
 		if sendMsg == nil {
+			// No batch produced (the policy was empty or every message was
+			// intentionally filtered away); the data was consumed successfully.
+			resolvePending(nil)
 			return
 		}


The input batcher received the same behavioral fix as the output batcher — flushBatchFn now nacks pending transactions on a processor error and acks them on an empty/filtered flush via resolvePending. However, only the output batcher gained a regression test (TestBatcherDroppedBatchMisattributesAck); these new branches in the input batcher have no coverage. Consider adding a mirrored regression test (e.g. a count-triggered batch whose processor filters every message away, followed by a later batch that nacks, asserting the filtered transactions resolve with nil independently) so a future regression in the input path is caught too.

Per the project test patterns, changed code should be accompanied by tests exercising the new behavior.

Added in c03f060: TestBatcherDroppedBatchMisattributesAck in the input batcher test, mirroring the output batcher's. It drives a count-triggered batch whose processor filters every message away, then a later batch that the consumer nacks, and asserts the filtered transactions resolve with nil independently. Verified it fails against the pre-fix code and passes with the fix.

claude · 2026-06-04T13:16:54Z

Commits
LGTM — batch: resolve pending transactions when a flush yields no batch follows the system: message convention, uses lowercase imperative mood, is a single self-contained logical change (the fix plus its regression test), and has a clear, accurate message body.

Review
The fix is correct. policy.Flush now returns (batch, error) to distinguish a ready batch, an intentionally-empty result (ack), and a processing failure (nack). The signature change is propagated consistently to all callers — the input batcher, the output batcher, and the public service.Batcher wrapper (whose external signature was already (MessageBatch, error), so buffer_memory.go is unaffected) — and policy_test.go / config_batch_policy_test.go are updated accordingly. The misattributed-ack bug is well covered by the new output-batcher regression test.

The input batcher (internal/component/input/batcher/batcher.go) received the equivalent resolvePending fix (nack on processor error, ack on empty/filtered flush) but, unlike the output batcher, has no regression test exercising those new branches. Consider mirroring TestBatcherDroppedBatchMisattributesAck for the input path. (See inline comment.)

Jeffail · 2026-06-04T13:17:15Z

Correction — see the follow-up below. On further investigation the data-loss scenario described here does not trigger in the current code.

When does this affect a pipeline in practice?

The most likely real-world exposure is on shutdown / restart. The unresolved-transaction path is reached whenever a flush yields no batch, and during a graceful shutdown the flush context can be cancelled while the batching processors are still running — leaving transactions accumulated in that codepath.

This is most probable when the batching policy is large and the batch processors are slow enough to be the pipeline's bottleneck: at any given instant during the drain there is a high chance messages are sitting mid-processing. When the SIGTERM lands in that window, those in-flight transactions could be acknowledged (treated as delivered) even though their data was never written — i.e. silent data loss across the restart.

Steady-state pipelines with cheap, fast batch processing are far less likely to observe this, since the window in which messages occupy that codepath is vanishingly small.

Follow-up correction. On further investigation the data-loss scenario above does not actually trigger in the current code, and I would like to correct the record.

For a batch to be dropped (rather than written) due to a failure, policy.Flush must return no batch. That happens in only three situations:

the batching processors intentionally filter every message away (e.g. root = deleted()) — here acknowledging is the correct outcome, not loss;
the policy is empty — nothing to lose;
ExecuteAll returns an error — but that occurs only on context cancellation, because the auto-observed processor wrapper swallows processor errors and passes the original data through unprocessed (it is written as raw data, not dropped).

The flush here uses the hard-stop context, so a cancelled flush coincides with a hard stop, at which point the batcher loop exits before any subsequent batch is produced. The accumulated transactions are therefore never acknowledged and are redelivered on restart — at-least-once is preserved, not violated. The only situation in which the loop continues after an empty flush is the intentional-filter case, where acknowledging is correct.

So the real-world symptoms of this bug are an acknowledgement-accounting issue: mis-attributed / delayed acknowledgements for filtered batches, possible spurious redelivery if a later batch is nacked, and — if a pipeline goes idle immediately after a filtered batch — transactions left unacknowledged, holding max_in_flight slots and blocking offset commits.

The data-loss risk becomes real only once batching-processor errors are routed into the dropped-batch path (the separate, upcoming output-error-rejection change). This PR's value is correcting the acknowledgement accounting and staging that nack path correctly; it does not fix a present-day data-loss bug.

josephwoodward · 2026-06-04T13:22:27Z

 		if err != nil {
 			p.log.Error("Batch processors resulted in error: %v, the batch has been dropped.", err)
-			return nil
+			return nil, err


Would it be worth adding context to this error message so it can be related to the error log above?

Good call — done in c03f060. The returned error is now wrapped as fmt.Errorf("batch processors resulted in error: %w", err), so the nack reason surfaced downstream lines up with the logged message just above it.

- Wrap the error returned from the batch policy flush with context so the nack reason surfaced downstream relates to the logged error. - Add a regression test for the input batcher mirroring the output batcher's, covering the new nack-on-error and ack-on-empty-flush branches.

claude · 2026-06-04T13:34:31Z

Commits

c03f060 — "batch: address review feedback". The message is vague/process-oriented: it describes the act of responding to review rather than what changed, in the same spirit as the discouraged "WIP"/"updates" examples. Its contents (wrapping the flush error with context, adding the mirrored input-batcher regression test) are refinements to the first commit rather than an independent logical change, so it should ideally be squashed into d10c250 for a clean, self-contained history.

Review

The fix is sound. policy.Flush now returns (message.Batch, error) to distinguish a ready batch, an intentionally-empty flush (ack), and a processing failure (nack); both the input and output batchers resolve their accumulated transactions immediately in the latter two cases instead of letting them inherit a later batch's result. All call sites were updated for the new signature (the buffer_memory.go caller uses the public service.Batcher, which was already (batch, error)), fmt is already imported in policy.go, and the new regression tests follow the existing conventions in their files.

LGTM

claude Bot reviewed Jun 4, 2026

View reviewed changes

josephwoodward reviewed Jun 4, 2026

View reviewed changes

squiidz approved these changes Jun 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

batch: resolve pending transactions when a flush yields no batch#446

batch: resolve pending transactions when a flush yields no batch#446
Jeffail wants to merge 2 commits into
mainfrom
CON-461

Jeffail commented Jun 4, 2026 •

edited

Loading

Uh oh!

claude Bot Jun 4, 2026

Uh oh!

Jeffail Jun 4, 2026

Uh oh!

claude Bot commented Jun 4, 2026

Uh oh!

Jeffail commented Jun 4, 2026 •

edited

Loading

When does this affect a pipeline in practice?

Uh oh!

josephwoodward Jun 4, 2026 •

edited

Loading

Uh oh!

Jeffail Jun 4, 2026

Uh oh!

claude Bot commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Jeffail commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Change

Tests

Uh oh!

claude Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

Jeffail Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot commented Jun 4, 2026

Uh oh!

Jeffail commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

When does this affect a pipeline in practice?

Uh oh!

josephwoodward Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Jeffail Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Jeffail commented Jun 4, 2026 •

edited

Loading

Jeffail commented Jun 4, 2026 •

edited

Loading

josephwoodward Jun 4, 2026 •

edited

Loading