Skip to content

feat(adk): add BeforeFinalAnswer hook to ChatModelAgentMiddleware#922

Draft
shentongmartin wants to merge 65 commits into
mainfrom
feat/retry_on_success
Draft

feat(adk): add BeforeFinalAnswer hook to ChatModelAgentMiddleware#922
shentongmartin wants to merge 65 commits into
mainfrom
feat/retry_on_success

Conversation

@shentongmartin

@shentongmartin shentongmartin commented Mar 30, 2026

Copy link
Copy Markdown
Contributor

Problem

When a ChatModelAgent produces a final answer (no tool calls), there was no way to inspect or reject it before the agent exits. Users had no quality gate for cases like:

  • FinishReason=length (context window overflow, truncated response)
  • Empty content (model returned nothing useful)
  • Repeated or low-quality output

The earlier approach of extending ModelRetryConfig to handle these cases blurred the line between transient error recovery and response quality gating — two fundamentally different concerns.

Solution

Add a BeforeFinalAnswer hook to ChatModelAgentMiddleware that acts as a quality gate on final answers. When the model produces a response with no tool calls, the hook inspects it and decides: accept (agent exits) or reject (loop back for another iteration).

type ChatModelAgentMiddleware interface {
    // ...existing methods...
    BeforeFinalAnswer(ctx context.Context, state *ChatModelAgentState) (context.Context, bool, *ChatModelAgentState, error)
}

The hook receives the full ChatModelAgentState (with the model's response as the last message in Messages). It can:

  • Inspect FinishReason, content, or any aspect of the response
  • Modify state.Messages (e.g., append a "please continue" user message)
  • Return accept=false to trigger another iteration

Rejected answers count toward MaxIterations, providing a natural cap against runaway loops.

Key Insight

The hook is embedded directly in the existing graph branch — no extra node needed for routing. In the ReAct graph, the toolCallCheck branch already decides between "tool calls → continue loop" and "no tool calls → exit." We extend this same branch to become a 3-way decision:

ChatModel → [toolCallCheck branch]
              → compose.END              (no tool calls + accepted)
              → FinalAnswerRejection     (no tool calls + rejected → loops back to ChatModel)
              → CancelCheck              (tool calls → tool execution loop, unchanged)

The FinalAnswerRejection node is needed solely for type conversion (Message[]Message from state) — the actual decision logic lives entirely in the branch function. This keeps the graph topology minimal.

A second insight: the stateModelWrapper already writes the model's response into State.Messages via compose.ProcessState before the branch runs. So the hook naturally sees the complete response without any special plumbing.

Decisions

Why a middleware hook instead of extending ModelRetryConfig?

ModelRetryConfig is about transient error recovery — low-level, sits in the model wrapper, fires on error returns. Response quality gating is a higher-level concern that should integrate with the agent's iteration system. Making it a middleware hook means:

  • It composes with other middleware (BeforeModelRewriteState, AfterModelRewriteState, etc.)
  • It naturally uses MaxIterations for capping (no separate MaxRetries to configure)
  • It doesn't interfere with error retry semantics
  • Each rejected attempt is visible to the user via AgentEvent (the event sender sits inside the retry wrapper)

Why convert the no-tools path from Chain to Graph?

The no-tools path previously used compose.NewChain (linear, no loop). To support BeforeFinalAnswer rejection (which requires looping back), we converted it to compose.NewGraph with the same Init → ChatModel → [branch] topology. This gives uniform architecture between tools and no-tools paths, and the MaxIterations enforcement is identical.

Summary

Problem Solution
No quality gate for model final answers (truncated, empty, low-quality) BeforeFinalAnswer hook on ChatModelAgentMiddleware — inspect and accept/reject
Response quality was conflated with error retry in ModelRetryConfig Separate concerns: ModelRetryConfig for errors only, BeforeFinalAnswer for response quality
No-tools path used Chain (no loop capability) Converted to Graph with same branch-based architecture as ReAct path

问题

ChatModelAgent 产生最终回答(无 tool 调用)时,没有机制在 agent 退出之前检查或拒绝该回答。用户无法对以下场景做质量把关:

  • FinishReason=length(上下文窗口溢出,回答被截断)
  • 空内容(模型没有返回有用内容)
  • 重复或低质量输出

之前尝试扩展 ModelRetryConfig 来处理这些场景,但这模糊了瞬态错误恢复和响应质量把关之间的界限——这是两个本质不同的关注点。

方案

ChatModelAgentMiddleware 上新增 BeforeFinalAnswer 钩子,作为最终回答的质量关卡。当模型产生无 tool 调用的响应时,该钩子检查并决定:接受(agent 正常退出)或拒绝(循环回去进行下一次迭代)。

type ChatModelAgentMiddleware interface {
    // ...已有方法...
    BeforeFinalAnswer(ctx context.Context, state *ChatModelAgentState) (context.Context, bool, *ChatModelAgentState, error)
}

钩子接收完整的 ChatModelAgentState(模型的回答是 Messages 中的最后一条消息)。它可以:

  • 检查 FinishReason、内容或回答的任何方面
  • 修改 state.Messages(例如追加一条 "请继续" 的用户消息)
  • 返回 accept=false 触发新一轮迭代

被拒绝的回答计入 MaxIterations,自然防止无限循环。

核心洞见

钩子直接嵌入现有的图分支中 —— 路由无需额外节点。在 ReAct 图中,toolCallCheck 分支原本在 "有 tool 调用 → 继续循环" 和 "无 tool 调用 → 退出" 之间做决策。我们将这个分支扩展为三路决策:

ChatModel → [toolCallCheck branch]
              → compose.END              (无 tool 调用 + 接受)
              → FinalAnswerRejection     (无 tool 调用 + 拒绝 → 循环回 ChatModel)
              → CancelCheck              (有 tool 调用 → tool 执行循环,不变)

FinalAnswerRejection 节点仅用于类型转换(Message → 从 state 读取 []Message)—— 实际决策逻辑完全在分支函数中。这保持了图拓扑的精简。

第二个洞见:stateModelWrapper 在分支运行之前就已经通过 compose.ProcessState 将模型的响应写入了 State.Messages。因此钩子无需任何特殊处理就能看到完整的响应。

决策

为什么用中间件钩子而不是扩展 ModelRetryConfig

ModelRetryConfig 用于瞬态错误恢复 —— 底层的、位于模型 wrapper 中的、在 error 返回时触发。响应质量把关是更高层次的关注点,应该与 agent 的迭代系统集成。作为中间件钩子意味着:

  • 与其他中间件组合(BeforeModelRewriteStateAfterModelRewriteState 等)
  • 自然使用 MaxIterations 做上限(无需单独配置 MaxRetries
  • 不干扰错误重试语义
  • 每次被拒绝的尝试都通过 AgentEvent 对用户可见

为什么将无 tools 路径从 Chain 转换为 Graph

无 tools 路径之前使用 compose.NewChain(线性,无循环能力)。为支持 BeforeFinalAnswer 拒绝(需要循环回去),我们将其转换为 compose.NewGraph,使用相同的 Init → ChatModel → [branch] 拓扑。这使 tools 路径和无 tools 路径的架构统一,且 MaxIterations 的强制机制完全一致。

总结

问题 方案
模型最终回答没有质量关卡(截断、空内容、低质量) ChatModelAgentMiddleware 上的 BeforeFinalAnswer 钩子 —— 检查并接受/拒绝
响应质量与 ModelRetryConfig 中的错误重试混为一谈 关注点分离:ModelRetryConfig 仅处理错误,BeforeFinalAnswer 处理响应质量
无 tools 路径使用 Chain(无循环能力) 转换为 Graph,与 ReAct 路径使用相同的分支架构

@codecov

codecov Bot commented Mar 30, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 80.00000% with 23 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (alpha/09@28a9142). Learn more about missing BASE report.

Files with missing lines Patch % Lines
adk/react.go 79.27% 20 Missing and 3 partials ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             alpha/09     #922   +/-   ##
===========================================
  Coverage            ?   82.52%           
===========================================
  Files               ?      161           
  Lines               ?    21610           
  Branches            ?        0           
===========================================
  Hits                ?    17834           
  Misses              ?     2552           
  Partials            ?     1224           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread adk/chatmodel.go
@shentongmartin shentongmartin changed the title feat(adk): support response-based retry and input modification in ModelRetryConfig feat(adk): add BeforeFinalAnswer hook to ChatModelAgentMiddleware Apr 7, 2026
Comment thread adk/before_final_answer_test.go
@shentongmartin shentongmartin force-pushed the feat/retry_on_success branch from f714cbd to f5f4489 Compare April 7, 2026 06:47
@shentongmartin shentongmartin force-pushed the alpha/09 branch 5 times, most recently from 9661935 to d7161f7 Compare April 15, 2026 06:15
@shentongmartin shentongmartin marked this pull request as draft April 16, 2026 01:59
@shentongmartin shentongmartin force-pushed the feat/retry_on_success branch 2 times, most recently from fad7d45 to f0d8d27 Compare April 16, 2026 06:06
hi-pender and others added 16 commits April 21, 2026 10:43
Change-Id: If20fa78dba82a1c177c8ec47090050ea8c1354ed
* fix(adk): skip saving checkpoint when TurnLoop is idle

When Stop() is called on an idle TurnLoop (no active agent run, no
unhandled items, no canceled items), the resulting checkpoint contains
no meaningful state. Skip saving such checkpoints to avoid unnecessary
store writes.

- Add isIdle check in cleanup() before checkpoint save decision
- Add TestTurnLoop_StopWhileIdle_SkipsCheckpoint test

Change-Id: I6aeaff5ed5833a971cb95298193fdb96d904baf8

* fix(internal): merge id2State in PopulateInterruptState instead of replacing

PopulateInterruptState merged id2Addr entries one by one but replaced
id2State wholesale. In a parallel workflow resume, two goroutines share
the same globalResumeInfo. If one goroutine's compose graph called
PopulateInterruptState (replacing id2State with compose-only entries)
before the other goroutine looked up its outer-level entry, the lookup
returned a zero-value InterruptState with State=nil, triggering the
'has no state' panic in ChatModelAgent.Resume.

Change id2State handling to merge entry by entry, consistent with
id2Addr.

Change-Id: Ia21f65289bff7beb2bc383fb033926ad9c92d7e7

* fix(adk): keep watching for cancel escalation after stopSig.done

When watchStopSignal entered the stopSig.done branch, it processed the
initial cancel and then blocked on <-done (turn completion), never
looping back to check notify. This meant a subsequent Stop() call with
a higher cancel mode (e.g. CancelImmediate) was never forwarded to the
agent, causing TestTurnLoop_Stop_EscalatesCancelMode to time out.

Replace the blocking <-done with an inner loop that selects on both
done and notify, so escalation signals are always delivered. Also apply
the generation-based dedup check consistent with the notify branch.

Change-Id: Ia6a04d00a2b44625ffbcb625ff0e559c12ed145f
…r agent cancellation (#929)

* fix(adk): prevent panic when orphaned tool goroutine sends event after agent cancellation

When CancelAfterChatModel times out and escalates to CancelImmediate,
GraphInterrupt fires with timeout=0. The compose graph returns immediately,
orphaning parallel tool goroutines. When an orphaned tool completes,
eventSenderToolWrapper tries to send an event via the AsyncGenerator which
is already closed, causing 'send on closed channel' panic.

- Add isImmediateCancelled() to cancelContext for checking immediateChan
- Make chatModelAgentExecCtx.send cancel-aware: skip send when immediate cancel is active
- Use trySend as safety net for the TOCTOU race window
- Route SendEvent() through execCtx.send() instead of direct generator.Send()

Change-Id: Ic7e0194c860e2692a3cddc559911ab379024f650

* test(adk): add test for orphaned tool goroutine panic after CancelImmediate

- unit_send_after_close: directly reproduces the panic by sending to a
  closed generator with isImmediateCancelled=true
- unit_send_after_close_without_cancel_ctx: verifies trySend safety net
  prevents panic even without cancelCtx
- integration_cancel_escalation_orphans_tool: end-to-end test with slow
  tool, CancelAfterChatModel timeout escalation, and orphaned goroutine

Change-Id: Ia82fa957b102ccc2ac42094d18d4b15db2a1701c

* test(adk): improve coverage for orphaned tool goroutine fix

Add test cases for:
- nil execCtx and nil generator defensive guards
- nil cancelContext in isImmediateCancelled
- TOCTOU race window (isImmediateCancelled=false but generator closed)
- SendEvent public API with closed generator
- SendEvent without exec context

Change-Id: I197c36f34675f5376cbe5f830b15db6ca873cd1f
…925)

* fix(adk): keep late turn loop items

Change-Id: Iabee0c25a83d5a25585d3592a41ca6a5fba35c2b

* docs(adk): clarify cancel wait semantics

Change-Id: Ia0a396b9cc2e43f15e85056d966f20b010dcd2b6

* feat(adk): add WithSkipCheckpoint and WithStopCause StopOptions

Add two new StopOption variants for TurnLoop.Stop():

- WithSkipCheckpoint: prevents checkpoint persistence on stop, for
  cases where the caller does not intend to resume in the future.
  The flag is sticky across escalation calls.

- WithStopCause: attaches a business-supplied reason string. Surfaced
  in TurnLoopExitState.StopCause and, after the Stopped channel
  closes, via TurnContext.StopCause(). Uses first-non-empty-wins
  semantics across multiple Stop() calls.

Thread both fields through stopSignal with proper mutex protection.
Update cleanup() to skip checkpoint save when skipCheckpoint is set.

Change-Id: Ifeat-stop-options-skip-checkpoint-stop-cause
* fix: rebase error

Change-Id: If20fa78dba82a1c177c8ec47090050ea8c1354ed

* feat(adk): add failover support for ChatModel

Change-Id: Ice1b513b4b509e7b540316da9119ff3d529c9bae

* feat(adk): add failover support for ChatModel

Change-Id: Ice1b513b4b509e7b540316da9119ff3d529c9bae

* feat(adk): add failover support for ChatModel

Change-Id: Id5483447b74322f6dd495bdd3b994c001094569d

* feat(adk): make Name and Description optional in ChatModelAgentConfig

* feat(adk): add callback lifecycle management to failoverProxyModel

- Extract prepareCallbacks method to reuse callback setup logic between
  Generate and Stream methods
- Add callbacks.ReuseHandlers with proper RunInfo (model type + component)
  before each failover model invocation so handlers receive correct identity
- Add explicit OnStart/OnEnd/OnError callback invocations in Generate and
  Stream since failoverProxyModel declares IsCallbacksEnabled() = true and
  the outer layer skips automatic callback injection

Change-Id: I0150529024125251828cf6f77c8247aa464b1f84

* fix(adk): preserve partial result in failoverProxyModel.Generate on error

Return result instead of nil when target.Generate fails, so that the
outer failoverModelWrapper can pass the partial output message to
ShouldFailover for inspection.

Change-Id: I32d86151a6e133f1a58d5e988bccf42d831a646c

* refactor(adk): use EnsureRunInfo in failoverProxyModel and separate ctx for callbacks

- Replace manual RunInfo construction + ReuseHandlers with
  callbacks.EnsureRunInfo for cleaner RunInfo setup
- Use nCtx (from EnsureRunInfo) for target model invocation and
  original ctx for OnStart/OnEnd/OnError callback lifecycle

Change-Id: I1d5982d0e1ceeaf8f6648b9c40c229b6a2b07ab8

---------

Co-authored-by: shentong.martin <shentong.martin@bytedance.com>
feat: tool search definition
…945)

- Add ToolAliases to prepareExecContext when building ToolsNodeConfig
- Add UnknownToolsHandler, ExecuteSequentially, ToolArgumentsHandler,
  and ToolAliases to applyBeforeAgent when rebuilding after BeforeAgent
  handlers modify tools
- Add tests covering argument alias remapping, name alias dispatch,
  alias preservation after handler rebuild, and handler-only tool
  registration with pre-configured aliases
@shentongmartin shentongmartin force-pushed the feat/retry_on_success branch from ea38bd8 to c665dc5 Compare April 21, 2026 09:08
JonXSnow and others added 6 commits April 21, 2026 20:23
feat(adk): add MultiModalRead with custom FileContentPart types

- Define FileContentPartType, FileContentPart in filesystem package
  to replace direct schema.ToolOutputPart dependency, supporting
  only Image (bytes) and File (bytes) types
- Add MultiModalReader interface and MultiModalReadRequest with Pages field
- Add multiModalReadFileArgs extending readFileArgs with PDF pages param
- Convert FileContentPart to schema.ToolOutputPart with base64
  encoding in middleware layer
- Guard against nil FileContent returned from Backend.Read and
  MultiModalRead; return human-readable fallback instead of panicking
- Reuse base64 encoding buffer across multimodal parts via base64Encoder
- Add tests for image, file, unsupported type, pages passthrough,
  schema fields, custom desc, empty data error, nil result, and routing
Change-Id: I743af2360a8742c809712c5a2134d67dcafc66ac
…anic, improve docs

- Introduce FinalAnswerDecision type (AcceptFinalAnswer/RejectFinalAnswer) to
  replace bare bool in BeforeFinalAnswer, making call sites self-documenting
- Add nil guard for newState in runBeforeFinalAnswer to prevent nil pointer
  dereference when a handler returns nil state with RejectFinalAnswer
- Document ctx propagation limitation with SetRunLocalValue workaround,
  nil state preservation, and multi-handler veto semantics in godoc
- Add protocol comment to toolCallCheck explaining stream drain / state flow
- Add 3 regression tests: nil state accept, nil state reject, multi-handler veto

Change-Id: I16148180557d8bfe76b99af007ec13799ccd9c00
…ReturnDirectly through gate

Change-Id: I5e03169f8411ce0a3dc17b6a3d47d901aece3607
…ration

- Remove orphaned code block in getRunFunc left from conflict resolution
- Change BeforeFinalAnswer receiver to generic TypedBaseChatModelAgentMiddleware[M]
  to avoid Go's "cannot define methods on instantiated type" error
- Guard no-tools chain optimization with len(a.handlers) == 0 so agents
  with BeforeFinalAnswer handlers still route through the react graph

Change-Id: Ic32282313888c4433ecf9e8bbb31e50a0e386cca
…raph

- Change BeforeFinalAnswer signature to use TypedChatModelAgentState[M]
  instead of hardcoded ChatModelAgentState, consistent with all other
  state-touching hooks
- Rename runBeforeFinalAnswer to runTypedBeforeFinalAnswer[M] so it works
  with both *schema.Message and *schema.AgenticMessage state types
- Wire BeforeFinalAnswer into newAgenticReact: add FinalAnswerRejection
  node, gate model final-answer path, and gate ReturnDirectly path

Change-Id: I273b3f9bb08c889adcbc8e1068844f95625be2be
@shentongmartin shentongmartin force-pushed the feat/retry_on_success branch from c665dc5 to 2570259 Compare April 21, 2026 13:19
Base automatically changed from alpha/09 to main May 19, 2026 10:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

7 participants