Tool retry on error
when a tool call fails, give the model one or two structured retries before surfacing the failure to the user.
Goal: when a tool call fails, give the model one or two structured retries before surfacing the failure to the user.
Difficulty: easy. Time: 30–60 minutes. Touches: internal/agent/agent.go.
What's already in place
The agent loop in internal/agent/agent.go already routes tool errors back to the model: the tool's output is wrapped in an api.Block with IsError: true, and the model gets to decide what to do next. That's flexible — but there's nothing stopping the model from calling the same broken tool with the same broken input in a tight loop.
You can read the relevant lines around toolResults and IsError in internal/agent/agent.go:107–137.
What to build
A small RepairPolicy that:
- counts consecutive failures of the same tool (same name, ideally same input hash),
- on retry, injects a short meta-note into the tool result like "this is retry 2/3; if you can't recover, ask the user",
- after
MaxRetries, aborts the tool dispatch and surfaces a clear error to the user.
Suggested steps
-
Define the policy. Somewhere in
internal/agent/(or a newinternal/repair/package):type RepairPolicy struct { MaxRetries int Note func(toolName string, attempt, max int) string // returns the meta-note } -
Track state on the agent. Add a
repairCounts map[string]intfield onAgent. Reset entries when a tool succeeds or when the model moves on to a different tool/input. -
Wire it into the loop. In the tool-dispatch block where
IsErroris set, consult the policy. Either append the note to the tool result content, or bump the count and abort. -
Make it pluggable in main.go. Mirror the other extension points:
a.Repair = &agent.RepairPolicy{MaxRetries: 3}A nil policy means "current behaviour, unlimited retries."
Acceptance
- A
bashcommand that fails returns to the model with aretry 1/3note. - The same command failing again gets
retry 2/3, then aborts at 3 with a single user-facing error message. - The cap is configurable from
main.go. - A
MockProvider-based test (seeinternal/provider/mock.go) confirms the abort happens at the configured retry count without an API call.
Stretch
- Report repair statistics in the
/debugpanel (count of repairs by tool name). - Different policies per tool:
bashgets 1 retry,read_filegets 3. - Hash the tool input so "same tool, different argument" doesn't count as a retry.