Reopen the failing run
Testing
Re-run cases, reopen failures, and keep the surrounding context close enough to act on instead of reconstructing it from logs.
01 / Builder scar
At first, a few manual chats still felt like a real check.
02 / Who is this for?
03 / The problem
With one or two tools, a few chats can still feel like a real check. With fifteen or twenty, the same plausible answer can hide a skipped tool, a wrong branch, or the wrong file being written.
You tweak the prompt, ask a few questions, and the answer sounds fine enough to move on.
One small prompt change can silently skip a tool, change a branch, or write the wrong file while the answer still sounds plausible.
04 / The idea
05 / The solution
Instead of trusting memory, I turned the request into a case with a query, a human-readable expectation, and the exact tools the agent should call.
from goose.testing import Goose
from my_agent import get_weather
def test_weather_query(weather_goose: Goose) -> None:
weather_goose.case(
query="What's the weather like in San Francisco?",
expectations=[
"Agent provides weather information for San Francisco",
"Response mentions sunny weather and 75°F",
],
expected_tool_calls=[get_weather],
) 06 / The problem
Once I could rerun the same case, every failure came back as a wall of terminal output. I had proof the bug existed, but reading the bug was still too expensive.
The interesting branch gets buried somewhere inside the run.
I still have to reconstruct what the agent actually did.
Reading the logs starts taking longer than fixing the bug.
07 / The idea
08 / The solution
The run stopped being a wall of output. I can reopen the failure, keep the history, and find the broken path fast enough to act on it.
What changed
I do not rebuild context from zero every time something drifts.
Expectations, tool calls, and outputs stay attached.
Testing / dashboard
The dashboard keeps the rerunnable history in view and lets me open the failing path without losing the surrounding case.
09 / The problem
The trace was finally clear, but touching one tool still meant booting the entire agent and walking through a full conversation just to reach one function.
I was paying full-agent cost for one tiny tool check.
The shortest debugging path still included a full conversation.
The real problem was one tool, but the whole stack kept getting in the way.
10 / The idea
11 / The solution
Now I can run the tool with real arguments, inspect the output immediately, and fix the exact layer that broke. The agent no longer has to be part of the setup.
Direct tool invoke
{
"location": "San Francisco",
"unit": "fahrenheit"
} {
"location": "San Francisco",
"temperature": "75°F",
"condition": "sunny"
} Real args. Real output. No full-agent detour.
Tooling / dashboard
Once the trace tells me where to look, I can jump into one tool, send real arguments, and inspect the output without booting the whole agent again.
12 / The problem
When someone reports a request outside the suite, I do not want to write the formal test in the dark. I want to replay the request first and watch what the agent actually does.
The request is real, but the suite does not have it yet.
I want to see the tool path before I decide what belongs in coverage.
Formalizing the case too early means guessing before I understand it.
13 / The idea
14 / The solution
Now I can replay the real request, inspect the tool calls and outputs, and only then decide whether the edge case deserves to become a formal test.
What happens next
Replay the live request.
Inspect the tool arguments and output.
Decide whether the edge case belongs in the suite.
Live replay
Compare today's weather in San Francisco with tomorrow and tell me if I should bike.
get_weather({"location": "San Francisco", "unit": "fahrenheit"})
today: 75°F and sunny, tomorrow: 68°F with light wind
Bike today. Tomorrow is still fine, but it will be cooler and windier.
Chat / dashboard
I can watch the request, the tool path, and the answer in one place first, then decide what deserves to become a formal regression test.
Inside the dashboard
After the replay, the dashboard still gives each next move its own clear surface instead of collapsing the whole investigation into one crowded screen.
Reopen the failing run
Re-run cases, reopen failures, and keep the surrounding context close enough to act on instead of reconstructing it from logs.
Inspect one tool in isolation
Send real arguments to one tool, inspect the output immediately, and fix the exact layer that broke before the agent gets in the way.
Replay the live request
Replay what the user actually asked, inspect the tool path in context, and only then decide whether the edge case deserves a permanent case.
15 / Closed loop
Once I could replay a real request, read the failing path, isolate one tool, and save the result as a case, Goose finally felt like a system instead of a pile of workarounds.
Goose only started feeling complete when I could move from a real request to saved coverage without guessing in between. If you have an edge case it does not cover yet, open an issue or send a PR.