01 / Builder scar

Goose started when manual testing stopped being a method.

At first, a few manual chats still felt like a real check.

02 / Who is this for?

For teams running agent systems in production that need reliable behavior validation.

03 / The problem

The real break was when plausible output became a false positive.

With one or two tools, a few chats can still feel like a real check. With fifteen or twenty, the same plausible answer can hide a skipped tool, a wrong branch, or the wrong file being written.

1-2 tools

Manual chats still feel honest

You tweak the prompt, ask a few questions, and the answer sounds fine enough to move on.

15-20 tools

Plausible output stops meaning correct behavior

One small prompt change can silently skip a tool, change a branch, or write the wrong file while the answer still sounds plausible.

04 / The idea

So I needed a rerunnable case, not another manual check.

05 / The solution

The first fix was a case I could rerun on every prompt change.

Instead of trusting memory, I turned the request into a case with a query, a human-readable expectation, and the exact tools the agent should call.

queryexpectationsexpected_tool_calls
behavior_case.py
from goose.testing import Goose
from my_agent import get_weather


def test_weather_query(weather_goose: Goose) -> None:
    weather_goose.case(
        query="What's the weather like in San Francisco?",
        expectations=[
            "Agent provides weather information for San Francisco",
            "Response mentions sunny weather and 75°F",
        ],
        expected_tool_calls=[get_weather],
    )

06 / The problem

Repeatability solved one problem and exposed the next.

Once I could rerun the same case, every failure came back as a wall of terminal output. I had proof the bug existed, but reading the bug was still too expensive.

Scrollback

The interesting branch gets buried somewhere inside the run.

Context

I still have to reconstruct what the agent actually did.

Speed

Reading the logs starts taking longer than fixing the bug.

07 / The idea

So I built the UI because the logs made me guess again.

08 / The solution

The Testing view put the failure path back in one place.

The run stopped being a wall of output. I can reopen the failure, keep the history, and find the broken path fast enough to act on it.

What changed

Open the failing run

I do not rebuild context from zero every time something drifts.

Read the path in one screen

Expectations, tool calls, and outputs stay attached.

Testing / dashboard

First the overview, then the exact trace.

The dashboard keeps the rerunnable history in view and lets me open the failing path without losing the surrounding case.

09 / The problem

Readable failures still left one absurd debugging path.

The trace was finally clear, but touching one tool still meant booting the entire agent and walking through a full conversation just to reach one function.

Overkill

I was paying full-agent cost for one tiny tool check.

Latency

The shortest debugging path still included a full conversation.

Focus

The real problem was one tool, but the whole stack kept getting in the way.

10 / The idea

So I gave the tool its own surface.

11 / The solution

Tooling let me debug one tool without booting the whole agent.

Now I can run the tool with real arguments, inspect the output immediately, and fix the exact layer that broke. The agent no longer has to be part of the setup.

Direct tool invoke

get_weather

Tooling
Arguments
{
  "location": "San Francisco",
  "unit": "fahrenheit"
}
Result
{
  "location": "San Francisco",
  "temperature": "75°F",
  "condition": "sunny"
}

Real args. Real output. No full-agent detour.

Tooling / dashboard

Tooling keeps the one-tool debugging loop close.

Once the trace tells me where to look, I can jump into one tool, send real arguments, and inspect the output without booting the whole agent again.

12 / The problem

Real user bugs do not show up as prewritten test cases.

When someone reports a request outside the suite, I do not want to write the formal test in the dark. I want to replay the request first and watch what the agent actually does.

A user arrives

The request is real, but the suite does not have it yet.

I still need visibility

I want to see the tool path before I decide what belongs in coverage.

Blind test writing is slow

Formalizing the case too early means guessing before I understand it.

13 / The idea

So I needed a live replay before I wrote the formal test.

14 / The solution

Chat lets me replay the request before I freeze it into coverage.

Now I can replay the real request, inspect the tool calls and outputs, and only then decide whether the edge case deserves to become a formal test.

What happens next

01

Replay the live request.

02

Inspect the tool arguments and output.

03

Decide whether the edge case belongs in the suite.

Live replay

Chat with tool visibility

Chat
User request

Compare today's weather in San Francisco with tomorrow and tell me if I should bike.

Tool call

get_weather({"location": "San Francisco", "unit": "fahrenheit"})

Tool output

today: 75°F and sunny, tomorrow: 68°F with light wind

Assistant

Bike today. Tomorrow is still fine, but it will be cooler and windier.

Chat / dashboard

Chat keeps the live replay visible before I formalize the case.

I can watch the request, the tool path, and the answer in one place first, then decide what deserves to become a formal regression test.

Inside the dashboard

The dashboard keeps every part of the debugging loop open

After the replay, the dashboard still gives each next move its own clear surface instead of collapsing the whole investigation into one crowded screen.

Reopen the failing run

Testing

Testing

Re-run cases, reopen failures, and keep the surrounding context close enough to act on instead of reconstructing it from logs.

Inspect one tool in isolation

Tooling

Tooling

Send real arguments to one tool, inspect the output immediately, and fix the exact layer that broke before the agent gets in the way.

Replay the live request

Chat

Chat

Replay what the user actually asked, inspect the tool path in context, and only then decide whether the edge case deserves a permanent case.

15 / Closed loop

The point was never three tabs. It was one closed loop.

Once I could replay a real request, read the failing path, isolate one tool, and save the result as a case, Goose finally felt like a system instead of a pile of workarounds.

01 Replay the request
02 Read the path
03 Isolate one tool
04 Save and rerun

Goose only started feeling complete when I could move from a real request to saved coverage without guessing in between. If you have an edge case it does not cover yet, open an issue or send a PR.