TDD: Use red-green refactoring to force AI to take small steps

Matt Pocock calls himself "the most stable way to improve the quality of agent output." An in-depth breakdown of why his TDD skill insists on vertical rather than horizontal red-green refactoring, how to avoid coupling between testing and implementation, and the new meaning of TDD in the AI era

The rate of feedback is your speed limit. Don't outrun your headlights.

David Thomas & Andrew HuntThe Pragmatic Programmer

Visit

Failure mode: "AI does the right thing, but it can't run"

The third failure mode in Matt's talk: The direction is right, but it doesn't work.

The most direct fix is to install feedback infrastructure for AI:

TypeScript (no static typing is crazy)
Allow LLM to access the browser and view the page by itself
Automated testing

But Matt observed one thing: Even with this feedback installed, LLM doesn't work well. It tends to write 500 lines at a time and then think "oh I should type check that". This is what the Pragmatic Programmer calls outrunning your headlights - drive faster than the headlights can illuminate, and it's only a matter of time before you hit the wall.

"The rate of feedback is your speed limit, which means you should be testing as you go, taking small deliberate steps. And the AI by default is really not very good at that."

To fix this problem, you need to force the AI to stop step by step at the tool level. Matt's answer is TDD - Testing first can force checkpoints.

Classic Theory: Kent Beck’s Red-Green Reconstruction

The standard rhythm of TDD is defined by Kent Beck in his 2003 book "Test-Driven Development: By Example":

RED: Write a failing test (describe what to do)
GREEN: Write the smallest enough code to make the test pass
REFACTOR: Improve code structure under test protection

Each loop is extremely short - on the order of minutes. There are automated checks (test pass/fail) at every step.

Matt directly follows this rhythm, but his SKILL.md spends a lot of time talking about an anti-pattern - this is the core.

Key anti-pattern: horizontally sliced red and green

Many people think that TDD means "write all the tests first, then write all the implementations." Matt directly says this is wrong in SKILL.md:

WRONG (horizontal slicing):
  RED:   test1, test2, test3, test4, test5
  GREEN: impl1, impl2, impl3, impl4, impl5

RIGHT (vertical slicing via tracer bullets):
  RED→GREEN: test1→impl1
  RED→GREEN: test2→impl2
  RED→GREEN: test3→impl3
  ...

Why is horizontal wrong? SKILL.md gave three reasons:

Tests written in bulk test imagined behavior, not actual behavior

You end up testing the shape of things (data structures, function signatures) rather than user-facing behavior

Tests become insensitive to real changes - they pass when behavior breaks, fail when behavior is fine

Human saying: Writing all the tests in one go is testing what's in your head, not the real code. When you write impl3, you realize that the design of test1 is wrong - but at this time, test2/test3/test4 are all coupled to the wrong design. Go back and make changes.

The correct approach is to test one implementation and then open the next pair after writing one pair. After each pair is completed and you have learned something from this implementation, the next pair of tests can be designed based on real experience - not imagination.

Skill full text structure

engineering/tdd/SKILL.md is one of the longest skills Matt has written because TDD itself has a lot of nuance. The core structure is as follows:

Philosophy

Core principle: Tests should verify behavior through public interfaces, not implementation details. Code can change entirely; tests shouldn't.

Good tests are integration-style: they exercise real code paths through public APIs. They describe what the system does, not how it does it. A good test reads like a specification.

Bad tests are coupled to implementation. They mock internal collaborators, test private methods, or verify through external means (like querying a database directly). The warning sign: your test breaks when you refactor, but behavior hasn't changed.

Remember a diagnosis: If you rename an internal function, the test will kneel down - then this test is testing the implementation rather than the behavior, which is a bad test.

Workflow (with checklist)

1. Planning

Align with users before writing code:

[ ] Confirm with user what interface changes are needed
[ ] Confirm with user which behaviors to test (prioritize)
[ ] Identify opportunities for deep modules (small interface, deep impl)
[ ] Design interfaces for testability
[ ] List the behaviors to test (not implementation steps)
[ ] Get user approval on the plan

Key question: "What should the public interface look like? Which behaviors are most important to test?"

"You can't test everything. Confirm with the user exactly which behaviors matter most. Focus testing effort on critical paths and complex logic, not every possible edge case."

This one is very counter-intuitive. By default, AI will want to exhaust all edge cases, but Matt emphasizes priority - not all behaviors are worth measuring, and focus firepower on the core path.

2. Tracer Bullet

Write a test that verifies one thing:

RED:   Write test for first behavior → test fails
GREEN: Write minimal code to pass → test passes

This is a "tracer bullet" - shoot it first and check the sight. Matt emphasized that this work should be end-to-end - not to write the schema first, then write the API and then write the UI, but to cut the thinnest path that runs through the entire stack.

3. Incremental Loop

Repeat for each subsequent behavior RED→GREEN:

RED:   Write next test → fails
GREEN: Minimal code to pass → passes

Rules:

One test at a time
Write only enough code to pass the current test
Don’t predict future tests
Testing focuses on observable behavior

"Don't predict" is particularly important. The AI can't help but think, "This function needs to support X anyway, let's add it by the way" - and this starts horizontal slicing.

4. Refactor

After all tests have passed, look for refactoring opportunities:

[ ] Extract duplication
[ ] Deepen modules (move complexity behind simple interfaces)
[ ] Apply SOLID principles where natural
[ ] Consider what new code reveals about existing code
[ ] Run tests after each refactor step

Never refactor while RED. Get to GREEN first.

Refactoring in red = changing tests and code at the same time = you don’t know whether the test is wrong or the code is wrong. Green first, then refactor.

Per-Cycle Checklist

At the end of each red and green cycle, Matt asks the AI to self-check:

[ ] Test describes behavior, not implementation
[ ] Test uses public interface only
[ ] Test would survive internal refactor
[ ] Code is minimal for this test
[ ] No speculative features added

These five points are used to identify bad tests and over-implementation. AI self-checking can prevent most common mistakes.

Real use: from issue to PR

/tdd is the next step in Matt's workflow from /to-issues. Given a vertical slice issue, the process is:

你: 实现 issue #43
    ↓
/tdd
    ↓
Claude 读 issue acceptance criteria
    ↓
Claude 探索代码库 → 找到 CONTEXT.md → 用项目术语
    ↓
Planning 阶段:
    - 列出准备改的接口
    - 列出准备测的行为（按优先级排序）
    - 让你点头
    ↓
Tracer Bullet:
    - RED: 写第一个测试（基于 acceptance criteria 第 1 条）
    - 跑测试，确认 fail
    - GREEN: 写最小实现
    - 跑测试，确认 pass
    ↓
Incremental Loop:
    - 每个 acceptance criteria 一个 RED→GREEN
    ↓
Refactor:
    - 看 deep module 提取机会
    - 每次重构后跑全套测试
    ↓
PR

Every red and green cycle the AI will stop and give you a status - "test fails"/"test passes, here's the diff". These pauses are the antidote to outrun headlights - the AI has no chance of laying out a thousand lines in one go.

About Mock: Matt’s strong opinion

SKILL.md specifically mentions the dangers of mocks - he also provides a separate mocking.md. Core ideas:

"Bad tests... mock internal collaborators."

Mock internal collaborators = 1:1 coupling between testing and implementation = testing team kneels when refactoring. Matt's preference is integration-style testing - try to use a real database (in-memory or testcontainers), real HTTP (MSW), and a real file system (tmp dir). Only mock at really expensive or unstable boundaries (like calls to the OpenAI API).

This is contrary to the current situation of many teams - most code libraries are full of unit tests, and there are more mocks than real code. Matt made a judgment in his speech: Good code base = easy to test code base. If you have to mock a bunch of things to test, it means there is a problem with the code structure, and you should change the architecture first (go to /improve-codebase-architecture).

The new significance of TDD in the AI era

When Kent Beck wrote that book 23 years ago, the core benefit of TDD was "people not writing wrong code." In the AI era, TDD has an additional meaning:

It is the only "success criterion" that AI can understand.

Matt quoted Karpathy’s words of wisdom later in his speech:

LLMs are exceptionally good at looping until they meet specific goals. Don't tell it what to do, give it success criteria and watch it go.

Andrej KarpathySame thread on AI coding

Visit

The best form of "success criteria" is test - it is machine-verifiable, binary, and cannot be quibbled with. Giving AI a test suite + "let it pass" is ten times more reliable than giving AI a description of requirements + "please implement".

So /tdd is not just a quality assurance tool - it is the input interface to the agent loop. Each red and green cycle is a complete "input → action → feedback". The AI learns the true situation of this implementation in the cycle, and the next cycle will be more accurate.

How to install and use

npx skills@latest add mattpocock/skills

Check tdd + setup-matt-pocock-skills.

If you mainly use Codex now, take over the installation product to .agents/skills/, and write the project-level workflow, test commands, and issue tracker rules into AGENTS.md. The essence of Matt's /tdd is the red-green refactoring cycle, not tied to Claude Code.

Calling method:

Direct: /tdd - let it infer what to measure from the current conversation context
Fetch issue: /tdd implement #43 - it will fetch issue and then open it
Fix bug: /tdd reproduce this bug then fix it - It will first write a failed test that can reproduce the bug, and then fix it

Notes

Not suitable for all tasks. One-off scripts, playground exploration code, UI fine-tuning - don't use TDD, it will slow down the pace. Matt himself said that TDD is suitable for code that "has lasting value and needs to be maintained."

Prepare the test infrastructure first. If the project has not installed the test framework (Vitest / Jest / Playwright, etc.), install it first and then use /tdd, otherwise it will install it for you first, but there are many questions in that step.

Don't let it automatically add e2e tests. e2e is slow and crisp, and the TDD rhythm is minute-level. /tdd defaults to integration test rather than e2e, but you can explicitly tell it "unit + integration only, not e2e".

The reconstruction phase is the easiest to get out of control. After the AI gets the GREEN status, it will refactor a bunch of things excitedly - stare at it, and run tests after each refactoring. This part is a high-risk area for AI deviation.

Reference resources

tdd/SKILL.md (源码)

Full TDD philosophy, horizontal-slice anti-pattern, complete workflow checklists, and per-cycle self-check.

Matt PocockGitHub2026

Visit

Test-Driven Development: By Example

The original book that defined the red-green-refactor loop.

Kent BeckAmazon2003

Visit

TDD with AI Coding（中文）

Companion piece exploring how TDD shifts meaning when AI writes the implementation.

Yu本站2026

Visit

Next article: Improve Codebase Architecture: Reconstruct shallow into deep modules - Periodic maintenance allows AI to run in your code base in the long term.

TDD: Use red-green refactoring to force AI to take small steps

Failure mode: "AI does the right thing, but it can't run"

Classic Theory: Kent Beck’s Red-Green Reconstruction

Key anti-pattern: horizontally sliced red and green

Skill full text structure

Philosophy

Workflow (with checklist)

1. Planning

2. Tracer Bullet

3. Incremental Loop

4. Refactor

Per-Cycle Checklist

Real use: from issue to PR

About Mock: Matt’s strong opinion

The new significance of TDD in the AI era

How to install and use

Notes

Reference resources

tdd/SKILL.md (源码)

Test-Driven Development: By Example

TDD with AI Coding（中文）

Comments

Table of Contents