TDD in the AI Era

TDD in the AI era: Let the model hit the red light first 的文章封面图

TDD in the AI era: Let the model hit the red light first

AI-assisted

Why is TDD needed in AI programming? The point is not to formally "write the test first", but to create a verifiable failure before the implementation changes from red to green.

Let’s talk about the conclusion first

AI can deliver code quickly, but TDD requires it to hit a verifiable red light first.

After AI writes code, TDD is not outdated, but has changed its position.

In the past, when we talked about TDD, we often talked about the self-discipline of programmers: write tests first, then write implementation, and refactor in small steps. When it comes to AI programming, it is more like a braking system. Because what the model is best at is also the most dangerous thing: it can quickly write a large piece of code that looks complete.

If you ask it to implement a function, it may give you:

an implementation file
a set of tests
an explanation
A word "done"

The thing is, "looking complete" is not done in the engineering sense. To be completed in an engineering sense, at least answer:

Is this behavior defined by an explicit failing test? Did this failure turn green due to implementation? After turning green, have we tidied up the code without changing the behavior?

This is why TDD is being talked about again in the AI era.

It's not about making the process look advanced, but about replacing "trust in the model" with "trust in feedback."

1. Common misunderstanding: TDD is not “write tests first”

The point of TDD is not that test files appear early, but that failure feedback appears early enough.

Many people hate TDD because they understand it as a ritual:

先写测试。
再写代码。
最后跑一下。

This is certainly boring and can easily turn into formalism.

The truly useful TDD is not "test files appear early", but "failures appear early enough".

The key is not testing, but red

The first step of TDD is called RED, not TEST.

RED means: write a test first so that the system clearly fails. Three things must be true for this to fail:

It does fail.
It fails because the target behavior does not exist.
It fails in the way you expect.

If the red is not seen first, the green that follows is meaningless.

For example, you want to implement slugify("Hello World") -> "hello-world". A valuable RED is not "I wrote a test file", but:

Test: tests/test_slugify.py
Command: pytest tests/test_slugify.py -q
Failure: NameError: name 'slugify' is not defined
Reason: 目标函数还不存在，符合预期

This is when testing becomes specification. It tells you: To achieve the next step, you only need to make this behavior true.

Go green first and then make up the test, usually making up the story

It's easy for AI to go the other way: write the implementation first and add testing later.

This is a smooth experience. When you see that the code has been run and the tests are in place, you will feel "almost" in your heart. But it has a fatal problem: the test is likely to be just a retroactive implementation of the current implementation.

It is not asking "what the requirements should be", but "how to write the current code so that it can easily pass".

This is why AI writing tests often have these smells:

Assertions are too specific to the current implementation
There are too many mocks and the real boundaries are not measured.
only test happy path
To make existing code pass, make assertions very wide
No test can prove that the old code was originally wrong

TDD requires the opposite: first let the requirements fail, and then let the code catch up with the requirements.

2. Why TDD is more needed in the AI era

The core contradiction of AI programming is not "code is written slowly", but "feedback comes late".

Without TDD, you usually work like this:

描述需求 -> AI 写一堆代码 -> 人肉看 diff -> 跑一下 -> 发现问题 -> 回头修

The questions will pile up until the end. By the time you find out it's wrong, there may have been three categories of things mixed together:

Misunderstanding of requirements
The implementation path is wrong
Refactoring breaks old behavior

The role of TDD is to shorten this long chain.

It gives the model a decidable goal

"Write elegantly" is not the goal.

"Automatically jumping back to the login page after the user's login status expires" is not specific enough.

A better goal would be:

当 access token 过期时：
1. 请求返回 401。
2. 客户端清理本地 session。
3. 用户被重定向到 /login。
4. 原始目标地址被保存在 redirect 参数里。

Go one step further and turn one of them into a failing test:

given expired session
when user opens /settings
then app redirects to /login?redirect=/settings

At this time, the AI is no longer guessing "how to handle the expiration of the login state", but completing a clear behavior.

It breaks large tasks into small closed loops

The easiest place for AI to lose control is to do it all in one go.

Let it implement login, permissions, refresh token, error prompts, and route jumps at once, and you will get a big diff in the end. It might work, but the review cost is high. You have to judge business, status, routing, boundaries, testing and refactoring at the same time.

The rhythm of TDD looks more like this:

一个行为 -> 一个失败测试 -> 最小实现 -> 变绿 -> 再下一个行为

Advance only a small amount at a time. It’s small enough that you can understand it, small enough that it’s difficult for AI to make up stories, and small enough that it can quickly locate when it fails.

It limits the model to "play smoothly"

A common problem with AI is overzealousness.

You ask it to fix a boundary bug, and it extracts the helper; you ask it to add a test, and it changes the implementation; you ask it to refactor, and it changes its behavior.

TDD uses phases to separate these actions:

Stages	What to do	What not to do
RED	Write a failing test	Write a production implementation
GREEN	Write the minimum implementation	Modify the test to get green
REFACTOR	Clean up structure	Introduce new behaviors

This table is more useful than "Please be cautious." It lets the model know which stage it is in, and makes it easier for humans to detect boundary violations.

3. Reconstruction of red and green: three doors, not three slogans

Red-green refactoring is more like three gates: first prove the gap, then let the current behavior pass, and finally just tidy up the structure.

“Red, Green, Refactor” could easily be a slogan. In actual use, it should look like three doors. Every time you pass through a door, you must leave evidence.

The first door: RED, proving that the needs have not been met

The most important questions during the RED phase are:

If this test fails, does it prove that we still lack a target behavior?

A BAD RED:

assert True

Not a very good RED either:

assert "hello" in format_title("Hello World")

It's too wide. Many faulty implementations also pass.

Better RED:

def test_slugify_lowercases_and_uses_hyphen():
    assert slugify("Hello World") == "hello-world"

This test is small, but it's clear. It specifies inputs, outputs, and behavior.

Second door: GREEN, only let the current test pass

The GREEN phase is not about writing the final architecture.

It has only one mission: to pass the currently failing test with the least amount of code.

This statement sounds counter-intuitive. Many people worry about whether "minimum implementation" will be too ugly. Yes, sometimes it can be ugly. But its value lies in maintaining design pressure.

If the first test is:

def test_slugify_lowercases_and_uses_hyphen():
    assert slugify("Hello World") == "hello-world"

An acceptable GREEN might just be:

def slugify(text: str) -> str:
    return text.lower().replace(" ", "-")

You don't need to support Chinese, accents, continuous punctuation, emoji, SEO special cases right away. Those should be driven by later tests.

The third door: REFACTOR, only changes the structure, not the behavior

The REFACTOR stage is the easiest to get confused by the AI.

It will interpret "tidy up the code" as "enhance it by the way." This won't work. The definition of refactoring is very narrow: the external behavior remains unchanged, but the internal structure becomes better.

Good refactoring looks like this:

Change the variable name to a more accurate one
Extract repeated expressions
Remove conditional branches that are too deep
Move function locations to make module responsibilities clearer

Bad refactoring looks like this:

New input is now supported
The error message has been changed easily
Changed dependencies easily
Conveniently changed the test assertion

The judgment criteria are simple:

If this commit was just called refactor:, it should be the same green before and after testing, and user behavior should be the same.

4. The taste of good testing

TDD is not about more tests being better. AI is also very good at generating a bunch of tests that have little value.

What's more important is testing the taste.

Good tests are like specs

A good test should read like a business specification:

当用户没有权限时，保存按钮不可点击。
当标题为空时，表单显示错误信息。
当重复提交同一个请求时，只创建一条记录。

It is concerned with external behavior, not what is done internally.

Bad tests look like implementation notes:

应该调用 validateInput 三次。
应该读取 state.user.flags。
应该触发 handleClick 内部函数。

Once the implementation details are tied up, refactoring will be painful. You just changed the internal structure, but the tests failed on a large scale. Rather than protecting the code, such tests freeze it.

Good tests have boundaries

A test is best designed to answer only one question.

If a test also asserts:

Correct format
Permissions are correct
The network request is correct
The toast copy is correct
The database status is correct

When it fails it's hard to know what the problem is.

AI is especially prone to writing such “big, comprehensive” tests because it wants to prove a lot of things at once. TDD is the opposite: an action, a failure, and an implementation.

Good tests make it difficult for implementations to cheat

If the test only covers an input that is too specific, the AI may write a fake implementation that just matches.

For example:

def slugify(text: str) -> str:
    return "hello-world"

The first test will allow it to pass, but the second test will force out the real logic:

def test_slugify_handles_another_title():
    assert slugify("Test Driven Development") == "test-driven-development"

Therefore, TDD does not always write only one test, but only adds one behavioral pressure in each round. The pressure gradually increases and the design gradually grows out.

5. How will AI bypass TDD?

When AI attempts to change tests, overwrite implementations, or use mocks to mask boundaries, the process needs to be pulled back to the evidence.

This part must be made clear because AI does not naturally respect testing.

Its optimization goal is simple: complete the task you just mentioned. If you say "let the test pass" it may do some actions that humans don't want.

The first type: change the test to get green

The most typical:

Change assert slugify("Hello World") == "hello-world" to the current output
Remove failed assertions
Add skip to the test
Change strict assertions to loose assertions

This is not TDD, this is taking out the red light.

The second type: write over-fitting implementation

For example, the test has only one input:

def test_slugify_lowercases_and_uses_hyphen():
    assert slugify("Hello World") == "hello-world"

The model might be written as:

def slugify(text: str) -> str:
    if text == "Hello World":
        return "hello-world"
    return text

You don't need to scold it at this time. You need to continue adding the next behavior so that the implementation cannot continue to be hard-coded.

The third method: use mock to cover the real boundary

AI loves mocks. Mocks make tests easier to write and make many real problems disappear.

It’s not that you can’t mock, but you have to ask:

What I am mocking now is a slow dependency, or is it a boundary that I really want to verify?

If you want to verify payment callback parsing but mock the parsing layer, the test will be meaningless.

6. When not to use TDD

TDD has value, but not everything is worth it.

Unsuitable scene

Pure visual fine-tuning
one-time script
Technology exploration demo
Prototypes for which the requirements themselves have not been clearly thought out
The test framework has not yet set up a warehouse

In these scenarios, pursue exploration speed first and don’t be held back by the process.

Suitable scene

bug fixes
Permissions, billing, state machine
Data transformation and boundary handling
Core modules that will be maintained for a long time
Code paths that AI will modify repeatedly

The judgment standard is not "whether this function is great or not", but:

If it's wrong, are the costs obvious?

The cost is obvious, so it's worth writing the test first.

7. An executable mental method

If I were to give AI just one sentence, I wouldn’t say:

请高质量实现这个功能。

I would say:

先写一个失败测试，运行它，确认失败原因符合预期。不要写实现，直到我说 go。

This sentence is of higher quality because it is not asking the model to "behave well" but rather asking it to enter a process that can be checked.

A little more complete:

每轮只处理一个行为。
RED：写一个失败测试并运行。
GREEN：写最小实现，不改测试。
REFACTOR：只在绿色状态下整理结构。
每轮报告测试文件、命令、失败原因、通过结果。

This is the core of TDD in the AI era.

Not superstitious about tests or processes, but about having evidence for every step.

Closing

What AI programming needs most is not more code, but shorter feedback.

Here's the value of TDD: it turns "I thought it should be right" into "Here was a failure, and then it turned green." The change is small, but real enough.

If you only remember one sentence, remember this:

Don’t let AI deliver code directly. First let it deliver a red light, and then let it turn the red light green.

The next Practical Guide will turn this rhythm into a workflow that can be directly copied.

Recommended resources

TDD, AI agents and coding with Kent Beck

Pragmatic Engineer interviews Kent Beck. The focus is on small-step feedback, test protection, and engineering discipline in AI programming.YouTube

Augmented Coding: Beyond the Vibes

Kent Beck's first-person account of using AI agents while keeping engineering discipline.

Kent BeckTidy First

Red/green TDD

Why red/green TDD is a useful compact instruction for coding agents.

Simon Willisonsimonwillison.net

Test-Driven Development: By Example

The original book that defined the red-green-refactor loop.

Kent BeckAmazon

Comments

The Complete Guide to Claude Agent Teams

Master Claude Code's Agent Teams feature - coordinate multiple Claude instances into a team for true multi-agent collaborative development

Practical Guide

Turn TDD into a reproducible workflow in Codex: use AGENTS.md to write project disciplines, use skills to solidify red and green refactoring, use subagents to isolate stages, and use hooks to check test diffs.

Table of Contents

Let’s talk about the conclusion first 1. Common misunderstanding: TDD is not “write tests first”The key is not testing, but red Go green first and then make up the test, usually making up the story 2. Why TDD is more needed in the AI era It gives the model a decidable goal It breaks large tasks into small closed loops It limits the model to "play smoothly"3. Reconstruction of red and green: three doors, not three slogans The first door: RED, proving that the needs have not been met Second door: GREEN, only let the current test pass The third door: REFACTOR, only changes the structure, not the behavior 4. The taste of good testing Good tests are like specs Good tests have boundaries Good tests make it difficult for implementations to cheat 5. How will AI bypass TDD?The first type: change the test to get green The second type: write over-fitting implementation The third method: use mock to cover the real boundary 6. When not to use TDD Unsuitable scene Suitable scene 7. An executable mental method Closing Recommended resources

TDD in the AI era: Let the model hit the red light first | Yu's Cyber Desk