TDD in the AI Era

TDD in the AI Era: Codex Practical Manual 的文章封面图

TDD in the AI Era: Codex Practical Manual

AI-assisted

Turn TDD into a reproducible workflow in Codex: use AGENTS.md to write project disciplines, use skills to solidify red and green refactoring, use subagents to isolate stages, and use hooks to check test diffs.

Give me a map first

Concept It’s about: In AI programming, the value of TDD is not the ritual of “write the test first”, but rather create a verifiable red light first, and then let the implementation turn it green.

This article talks about how to implement Codex.

Don’t ask “what documents do I need to match” right away. A better question is:

How do I make Codex act on the same TDD workflow every time?

This workflow can be split into four levels:

Hierarchy	What are you doing	Where	When is it appropriate
L1	Write project disciplines	`AGENTS.md`	All projects should have
L2	Solidification process	`.agents/skills/tdd-codex/SKILL.md`	Repeatedly use TDD to meet requirements
L3	Isolation stage	`.codex/agents/*.toml`	Complex tasks, fear of mutual contamination between testing and implementation
L4	Automatic reminder	`.codex/hooks.json`	Important warehouse, afraid of AI secretly modifying the test

The smallest available versions are L1 + L2.

The complete line of defense is L1 + L2 + L3 + L4.

1. First define what "completion" is

If there are no completion standards, Codex can easily treat "written code" as "done".

In a TDD scenario, completion criteria should be more specific.

Evidence must be delivered in every round

Have Codex report these six items every round:

Behavior: 这一轮实现哪个行为
Test: 测试文件和测试名
Command: 跑了什么命令
RED: 失败原因是否符合预期
GREEN: 通过结果是什么
REFACTOR: 是否重构，为什么

This is much more useful than a "done."

It lets you know that the model has really gone through the red and green cycles, instead of writing the implementation first and then adding a test that looks reasonable.

Only one behavior is processed in a round

This is critical.

Don't let Codex generate the entire test matrix at once. That would become a "horizontal laying test":

RED: test1, test2, test3, test4, test5
GREEN: 一次写一个大实现

What you want is to slice it lengthwise:

RED test1 -> GREEN impl1 -> REFACTOR
RED test2 -> GREEN impl2 -> REFACTOR
RED test3 -> GREEN impl3 -> REFACTOR

The first round of implementation will change your understanding of the problem. Don't write all your tests in one go.

2. L1: Write discipline into AGENTS.md

AGENTS.md is the description file that Codex will read when entering the project.

OpenAI official documentation states that Codex will first read the global description, and then read it from the project root directory all the way to the current directory. Each layer reads AGENTS.override.md first, otherwise reads AGENTS.md. Descriptions closer to the current directory appear later and therefore have higher priority. The default merge limit is 32 KiB, so a long tutorial cannot be written here.

It should be like project traffic rules

AGENTS.md is not responsible for teaching Codex what TDD is. It is only responsible for writing clearly: what behavior is not allowed in this project.

You can put this paragraph directly:

# TDD Rules

- For new behavior and bug fixes, use red/green TDD.
- RED: write exactly one failing behavior test first.
- Run the smallest relevant test command and confirm the failure is expected.
- Do not edit production implementation during RED.
- GREEN: write the minimum production code required to pass the current failing test.
- Never modify, delete, skip, or weaken tests to make implementation pass.
- REFACTOR only after tests are green.
- Keep structural changes and behavior changes separate.
- Report Behavior, Test, Command, RED, GREEN, and REFACTOR for each cycle.

Additional project commands:

# Verification

- Use `pytest` or the smallest relevant pytest command for Python behavior tests.
- Use `npm run types:check` only when this blog site's MDX or TypeScript changes.
- Use the smallest targeted command during RED/GREEN loops.
- If a command is slow, explain what targeted command was used first and what full command remains.

It should not be written as an encyclopedia

A bad AGENTS.md would be filled with:

History of TDD
All testing philosophies
A bunch of framework tutorials
Complex prompt templates
Complete specifications in different languages

These things dilute the rules that really matter.

My suggestion is: AGENTS.md Only put resident disciplines. Use skills for long processes.

3. L2: Make the process into Codex Skill

AGENTS.md solves "default discipline" and skill solves "complete process".

When you often say "do it by TDD" to Codex, you should upgrade this sentence to a project-level skill.

Directory structure

Put it here:

.agents/
  skills/
    tdd-codex/
      SKILL.md

Codex will scan from the current directory all the way up to .agents/skills. Skills in the root directory of the warehouse are suitable for workflows used by the team.

Minimum available SKILL.md

---
name: tdd-codex
description: Implementing or fixing maintainable code with Codex using strict red-green-refactor TDD. Use for new behavior, bug reproduction, behavior tests, or safe AI coding.
---

# TDD Codex Workflow

Use one behavior slice per cycle.

## Phase 0: Scope

Identify one observable behavior.
Name the public API, user flow, or integration boundary under test.
Do not edit production code.

## Phase 1: RED

Write exactly one failing behavior test.
Prefer public behavior over implementation details.
Run the smallest relevant test command.
Confirm the failure is expected.
Stop and report:
- Behavior
- Test file
- Command
- Failure reason

## Phase 2: GREEN

Write the minimum production code to pass the current failing test.
Never modify, delete, skip, or weaken tests to pass.
Do not add speculative features or abstractions.
Run the same test command.
Report the passing result.

## Phase 3: REFACTOR

Only refactor after tests are green.
If the code is already simple, skip.
If refactoring, make one structural change at a time.
Run tests after each refactor.
Do not change behavior.

## Cycle Report

Return:
- Behavior:
- Test:
- Command:
- RED:
- GREEN:
- REFACTOR:
- Next slice:

Calling method

From now on you can say:

用 tdd-codex skill 做这个需求。
每轮只处理一个行为。
先 RED，确认失败后停下来，不要直接写实现。

Or shorter:

用 tdd-codex。先写红灯，等我说 go。

The point is not how beautiful the prompt is, but that it brings Codex back on the same track every time.

4. L3: Use Subagents to isolate red and green reconstruction

Not all tasks require subagents.

However, when tasks are complex, tests are easily contaminated by implementation, and refactoring is easy to get out of control, it will be more stable to split RED, GREEN, and REFACTOR into different agents.

When is it worth dismantling?

Suitable for disassembly:

Permissions, billing, state machine -Multi-module functionality
The bug is very hidden, so you need to write a recurrence test first
The model is always changed to test to get green
You want someone who is only responsible for reviewing test quality

Not suitable for disassembly:

Gadget functions
Copywriting changes
Pure visual fine-tuning
one-time script

The cost of tearing down an agent is real. Use only when the benefits of isolation outweigh the costs of communication.

RED agent

# .codex/agents/tdd-test-writer.toml
name = "tdd_test_writer"
description = "RED phase agent. Writes one failing behavior test and stops before implementation."
sandbox_mode = "workspace-write"

developer_instructions = """
You own only the RED phase.
Write exactly one behavior-focused test for the requested slice.
Prefer public APIs and user-visible behavior over implementation details.
Run the smallest relevant test command.
Confirm the test fails for the expected reason.
Do not edit production implementation.
Do not add multiple tests at once.
Return Behavior, Test, Command, and RED failure reason.
"""

GREEN agent

# .codex/agents/tdd-implementer.toml
name = "tdd_implementer"
description = "GREEN phase agent. Implements the minimum production code to pass the current failing test."
sandbox_mode = "workspace-write"

developer_instructions = """
You own only the GREEN phase.
Read the failing test and relevant production code.
Write the minimum implementation required to pass the current test.
Never modify, delete, skip, or weaken tests to make them pass.
Do not add speculative features, helpers, configuration, or abstractions.
Run the relevant tests and return the command plus passing output.
"""

REFACTOR agent

# .codex/agents/tdd-refactorer.toml
name = "tdd_refactorer"
description = "REFACTOR phase agent. Improves structure only after tests are green."
sandbox_mode = "workspace-write"

developer_instructions = """
You own only the REFACTOR phase.
Start by running the relevant tests to confirm the code is green.
Look for duplication, unclear names, needless branching, or misplaced responsibility.
Skip refactoring when the code is already simple.
If you refactor, make one structural change at a time.
Run tests after each refactor.
Never change behavior in this phase.
"""

How to command the main session

按三阶段 TDD 做这个 slice：
1. tdd_test_writer 只写一个失败测试，并确认 RED。
2. 等我确认后，tdd_implementer 写最小实现，并确认 GREEN。
3. tdd_refactorer 判断是否需要结构重构。
不要批量铺测试。
不要在 GREEN 阶段修改测试。

The point here is to isolate context. People who write tests should try not to be affected by implementation details; people who write implementations cannot test manually; people who refactor cannot introduce new behaviors.

5. L4: Use Hooks to focus on test diff

If you rely solely on rules, the model may still go out of bounds.

The most common cross-border is: the test is red, and the model changes the test in order to turn it green.

The value of hooks is not to be "absolutely safe", but to expose this action immediately.

Enable Codex hooks

First open the feature flag in the configuration:

# ~/.codex/config.toml 或 <repo>/.codex/config.toml
[features]
codex_hooks = true

Codex looks for hooks next to the active configuration layer. Common locations:

~/.codex/hooks.json
~/.codex/config.toml
<repo>/.codex/hooks.json
<repo>/.codex/config.toml

At the project level, it is recommended to use <repo>/.codex/hooks.json first, because it can follow the warehouse.

hooks.json

{
  "hooks": {
    "PostToolUse": [
      {
        "matcher": "apply_patch|Edit|Write",
        "hooks": [
          {
            "type": "command",
            "command": "bash \"$(git rev-parse --show-toplevel)/.codex/hooks/watch-test-edits.sh\"",
            "timeout": 10,
            "statusMessage": "Checking test file edits"
          }
        ]
      },
      {
        "matcher": "Bash|apply_patch|Edit|Write",
        "hooks": [
          {
            "type": "command",
            "command": "bash \"$(git rev-parse --show-toplevel)/.codex/hooks/run-fast-check.sh\"",
            "timeout": 120,
            "statusMessage": "Running fast checks"
          }
        ]
      }
    ]
  }
}

Check whether the test file has been modified

# .codex/hooks/watch-test-edits.sh
#!/usr/bin/env bash
set -euo pipefail

changed_tests="$(
  git diff --name-only |
    grep -E '(^|/)(__tests__|tests?)/|\.(test|spec)\.[cm]?[jt]sx?$|_test\.go$|test_.*\.py$' || true
)"

if [ -n "$changed_tests" ]; then
  cat <<EOF
{
  "continue": false,
  "stopReason": "Test files changed. Review before continuing.",
  "systemMessage": "检测到测试文件被修改：\n$changed_tests\n\n测试文件可以改，但必须说明为什么改。先停下来确认：这是补充规格，还是为了凑绿而改测试？"
}
EOF
fi

Run a quick check

# .codex/hooks/run-fast-check.sh
#!/usr/bin/env bash
set -euo pipefail

root="$(git rev-parse --show-toplevel)"
cd "$root"

if [ -f pyproject.toml ] || [ -f pytest.ini ] || [ -d tests ]; then
  pytest
elif [ -f package.json ]; then
  if npm run | grep -q "types:check"; then
    npm run types:check
  elif npm run | grep -q "check"; then
    npm run check
  fi
elif [ -f go.mod ]; then
  go test ./...
fi

Don’t deify hooks

hooks are guardrail, not full enforcement boundary.

It can remind and block common paths, but it cannot replace review, CI, and human judgment. Especially when the test really needs to be modified, the correct approach is not to ban it forever, but to require model explanation:

为什么要改测试？
这是新需求、新边界，还是旧测试写错？
生产实现有没有被同步验证？

6. What do you think about TDD-Guard?

TDD-Guard is worth a look, but don't take it as a Codex installation suggestion.

It is the Claude Code plugin route. The core idea is to prevent AI from violating TDD, especially to prevent changing tests to get green.

When migrating to Codex, you can borrow ideas instead of copying the path.

Claude Code Route	Codex Route
`CLAUDE.md`	`AGENTS.md`
`.claude/agents/*.md`	`.codex/agents/*.toml`
Claude plugin/TDD-Guard	hooks + git diff check + CI
slash command	skill or prompt

A more realistic combination on the Codex side is:

AGENTS.md 写纪律
skill 固化流程
subagents 隔离角色
hooks 检查测试 diff
CI 做最后兜底

This is not the exact equivalent of a mandatory boundary, but it is sufficient for individual projects and most team projects.

7. Complete walkthrough: slugify

Now go through it with a small function.

Requirements:

实现 slugify(text: string): string。
把英文标题转成 URL slug。

Don’t think this example is small. TDD should understand the rhythm from small examples.

Step 0: Ask about the boundaries first, without writing code

Let Codex not rush to write:

我想实现 slugify(text: string): string。
先不要写代码。
请先问我 5 个边界问题，覆盖大小写、空格、标点、unicode、空字符串。
然后把确认后的规格写成 SPEC.md。

Specifications may be:

# slugify SPEC

- "Hello World" -> "hello-world"
- trim leading/trailing spaces
- collapse repeated spaces into one hyphen
- remove punctuation
- normalize "Café" -> "cafe"
- empty input returns empty string

Step 1: First red light

读取 SPEC.md。
只实现第一条行为："Hello World" -> "hello-world"。
先 RED：只写一个失败测试，运行它，确认失败。
不要写生产实现。

Ideal output:

Behavior: basic title becomes lowercase hyphenated slug
Test: tests/test_slugify.py
Command: pytest tests/test_slugify.py -q
RED: failed because slugify is not defined

Only then can you continue.

Step 2: Minimal green

go

Codex writes the minimal implementation:

def slugify(text: str) -> str:
    return text.lower().replace(" ", "-")

Then report:

GREEN: pytest tests/test_slugify.py -q passed
REFACTOR: skipped, implementation is still simple
Next slice: trim leading/trailing spaces

Step 3: Second red light

继续下一条：去掉首尾空格。
先 RED，只写一个测试。

Test:

def test_slugify_trims_spaces():
    assert slugify("  Hello World  ") == "hello-world"

If the current implementation outputs -hello-world-, the red light is true.

Then GREEN:

def slugify(text: str) -> str:
    return text.strip().lower().replace(" ", "-")

Step 4: Don’t rush to abstraction

At this point many AIs will want to draw a normalizeInput, removePunctuation, toAscii.

Don't rush yet.

The design of TDD should be pushed out by test pressure, not by imagination. Wait until you add unicode, punctuation, and empty strings, and structural pressure really arises, then refactor.

8. Daily usage quick check

New features

用 TDD 实现这个需求。
每轮只处理一个行为。
先写一个失败测试并运行确认 RED。
不要写生产实现，直到我说 go。

Fix bug

先写一个能复现这个 bug 的失败测试。
确认它因为这个 bug 失败后，再写最小修复。
不要改测试来适配当前实现。

Complex functions

先不要写代码。
请给出 TDD 分解计划：
- 外圈集成测试是什么
- 内圈每个行为 slice 是什么
- 每轮用什么命令验证
- 哪些地方不能 mock
等我确认后再开始 RED。

Review

Review 这次改动，重点看：
- 是否先有失败测试
- 测试是否测行为而不是实现
- 是否存在为了通过而弱化测试
- 结构改动和行为改动是否混在一起
- 是否缺少外圈集成测试

9. Final checklist

Every time you ask Codex to do TDD, use this table to check at the end.

Question	Eligibility Criteria
Is it really popular first	Are there failed commands and reasons for failure
Is the red color correct	The failure reason corresponds to the lack of target behavior
Do only one behavior per round	No batch testing
GREEN Has the test been changed?	No test has been changed to get green
Does testing test behavior	Does not rely on internal implementation details
Have mixed behaviors been refactored	Separate structural changes and behavioral changes
Is there complete verification	Target tests and necessary full inspections have been run

If this table cannot pass, don't rush to merge.

Recommended resources

AGENTS.md

Official guide to global, project, and nested instruction files.

OpenAIOpenAI Developers

Agent Skills

Official guide to packaging reusable workflows as skills.

OpenAIOpenAI Developers

Subagents

Official guide to custom agents and subagent workflows.

OpenAIOpenAI Developers

Codex Hooks

Official guide to deterministic scripts during the Codex lifecycle.

OpenAIOpenAI Developers

Red Green Refactor is OP With Claude Code

Short video by Matt Pocock. The title retains the original name, focusing on the rhythm of red/green/refactor.YouTube

TDD-Guard: Automated TDD enforcement for Claude Code

Claude Code plugin route for TDD enforcement. Codex users should borrow the guardrail idea, not the install path.

nizosGitHub

Comments

Concept introduction

Why is TDD needed in AI programming? The point is not to formally "write the test first", but to create a verifiable failure before the implementation changes from red to green.

Tmux Quick Start Guide

Learn Tmux terminal multiplexer from scratch - core concepts, essential commands, and deep integration with Claude Code

Table of Contents

Give me a map first 1. First define what "completion" is Evidence must be delivered in every round Only one behavior is processed in a round 2. L1: Write discipline into AGENTS.md It should be like project traffic rules It should not be written as an encyclopedia 3. L2: Make the process into Codex Skill Directory structure Minimum available SKILL.md Calling method 4. L3: Use Subagents to isolate red and green reconstruction When is it worth dismantling?RED agent GREEN agent REFACTOR agent How to command the main session 5. L4: Use Hooks to focus on test diff Enable Codex hooks hooks.json Check whether the test file has been modified Run a quick check Don’t deify hooks 6. What do you think about TDD-Guard?7. Complete walkthrough: slugify Step 0: Ask about the boundaries first, without writing code Step 1: First red light Step 2: Minimal green Step 3: Second red light Step 4: Don’t rush to abstraction 8. Daily usage quick check New features Fix bug Complex functions Review 9. Final checklist Recommended resources

TDD in the AI Era: Codex Practical Manual | Yu's Cyber Desk