
TDD in the AI Era: Codex Practical Manual
Turn TDD into a reproducible workflow in Codex: use AGENTS.md to write project disciplines, use skills to solidify red and green refactoring, use subagents to isolate stages, and use hooks to check test diffs.
Give me a map first
Concept It’s about: In AI programming, the value of TDD is not the ritual of “write the test first”, but rather create a verifiable red light first, and then let the implementation turn it green.
This article talks about how to implement Codex.
Don’t ask “what documents do I need to match” right away. A better question is:
How do I make Codex act on the same TDD workflow every time?
This workflow can be split into four levels:
| Hierarchy | What are you doing | Where | When is it appropriate |
|---|---|---|---|
| L1 | Write project disciplines | AGENTS.md | All projects should have |
| L2 | Solidification process | .agents/skills/tdd-codex/SKILL.md | Repeatedly use TDD to meet requirements |
| L3 | Isolation stage | .codex/agents/*.toml | Complex tasks, fear of mutual contamination between testing and implementation |
| L4 | Automatic reminder | .codex/hooks.json | Important warehouse, afraid of AI secretly modifying the test |
The smallest available versions are L1 + L2.
The complete line of defense is L1 + L2 + L3 + L4.
1. First define what "completion" is
If there are no completion standards, Codex can easily treat "written code" as "done".
In a TDD scenario, completion criteria should be more specific.
Evidence must be delivered in every round
Have Codex report these six items every round:
Behavior: 这一轮实现哪个行为
Test: 测试文件和测试名
Command: 跑了什么命令
RED: 失败原因是否符合预期
GREEN: 通过结果是什么
REFACTOR: 是否重构,为什么This is much more useful than a "done."
It lets you know that the model has really gone through the red and green cycles, instead of writing the implementation first and then adding a test that looks reasonable.
Only one behavior is processed in a round
This is critical.
Don't let Codex generate the entire test matrix at once. That would become a "horizontal laying test":
RED: test1, test2, test3, test4, test5
GREEN: 一次写一个大实现What you want is to slice it lengthwise:
RED test1 -> GREEN impl1 -> REFACTOR
RED test2 -> GREEN impl2 -> REFACTOR
RED test3 -> GREEN impl3 -> REFACTORThe first round of implementation will change your understanding of the problem. Don't write all your tests in one go.
2. L1: Write discipline into AGENTS.md
AGENTS.md is the description file that Codex will read when entering the project.
OpenAI official documentation states that Codex will first read the global description, and then read it from the project root directory all the way to the current directory. Each layer reads AGENTS.override.md first, otherwise reads AGENTS.md. Descriptions closer to the current directory appear later and therefore have higher priority. The default merge limit is 32 KiB, so a long tutorial cannot be written here.
It should be like project traffic rules
AGENTS.md is not responsible for teaching Codex what TDD is. It is only responsible for writing clearly: what behavior is not allowed in this project.
You can put this paragraph directly:
# TDD Rules
- For new behavior and bug fixes, use red/green TDD.
- RED: write exactly one failing behavior test first.
- Run the smallest relevant test command and confirm the failure is expected.
- Do not edit production implementation during RED.
- GREEN: write the minimum production code required to pass the current failing test.
- Never modify, delete, skip, or weaken tests to make implementation pass.
- REFACTOR only after tests are green.
- Keep structural changes and behavior changes separate.
- Report Behavior, Test, Command, RED, GREEN, and REFACTOR for each cycle.Additional project commands:
# Verification
- Use `pytest` or the smallest relevant pytest command for Python behavior tests.
- Use `npm run types:check` only when this blog site's MDX or TypeScript changes.
- Use the smallest targeted command during RED/GREEN loops.
- If a command is slow, explain what targeted command was used first and what full command remains.It should not be written as an encyclopedia
A bad AGENTS.md would be filled with:
- History of TDD
- All testing philosophies
- A bunch of framework tutorials
- Complex prompt templates
- Complete specifications in different languages
These things dilute the rules that really matter.
My suggestion is: AGENTS.md Only put resident disciplines. Use skills for long processes.
3. L2: Make the process into Codex Skill
AGENTS.md solves "default discipline" and skill solves "complete process".
When you often say "do it by TDD" to Codex, you should upgrade this sentence to a project-level skill.
Directory structure
Put it here:
.agents/
skills/
tdd-codex/
SKILL.mdCodex will scan from the current directory all the way up to .agents/skills. Skills in the root directory of the warehouse are suitable for workflows used by the team.
Minimum available SKILL.md
---
name: tdd-codex
description: Implementing or fixing maintainable code with Codex using strict red-green-refactor TDD. Use for new behavior, bug reproduction, behavior tests, or safe AI coding.
---
# TDD Codex Workflow
Use one behavior slice per cycle.
## Phase 0: Scope
Identify one observable behavior.
Name the public API, user flow, or integration boundary under test.
Do not edit production code.
## Phase 1: RED
Write exactly one failing behavior test.
Prefer public behavior over implementation details.
Run the smallest relevant test command.
Confirm the failure is expected.
Stop and report:
- Behavior
- Test file
- Command
- Failure reason
## Phase 2: GREEN
Write the minimum production code to pass the current failing test.
Never modify, delete, skip, or weaken tests to pass.
Do not add speculative features or abstractions.
Run the same test command.
Report the passing result.
## Phase 3: REFACTOR
Only refactor after tests are green.
If the code is already simple, skip.
If refactoring, make one structural change at a time.
Run tests after each refactor.
Do not change behavior.
## Cycle Report
Return:
- Behavior:
- Test:
- Command:
- RED:
- GREEN:
- REFACTOR:
- Next slice:Calling method
From now on you can say:
用 tdd-codex skill 做这个需求。
每轮只处理一个行为。
先 RED,确认失败后停下来,不要直接写实现。Or shorter:
用 tdd-codex。先写红灯,等我说 go。The point is not how beautiful the prompt is, but that it brings Codex back on the same track every time.
4. L3: Use Subagents to isolate red and green reconstruction
Not all tasks require subagents.
However, when tasks are complex, tests are easily contaminated by implementation, and refactoring is easy to get out of control, it will be more stable to split RED, GREEN, and REFACTOR into different agents.
When is it worth dismantling?
Suitable for disassembly:
- Permissions, billing, state machine -Multi-module functionality
- The bug is very hidden, so you need to write a recurrence test first
- The model is always changed to test to get green
- You want someone who is only responsible for reviewing test quality
Not suitable for disassembly:
- Gadget functions
- Copywriting changes
- Pure visual fine-tuning
- one-time script
The cost of tearing down an agent is real. Use only when the benefits of isolation outweigh the costs of communication.
RED agent
# .codex/agents/tdd-test-writer.toml
name = "tdd_test_writer"
description = "RED phase agent. Writes one failing behavior test and stops before implementation."
sandbox_mode = "workspace-write"
developer_instructions = """
You own only the RED phase.
Write exactly one behavior-focused test for the requested slice.
Prefer public APIs and user-visible behavior over implementation details.
Run the smallest relevant test command.
Confirm the test fails for the expected reason.
Do not edit production implementation.
Do not add multiple tests at once.
Return Behavior, Test, Command, and RED failure reason.
"""GREEN agent
# .codex/agents/tdd-implementer.toml
name = "tdd_implementer"
description = "GREEN phase agent. Implements the minimum production code to pass the current failing test."
sandbox_mode = "workspace-write"
developer_instructions = """
You own only the GREEN phase.
Read the failing test and relevant production code.
Write the minimum implementation required to pass the current test.
Never modify, delete, skip, or weaken tests to make them pass.
Do not add speculative features, helpers, configuration, or abstractions.
Run the relevant tests and return the command plus passing output.
"""REFACTOR agent
# .codex/agents/tdd-refactorer.toml
name = "tdd_refactorer"
description = "REFACTOR phase agent. Improves structure only after tests are green."
sandbox_mode = "workspace-write"
developer_instructions = """
You own only the REFACTOR phase.
Start by running the relevant tests to confirm the code is green.
Look for duplication, unclear names, needless branching, or misplaced responsibility.
Skip refactoring when the code is already simple.
If you refactor, make one structural change at a time.
Run tests after each refactor.
Never change behavior in this phase.
"""How to command the main session
按三阶段 TDD 做这个 slice:
1. tdd_test_writer 只写一个失败测试,并确认 RED。
2. 等我确认后,tdd_implementer 写最小实现,并确认 GREEN。
3. tdd_refactorer 判断是否需要结构重构。
不要批量铺测试。
不要在 GREEN 阶段修改测试。The point here is to isolate context. People who write tests should try not to be affected by implementation details; people who write implementations cannot test manually; people who refactor cannot introduce new behaviors.
5. L4: Use Hooks to focus on test diff
If you rely solely on rules, the model may still go out of bounds.
The most common cross-border is: the test is red, and the model changes the test in order to turn it green.
The value of hooks is not to be "absolutely safe", but to expose this action immediately.
Enable Codex hooks
First open the feature flag in the configuration:
# ~/.codex/config.toml 或 <repo>/.codex/config.toml
[features]
codex_hooks = trueCodex looks for hooks next to the active configuration layer. Common locations:
~/.codex/hooks.json~/.codex/config.toml<repo>/.codex/hooks.json<repo>/.codex/config.toml
At the project level, it is recommended to use <repo>/.codex/hooks.json first, because it can follow the warehouse.
hooks.json
{
"hooks": {
"PostToolUse": [
{
"matcher": "apply_patch|Edit|Write",
"hooks": [
{
"type": "command",
"command": "bash \"$(git rev-parse --show-toplevel)/.codex/hooks/watch-test-edits.sh\"",
"timeout": 10,
"statusMessage": "Checking test file edits"
}
]
},
{
"matcher": "Bash|apply_patch|Edit|Write",
"hooks": [
{
"type": "command",
"command": "bash \"$(git rev-parse --show-toplevel)/.codex/hooks/run-fast-check.sh\"",
"timeout": 120,
"statusMessage": "Running fast checks"
}
]
}
]
}
}Check whether the test file has been modified
# .codex/hooks/watch-test-edits.sh
#!/usr/bin/env bash
set -euo pipefail
changed_tests="$(
git diff --name-only |
grep -E '(^|/)(__tests__|tests?)/|\.(test|spec)\.[cm]?[jt]sx?$|_test\.go$|test_.*\.py$' || true
)"
if [ -n "$changed_tests" ]; then
cat <<EOF
{
"continue": false,
"stopReason": "Test files changed. Review before continuing.",
"systemMessage": "检测到测试文件被修改:\n$changed_tests\n\n测试文件可以改,但必须说明为什么改。先停下来确认:这是补充规格,还是为了凑绿而改测试?"
}
EOF
fiRun a quick check
# .codex/hooks/run-fast-check.sh
#!/usr/bin/env bash
set -euo pipefail
root="$(git rev-parse --show-toplevel)"
cd "$root"
if [ -f pyproject.toml ] || [ -f pytest.ini ] || [ -d tests ]; then
pytest
elif [ -f package.json ]; then
if npm run | grep -q "types:check"; then
npm run types:check
elif npm run | grep -q "check"; then
npm run check
fi
elif [ -f go.mod ]; then
go test ./...
fiDon’t deify hooks
hooks are guardrail, not full enforcement boundary.
It can remind and block common paths, but it cannot replace review, CI, and human judgment. Especially when the test really needs to be modified, the correct approach is not to ban it forever, but to require model explanation:
为什么要改测试?
这是新需求、新边界,还是旧测试写错?
生产实现有没有被同步验证?6. What do you think about TDD-Guard?
TDD-Guard is worth a look, but don't take it as a Codex installation suggestion.
It is the Claude Code plugin route. The core idea is to prevent AI from violating TDD, especially to prevent changing tests to get green.
When migrating to Codex, you can borrow ideas instead of copying the path.
| Claude Code Route | Codex Route |
|---|---|
CLAUDE.md | AGENTS.md |
.claude/agents/*.md | .codex/agents/*.toml |
| Claude plugin/TDD-Guard | hooks + git diff check + CI |
| slash command | skill or prompt |
A more realistic combination on the Codex side is:
AGENTS.md 写纪律
skill 固化流程
subagents 隔离角色
hooks 检查测试 diff
CI 做最后兜底This is not the exact equivalent of a mandatory boundary, but it is sufficient for individual projects and most team projects.
7. Complete walkthrough: slugify
Now go through it with a small function.
Requirements:
实现 slugify(text: string): string。
把英文标题转成 URL slug。Don’t think this example is small. TDD should understand the rhythm from small examples.
Step 0: Ask about the boundaries first, without writing code
Let Codex not rush to write:
我想实现 slugify(text: string): string。
先不要写代码。
请先问我 5 个边界问题,覆盖大小写、空格、标点、unicode、空字符串。
然后把确认后的规格写成 SPEC.md。Specifications may be:
# slugify SPEC
- "Hello World" -> "hello-world"
- trim leading/trailing spaces
- collapse repeated spaces into one hyphen
- remove punctuation
- normalize "Café" -> "cafe"
- empty input returns empty stringStep 1: First red light
读取 SPEC.md。
只实现第一条行为:"Hello World" -> "hello-world"。
先 RED:只写一个失败测试,运行它,确认失败。
不要写生产实现。Ideal output:
Behavior: basic title becomes lowercase hyphenated slug
Test: tests/test_slugify.py
Command: pytest tests/test_slugify.py -q
RED: failed because slugify is not definedOnly then can you continue.
Step 2: Minimal green
goCodex writes the minimal implementation:
def slugify(text: str) -> str:
return text.lower().replace(" ", "-")Then report:
GREEN: pytest tests/test_slugify.py -q passed
REFACTOR: skipped, implementation is still simple
Next slice: trim leading/trailing spacesStep 3: Second red light
继续下一条:去掉首尾空格。
先 RED,只写一个测试。Test:
def test_slugify_trims_spaces():
assert slugify(" Hello World ") == "hello-world"If the current implementation outputs -hello-world-, the red light is true.
Then GREEN:
def slugify(text: str) -> str:
return text.strip().lower().replace(" ", "-")Step 4: Don’t rush to abstraction
At this point many AIs will want to draw a normalizeInput, removePunctuation, toAscii.
Don't rush yet.
The design of TDD should be pushed out by test pressure, not by imagination. Wait until you add unicode, punctuation, and empty strings, and structural pressure really arises, then refactor.
8. Daily usage quick check
New features
用 TDD 实现这个需求。
每轮只处理一个行为。
先写一个失败测试并运行确认 RED。
不要写生产实现,直到我说 go。Fix bug
先写一个能复现这个 bug 的失败测试。
确认它因为这个 bug 失败后,再写最小修复。
不要改测试来适配当前实现。Complex functions
先不要写代码。
请给出 TDD 分解计划:
- 外圈集成测试是什么
- 内圈每个行为 slice 是什么
- 每轮用什么命令验证
- 哪些地方不能 mock
等我确认后再开始 RED。Review
Review 这次改动,重点看:
- 是否先有失败测试
- 测试是否测行为而不是实现
- 是否存在为了通过而弱化测试
- 结构改动和行为改动是否混在一起
- 是否缺少外圈集成测试9. Final checklist
Every time you ask Codex to do TDD, use this table to check at the end.
| Question | Eligibility Criteria |
|---|---|
| Is it really popular first | Are there failed commands and reasons for failure |
| Is the red color correct | The failure reason corresponds to the lack of target behavior |
| Do only one behavior per round | No batch testing |
| GREEN Has the test been changed? | No test has been changed to get green |
| Does testing test behavior | Does not rely on internal implementation details |
| Have mixed behaviors been refactored | Separate structural changes and behavioral changes |
| Is there complete verification | Target tests and necessary full inspections have been run |
If this table cannot pass, don't rush to merge.
Recommended resources
AGENTS.md
Official guide to global, project, and nested instruction files.
Agent Skills
Official guide to packaging reusable workflows as skills.
Subagents
Official guide to custom agents and subagent workflows.
Codex Hooks
Official guide to deterministic scripts during the Codex lifecycle.
TDD-Guard: Automated TDD enforcement for Claude Code
Claude Code plugin route for TDD enforcement. Codex users should borrow the guardrail idea, not the install path.
Comments
Concept introduction
Why is TDD needed in AI programming? The point is not to formally "write the test first", but to create a verifiable failure before the implementation changes from red to green.
Tmux Quick Start Guide
Learn Tmux terminal multiplexer from scratch - core concepts, essential commands, and deep integration with Claude Code