TDD with AI Agents: Why Red-Green-Refactor Still Matters
Test-driven development makes AI-generated code dramatically better. Here's how to apply red-green-refactor cycles when working with coding agents.
When working with AI code generation, I noticed something troubling: developers write all tests upfront, the AI implements everything at once, tests pass immediately. No friction. No real constraints. Without a failing test to guide the AI, it has too much freedom—it can take shortcuts, make assumptions, optimize for passing mocks instead of solving the real problem.
This isn't how TDD works. Kent Beck's red-green-refactor cycle is still the most powerful approach for AI-assisted development. The difference is understanding how to apply it when your collaborator is a language model.
The Horizontal Slicing Problem
Most developers default to what I call "horizontal slicing" with AI: write all tests, generate implementation, done. The tests look like this:
test('processRefund should succeed for valid request', () => {
const mockService = { reverseCharge: jest.fn() };
const processor = new RefundProcessor(mockService);
const result = processor.processRefund({ orderId: '123', amount: 50 });
expect(result.success).toBe(true);
});
test('processRefund should fail if payment service throws', () => {
const mockService = {
reverseCharge: jest.fn().mockRejectedValue(new Error('failed'))
};
const processor = new RefundProcessor(mockService);
const result = processor.processRefund({ orderId: '123', amount: 50 });
expect(result.success).toBe(false);
expect(result.error).toBe('failed');
});
Then you ask Claude to implement RefundProcessor and make all tests pass. The AI does it in one go. But what you've created is weak. Without real constraint during implementation, the AI can satisfy mocks without actually implementing behavior. You're not testing behavior; you're testing whether the AI matched your implementation expectations.
Vertical Slicing: One Test, One Implementation
The solution is vertical slicing. Write one test. Watch it fail. Show that failing test to the AI and ask it to make it pass. Review the code. Move to the next test.
This works because a failing test is the clearest specification you can give an AI. It's not ambiguous. It's measurable. The AI either satisfies it or doesn't.
Here's the workflow:
- Write a single failing test
- Show it to Claude with: "Make this test pass, implement only what's necessary"
- Review the minimal implementation
- Refactor together if needed
- Write the next test
Each iteration is focused. Each implementation is constrained. You get continuous feedback rather than one large implementation pass.
Strong Tests vs Implementation Details
Not all tests are equal. A weak test verifies that processRefund() calls mockService.reverseCharge() with specific arguments. A strong test verifies that processRefund() with valid input succeeds and with invalid input fails with the right error.
Weak tests couple your tests to implementation. Strong tests verify the actual contract your code provides.
This matters more with AI because language models naturally pattern-match against common testing structures. You must be explicit: tests verify behavior through public interfaces, not internal implementation. Martin Fowler's testing guide covers this principle thoroughly.
The difference in practice:
// Weak: Testing implementation details
expect(mockService.reverseCharge).toHaveBeenCalledWith('123', 50);
// Strong: Testing actual behavior
expect(result.success).toBe(true);
expect(result.orderId).toBe('123');
Strong tests let you refactor internals without breaking tests. With AI, this becomes your safety net as the codebase evolves.
A Concrete TDD Cycle
Let me walk through an actual workflow. I'm building a refund processor. First, I define the interface: a RefundProcessor class with a processRefund(request) method that takes injected dependencies.
Test one:
test('processRefund returns success for valid request', () => {
const mockService = { reverseCharge: jest.fn().mockResolvedValue({ id: 'charge-123' }) };
const processor = new RefundProcessor(mockService);
const result = processor.processRefund({ orderId: '123', amount: 50 });
expect(result.success).toBe(true);
expect(result.transactionId).toBe('charge-123');
});
I show this failing test to Claude. Claude implements:
class RefundProcessor {
constructor(paymentService) {
this.paymentService = paymentService;
}
async processRefund(request) {
const result = await this.paymentService.reverseCharge(request.orderId, request.amount);
return { success: true, transactionId: result.id };
}
}
Minimal. Specific. Test passes. Next test:
test('processRefund returns error when payment fails', () => {
const mockService = {
reverseCharge: jest.fn().mockRejectedValue(new Error('Card declined'))
};
const processor = new RefundProcessor(mockService);
const result = processor.processRefund({ orderId: '123', amount: 50 });
expect(result.success).toBe(false);
expect(result.error).toBe('Card declined');
});
Claude adds error handling. Test passes. Continue iteratively. Each test adds one behavior. Each implementation is minimal and focused. By completion, you have well-tested code that evolved through explicit requirements.
This approach works because each failing test is a concrete specification. Claude isn't guessing; it's satisfying clear constraints.
Fitting TDD Into Your Workflow
TDD with AI is a reusable agent skill. You can extract the pattern and apply it across projects. It fits naturally into the plan-execute-clear loop—plan interfaces upfront, execute tests one at a time with AI, clear assumptions as you go.
You can also think of TDD as AI evaluation. Each failing test is an evaluation criterion. The code either passes or fails. This makes TDD more rigorous than free-form code review.
When documenting patterns in CLAUDE.md files for team AI workflows, use context engineering to specify test-first discipline explicitly. Show examples. Make it clear that vertical slicing is expected.
When TDD Doesn't Apply
TDD isn't always the right tool. During exploration and prototyping, you might need loose constraints. When learning a new domain, horizontal slicing lets you iterate faster. TDD becomes essential once you move from "what should this do?" to "this must work reliably."
Know the difference. Use TDD for production code. Use exploration for learning.
Conclusion
The failing test is the most powerful constraint you can give an AI. Not requirements documents. Not architectural diagrams. Not vague descriptions. A test that fails, clearly, unambiguously, until the code is correct.
Red-green-refactor still matters. It matters more with AI because it provides the clarity language models need to generate trustworthy code.
Write one test. Watch it fail. Ask Claude to make it pass. Review. Refactor. Repeat. This is how you build code you can trust.
About the author: Alex Hinds builds AI-assisted development workflows and engineering practices across teams.