AI-written code is always clean. That is the trap.
Clean code hides subtleties. The PM who delegates thinking to an AI pays for it in production six months later, when a bug that was never really fixed surfaces in a way the model never read. The engineer who copy-pastes the AI’s explanation of a stack trace fixes the symptom and reintroduces the root cause two weeks later, with new wording.
After shipping six AI products in 2025 with Claude Code as a co-pilot, the most useful thing I can write about AI in product and engineering work is not where to use it. It is where to refuse to use it. This article is that list, in the order I have come to enforce it on myself.
The new-project flow
Order matters. When I start a new product or feature, the sequence is non-negotiable. AI does not enter the room before step four.
Step 1. Whiteboard, no AI
What problem am I solving. Who is the user. What is the smallest path to a real moment of value. What does the world look like if this works, and what does it look like if it does not.
AI is bad at this part because it has no skin in the game. It will produce a competent, generic framing that reads correctly and commits to nothing. Generic framings ship generic products. The whiteboard phase is where you commit to a specific opinion about a specific user, and the only way to do that is to think without a model in the room.
Step 2. Data model on paper, no AI
What entities exist. What relationships matter. What constraints are imposed by the business, by the regulator, by the operator you depend on. What invariants must hold at the database level, no matter what the application does.
AI will generate a reasonable schema in thirty seconds. It will miss the business rule that contradicts the schema, because the rule lives in your head, not in the prompt. At Zepargn, the difference between PENDING_PAYMENT and PENDING_VOTE as two distinct states on group savings was a six-hour debug because the AI would have called both PENDING, and so did we, until the collision produced subtle bugs that took weeks to surface. A clean data model encodes the business it serves. A model generated from a prompt encodes what is canonical, which is rarely the same thing.
Step 3. Architecture on paper, no AI
Where each component lives. Where the boundaries are. What is idempotent and what is not. Where retries are safe and where they will double-debit a user. What can fail loud and what must fail silent.
AI defaults to canonical patterns. Microservices when monoliths would be right. Generic retry loops when domain-specific backoff is right. Standard auth flows when your market needs anonymous-first. Those defaults are not wrong in general; they are wrong for your constraints, which the model does not know. The architecture phase is where you lock in choices that the model cannot unmake later without rewriting half the system.
Step 4. Now delegate to Claude Code
Once the thinking is on paper, the writing is mechanical. Claude Code is fast at mechanical. The whiteboard becomes the prompt, the data model becomes the migration, the architecture becomes the file layout. Five to ten minutes of prompting produces a scaffold that would have been an hour of typing.
The trick at this step is to delegate WIDTH, not DEPTH. Claude Code is excellent at producing a complete first pass across many files. It is mediocre at the one subtle line that holds the invariant. Let it do the width, then go back and write the critical lines yourself. That ratio (model on width, human on depth) is the one that compresses delivery without compounding debt.
Step 5. Review with an AI code reviewer
I run CodeRabbit on every PR before I look at the code myself. Surface bugs, style issues, common smells, unused imports, missing null checks. The layer Claude Code generated past quickly. An AI reviewer catches roughly 70% of the surface defects in 30 seconds, which is 30 seconds well spent.
Two cautions. First, treat AI review like a junior reviewer: catches the obvious, misses the structural. Second, never merge on the AI’s approval alone. The approval threshold is one human reviewer who understood the change, plus the AI as a net underneath.
Step 6. Final deeper review by me
The two-layers-below review. The questions I ask, in order:
- Does this preserve the business invariant I designed in?
- Does this fit the existing architecture, or does it leak responsibility into a layer that should not own it?
- Where does this fail loud, and where does it fail silent?
- What happens at 10x the current load?
- What does this assume about the operator that we have no contract to enforce?
None of these questions can be delegated. They are the questions a senior reviewer asks because they read the system, not the ticket. The AI did not read the system. The CodeRabbit review did not read the system. The author read it, and they should be the one who signs off.
The debugging flow (where AI gets dangerous)
Modern observability tools (Sentry, Datadog, the rest) now ship with AI explanations baked in. You see a stack trace, the AI suggests a cause, sometimes proposes a fix. It looks right. It is often wrong-by-omission.
The trap, in three steps
- Paste the stack trace into the AI.
- Apply the suggested fix.
- Symptom goes away. You ship.
Two weeks later the same class of bug surfaces in a slightly different code path. You paste the new stack trace. Apply the new fix. Symptom goes away again. The cycle repeats until the bug reaches a layer where it cannot be patched cleanly, and you spend a week unwinding the four surface fixes that were actually masking one root cause two layers below.
A concrete example
A payment retry storm at a fintech I have worked on. The observability tool flagged it as a client double-tap. The AI explanation looked correct: the same payment was being initiated twice within milliseconds, the suggested fix was to debounce the button on the client.
The debounce shipped. The metric went green for forty-eight hours. Then it returned, now flagged as a retry storm on the operator side. New AI explanation, new suggested fix at the retry layer. Shipped that too. Came back a third time, on a different operator.
The actual root cause was two layers below either of those symptoms. Our operator client was generating a new idempotency reference on every retry instead of preserving the original one, because the retry wrapper was instantiated in a context that the original code path did not propagate. The model could not see this because the model only saw the stack trace, not the architectural boundary between the retry wrapper and the operator client. The fix was twenty lines in a file the model had never been shown.
How to use AI for debugging without the trap
Three rules I enforce on myself:
- Treat the AI’s explanation as a hypothesis, never an answer. Ask “what would I check to verify this?” before applying any fix.
- Read the two layers below the stack frame the AI is pointing at. The actual cause is almost never in the file the model highlighted.
- State the business invariant the bug is violating. If you cannot, you do not yet understand the bug, and the AI cannot help you because it does not know the invariant either.
AI is a useful first pass on debugging. It is dangerous as a final pass.
Where AI delegation is safe
Honest list of things I delegate without the deeper review, because the cost of being wrong is low and the speed-up is meaningful:
- Boilerplate code. CRUD endpoints, glue code, standard middleware, config files, type definitions. Anything where the canonical answer is also the right answer.
- Tests for already-understood behavior. Once the behavior is clear and the test name is precise, the test body is mechanical.
- Documentation of code you already wrote.README sections, JSDoc, OpenAPI specs derived from existing code.
- Throwaway prototypes. If it ships in the next forty-eight hours and gets thrown away by next week, AI is fine to write most of it.
- First drafts of prose. Emails, internal updates, PR descriptions. Edit ruthlessly, but starting from a blank page is wasted time.
- Translation. i18n strings, copy translations, documentation in a second language.
Notice what is missing from this list. It is missing the things where being wrong costs a quarter of work to fix.
Where the human stays in the loop (non-negotiable)
The other list. The decisions and judgments AI does not own, ever:
- Architecture decisions. The shape of the system. Where the boundaries are. What is idempotent. What can fail. Those decisions outlive the code that implements them.
- Data model design. The schema is the business rule. The business rule lives in your head and the AI cannot read it.
- Business constraint interpretation. What is actually legal, what the regulator will tolerate, what your biggest customer will accept. The AI knows none of this.
- Production root-cause analysis. The two layers below the symptom. AI is a hypothesis generator here, never the verdict.
- Hiring decisions. Who joins the team. Who gets promoted. Who gets feedback that changes their trajectory.
- Strategic prioritization. What ships this quarter, what gets explicitly not done. The arbitrage is the job.
- The final review before shipping. The two-layers-below review. The signature on the merge button.
Closing
The PM or engineer who delegates thinking accumulates two kinds of debt at the same time:
Technical debt. Code that works in the small, breaks in the large. Bugs that pass review because the review was on the surface. Architecture that grows by accretion because no one is checking that the new component respects the old boundaries.
Product debt. Decisions that look correct in the prompt but do not fit the business. Roadmaps that aggregate instead of arbitrating. Pricing pages that test well in isolation and produce churn in cohort analysis.
Both kinds compound. Both kinds are paid in full eventually, by the team that inherits the code or the next PM who looks at the funnel and cannot tell why it was designed this way.
The AI-native builder in 2026 is not the person who uses AI the most. It is the person who knows precisely where AI stops being useful, refuses to use it past that line, and stays in the loop for the decisions that compound. The whiteboard before the prompt. The two-layers-below review before the merge. The invariant before the fix.
Use AI to delegate execution. Never delegate the thinking.