When I started building PM Agent — an AI product coach trained on 14 Reforge frameworks — the architecture choice that consumed the most thinking wasn’t the model, the prompt, or the evaluations. It was a much more boring decision: how does the system decide which framework to apply to a given user question?
Two camps exist. RAG: throw all the frameworks into a vector store and let semantic search pull the relevant chunks at runtime. Deterministic routing: write code that maps user intent to one or more frameworks before the LLM sees anything.
Both are defensible. Both have failure modes that look invisible until they bite you in production. After three months of measuring responses against a 200-question gold set, I ended up with a hybrid, and I’m fairly convinced now that pure-RAG was never going to be the right answer for this product.
This post is what I wish I’d known on day one.
What we’re actually choosing between
The two architectures answer the same question — “what context should the model see?” — in opposite directions.
RAG (retrieval-augmented generation)
Slice your knowledge base into chunks. Embed each chunk into a vector. At query time, embed the user question, find the top-K nearest chunks, stuff them into the prompt, ask the model to answer based on the retrieved context.
Strengths: handles arbitrary content, scales to massive corpora, adapts as you add new material. Weaknesses: retrieval is fuzzy and sometimes wrong in ways the user can’t see, and the model cannot tell you why it picked a particular chunk.
Deterministic routing
Write rules — heuristic, keyword-based, classifier-based, or LLM-as-router-based — that map a user query to one of a finite set of pre-built prompt templates, each loaded with the right context up front.
Strengths: predictable, debuggable, fast, easy to measure quality per route. Weaknesses: brittle when users phrase things off-script, requires ongoing maintenance as the corpus grows, harder to handle the long tail of fuzzy queries.
Why I started with pure RAG
It was the obvious choice — and the most fashionable.
I had 14 framework documents (Reforge programs, distilled into structured notes). Total: about 180,000 tokens. Way too much to fit in a context window without aggressive trimming. RAG seemed custom-built for this scenario: chunk, embed, retrieve, answer.
The first version was up in two days. It worked. Sort of.
How RAG quietly broke down
The failure modes weren’t spectacular. They were subtle. That’s what made them dangerous.
Failure mode 1: semantic ambiguity
A user asks “how should I think about retention?”. Retrieval pulls chunks from five different frameworks that all mention retention — Mastering Product, Retention & Engagement, Pricing & Monetisation, Growth Loops, even User Insights. The model dutifully synthesises across all of them. The answer is technically correct, vaguely plausible, and useless.
The user wanted a coherent point of view from one framework applied well, not a smoothie of five frameworks averaged.
Failure mode 2: chunk boundaries that lie
Reforge content is highly structured: each framework has a set of steps, and each step depends on the previous one. When you chunk by 500-token windows, you cut steps in half. Retrieval pulls a chunk that is technically relevant but missing the prerequisite step. The model produces advice that is wrong because the user is at step 1 and the chunk was about step 4.
This is the structural mismatch nobody warns you about. RAG works beautifully on flat reference material — encyclopaedias, documentation, product manuals. It works much worse on procedural material where order matters.
Failure mode 3: invisibility
With pure RAG, the user has no idea which framework the answer came from. Sometimes that’s fine. For PM Agent, where the value proposition is “I’ll apply the right Reforge framework to your question”, it’s catastrophic. Users couldn’t cite the answer back. They couldn’t go deeper into the chosen framework. They couldn’t challenge the methodology because the methodology was hidden.
Worse, when the retrieval was bad, I had no clean way to debug it. I’d look at the logs, see chunks from three different frameworks, and have to reverse-engineer why retrieval picked them.
Why I tried pure deterministic routing next
After six weeks of subtle wrong answers, I rebuilt around deterministic routing. The new architecture was simple:
- A small classifier (a fast LLM call) takes the user’s question and outputs one or two framework names.
- Each framework has a hand-tuned system prompt with the full framework loaded as structured context.
- The chosen framework prompt runs against the user’s question. The answer cites which framework it came from.
This was a clear improvement on three dimensions:
- Answers were coherent — they came from one framework applied well, not five averaged.
- Procedural integrity was preserved because the entire framework was always in context, never chopped.
- The user could see which framework was chosen and trust (or challenge) the choice.
But it had its own failure modes.
Where deterministic routing fell short
The off-script question
Some questions don’t fit a Reforge framework cleanly: “how should I structure my product team for a fintech in West Africa?” That’s a real question I got. There’s no Reforge framework called “Org design for emerging-market fintechs.” The router would pick the closest framework (Leading a Product Strategy, usually) and the answer was underwhelming because the framework wasn’t built for that context.
The multi-framework question
Other questions genuinely need synthesis across frameworks. “I’m about to launch a freemium tier — what should I consider?” touches Pricing, Retention, and Mastering at minimum. The router was forced to pick one, and the answer was always missing pieces.
The cold-start problem
For each new framework I added, I had to write a new routing rule and maintain it. The maintenance overhead grew linearly with the corpus. RAG would have absorbed new content for free.
The hybrid I shipped
The current architecture is layered. Three stages, each addressing a failure of the previous:
Stage 1 — Intent classification (deterministic). A fast LLM call (Haiku-class model) takes the user query and outputs one of three labels:
single-framework— the question maps cleanly to one Reforge framework.multi-framework— the question genuinely requires synthesis across two or three.off-corpus— the question doesn’t fit a framework at all.
Stage 2 — Context assembly (mixed). Depending on the label:
single-framework: load the full framework prompt deterministically. No retrieval, no ambiguity. This is ~70% of queries and accounts for the bulk of the quality wins.multi-framework: load the top-2 framework prompts deterministically and run a small RAG pass over a curated synthesis library (cross-framework essays I’ve written myself). Retrieval is much narrower because the corpus is now ~40k tokens of synthesis content, not the raw frameworks.off-corpus: skip the framework system entirely, run a general PM coach prompt with web search disabled. Be honest with the user that this is outside the framework canon.
Stage 3 — Citation (deterministic). Every answer is forced via structured output to declare which framework(s) it drew from. The UI surfaces those citations as clickable cards.
Numbers
On the gold set of 200 hand-graded questions:
- Pure RAG: 62% rated “good” or better.
- Pure deterministic routing: 74%.
- Hybrid: 87%.
More importantly, the hybrid’s failures are now legible. I can look at the routing label, see why a question was misclassified, and fix the classifier. With pure RAG, debugging required reading retrieved chunks and guessing.
Lessons for PMs evaluating AI features
If you’re a PM scoping an AI feature backed by a knowledge base, the questions to ask before picking an architecture aren’t about technology. They’re about the shape of your content.
Is your content procedural or referential?
If users need a step-by-step walked through (procedures, frameworks, playbooks), RAG will silently butcher it. Lean deterministic. If users need lookup answers (definitions, facts, historical context), RAG works fine.
Does the user benefit from knowing the source?
If the citation is part of the value proposition (legal advice, coaching, medical), deterministic is non-negotiable. Users need to know which authority you’re channelling.
How big is your corpus, and how often does it change?
Small (under ~500k tokens), stable corpus → deterministic wins on quality. Large, fast-moving corpus → RAG wins on maintenance, even if quality suffers.
What does “wrong” cost you?
Cheap consequences (entertainment, brainstorming, drafts) → RAG is fine. Expensive consequences (financial, legal, medical, strategic) → deterministic gives you the audit trail you’ll need when something goes sideways.
The boring conclusion
“Should I use RAG?” is the wrong question. The right question is: what is the structure of the work my system is trying to do, and which architecture preserves that structure? For procedural, citation-heavy, audit-sensitive work, deterministic routing is usually the better default — possibly with a narrow RAG layer for synthesis.
For PM Agent, that means the LLM is mostly doing prose generation on top of context my code chose deliberately. The intelligence is upstream of the model, not inside it. That’s the version that ships.