ยท2/21/2026ยท5 min read
5 AI PM Frameworks That Actually Work (Not Theoretical Nonsense)
Guide
# 5 AI PM Frameworks That Actually Work (Not Theoretical Nonsense)
**Subtitle:** Frameworks from shipping AI at a $7B company โ not from a consulting deck.
**PM the Builder | SEO Target: "AI PM frameworks"**
---
## TL;DR
Most PM frameworks don't work for AI products because they assume deterministic outcomes. Here are 5 AI PM frameworks I actually use: (1) Eval-First Development, (2) Prototype-to-Production Pipeline, (3) Model Selection Matrix, (4) AI Cost Engineering, and (5) Hallucination Risk Assessment. Each one is battle-tested from real AI product development, not theoretical consulting frameworks.
---
I have a visceral reaction to PM frameworks.
Most are made-up acronyms that look good on Medium but crumble on contact with reality. RICE scoring? Sure, if you enjoy assigning fake numbers to feel productive. MoSCoW prioritization? A way to avoid making actual decisions.
But AI product management genuinely needs frameworks โ because AI is different enough from traditional software that your existing mental models fail. When your feature's output is probabilistic, when your costs scale with usage in non-obvious ways, when "quality" is subjective and hard to measure โ you need structured thinking to avoid chaos.
Here are 5 AI PM frameworks I actually use. Not because I read about them. Because I built them from painful experience shipping AI features at scale.
---
## Framework 1: Eval-First Development
**Use when:** Starting any new AI feature or making changes to an existing one.
### The Problem It Solves
Traditional development: Write spec โ Build โ Test โ Ship
AI development: Write spec โ Build โ "Is this good?" โ "How do we even measure good?" โ Argue about quality โ Ship and hope โ Users hate it โ Scramble
The root cause: you didn't define "good" before building. Eval-First fixes this.
### The Framework
**Step 1: Define the quality bar BEFORE writing any code.**
Ask three questions:
- What does a "great" output look like for this feature?
- What does an "acceptable" output look like?
- What does a "failure" look like?
Write actual examples. Not vague criteria โ specific input/output pairs that represent each quality level.
**Step 2: Build the eval suite BEFORE building the feature.**
- Create 50-100 test cases covering happy path, edge cases, and adversarial inputs
- Define scoring criteria (rubric-based, 1-5 scale)
- Set ship thresholds: "We launch when score > 4.0 average, with zero safety failures"
- Set up LLM-as-judge automation
**Step 3: Build the feature TO PASS the eval.**
Engineering's target isn't a spec โ it's the eval suite. Iterate prompts, architecture, and models until evals pass.
**Step 4: Expand the eval post-launch.**
Every production failure becomes a new test case. Every user complaint gets added. The eval suite is a living document.
### Why It Works
- Forces clarity on quality before engineering effort
- Makes "ship/no-ship" a data decision, not a vibes decision
- Creates accountability (the eval is the contract)
- Compounds over time (eval suite gets better with every iteration)
### Common Mistakes
- Making the eval too easy (everything passes โ useless)
- Making the eval too hard (nothing ships โ paralysis)
- Not updating the eval with production learnings
- Using only automated evals without human calibration
For the full deep dive, see [Evals Are the New PRD](/blog-drafts/evals-are-the-new-prd).
---
## Framework 2: The Prototype-to-Production Pipeline
**Use when:** You have an AI feature idea and need to go from concept to shipped product.
### The Problem It Solves
Traditional PM process for AI features is devastatingly slow:
1. PM writes spec (2 weeks)
2. Design review (1 week)
3. Engineering estimates (1 week)
4. 2-3 sprints of building (4-6 weeks)
5. Testing and iteration (2 weeks)
6. Launch (1 week)
Total: 3-4 months. For a feature that might not even work with AI.
### The Framework
**Day 1: Prototype (4-6 hours)**
Build a working AI feature using LLM APIs. Not a mockup. A thing that takes input and produces output. Hardcode everything except the AI interaction. Deploy to a shareable URL.
**Day 2: Validate (2-3 hours)**
Show the prototype to 5 users. Watch them use it. Note where AI output disappoints, where they get confused, where they say "oh, this is cool."
Kill or continue decision. Most ideas die here. Good. You spent 1 day, not 1 quarter.
**Day 3: Eval & Harden (4-6 hours)**
Build eval suite. Collect test cases from Day 2 sessions. Run evals. Fix worst failures. Set ship criteria.
**Day 4: Production-ize (6-8 hours, with engineering)**
Hand off working code + passing eval suite + cost projections. Engineering adds error handling, auth, monitoring, rate limiting. Not building from scratch โ hardening a proven prototype.
**Day 5: Ship & Monitor**
Shadow mode โ internal โ 5% โ 25% โ 100%. Monitor quality scores, latency, cost, user feedback.
### Why It Works
- Validates feasibility in hours, not weeks
- Kills bad ideas cheaply
- Engineers harden proven concepts (not spec-based assumptions)
- The prototype IS the communication โ no translation loss
### When NOT to Use It
- Highly regulated features that need compliance review before any deployment
- Features that require deep infrastructure changes before AI is relevant
- When the PM genuinely can't build a prototype (fix this โ learn to build)
Full breakdown in [Prototype to Production](/blog-drafts/prototype-to-production).
---
## Framework 3: The Model Selection Matrix
**Use when:** Choosing which LLM(s) to use for a feature, or re-evaluating your current model choice.
### The Problem It Solves
"Our engineer picked GPT-4" isn't a strategy. Model selection has massive product implications: cost, quality, latency, privacy, vendor lock-in. Most teams either default to whatever their ML lead prefers or chase the latest benchmark leader. Both approaches are wrong.
### The Framework
**Step 1: Define Requirements (PM owns this)**
| Requirement | Weight (1-5) | Your Spec |
|-------------|--------------|-----------|
| Quality bar | ? | Minimum acceptable quality score |
| Latency target | ? | p95 target in ms |
| Cost budget | ? | Max $/request at projected scale |
| Privacy constraints | ? | Data residency, no-training clauses |
| Context window needs | ? | Max input size in tokens |
| Multi-modal needs | ? | Text only? Images? Audio? |
| Fine-tuning capability | ? | Need to customize? |
| Vendor diversification | ? | Single vendor OK or need backup? |
Weight each requirement 1-5 based on your specific use case. A customer support bot has different weights than a creative writing tool.
**Step 2: Shortlist Models (Joint PM/Eng)**
Identify 3-5 candidate models based on requirements. Include at least one from each category:
- Frontier (GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro)
- Efficient (GPT-4o-mini, Claude Haiku, Gemini Flash)
- Open source (Llama 3.1, Mistral)
**Step 3: Run Comparative Eval (Eng builds, PM designs)**
Same eval suite, all candidate models. Measure quality, latency, cost.
| Model | Quality (1-5) | p95 Latency | Cost/Request | Privacy | Score |
|-------|---------------|-------------|--------------|---------|-------|
| Model A | 4.3 | 800ms | $0.02 | โ
| ? |
| Model B | 4.6 | 1500ms | $0.05 | โ
| ? |
| Model C | 3.9 | 400ms | $0.005 | โ
| ? |
**Step 4: Score and Decide (PM leads)**
Multiply each metric by its weight. Pick the winner. Document the reasoning. Set a review trigger ("re-evaluate when a new model launches or costs change >20%").
**Step 5: Consider Model Routing**
Often the best answer isn't one model โ it's routing:
- Simple queries โ cheap/fast model
- Complex queries โ expensive/quality model
- This optimizes cost without sacrificing quality where it matters
### Why It Works
- Makes model selection a structured product decision, not an engineering default
- Forces clarity on what matters most for YOUR use case
- Creates an audit trail for the decision
- Makes re-evaluation systematic, not reactive
---
## Framework 4: AI Cost Engineering
**Use when:** Planning, budgeting, or optimizing the cost of AI features.
### The Problem It Solves
AI features have costs that traditional software doesn't:
- Every API call costs money (tokens in, tokens out)
- Costs scale linearly (or worse) with usage
- Users can trigger wildly different costs (a simple question vs a complex analysis)
- Model provider pricing changes without warning
PMs who don't understand AI cost engineering build features that are economically unsustainable at scale.
### The Framework
**Step 1: Map the Cost Drivers**
For each AI feature, identify:
- Input tokens per request (average, p50, p95)
- Output tokens per request (average, p50, p95)
- Model price per token (input and output)
- Requests per user per day
- Number of users
- Any additional costs (embeddings, vector DB, fine-tuning)
**Step 2: Model the Unit Economics**
```
Cost per request = (input_tokens ร input_price) + (output_tokens ร output_price)
Cost per user per month = cost_per_request ร requests_per_user_per_day ร 30
AI feature cost per month = cost_per_user ร active_users
Revenue per user per month = subscription_price (or ad revenue, etc.)
AI cost as % of revenue = AI_cost / revenue
```
**Healthy range:** AI costs should be 5-15% of revenue per user. Above 20%, you have a margin problem.
**Step 3: Identify Optimization Levers**
| Lever | Impact | Effort |
|-------|--------|--------|
| Prompt optimization (shorter prompts, fewer tokens) | 20-40% cost reduction | Low |
| Caching (same questions = cached answers) | 30-60% cost reduction | Medium |
| Model routing (cheap model for simple, expensive for complex) | 40-60% cost reduction | Medium |
| Switching to cheaper model | 50-80% cost reduction | Medium |
| Self-hosting open source | 60-80% cost reduction | High |
| Fine-tuning smaller model | 50-70% cost reduction | High |
**Step 4: Set Cost Guardrails**
- Maximum cost per request (kill switch if exceeded)
- Monthly budget cap with alerting
- Cost per user trending (catch runaway costs early)
- Token usage monitoring by feature
**Step 5: Build Cost into Product Decisions**
Every feature decision should include:
- Cost at current scale
- Cost at 10x scale
- Cost optimization path
- Margin impact
### Why It Works
- Prevents the "we built it, it works great, we can't afford it" trap
- Makes cost a first-class product consideration
- Gives engineering clear optimization targets
- Enables informed decisions about pricing and packaging
---
## Framework 5: Hallucination Risk Assessment
**Use when:** Designing any AI feature that generates text, recommendations, or decisions.
### The Problem It Solves
All LLMs hallucinate. The question isn't "will it hallucinate?" but "what happens when it does?" Different features have wildly different hallucination risk profiles, and they need different mitigation strategies.
### The Framework
**Step 1: Assess Impact Severity**
Rate the cost of a hallucination for your feature:
| Severity | Description | Examples |
|----------|-------------|---------|
| **Critical** | Physical, legal, or financial harm | Medical advice, legal guidance, financial transactions |
| **High** | Business damage, trust destruction | Customer-facing factual claims, product recommendations |
| **Medium** | User frustration, rework needed | Writing assistance, summarization, search results |
| **Low** | Minor inconvenience | Creative suggestions, brainstorming, internal tools |
**Step 2: Assess Detection Difficulty**
How easily can the user or system detect a hallucination?
| Detection | Description | Mitigation Implication |
|-----------|-------------|----------------------|
| **Easy** | User immediately knows it's wrong | Design for easy correction |
| **Moderate** | User can verify with effort | Provide sources, encourage verification |
| **Hard** | User can't easily tell | Require human review, constrain outputs |
| **Near-impossible** | Wrong info looks completely plausible | May not be suitable for AI |
**Step 3: Map to Mitigation Strategy**
| Impact ร Detection | Strategy |
|--------------------|----------|
| Low impact, easy detect | Ship with basic guardrails |
| Low impact, hard detect | Add confidence indicators |
| Medium impact, easy detect | RAG + source citations |
| Medium impact, hard detect | RAG + human spot-checks + monitoring |
| High impact, easy detect | RAG + constrained outputs + user confirmation |
| High impact, hard detect | Human-in-the-loop required |
| Critical impact, any detection | Reconsider whether AI is appropriate |
**Step 4: Define the Hallucination Budget**
Based on the risk profile:
- What hallucination rate is acceptable? (0.1%? 1%? 5%?)
- How do you measure it? (Automated checks, human review, user reports)
- What triggers a rollback? (Hallucination rate exceeds X%)
**Step 5: Monitor and Respond**
- Track hallucination rate in production
- Sample outputs for human review
- Create incident response plan for hallucination-caused harm
- Feed failures back into eval suite
### Why It Works
- Prevents both under-investment (shipping dangerous features) and over-investment (gold-plating low-risk features)
- Creates a shared language with stakeholders about risk
- Makes "is this safe to ship?" a structured decision
- Scales across features with different risk profiles
For the full hallucination playbook, see [Shipping AI Features That Don't Hallucinate](/blog-drafts/shipping-ai-features-that-dont-hallucinate).
---
## How These Frameworks Work Together
These 5 AI PM frameworks aren't isolated. They form a system:
1. **Eval-First** defines what "good" looks like
2. **Prototype-to-Production** gets you there fast
3. **Model Selection Matrix** picks the right AI engine
4. **Cost Engineering** keeps it economically viable
5. **Hallucination Risk** keeps it safe
For any new AI feature, run through all five. The order depends on context, but you'll use all of them.
---
## Try This Week
Pick the framework that addresses your biggest current pain point. If you're not sure which, start with Eval-First โ it's the foundation everything else builds on. Take one AI feature you're working on (or thinking about) and apply the framework. Write down what you learn. That exercise alone will teach you more than any AI PM course.
---
## Keep Building
**Subscribe to PM the Builder** for weekly AI PM frameworks, tactics, and real examples from someone shipping AI at scale. No consulting-deck nonsense. Just stuff that works.
[Subscribe at pmthebuilder.com]
๐งช
Free Tool
How strong are your AI PM skills?
8 real production scenarios. LLM-judged across 5 dimensions. Takes ~15 minutes. See exactly where your gaps are.
๐ ๏ธ
PM the Builder
Practical AI product management โ backed by PM leaders who build AI products, hire AI PMs, and ship every day. Building what we wish existed when we started.
๐งช
Benchmark your AI PM skills
8 production scenarios. Free. LLM-judged. See where you stand.
๐
Go deeper with the full toolkit
Playbooks, interview prep, prompt libraries, and production frameworks โ built by the teams who hire AI PMs.
โก
Free: 68-page AI PM Prompt Library
Production-ready prompts for evals, architecture reviews, stakeholder comms, and shipping. Enter your email, get the PDF.
Related Posts
Want more like this?
Get weekly tactics for AI product managers.