Model Selection Isnt A Technical Decision
Why Model Selection Is a PM Concern
Let me show you why this matters:
Cost Structure
Different models have vastly different costs:
| Model | Input Cost (per 1M tokens) | Output Cost (per 1M tokens) |
|---|---|---|
| GPT-4 Turbo | $10 | $30 |
| Claude 3.5 Sonnet | $3 | $15 |
| GPT-3.5 Turbo | $0.50 | $1.50 |
| Llama 3.1 (self-hosted) | ~$1-2 | ~$1-2 |
At scale, these differences are millions of dollars annually. That's a product economics question, not a technical one.
Quality Tradeoffs
Models excel at different things:
- Claude: Following complex instructions, nuanced analysis, safety
- GPT-4o: Multi-modal, general reasoning, code
- Gemini 1.5: Long context, video understanding
- Llama: Self-hosting, customization, cost
Which tradeoffs matter depends on your use case. That's a product question.
Latency
Model latency affects user experience:
- Fast models (GPT-4o, Claude Sonnet): 500ms-1s typical
- Slower models (GPT-4, Claude Opus): 2-5s typical
- Self-hosted: Variable based on infrastructure
For real-time features, latency determines UX. For batch processing, it doesn't matter. Product decides.
Privacy
Data handling varies by provider:
- Do they train on your data?
- Where is data processed geographically?
- What's their retention policy?
- Can you get a HIPAA BAA?
For healthcare, finance, or privacy-sensitive products, this determines which models are even legal to use.
Lock-in
Switching models has costs:
- Prompts may not transfer cleanly
- Output formats differ
- Fine-tuning is model-specific
- Integration points vary
The choice of model creates dependencies. PM should understand these.
The PM's Model Selection Framework
Here's how I approach model selection:
Step 1: Define Requirements (PM-Led)
Before any model comparison, define what you need:
Quality requirements:
- What task is the model doing?
- What does "good enough" look like?
- What's unacceptable?
Performance requirements:
- Target latency (p50, p95)
- Throughput needs
- Availability requirements
Cost constraints:
- Budget per request
- Monthly budget cap
- Cost at projected scale
Privacy/compliance requirements:
- Data sensitivity level
- Regulatory requirements
- Acceptable data processing locations
Strategic requirements:
- How important is avoiding vendor lock-in?
- Do we need fine-tuning capability?
- Do we need multi-modal?
Write these down BEFORE comparing models. Otherwise you'll optimize for the wrong things.
Step 2: Shortlist Models (Joint PM/Eng)
Based on requirements, identify candidate models.
The usual suspects:
- OpenAI: GPT-4o, GPT-4 Turbo, GPT-3.5
- Anthropic: Claude 3 Opus, Claude 3.5 Sonnet
- Google: Gemini 1.5 Pro, Gemini 1.5 Flash
- Meta: Llama 3.1 (various sizes)
- Mistral: Mistral Large, Mistral Small, Mixtral
Don't limit to one provider. Compare.
Step 3: Run Comparative Evals (Eng-Led, PM-Designed)
Here's where engineering does the work, but PM designs the test:
Create an eval set:
- 50-100 real examples from your use case
- Cover edge cases and adversarial inputs
- Define clear scoring criteria
Run each candidate:
- Same prompts, same examples
- Measure quality scores
- Measure latency
- Calculate cost
Compare results:
| Model | Quality Score | p50 Latency | Cost/Request |
|---|---|---|---|
| Model A | 87% | 800ms | $0.02 |
| Model B | 91% | 1200ms | $0.05 |
| Model C | 85% | 600ms | $0.01 |
Now you have data.
Step 4: Make the Decision (PM-Led)
With data in hand, PM makes the call:
If quality differences are small: Optimize for cost or latency
If quality differences are large: Pay for quality if business case supports it
If privacy constraints exist: Eliminate non-compliant options
If lock-in matters: Favor standards-based or open-source options
Document the decision and the reasoning. You'll revisit this.
The Model Selection Checklist
Use this for any model selection decision:
Requirements Clarity:
- Quality bar defined with specific criteria
- Latency requirements specified
- Cost budget established
- Privacy/compliance needs documented
- Strategic considerations (lock-in, customization) identified
Evaluation Rigor:
- Multiple models compared
- Real use case examples in eval set
- Quality scoring methodology defined
- Latency measured under realistic conditions
- Cost calculated at projected scale
Decision Quality:
- Data-driven comparison completed
- Tradeoffs explicitly acknowledged
- Decision documented with reasoning
- Fallback/migration plan considered
- Review trigger defined (when to reconsider)
Multi-Model Strategies
Here's where it gets interesting: you don't have to pick one.
Model Routing: Use different models for different request types.
- Simple queries โ cheap/fast model (GPT-3.5, Haiku)
- Complex queries โ premium model (GPT-4, Opus)
Route based on query complexity. Reduce cost without sacrificing quality where it matters.
Fallback Chains: Primary model fails or is slow โ fall back to alternative.
Improves reliability. Reduces dependency on single provider.
A/B Testing: Run different models for different user segments.
Learn which model performs better for your specific use case.
Ensemble: Multiple models vote or verify each other.
Improves quality for high-stakes decisions.
PM should push for multi-model architecture when it makes sense. Single-model dependency is a risk.
The "But Engineering Said X" Conversation
What to do when engineering has already picked a model?
Don't: Challenge the decision confrontationally.
Do: Ask good questions.
"Help me understand the model selection. I want to make sure I can defend it to stakeholders."
- What options did we consider?
- What were the quality scores on our use case?
- What's the cost trajectory as we scale?
- What are the lock-in implications?
- What would trigger us to reconsider?
If the answers are solid, great. If the answers are "it's what we know" or "it's the best," dig deeper.
When to Reconsider Model Selection
Model selection isn't permanent. Revisit when:
Cost changes: Your costs spike, or a provider changes pricing.
Quality changes: New models release (happens constantly). Your model is no longer best-in-class.
Requirements change: You need longer context, multi-modal, or different capabilities.
Scale changes: Volume justifies self-hosting what you're buying via API.
Provider issues: Reliability problems, deprecation announcements, policy changes.
Build in regular model reviews (quarterly) to avoid complacency.
The Strategic View
Model selection is product strategy.
Commodity AI: Use APIs, optimize for cost, accept some vendor dependency. AI is a feature, not the differentiator.
Competitive AI: Customize models, prioritize quality, invest in differentiation. AI is core to the product.
Regulated AI: Prioritize compliance, accept cost premiums, prefer self-hosted or compliant providers. Constraints dominate.
Know which quadrant you're in. Let that drive model selection philosophy.
The Conversation with Engineering
When you engage engineering on model selection:
Come prepared with:
- Requirements document (quality, latency, cost, privacy)
- Business context (why these requirements matter)
- Questions, not demands
Ask for:
- Comparative eval data
- Cost projections at scale
- Lock-in assessment
- Maintenance implications
Collaborate on:
- Eval set design (you know the use cases)
- Tradeoff decisions (you own the business case)
- Review cadence (you'll know when requirements change)
Own:
- The final decision (within your scope)
- Communicating rationale to stakeholders
- Revisiting when circumstances change
Model selection is a joint effort, but PM should drive the process, not just receive the output.
Key Takeaways
Model selection has product implications โ cost, quality, latency, privacy, and lock-in are all PM concerns
Define requirements before comparing โ know what you need, then evaluate; don't let engineering optimize for the wrong criteria
Consider multi-model strategies โ routing, fallbacks, and A/B testing can optimize better than any single model choice
Free Tool
How strong are your AI PM skills?
8 real production scenarios. LLM-judged across 5 dimensions. Takes ~15 minutes. See exactly where your gaps are.
PM the Builder
Practical AI product management โ backed by PM leaders who build AI products, hire AI PMs, and ship every day. Building what we wish existed when we started.
Benchmark your AI PM skills
8 production scenarios. Free. LLM-judged. See where you stand.
Go deeper with the full toolkit
Playbooks, interview prep, prompt libraries, and production frameworks โ built by the teams who hire AI PMs.
Free: 68-page AI PM Prompt Library
Production-ready prompts for evals, architecture reviews, stakeholder comms, and shipping. Enter your email, get the PDF.
Related Posts
Want more like this?
Get weekly tactics for AI product managers.