About this article
As an addendum to the âCase Studiesâ category in the series âArchitecture Crash Course for the Generative-AI Era,â this article explains the AI-product startup case.
Products centered on LLMs (Large Language Models) carry 4 new topics of âmodel selection,â âinference cost,â âhallucination,â and âEval (evaluation).â This article organizes model-vendor selection, RAG design, inference-cost management, data-setup priorities, and AI-specific evaluation design.
Other articles in this category
4 main AI-product topics
flowchart TB
AI[AI-product startup]
A[1. Model selection<br/>Claude/GPT/Gemini/OSS]
B[2. Inference-cost management<br/>$/1M tokens and rate limits]
C[3. Hallucination countermeasures<br/>RAG, citations, Guardrails]
D[4. Eval / evaluation design<br/>quantitative-qualitative loop]
AI --> A
AI --> B
AI --> C
AI --> D
classDef root fill:#fef3c7,stroke:#d97706,stroke-width:2px;
classDef axis fill:#dbeafe,stroke:#2563eb;
class AI root;
class A,B,C,D axis;
Model-selection decision axes
The 2026 positioning of major LLMs. Differentiated by use case, cost, lock-in resistance.
| Model | Strengths | Weaknesses | Where to use |
|---|---|---|---|
| Claude (Anthropic) | Long-form, coding, reasoning accuracy | Image generation separately | Business tools, agents |
| GPT (OpenAI) | Largest ecosystem, multimodal | Periods of strict rate limits | General-purpose, images |
| Gemini (Google) | 1M-token context, low cost | Maturity around agents | Mass document processing |
| OSS (Llama / Mistral etc.) | Self-hosted, no data outflow | High operational cost, models lag behind | Confidential data, regulated industries |
| Azure OpenAI Service | Enterprise contracts, SLA, data residency | Higher than direct OpenAI contract | Large-enterprise ChatGPT use |
The realistic answer for new startups is Claude or GPT as primary, Gemini for long-form jobs, OSS only for regulated operations. Pre-emptively self-building âmulti-model abstractionâ is the typical over-design. Thin routers like LiteLLM are enough.
Inference-cost structure
LLM cost explodes with input tokens x output tokens x user count. Cases of monthly tens of thousands of yen at MVP ballooning to monthly tens of millions of yen post-PMF arenât rare.
| Phase | Monthly-cost guideline | Main measures |
|---|---|---|
| MVP (~1,000 users) | Thousands-tens of thousands of yen | Prompt compression, cache |
| Growth (~100k users) | Hundreds of thousands-millions of yen | Prompt caching, model branching |
| Scale (100k+) | Millions-tens of millions of yen | Multi-model split, in-house fine-tuning |
3 concrete measures work:
- Prompt caching (Claude/GPT official feature) reduces repeated-part cost by 90%
- Model branching (small tasks on Haiku/GPT-mini, complex tasks on Opus/GPT-4 family)
- Context compression via RAG (Top-K extraction over passing full text)
âFor now, processing all with strongest modelâ is the typical funds-shortage route.
RAG (Retrieval-Augmented Generation) design
When handling internal docs, FAQs, product masters, etc., RAG is effectively the standard. âLetting LLM alone hold everythingâ gets stuck on accuracy, cost, and updatability.
flowchart LR
Q([User question])
EMB[Embedding vectorization]
VDB[(Vector DB<br/>pgvector / Pinecone)]
TOPK[Top-K document extraction]
LLM[LLM<br/>+ context injection]
A([Answer + citations])
Q --> EMB --> VDB --> TOPK --> LLM --> A
classDef q fill:#fef3c7,stroke:#d97706;
classDef ret fill:#dbeafe,stroke:#2563eb;
classDef llm fill:#fae8ff,stroke:#a21caf;
class Q,A q;
class EMB,VDB,TOPK ret;
class LLM llm;
| Component | Recommended |
|---|---|
| Vector DB | pgvector (PostgreSQL extension) / Pinecone / Weaviate |
| Embedding model | text-embedding-3-small / OSS bge-m3 |
| Search | Hybrid (vector + BM25) |
| Reranker | Cohere Rerank / in-house Cross-Encoder |
| Citation display | Required (hallucination suppression + trust building) |
The 2026 standard upgrade path from âfor now, throw everything to ChatGPT APIâ is starting with pgvector first.
Hallucination and guardrails
LLMs generate âplausible-sounding lies.â Putting them on products requires structurally suppressing this.
| Countermeasure | Content |
|---|---|
| RAG + citations | Always present source documents, donât answer if uncited |
| Structured Output | Bind output via JSON Schema |
| Guardrails layer | Filters for personal info, abuse, banned topics |
| Confidence display | Make âlow confidenceâ explicit and escalate |
| Human-in-the-loop | Insert human review for important decisions |
For critical areas like medical, legal, and finance, Human-in-the-loop is required in principle. Lean to designs where AI is for draft creation only and humans make final decisions.
Eval (evaluation) design
AI products may have quality degrading even in states âlooking like running.â AI products without Eval built into CI/CD donât surface production failures.
| Eval type | Content | Tools |
|---|---|---|
| Offline Eval | Auto-evaluate on past-question / expected-answer sets | Promptfoo / OpenAI Evals / LangSmith |
| Online A/B | Compare multiple prompts / models in production | LaunchDarkly + in-house |
| LLM-as-a-Judge | Strong other LLM scores | GPT-4o / Claude Opus |
| Manual Eval | Sample and human-evaluate | In-house + annotation tools |
| User feedback | Collect đ/đ buttons, comments | Sentry / Posthog |
The desired state for offline Eval is auto-running per PR. Without auto-regression-detection, LLM operations break down.
Recommended stack
| Area | Recommended |
|---|---|
| LLM API | Claude API / OpenAI API (multiple simultaneous contracts recommended) |
| Agent foundation | LangGraph / Mastra / in-house |
| Vector DB | pgvector (small-mid) / Pinecone (scale) |
| Eval | Promptfoo + LangSmith |
| Observation (LLM-specialized) | LangSmith / Helicone / Phoenix |
| Prompt management | Git-managed + templates |
| Billing (user-side) | Stripe + usage metrics |
| Language | TypeScript / Python |
| Hosting | Vercel / Cloud Run / Modal |
What to decide - what is your projectâs answer?
| Item | Choice examples |
|---|---|
| Primary model | Claude / GPT / Gemini / OSS |
| Secondary model (cost-optimal) | Haiku / GPT-mini / Gemini Flash |
| Vector DB | pgvector / Pinecone / Weaviate |
| Hallucination countermeasure | RAG + citations / Structured / human review |
| Eval strategy | Offline / A/B / LLM-as-Judge |
| Data-residency requirement | Direct API / Azure OpenAI / OSS self-hosted |
| Billing model | Subscription / usage-based / hybrid |
| Prompt management | Git-managed / dedicated SaaS |
AI-product-specific pitfalls
| Forbidden move | Why itâs bad |
|---|---|
| Pre-emptively self-build model abstraction | Same trap as LangChain abuse, thin router enough |
| âFor now, strongest modelâ for all tasks | 10x inference cost, most tasks done with Haiku/Flash |
| Production operation without Eval | Quality degradation invisible on prompt changes |
| Free-form output without citations | Trust collapses on hallucinations, SLA-violation risk |
| Personal info / secrets directly in prompts | Data residency / training risk, masking required |
| Use only synchronous processing for LLM calls | Stuck on rate limits / timeouts, Job Queue required |
| Production agents without runaway control | Infinite loops blow up bills, always set max steps / budget caps |
| Build RAG with data quality postponed | Garbage data â garbage answers, data setup first |
AI-era âAI-era perspectiveâ
In AI products, âfavored in the AI eraâ directly is the productâs essence.
| Effective | Ineffective |
|---|---|
| Data quality / organization | âFor now, smash with modelâ |
| Git-manage prompts / Eval | Edit prompts in admin screen, lose history |
| Prompt caching / model branching | Strongest model for all tasks |
| Structured Output / with citations | Leave all to free-form |
| User-feedback loop | One-way release |
If data setup lags, even putting in the strongest model later wonât raise accuracy. Investment in data architecture directly becomes the AI productâs ceiling.
How to make the final call
The core of AI products is the recognition âwin not by model but by data and evaluation loop.â Strongest models become outdated over time, but organized data and operating Eval loops become assets working compoundly.
Another axis is âdesign cost structure from the start.â Prioritizing âfor now it worksâ at MVP and throwing all to the strongest model shakes management with post-PMF bills. Build in 3 points from the start: prompt caching, model branching, and context compression via RAG.
Selection priorities
- Data setup and Eval as top priority - before model selection
- Models with Claude/GPT primary + secondary models - multi-provider for fault tolerance
- RAG + citations - standard equipment for hallucination suppression
- Cost-management mechanisms from the start - cache, branching, cap settings
âModels commoditize, victory is data and evaluationâ is the 2026 AI-product standard.
Summary
This article covered the AI-product startup case, including model selection, inference cost, RAG, hallucination, and Eval.
Win with data and evaluation loops, multiple primary models, build trust with RAG and citations, design cost structure from the start. That is the practical answer for AI products in 2026.
With this, the 3 addendum articles (public, mobile, AI-product) of the âCase Studiesâ category are complete, letting you position your project from both scale axis (startup/saas/enterprise) and industry axis (public/mobile/ai-product).
I hope youâll read the next article as well.
đ Series: Architecture Crash Course for the Generative-AI Era (86/89)