About this article
As an addendum to the “Case Studies” category in the series “Architecture Crash Course for the Generative-AI Era,” this article explains the AI-product startup case.
Products centered on LLMs (Large Language Models) carry 4 new topics of “model selection,” “inference cost,” “hallucination,” and “Eval (evaluation).” This article organizes model-vendor selection, RAG design, inference-cost management, data-setup priorities, and AI-specific evaluation design.
What is AI product architecture, anyway?
Picture an AI-powered home appliance. Traditional appliances gave “the same result every time you push a button,” but AI appliances “produce different output depending on the input,” “occasionally make mistakes,” and “cost electricity every time they run” — fundamentally different traits from what came before.
AI product architecture is the same: you need to build model selection, inference cost, hallucination countermeasures, and evaluation design — four new considerations that didn’t exist in traditional software — into the design.
If you build with the same mindset as a traditional web app, you fall into AI-specific traps: going into the red on inference costs, or losing user trust to hallucinations.
Why AI products need special design
Because new concerns that didn’t exist before have been added
Traditional web apps had deterministic “input -> process -> output,” but incorporating LLMs adds fundamentally different traits: output varies every time, occasionally wrong (hallucinations), and incurs cost on every invocation. Without weaving these into the architecture, both quality and cost break down.
Because inference cost explodes alongside scale
Traditional apps see infrastructure costs rise gently with user growth, but LLM-based products see cost explode linearly with user count times token count. Monthly costs of a few tens of thousands of yen at MVP inflating to tens of millions post-PMF are not uncommon. Prompt caching, model branching, and context compression via RAG need to be built into the design from the start.
Because quality evaluation methods are fundamentally different from before
Traditional tests could judge by “does it match the expected value,” but LLM output has no single correct answer, making Eval design itself a new technical challenge. An evaluation pipeline combining human evaluation, LLM-as-a-Judge, and statistical metrics becomes necessary.
4 main AI-product topics
Model-selection decision axes
The 2026 positioning of major LLMs. Differentiated by use case, cost, lock-in resistance.
| Model | Strengths | Weaknesses | Where to use |
|---|---|---|---|
| Claude (Anthropic) | Long-form, coding, reasoning accuracy | Image generation separately | Business tools, agents |
| GPT (OpenAI) | Largest ecosystem, multimodal | Periods of strict rate limits | General-purpose, images |
| Gemini (Google) | 1M-token context, low cost | Maturity around agents | Mass document processing |
| OSS (Llama / Mistral etc.) | Self-hosted, no data outflow | High operational cost, models lag behind | Confidential data, regulated industries |
| Azure OpenAI Service | Enterprise contracts, SLA, data residency | Higher than direct OpenAI contract | Large-enterprise ChatGPT use |
The realistic answer for new startups is Claude or GPT as primary, Gemini for long-form jobs, OSS only for regulated operations. Pre-emptively self-building “multi-model abstraction” is the typical over-design. Thin routers like LiteLLM are enough.
Inference-cost structure
LLM cost explodes with input tokens x output tokens x user count. Cases of monthly tens of thousands of yen at MVP ballooning to monthly tens of millions of yen post-PMF aren’t rare.
| Phase | Monthly-cost guideline | Main measures |
|---|---|---|
| MVP (~1,000 users) | Thousands-tens of thousands of yen | Prompt compression, cache |
| Growth (~100k users) | Hundreds of thousands-millions of yen | Prompt caching, model branching |
| Scale (100k+) | Millions-tens of millions of yen | Multi-model split, in-house fine-tuning |
3 concrete measures work:
- Prompt caching (Claude/GPT official feature) reduces repeated-part cost by 90%
- Model branching (small tasks on Haiku/GPT-mini, complex tasks on Opus/GPT-4 family)
- Context compression via RAG (Top-K extraction over passing full text)
“For now, processing all with strongest model” is the typical funds-shortage route.
RAG (Retrieval-Augmented Generation) design
When handling internal docs, FAQs, product masters, etc., RAG is effectively the standard. “Letting LLM alone hold everything” gets stuck on accuracy, cost, and updatability.
| Component | Recommended |
|---|---|
| Vector DB | pgvector (PostgreSQL extension) / Pinecone / Weaviate |
| Embedding model | text-embedding-3-small / OSS bge-m3 |
| Search | Hybrid (vector + BM25) |
| Reranker | Cohere Rerank / in-house Cross-Encoder |
| Citation display | Required (hallucination suppression + trust building) |
The 2026 standard upgrade path from “for now, throw everything to ChatGPT API” is starting with pgvector first.
Hallucination and guardrails
LLMs generate “plausible-sounding lies.” Putting them on products requires structurally suppressing this.
| Countermeasure | Content |
|---|---|
| RAG + citations | Always present source documents, don’t answer if uncited |
| Structured Output | Bind output via JSON Schema |
| Guardrails layer | Filters for personal info, abuse, banned topics |
| Confidence display | Make “low confidence” explicit and escalate |
| Human-in-the-loop | Insert human review for important decisions |
For critical areas like medical, legal, and finance, Human-in-the-loop is required in principle. Lean to designs where AI is for draft creation only and humans make final decisions.
Eval (evaluation) design
AI products may have quality degrading even in states “looking like running.” AI products without Eval built into CI/CD don’t surface production failures.
| Eval type | Content | Tools |
|---|---|---|
| Offline Eval | Auto-evaluate on past-question / expected-answer sets | Promptfoo / OpenAI Evals / LangSmith |
| Online A/B | Compare multiple prompts / models in production | LaunchDarkly + in-house |
| LLM-as-a-Judge | Strong other LLM scores | GPT-4o / Claude Opus |
| Manual Eval | Sample and human-evaluate | In-house + annotation tools |
| User feedback | Collect 👍/👎 buttons, comments | Sentry / Posthog |
The desired state for offline Eval is auto-running per PR. Without auto-regression-detection, LLM operations break down.
Recommended stack
| Area | Recommended |
|---|---|
| LLM API | Claude API / OpenAI API (multiple simultaneous contracts recommended) |
| Agent foundation | LangGraph / Mastra / in-house |
| Vector DB | pgvector (small-mid) / Pinecone (scale) |
| Eval | Promptfoo + LangSmith |
| Observation (LLM-specialized) | LangSmith / Helicone / Phoenix |
| Prompt management | Git-managed + templates |
| Billing (user-side) | Stripe + usage metrics |
| Language | TypeScript / Python |
| Hosting | Vercel / Cloud Run / Modal |
What to decide - what is your project’s answer?
| Item | Choice examples |
|---|---|
| Primary model | Claude / GPT / Gemini / OSS |
| Secondary model (cost-optimal) | Haiku / GPT-mini / Gemini Flash |
| Vector DB | pgvector / Pinecone / Weaviate |
| Hallucination countermeasure | RAG + citations / Structured / human review |
| Eval strategy | Offline / A/B / LLM-as-Judge |
| Data-residency requirement | Direct API / Azure OpenAI / OSS self-hosted |
| Billing model | Subscription / usage-based / hybrid |
| Prompt management | Git-managed / dedicated SaaS |
AI-product-specific pitfalls
| Forbidden move | Why it’s bad |
|---|---|
| Pre-emptively self-build model abstraction | Same trap as LangChain abuse, thin router enough |
| ”For now, strongest model” for all tasks | 10x inference cost, most tasks done with Haiku/Flash |
| Production operation without Eval | Quality degradation invisible on prompt changes |
| Free-form output without citations | Trust collapses on hallucinations, SLA-violation risk |
| Personal info / secrets directly in prompts | Data residency / training risk, masking required |
| Use only synchronous processing for LLM calls | Stuck on rate limits / timeouts, Job Queue required |
| Production agents without runaway control | Infinite loops blow up bills, always set max steps / budget caps |
| Build RAG with data quality postponed | Garbage data → garbage answers, data setup first |
| ”Process everything with the strongest model” as a luxury | 10x inference cost; most tasks can be done with Haiku / Flash |
| ”Run production without Eval” and ignore quality | Quality degradation invisible on prompt changes; regression detection automation is a must |
AI decision axes
| AI-favorable | AI-unfavorable |
|---|---|
| Data quality and organization | ”For now, smash it with the model” |
| Git-managed prompts and Eval | Edit prompts in admin screens, lose history |
| Prompt caching and model branching | Strongest model for all tasks |
| Structured Output / with citations | Leave all to free-form |
| User-feedback loop | One-way release |
- Data setup and Eval as top priority — before model selection.
- Models with Claude / GPT primary + secondary models — multi-provider for fault tolerance.
- RAG + citations — standard equipment for hallucination suppression.
- Cost-management mechanisms from the start — cache, branching, cap settings.
Eval (evaluation loop) becomes the core of AI product quality management
In traditional software, unit tests and E2E tests were the quality foundation, but in AI products, Eval (LLM output evaluation) additionally becomes the core of quality management. Since LLM output is non-deterministic, there’s no guarantee the same input returns the same output every time, making traditional testing alone insufficient for quality assurance.
A practical 3-layer structure for Eval design:
- Auto-Eval: output format validation (is JSON parseable, are required fields present)
- LLM-as-Judge: have another LLM score output quality (low cost, mass-executable)
- Human Eval: cases requiring business judgment (accuracy, appropriateness, brand tone)
Building these into CI/CD pipelines for automatic execution on every model or prompt change enables early detection of quality degradation.
Version management of prompts and model settings
AI product prompts are frequently changed “code” — editing in an admin-screen text area with no history is fatal. Managing prompts in a Git repository with Eval auto-executing on every change enables tracking “when, who changed what, and how quality changed.”
Model selection (Claude/GPT/Gemini etc.) and parameters (temperature, max_tokens) should also be Git-managed as config files, switchable via environment variables. This makes A/B testing and fallback switching easy. A configuration where a single config-line change switches to a secondary model during production model outages serves as an operational safety valve.
Related Articles
https://en.senkohome.com/arch-intro-case-enterprise/ https://en.senkohome.com/arch-intro-case-mobile/ https://en.senkohome.com/arch-intro-case-public/
Summary
This article covered the AI-product startup case, including model selection, inference cost, RAG, hallucination, and Eval.
Win with data and evaluation loops, multiple primary models, build trust with RAG and citations, design cost structure from the start. That is the practical answer for AI products in 2026.
With this, the 3 addendum articles (public, mobile, AI-product) of the “Case Studies” category are complete, letting you position your project from both scale axis (startup/saas/enterprise) and industry axis (public/mobile/ai-product).
Back to series TOC -> ‘Architecture Crash Course for the Generative-AI Era’: How to Read This Book
I hope you’ll read the next article as well.
📚 Series: Architecture Crash Course for the Generative-AI Era (86/89)