Case Studies

AI-Product Startup - Inference Cost and Data Setup Are Everything

AI-Product Startup - Inference Cost and Data Setup Are Everything

About this article

As an addendum to the “Case Studies” category in the series “Architecture Crash Course for the Generative-AI Era,” this article explains the AI-product startup case.

Products centered on LLMs (Large Language Models) carry 4 new topics of “model selection,” “inference cost,” “hallucination,” and “Eval (evaluation).” This article organizes model-vendor selection, RAG design, inference-cost management, data-setup priorities, and AI-specific evaluation design.

4 main AI-product topics

flowchart TB
    AI[AI-product startup]
    A[1. Model selection<br/>Claude/GPT/Gemini/OSS]
    B[2. Inference-cost management<br/>$/1M tokens and rate limits]
    C[3. Hallucination countermeasures<br/>RAG, citations, Guardrails]
    D[4. Eval / evaluation design<br/>quantitative-qualitative loop]
    AI --> A
    AI --> B
    AI --> C
    AI --> D
    classDef root fill:#fef3c7,stroke:#d97706,stroke-width:2px;
    classDef axis fill:#dbeafe,stroke:#2563eb;
    class AI root;
    class A,B,C,D axis;

Model-selection decision axes

The 2026 positioning of major LLMs. Differentiated by use case, cost, lock-in resistance.

ModelStrengthsWeaknessesWhere to use
Claude (Anthropic)Long-form, coding, reasoning accuracyImage generation separatelyBusiness tools, agents
GPT (OpenAI)Largest ecosystem, multimodalPeriods of strict rate limitsGeneral-purpose, images
Gemini (Google)1M-token context, low costMaturity around agentsMass document processing
OSS (Llama / Mistral etc.)Self-hosted, no data outflowHigh operational cost, models lag behindConfidential data, regulated industries
Azure OpenAI ServiceEnterprise contracts, SLA, data residencyHigher than direct OpenAI contractLarge-enterprise ChatGPT use

The realistic answer for new startups is Claude or GPT as primary, Gemini for long-form jobs, OSS only for regulated operations. Pre-emptively self-building “multi-model abstraction” is the typical over-design. Thin routers like LiteLLM are enough.

Inference-cost structure

LLM cost explodes with input tokens x output tokens x user count. Cases of monthly tens of thousands of yen at MVP ballooning to monthly tens of millions of yen post-PMF aren’t rare.

PhaseMonthly-cost guidelineMain measures
MVP (~1,000 users)Thousands-tens of thousands of yenPrompt compression, cache
Growth (~100k users)Hundreds of thousands-millions of yenPrompt caching, model branching
Scale (100k+)Millions-tens of millions of yenMulti-model split, in-house fine-tuning

3 concrete measures work:

  • Prompt caching (Claude/GPT official feature) reduces repeated-part cost by 90%
  • Model branching (small tasks on Haiku/GPT-mini, complex tasks on Opus/GPT-4 family)
  • Context compression via RAG (Top-K extraction over passing full text)

“For now, processing all with strongest model” is the typical funds-shortage route.

RAG (Retrieval-Augmented Generation) design

When handling internal docs, FAQs, product masters, etc., RAG is effectively the standard. “Letting LLM alone hold everything” gets stuck on accuracy, cost, and updatability.

flowchart LR
    Q([User question])
    EMB[Embedding vectorization]
    VDB[(Vector DB<br/>pgvector / Pinecone)]
    TOPK[Top-K document extraction]
    LLM[LLM<br/>+ context injection]
    A([Answer + citations])
    Q --> EMB --> VDB --> TOPK --> LLM --> A
    classDef q fill:#fef3c7,stroke:#d97706;
    classDef ret fill:#dbeafe,stroke:#2563eb;
    classDef llm fill:#fae8ff,stroke:#a21caf;
    class Q,A q;
    class EMB,VDB,TOPK ret;
    class LLM llm;
ComponentRecommended
Vector DBpgvector (PostgreSQL extension) / Pinecone / Weaviate
Embedding modeltext-embedding-3-small / OSS bge-m3
SearchHybrid (vector + BM25)
RerankerCohere Rerank / in-house Cross-Encoder
Citation displayRequired (hallucination suppression + trust building)

The 2026 standard upgrade path from “for now, throw everything to ChatGPT API” is starting with pgvector first.

Hallucination and guardrails

LLMs generate “plausible-sounding lies.” Putting them on products requires structurally suppressing this.

CountermeasureContent
RAG + citationsAlways present source documents, don’t answer if uncited
Structured OutputBind output via JSON Schema
Guardrails layerFilters for personal info, abuse, banned topics
Confidence displayMake “low confidence” explicit and escalate
Human-in-the-loopInsert human review for important decisions

For critical areas like medical, legal, and finance, Human-in-the-loop is required in principle. Lean to designs where AI is for draft creation only and humans make final decisions.

Eval (evaluation) design

AI products may have quality degrading even in states “looking like running.” AI products without Eval built into CI/CD don’t surface production failures.

Eval typeContentTools
Offline EvalAuto-evaluate on past-question / expected-answer setsPromptfoo / OpenAI Evals / LangSmith
Online A/BCompare multiple prompts / models in productionLaunchDarkly + in-house
LLM-as-a-JudgeStrong other LLM scoresGPT-4o / Claude Opus
Manual EvalSample and human-evaluateIn-house + annotation tools
User feedbackCollect 👍/👎 buttons, commentsSentry / Posthog

The desired state for offline Eval is auto-running per PR. Without auto-regression-detection, LLM operations break down.

AreaRecommended
LLM APIClaude API / OpenAI API (multiple simultaneous contracts recommended)
Agent foundationLangGraph / Mastra / in-house
Vector DBpgvector (small-mid) / Pinecone (scale)
EvalPromptfoo + LangSmith
Observation (LLM-specialized)LangSmith / Helicone / Phoenix
Prompt managementGit-managed + templates
Billing (user-side)Stripe + usage metrics
LanguageTypeScript / Python
HostingVercel / Cloud Run / Modal

What to decide - what is your project’s answer?

ItemChoice examples
Primary modelClaude / GPT / Gemini / OSS
Secondary model (cost-optimal)Haiku / GPT-mini / Gemini Flash
Vector DBpgvector / Pinecone / Weaviate
Hallucination countermeasureRAG + citations / Structured / human review
Eval strategyOffline / A/B / LLM-as-Judge
Data-residency requirementDirect API / Azure OpenAI / OSS self-hosted
Billing modelSubscription / usage-based / hybrid
Prompt managementGit-managed / dedicated SaaS

AI-product-specific pitfalls

Forbidden moveWhy it’s bad
Pre-emptively self-build model abstractionSame trap as LangChain abuse, thin router enough
”For now, strongest model” for all tasks10x inference cost, most tasks done with Haiku/Flash
Production operation without EvalQuality degradation invisible on prompt changes
Free-form output without citationsTrust collapses on hallucinations, SLA-violation risk
Personal info / secrets directly in promptsData residency / training risk, masking required
Use only synchronous processing for LLM callsStuck on rate limits / timeouts, Job Queue required
Production agents without runaway controlInfinite loops blow up bills, always set max steps / budget caps
Build RAG with data quality postponedGarbage data → garbage answers, data setup first

AI-era “AI-era perspective”

In AI products, “favored in the AI era” directly is the product’s essence.

EffectiveIneffective
Data quality / organization”For now, smash with model”
Git-manage prompts / EvalEdit prompts in admin screen, lose history
Prompt caching / model branchingStrongest model for all tasks
Structured Output / with citationsLeave all to free-form
User-feedback loopOne-way release

If data setup lags, even putting in the strongest model later won’t raise accuracy. Investment in data architecture directly becomes the AI product’s ceiling.

How to make the final call

The core of AI products is the recognition “win not by model but by data and evaluation loop.” Strongest models become outdated over time, but organized data and operating Eval loops become assets working compoundly.

Another axis is “design cost structure from the start.” Prioritizing “for now it works” at MVP and throwing all to the strongest model shakes management with post-PMF bills. Build in 3 points from the start: prompt caching, model branching, and context compression via RAG.

Selection priorities

  1. Data setup and Eval as top priority - before model selection
  2. Models with Claude/GPT primary + secondary models - multi-provider for fault tolerance
  3. RAG + citations - standard equipment for hallucination suppression
  4. Cost-management mechanisms from the start - cache, branching, cap settings

“Models commoditize, victory is data and evaluation” is the 2026 AI-product standard.

Summary

This article covered the AI-product startup case, including model selection, inference cost, RAG, hallucination, and Eval.

Win with data and evaluation loops, multiple primary models, build trust with RAG and citations, design cost structure from the start. That is the practical answer for AI products in 2026.

With this, the 3 addendum articles (public, mobile, AI-product) of the “Case Studies” category are complete, letting you position your project from both scale axis (startup/saas/enterprise) and industry axis (public/mobile/ai-product).

Back to series TOC -> ‘Architecture Crash Course for the Generative-AI Era’: How to Read This Book

I hope you’ll read the next article as well.