Case Studies

AI-Product Startup - Inference Cost and Data Setup Are Everything

AI-Product Startup - Inference Cost and Data Setup Are Everything

About this article

As an addendum to the “Case Studies” category in the series “Architecture Crash Course for the Generative-AI Era,” this article explains the AI-product startup case.

Products centered on LLMs (Large Language Models) carry 4 new topics of “model selection,” “inference cost,” “hallucination,” and “Eval (evaluation).” This article organizes model-vendor selection, RAG design, inference-cost management, data-setup priorities, and AI-specific evaluation design.

What is AI product architecture, anyway?

Picture an AI-powered home appliance. Traditional appliances gave “the same result every time you push a button,” but AI appliances “produce different output depending on the input,” “occasionally make mistakes,” and “cost electricity every time they run” — fundamentally different traits from what came before.

AI product architecture is the same: you need to build model selection, inference cost, hallucination countermeasures, and evaluation design — four new considerations that didn’t exist in traditional software — into the design.

If you build with the same mindset as a traditional web app, you fall into AI-specific traps: going into the red on inference costs, or losing user trust to hallucinations.

Why AI products need special design

Because new concerns that didn’t exist before have been added

Traditional web apps had deterministic “input -> process -> output,” but incorporating LLMs adds fundamentally different traits: output varies every time, occasionally wrong (hallucinations), and incurs cost on every invocation. Without weaving these into the architecture, both quality and cost break down.

Because inference cost explodes alongside scale

Traditional apps see infrastructure costs rise gently with user growth, but LLM-based products see cost explode linearly with user count times token count. Monthly costs of a few tens of thousands of yen at MVP inflating to tens of millions post-PMF are not uncommon. Prompt caching, model branching, and context compression via RAG need to be built into the design from the start.

Because quality evaluation methods are fundamentally different from before

Traditional tests could judge by “does it match the expected value,” but LLM output has no single correct answer, making Eval design itself a new technical challenge. An evaluation pipeline combining human evaluation, LLM-as-a-Judge, and statistical metrics becomes necessary.

4 main AI-product topics

4 Major Considerations for AI Products Like AI appliances. Output varies each time, makes mistakes, and costs money with every run 1 Model Selection Claude / GPT as primary, Gemini for long-form OSS (Llama/Mistral) only for regulated industries Custom multi-model abstraction is overengineering A thin router like LiteLLM is sufficient "Just use the best model for everything" = typical cash burn 2 Inference Cost Management Cost explodes linearly: users × tokens MVP: ~¥10K/month → Scale: ~¥10M/month 3 countermeasures: Prompt caching (90% reduction on repeated parts) Model routing (small tasks → small models) / RAG compression 3 Hallucination Mitigation RAG + citation display (always show source documents) Structured Output (constrain output via JSON Schema) Guardrails layer (PII / prohibited topic filters) Human-in-the-loop (required for medical, legal, finance) In critical areas, AI drafts, humans make final decisions 4 Eval Design Offline Eval: automated evaluation with past question sets LLM-as-a-Judge: have another LLM score it Human Eval: scrutinize via sampling Auto-run per PR = automated regression detection Production without Eval = invisible quality degradation Win with data & eval loops, multiple model cores, and build trust with RAG & citations

Model-selection decision axes

The 2026 positioning of major LLMs. Differentiated by use case, cost, lock-in resistance.

ModelStrengthsWeaknessesWhere to use
Claude (Anthropic)Long-form, coding, reasoning accuracyImage generation separatelyBusiness tools, agents
GPT (OpenAI)Largest ecosystem, multimodalPeriods of strict rate limitsGeneral-purpose, images
Gemini (Google)1M-token context, low costMaturity around agentsMass document processing
OSS (Llama / Mistral etc.)Self-hosted, no data outflowHigh operational cost, models lag behindConfidential data, regulated industries
Azure OpenAI ServiceEnterprise contracts, SLA, data residencyHigher than direct OpenAI contractLarge-enterprise ChatGPT use

The realistic answer for new startups is Claude or GPT as primary, Gemini for long-form jobs, OSS only for regulated operations. Pre-emptively self-building “multi-model abstraction” is the typical over-design. Thin routers like LiteLLM are enough.

Inference-cost structure

LLM cost explodes with input tokens x output tokens x user count. Cases of monthly tens of thousands of yen at MVP ballooning to monthly tens of millions of yen post-PMF aren’t rare.

PhaseMonthly-cost guidelineMain measures
MVP (~1,000 users)Thousands-tens of thousands of yenPrompt compression, cache
Growth (~100k users)Hundreds of thousands-millions of yenPrompt caching, model branching
Scale (100k+)Millions-tens of millions of yenMulti-model split, in-house fine-tuning

3 concrete measures work:

  • Prompt caching (Claude/GPT official feature) reduces repeated-part cost by 90%
  • Model branching (small tasks on Haiku/GPT-mini, complex tasks on Opus/GPT-4 family)
  • Context compression via RAG (Top-K extraction over passing full text)

“For now, processing all with strongest model” is the typical funds-shortage route.

RAG (Retrieval-Augmented Generation) design

When handling internal docs, FAQs, product masters, etc., RAG is effectively the standard. “Letting LLM alone hold everything” gets stuck on accuracy, cost, and updatability.

Basic RAG Architecture Making LLM hold everything fails on accuracy, cost, and updatability User Question Input Embedding Model text-embedding-3-small Vectorize query Vector Search pgvector Pinecone Hybrid (Vector + BM25) Reranker Cohere Rerank Re-rank Top-K LLM (Claude / GPT) Search results + question → answer generation Answer + Citation Display Always show source documents Data Ingestion Pipeline Internal docs / FAQ Chunk splitting Generate embeddings Store in Vector DB Scheduled batch or real-time Start with pgvector — the 2026 standard. Citation display is essential for building trust
ComponentRecommended
Vector DBpgvector (PostgreSQL extension) / Pinecone / Weaviate
Embedding modeltext-embedding-3-small / OSS bge-m3
SearchHybrid (vector + BM25)
RerankerCohere Rerank / in-house Cross-Encoder
Citation displayRequired (hallucination suppression + trust building)

The 2026 standard upgrade path from “for now, throw everything to ChatGPT API” is starting with pgvector first.

Hallucination and guardrails

LLMs generate “plausible-sounding lies.” Putting them on products requires structurally suppressing this.

CountermeasureContent
RAG + citationsAlways present source documents, don’t answer if uncited
Structured OutputBind output via JSON Schema
Guardrails layerFilters for personal info, abuse, banned topics
Confidence displayMake “low confidence” explicit and escalate
Human-in-the-loopInsert human review for important decisions

For critical areas like medical, legal, and finance, Human-in-the-loop is required in principle. Lean to designs where AI is for draft creation only and humans make final decisions.

Eval (evaluation) design

AI products may have quality degrading even in states “looking like running.” AI products without Eval built into CI/CD don’t surface production failures.

Eval typeContentTools
Offline EvalAuto-evaluate on past-question / expected-answer setsPromptfoo / OpenAI Evals / LangSmith
Online A/BCompare multiple prompts / models in productionLaunchDarkly + in-house
LLM-as-a-JudgeStrong other LLM scoresGPT-4o / Claude Opus
Manual EvalSample and human-evaluateIn-house + annotation tools
User feedbackCollect 👍/👎 buttons, commentsSentry / Posthog

The desired state for offline Eval is auto-running per PR. Without auto-regression-detection, LLM operations break down.

AreaRecommended
LLM APIClaude API / OpenAI API (multiple simultaneous contracts recommended)
Agent foundationLangGraph / Mastra / in-house
Vector DBpgvector (small-mid) / Pinecone (scale)
EvalPromptfoo + LangSmith
Observation (LLM-specialized)LangSmith / Helicone / Phoenix
Prompt managementGit-managed + templates
Billing (user-side)Stripe + usage metrics
LanguageTypeScript / Python
HostingVercel / Cloud Run / Modal

What to decide - what is your project’s answer?

ItemChoice examples
Primary modelClaude / GPT / Gemini / OSS
Secondary model (cost-optimal)Haiku / GPT-mini / Gemini Flash
Vector DBpgvector / Pinecone / Weaviate
Hallucination countermeasureRAG + citations / Structured / human review
Eval strategyOffline / A/B / LLM-as-Judge
Data-residency requirementDirect API / Azure OpenAI / OSS self-hosted
Billing modelSubscription / usage-based / hybrid
Prompt managementGit-managed / dedicated SaaS

AI-product-specific pitfalls

Forbidden moveWhy it’s bad
Pre-emptively self-build model abstractionSame trap as LangChain abuse, thin router enough
”For now, strongest model” for all tasks10x inference cost, most tasks done with Haiku/Flash
Production operation without EvalQuality degradation invisible on prompt changes
Free-form output without citationsTrust collapses on hallucinations, SLA-violation risk
Personal info / secrets directly in promptsData residency / training risk, masking required
Use only synchronous processing for LLM callsStuck on rate limits / timeouts, Job Queue required
Production agents without runaway controlInfinite loops blow up bills, always set max steps / budget caps
Build RAG with data quality postponedGarbage data → garbage answers, data setup first
”Process everything with the strongest model” as a luxury10x inference cost; most tasks can be done with Haiku / Flash
”Run production without Eval” and ignore qualityQuality degradation invisible on prompt changes; regression detection automation is a must

AI decision axes

AI-favorableAI-unfavorable
Data quality and organization”For now, smash it with the model”
Git-managed prompts and EvalEdit prompts in admin screens, lose history
Prompt caching and model branchingStrongest model for all tasks
Structured Output / with citationsLeave all to free-form
User-feedback loopOne-way release
  1. Data setup and Eval as top priority — before model selection.
  2. Models with Claude / GPT primary + secondary models — multi-provider for fault tolerance.
  3. RAG + citations — standard equipment for hallucination suppression.
  4. Cost-management mechanisms from the start — cache, branching, cap settings.

Eval (evaluation loop) becomes the core of AI product quality management

In traditional software, unit tests and E2E tests were the quality foundation, but in AI products, Eval (LLM output evaluation) additionally becomes the core of quality management. Since LLM output is non-deterministic, there’s no guarantee the same input returns the same output every time, making traditional testing alone insufficient for quality assurance.

A practical 3-layer structure for Eval design:

  • Auto-Eval: output format validation (is JSON parseable, are required fields present)
  • LLM-as-Judge: have another LLM score output quality (low cost, mass-executable)
  • Human Eval: cases requiring business judgment (accuracy, appropriateness, brand tone)

Building these into CI/CD pipelines for automatic execution on every model or prompt change enables early detection of quality degradation.

Version management of prompts and model settings

AI product prompts are frequently changed “code” — editing in an admin-screen text area with no history is fatal. Managing prompts in a Git repository with Eval auto-executing on every change enables tracking “when, who changed what, and how quality changed.”

Prompt Git Management + Automated Eval Pipeline Like a recipe book. Record changes and taste-test before every serving Git Repository (single source) prompts/ system.md / user.md / few-shot.md config/ model.yaml (Claude / GPT / Gemini) evals/ test_set.jsonl / golden.jsonl Editing in admin UI → losing history is fatal PR Automated Eval Pipeline Layer 1 Auto Eval: JSON parse & required field validation Layer 2 LLM-as-Judge: another LLM scores quality Layer 3 Human Eval: business judgment (sampling) When, who, what changed, and how quality changed Judgment & Operations Deploy Rollback Eval score above threshold → Go Score drops → auto Rollback A/B Testing Switch models with 1 config line change Git Management of Model Config Files model.yaml primary: claude-opus / temp: 0.3 Switch via environment variables MODEL_ENV=prod A/B Testing traffic_split: 90/10 Failure Fallback Switch to fallback model with 1 config line Prompts and model configs are "code." Git + automated Eval catches quality degradation instantly

Model selection (Claude/GPT/Gemini etc.) and parameters (temperature, max_tokens) should also be Git-managed as config files, switchable via environment variables. This makes A/B testing and fallback switching easy. A configuration where a single config-line change switches to a secondary model during production model outages serves as an operational safety valve.

https://en.senkohome.com/arch-intro-case-enterprise/ https://en.senkohome.com/arch-intro-case-mobile/ https://en.senkohome.com/arch-intro-case-public/

Summary

This article covered the AI-product startup case, including model selection, inference cost, RAG, hallucination, and Eval.

Win with data and evaluation loops, multiple primary models, build trust with RAG and citations, design cost structure from the start. That is the practical answer for AI products in 2026.

With this, the 3 addendum articles (public, mobile, AI-product) of the “Case Studies” category are complete, letting you position your project from both scale axis (startup/saas/enterprise) and industry axis (public/mobile/ai-product).

Back to series TOC -> ‘Architecture Crash Course for the Generative-AI Era’: How to Read This Book

I hope you’ll read the next article as well.