Retrieval-Augmented Generation (RAG) is the most common pattern for knowledge-grounded AI agents. But most teams only track the inference costs and completely miss the embedding side.

The Two Cost Centers

Embedding Costs (Ingestion)

Every document you add to your knowledge base needs to be embedded. OpenAI's text-embedding-3-small costs $0.02 per million tokens. Sounds cheap until you're processing 100,000 documents with an average of 2,000 tokens each — that's $4.00 just for initial ingestion. Re-index after updates and it multiplies.

Inference Costs (Query Time)

Each user query triggers: (1) embed the query ($0.000002), (2) retrieve relevant chunks, (3) send chunks + query to the LLM ($0.02-0.10 depending on context size and model).

Tracking Both in AgentBurn

# Track embedding costs during ingestion
ingest_event(
    agent_id="rag-indexer",
    provider="openai",
    model="text-embedding-3-small",
    operation="embedding",
    input_tokens=token_count,
    cost_usd=token_count * 0.02 / 1_000_000
)

# Track inference costs at query time
ingest_event(
    agent_id="rag-query-agent",
    provider="anthropic",
    model="claude-sonnet-4-20250514",
    operation="llm_call",
    input_tokens=context_tokens + query_tokens,
    output_tokens=response_tokens,
    cost_usd=calculated_cost
)

Illustrative Cost Breakdown

For a hypothetical RAG system processing 50K documents and serving 1,000 queries/day, based on published API pricing:

Initial embedding: A few dollars (one-time) — embeddings are cheap per call
Daily re-indexing (10% updates): Pennies per day
Daily inference: The dominant cost — each query sends retrieved chunks to an LLM, and this scales linearly with query volume
Key takeaway: Inference almost always dwarfs embedding costs at query volumes above a few hundred per day

Optimization Strategies

Reduce context window — Send 3 relevant chunks instead of 10. Each chunk you remove saves tokens
Use cheaper models for simple questions — Route factual lookups to Haiku/Flash, complex analysis to Sonnet
Cache frequent queries — Many RAG systems see 30% query repetition
Incremental indexing — Only re-embed changed documents, not the full corpus

AgentBurn's provider breakdown immediately shows the embedding vs inference split, making it clear where optimization effort should focus.

Cost-Optimizing RAG Pipelines: Embedding vs Inference Spend

The Two Cost Centers

Embedding Costs (Ingestion)

Inference Costs (Query Time)

Tracking Both in AgentBurn

Illustrative Cost Breakdown

Optimization Strategies

Start tracking your AI agent costs

Related Articles

From $10K to $3K: A Playbook for Cutting Agent Costs 70%

Tracking E2B Sandbox Costs for Code Generation Agents

Monitoring Multi-Agent Workflows: A CrewAI Cost Breakdown