| # | Model | Input Price (per 1M tokens) | Output Price (per 1M tokens) | Monthly Input Cost | Monthly Output Cost | Total Monthly Cost |
|---|
AI APIs charge based on the number of tokens processed. Understanding how pricing works helps you choose the right model and optimize your costs significantly.
| Type | What It Includes | Typical Cost Ratio |
|---|---|---|
| Input tokens | The prompt you send: system instructions, conversation history, user messages, context documents | Cheaper (1x baseline) |
| Output tokens | The response the model generates: answers, code, summaries, completions | More expensive (3–5x input price) |
Output tokens are priced higher because generating text is computationally more expensive than reading it. This means long responses cost disproportionately more than long prompts.
| Model | Provider | Input (per 1M) | Output (per 1M) | Best For |
|---|---|---|---|---|
| GPT-4.1 | OpenAI | $2.00 | $8.00 | General purpose, tool use |
| GPT-4.1 mini | OpenAI | $0.40 | $1.60 | Balanced cost & capability |
| GPT-4.1 nano | OpenAI | $0.10 | $0.40 | High-volume, cost-sensitive tasks |
| o3 | OpenAI | $10.00 | $40.00 | Advanced reasoning, complex tasks |
| o4-mini | OpenAI | $1.10 | $4.40 | Efficient reasoning at lower cost |
| Claude Opus 4 | Anthropic | $15.00 | $75.00 | Complex reasoning, long context |
| Claude Sonnet 4 | Anthropic | $3.00 | $15.00 | Balanced performance & cost |
| Claude Haiku 3.5 | Anthropic | $0.80 | $4.00 | Fast, lightweight tasks |
| Gemini 2.5 Pro | $1.25 | $10.00 | Long context, multimodal | |
| Gemini 2.5 Flash | $0.15 | $0.60 | Fast, cheap, high throughput | |
| DeepSeek V3 | DeepSeek | $0.27 | $1.10 | Cost-efficient general tasks |
| DeepSeek R1 | DeepSeek | $0.55 | $2.19 | Cost-efficient reasoning |
| Qwen 3 235B | Alibaba | $0.40 | $1.60 | Open-weight, large scale |
Tokens are the basic units of text that language models process. In English, 1 token is roughly 0.75 words or 4 characters. Numbers, punctuation, and common words often map to single tokens, while rare words may split into multiple tokens.
| Strategy | Impact | How To |
|---|---|---|
| Use a smaller model for simple tasks | Up to 20x cheaper | Route classification, summarization, extraction to GPT-4o mini / Gemini Flash |
| Keep system prompts concise | 10–30% savings | Every extra word in the system prompt is charged on every request |
| Limit output length | Large savings | Use max_tokens to cap responses; instruct the model to be concise |
| Cache repeated context | Up to 90% savings on input | Use prompt caching (Anthropic, OpenAI) for static system prompts or documents |
| Batch requests | Up to 50% savings | Use Batch API (OpenAI, Anthropic) for non-real-time workloads |
| Trim conversation history | Reduces input tokens | Only send the last N turns of history instead of the full conversation |
| Self-host open-weight models | Eliminate per-token cost | Run Llama 3.1 on your own GPU infrastructure for high-volume usage |
Picking the right model for each task is the single highest-leverage cost optimization. A common architecture is a model routing pattern: classify each request by difficulty first (using a cheap model), then escalate only complex ones to a premium model. This can cut average cost by 60–80% with little quality loss.