LLM Comparison Review

LLM Cost Explorer

LLM Cost Explorer · Best viewed on a desktop browser

Compare estimated API costs across 25 models from Anthropic, OpenAI, Google, Groq, and open source — at every max_tokens setting from 256 to 8192. Darker green means cheaper; red means expensive. Hover any cell for a full cost breakdown. Use the filters below to highlight models best suited to your use case.

Highlight by use case:
Use case spotlight

Model use cases

ModelVendorCostBest used for
claude-opus-4-7Anthropic$$$$Complex reasoning, hard coding, deep research, agentic tasks, nuanced long-form writing, science 🔬
claude-opus-4-6Anthropic$$$$Complex reasoning and analysis, research, difficult coding, detailed document generation, science 🔬
claude-sonnet-4-6Anthropic$$$Balanced capability and cost — coding, business analysis, content creation, everyday tasks
claude-haiku-4-5Anthropic$$Fast and cheap — classification, summarisation, chatbots, simple Q&A, high-volume pipelines
gpt-4oOpenAI$$$General purpose — strong coding, vision/image input, real-time applications, tool use
gpt-4o-miniOpenAI$Lightweight tasks — quick answers, classification, low-latency apps, cost-sensitive workloads
o3OpenAI$$$$Deep reasoning — complex maths, advanced science, hard logic problems, rigorous research 🔬
gemini-2.5-proGoogle$$$Long context and multimodal — large document analysis, video/image understanding, research, science 🔬
gemini-2.5-flashGoogle$Fast and efficient — summarisation, structured extraction, balanced everyday tasks
gemini-1.5-proGoogle$$Very long documents — book-length context, large codebase analysis, extended conversations
llama-4-scoutGroq$High-volume cheap inference — chatbots, simple generation, experimentation, prototyping
llama-4-maverickGroq$Moderate complexity — general Q&A, content drafting, open source flexibility
mixtral-8x7bGroq$Open source general purpose — moderate tasks, multilingual, good baseline for fine-tuning
Open Source / Self-hosted — $0 API cost (compute not included)
llama-3.3-70bMetaFree*Strong general-purpose open weights — coding, analysis, Q&A, content generation. Best balance of capability and size in Llama 3 family
llama-3.1-405bMetaFree*Largest open weights Llama — complex reasoning, hard coding, long-form tasks, 128K context. Closest open source rival to GPT-4o
deepseek-r1DeepSeekFree*Open weights reasoning model — hard maths, advanced science 🔬, logic, competitive coding. Rivals o3 on many benchmarks at zero API cost
deepseek-v3DeepSeekFree*Strong open weights general/coding model — instruction following, code generation, analysis. Top-tier for a self-hosted model
qwen-2.5-72bQwen/AlibabaFree*Excellent open weights for coding and multilingual tasks — code generation, translation, structured output, STEM
mistral-7bMistralFree*Lightweight and fast open weights — simple tasks, classification, high-volume pipelines, prototyping. Very resource-efficient
phi-4MicrosoftFree*Small but capable open weights — strong reasoning and Q&A despite compact size. Ideal for edge or embedded inference

$ = under $1/1M output tokens  ·  $$ = $1–$6/1M  ·  $$$ = $6–$18/1M  ·  $$$$ = over $18/1M output tokens  ·  Free* = self-hosted, $0 API cost (compute/hosting costs apply)

Columns — max_tokens reference guide

max_tokens Approx. words Typical use cases
256 ~190 words Short answer, single paragraph, yes/no with brief explanation, quick translation
512 ~380 words Half a page, short email reply, basic code function, brief summary
1 024 ~750 words Full page, detailed explanation, short code file, structured list response
2 048 ~1 500 words 2–3 pages, short report, medium code review, multi-step reasoning answer
4 096 ~3 000 words Long article, full code module, detailed analysis, complex creative writing
8 192 ~6 000 words Book chapter, large codebase review, extensive research response, full document draft

1 token ≈ 0.75 words in English. Word counts are approximate and vary by content type. max_tokens sets the maximum response length — the model may respond with fewer tokens.

Assumptions used in this chart

Input tokens per run500 tokens (typical prompt size)
Claude effort rowsmax effort applies a 1.8× output token multiplier vs high (default)
Adaptive thinking rowsAdds an estimated +1200 thinking tokens (billed as output tokens)
Effort multipliersRepresentative estimates — Anthropic does not publish exact per-effort token counts
Open source modelsShown as $0.00000 / Free — no API fee when self-hosted. Actual cost depends on your compute (GPU cloud, local hardware, etc.)

⚠️ DISCLAIMER

Prices shown are estimates based on publicly available information and may be outdated or incorrect.

Actual costs vary depending on your usage, prompt caching, batch discounts, free tiers, and each vendor’s current pricing.

Always verify current pricing directly with each vendor before making any financial or architectural decisions.

This chart is for indicative comparison purposes only and should not be treated as a source of truth.