Measured savings across 11 LLMs — Claude Opus 4.7 to Gemini Flash.→ See per-model data
Get free API key →
Cost

Vision Tokens Are Expensive and Nobody Reads the Pricing Page

Claude charges (width x height) / 750 tokens per image. A 1920x1080 screenshot costs ~2,765 tokens on Opus 4.7. Here's what that means for agents that use screenshots routinely.

James Hollingsworth(Contributor)Published 5 min~830 words

You added image support to your app. You sent a 1,000×1,000 pixel screenshot to Claude. You were charged for 1,334 tokens.

You didn't write 1,334 tokens. You sent one image. But the model doesn't see images the way a human does. It converts them into token representations first, and that conversion is expensive in ways that the words "vision support" don't communicate.

How Anthropic Calculates Image Tokens

Claude's vision pricing follows a specific formula documented in Anthropic's API reference:

`` tokens = (width × height) / 750 ``

This is applied after the image is resized to fit within the model's maximum dimensions. For standard Claude models (Claude 3.5, Claude 3.7), the maximum is approximately 1,568 pixels on the long edge before resizing kicks in. For Claude Opus 4.7, the limit is higher. The model supports images up to approximately 4,784 tokens per image.

The official Anthropic documentation provides a reference table:

Image sizeToken count
200×200 px~53 tokens
1000×1000 px~1,334 tokens
1092×1092 px~1,590 tokens
For Claude Opus 4.7 specifically, larger images are supported without the standard resize ceiling:
  • 1920×1080 px → ~2,765 tokens
  • 2000×1500 px → ~4,000 tokens
  • These are not edge cases. A standard 1080p screenshot (the kind a browser automation agent might capture to verify a UI state) costs nearly 2,800 tokens on Opus 4.7. At Claude's current output pricing, that's the token equivalent of a paragraph of reasoning, paid just for the image ingestion.

    Where Vision Token Costs Compound

    Single images at small scale are not the issue. The issue is systems that use vision as a routine part of their workflow:

    Browser automation agents that take screenshots to verify navigation steps. A 20-step workflow with one screenshot per step sends 20 images. At 1,334 tokens each for 1000×1000 images, that's 26,680 tokens per run, before any text context.

    Document processing pipelines that convert PDFs to images before sending to the model. A 10-page PDF rendered at standard resolution can easily exceed 15,000 vision tokens, more than many systems' entire text context budget.

    UI testing systems that use vision models to verify component rendering. Continuous integration systems running 50 test cases per commit, each with 3–5 screenshots, accumulate vision token costs that dwarf the text token costs in the same pipeline.

    Multi-modal RAG systems that index product images alongside text. Retrieval returns N images plus text chunks. Each image in the retrieved set costs 1,000–1,500 tokens before the model reads the actual query.

    Strategies for Reducing Vision Token Spend

    Resize before sending. The formula is linear in pixel count. A 1000×1000 image at 1,334 tokens becomes ~334 tokens at 500×500. If your task doesn't require fine-grained detail (verifying that a button exists, checking that a form rendered, confirming a layout didn't break), resizing to 500×500 or smaller cuts costs by 75% with minimal accuracy impact.

    Crop to the region of interest. Sending a full-page screenshot when you care about a 200×300 pixel UI component wastes everything outside that region. Cropping to the component before sending reduces vision tokens proportionally.

    Use text extraction as a pre-filter. Many vision tasks are actually text extraction tasks. If your image contains structured text (a table, a form, a code block), extracting the text first (via OCR or a lighter vision call) and sending the extracted text to the main model is dramatically cheaper than sending the image directly.

    Cache vision representations for repeated images. Anthropic's prompt caching applies to image tokens the same way it applies to text tokens. If your system sends the same base screenshot repeatedly with different questions, prompt caching eliminates the repeated image token cost after the first call.

    Compress your text context to leave room. Vision tokens and text tokens share the same context budget. If your text context is bloated (large system prompts, accumulated conversation history, verbose few-shot examples), you're competing for budget against your images. Compressing text context gives vision tokens more headroom without hitting the model's limits.

    The Math on a Typical Agent Workflow

    Assume a browser agent running a 15-step task with:

  • One 1920×1080 screenshot per step: 15 × 2,765 = 41,475 vision tokens
  • A 5,000-token system prompt per call: 15 × 5,000 = 75,000 text tokens
  • 2,000 tokens of conversation history per call (growing): ~15,000–30,000 text tokens
  • Total per run: roughly 130,000–150,000 tokens. Vision is accounting for ~30% of that.

    Shrink each screenshot to 800×600 and apply gotcontext.ai compression to the system prompt and conversation history:

  • Vision: 15 × (800×600/750) = 15 × 640 = 9,600 tokens (down from 41,475)
  • System prompt compressed 10×: 500 tokens (down from 5,000)
  • History compressed 5×: 3,000–6,000 tokens
  • New total: ~13,100–16,100 tokens per run. That's an 89% reduction. Same task, same model, no logic changes.

    Vision tokens aren't optional if your application uses images. But how many you spend per image, and how much text context competes for the same budget, is entirely within your control.

    Compress your text context and give your vision budget room to breathe →

    Cite this

    Researchers, analysts, or journalists referencing this post can use either format below — both are copyable.

    BibTeXbibtex
    @misc{vision-tokens-hidden-cost-multimodal-2026,
      title  = {Vision Tokens Are Expensive and Nobody Reads the Pricing Page},
      author = {James Hollingsworth},
      year   = {2026},
      month  = {May},
      url    = {https://www.gotcontext.ai/blog/vision-tokens-hidden-cost-multimodal},
      note   = {gotcontext.ai engineering blog.},
    }
    APAtext
    James Hollingsworth. (2026, May 8). Vision Tokens Are Expensive and Nobody Reads the Pricing Page. gotcontext.ai. Retrieved from https://www.gotcontext.ai/blog/vision-tokens-hidden-cost-multimodal.

    Contribute