Why pasting screenshots into Claude eats your tokens (and how OCR fixes it)
Anthropic prices Claude images at (width × height) / 750 tokens. A 1080p screenshot of a stack trace costs around 1,568 tokens on Sonnet, up to 4,784 on Opus. The same text typed out is 150–300 tokens. For text-heavy screenshots — terminal output, errors, log dumps, PDF excerpts — feeding Claude the OCR'd text instead is 5–10× cheaper and the answer is usually better. Maus OCR's every copied screenshot automatically and locally, so the text version is already in your clipboard before you need it.
How Claude charges for images
Anthropic publishes the formula in the vision docs:
- tokens ≈ (width × height) / 750
- Sonnet 4.6 and earlier: long edge capped at 1568px → max 1,568 tokens per image
- Opus 4.7 / 4.8: long edge capped at 2576px → max 4,784 tokens per image
Anything larger gets downscaled before counting, but it still hits the cap. The math is mechanical — you can run the numbers yourself for any screenshot.
| Screenshot | Pixels | Sonnet tokens | Opus tokens |
|---|---|---|---|
| Small UI snippet | 200 × 200 | ~54 | ~54 |
| Single-monitor window | 800 × 600 | ~640 | ~640 |
| Full 1080p screen | 1920 × 1080 | 1,568 (capped) | ~2,765 |
| Retina screen (4K) | 3840 × 2160 | 1,568 (capped) | 4,784 (capped) |
What the same content costs as text
Take a typical stack trace screenshot — a 1080p window with 15 lines of output. Pasted as an image: ~1,568 tokens on Sonnet. Typed as plain text: ~200 tokens. Same information, ~8× cheaper.
The ratio holds for almost everything devs screenshot:
- Terminal output (one screen ≈ 50 lines ≈ 600–1,000 chars ≈ 200–300 tokens)
- Error dialogs (one paragraph ≈ 30–60 tokens)
- Compiler output (10 errors ≈ 300–500 tokens)
- Log lines from a tail (one screen ≈ 200–400 tokens)
- Code snippets in a screenshot (one function ≈ 100–300 tokens)
For all of these, the image is doing zero extra work for Claude. The model has to OCR the screenshot internally anyway — you're paying for pixels that resolve back to the same tokens it would have read directly from text.
When this actually matters
For a one-off question, the token cost is negligible. The case where it bites:
- Long debug sessions. 20 screenshots of terminal output across an hour of iterating = 30,000+ tokens just on images. That's burning your context budget on rendered pixels.
- Multi-file refactors. Pasting screenshots of file contents instead of the file text fills the context faster, leaving less room for Claude's own reasoning before it has to summarize and lose detail.
- Repeated similar screenshots. The same error dialog screenshotted 10 times costs 10× the tokens. Same text shown 10 times still costs ~10× the text tokens — but the text is the floor, the image is the ceiling.
- API costs. If you're hitting the API directly (Claude Code, Cursor on the API tier, your own scripts), image tokens are billable tokens. The bill scales with the screenshot habit.
When the image is still the right call
OCR'd text isn't a strict replacement. Keep the image when:
- Layout matters. UI mockups, design feedback, "why is this misaligned" — Claude needs to see the geometry.
- Visual hierarchy is the question. "Which button should be primary?" needs the visual, not the text.
- Charts and diagrams. A flame graph or a flow diagram is mostly information about relationships, not text.
- Mixed content where you need both. Some screenshots have text plus a diagram. Send both — the image and the OCR'd text — and Claude picks what's relevant.
For everything else where the screenshot is text wearing a pixel costume, the text version wins.
How automatic OCR fits
The friction has always been: by the time you've decided "I should send the text instead of the image", you've already taken the screenshot, and retyping a 15-line traceback is annoying. macOS Live Text handles this if you take the screenshot to file and open it in Preview — but most devs don't, they hit ⌘⇧⌃4 straight to clipboard.
Maus does this automatically. Any time you copy a screenshot (or any image with text), Maus runs OCR using Apple's Vision framework — locally, no upload — and adds the recognized text as a separate clipboard item right below the image. Two clips, both available. Next paste, you choose:
- The image if layout matters.
- The text if you want to save tokens.
No setup. No "convert this screenshot" step. The text version is just there, in your history, searchable.
Three concrete workflows
1. Pasting a terminal error into Claude Code
You see a stack trace in Warp. Old workflow: take a screenshot, paste image into Claude Code. ~1,500 tokens. New workflow: ⌘⇧⌃4 to capture, paste the OCR'd text instead. ~150 tokens. Same answer.
Even simpler: don't screenshot at all. Just select the terminal text and copy. But if you've already screenshotted (faster for partial selections on a busy terminal), the OCR'd text is the cheap path.
2. Feeding Claude an excerpt from a PDF
PDFs in Preview support text selection — but tables, scanned PDFs, and figures don't. Screenshot the section, Maus OCRs it, paste the text into Claude. Works for anything you can see on screen, including image-only PDFs.
3. Capturing a code snippet from a video
Conference talk, screencast, tutorial. Pause, screenshot the code shown on screen, OCR'd text lands in your clipboard. Paste into Cursor or Claude to ask "explain this" or "port this to Rust". No transcription.
Privacy and accuracy
Apple Vision (the OCR engine in Maus and Live Text) runs entirely on-device. Nothing about the screenshot leaves your Mac. For internal logs, error messages with paths, customer data — this matters. Cloud OCR (Google Vision, AWS Textract) uploads the image; for sensitive content, that's a no.
Accuracy is high for clean rendered text (terminal output, IDE code, web pages). It's lower for handwriting, decorative fonts, or photos of screens with reflections. For 99% of dev screenshots, it's accurate enough that pasting the text into Claude gives the same answer as pasting the image.
FAQ
How many tokens does a screenshot cost in Claude?
Roughly (width × height) / 750. Capped at 1,568 on Sonnet 4.6 and earlier, up to 4,784 on Opus 4.7/4.8. A 1080p screenshot is near the Sonnet cap. The same text is usually 150–300 tokens.
Does OCR change Claude's answer quality?
For text-heavy screenshots — terminal output, stack traces, code — no. The image and the text resolve to the same content. Keep the image when layout, hierarchy, or geometry is the question.
Is screenshot OCR private?
Depends on the tool. macOS Live Text and Maus use Apple Vision locally — the image never leaves your Mac. Cloud OCR services upload it. Use local OCR for anything sensitive.
Why not just retype the text from the screenshot?
You can. But 15 lines of traceback is 20+ seconds and a typo risk. Automatic OCR makes the text version available the same moment the screenshot lands.
What about Claude Code's image attachment?
Same cost model. Image tokens scale with pixels; text tokens scale with characters. For long sessions where context budget matters, feeding text instead of images is what keeps you under it.
Stop paying for pixels when text is what Claude needs
Maus runs OCR on every copied screenshot, locally, using Apple Vision. The text version sits in your clipboard ready to paste. Free with 24h history. Pro $12.99 once for unlimited.
Download Maus for Mac More on OCR on Mac