Show HN: Benchmarking VLMs vs. Traditional OCR
getomni.aiVision models have been gaining popularity as a replacement for traditional OCR. Especially with Gemini 2.0 becoming cost competitive with the cloud platforms.
We've been continuously evaluating different models since we released the Zerox package last year (https://github.com/getomni-ai/zerox). And we wanted to put some numbers behind it. So we’re open sourcing our internal OCR benchmark + evaluation datasets.
Full writeup + data explorer here: https://getomni.ai/ocr-benchmark
Github: https://github.com/getomni-ai/benchmark
Huggingface: https://huggingface.co/datasets/getomni-ai/ocr-benchmark
Couple notes on the methodology:
1. We are using JSON accuracy as our primary metric. The end goal is to evaluate how well each OCR provider can prepare the data for LLM ingestion.
2. This methodology differs from a lot of OCR benchmarks, because it doesn't rely on text similarity. We believe text similarity measurements are heavily biased towards the exact layout of the ground truth text, and penalize correct OCR that has slight layout differences.
3. Every document goes Image => OCR => Predicted JSON. And we compare the predicted JSON against the annotated ground truth JSON. The VLMs are capable of Image => JSON directly, we are primarily trying to measure OCR accuracy here. Planning to release a separate report on direct JSON accuracy next week.
This is a continuous work in progress! There are at least 10 additional providers we plan to add to the list.
The next big roadmap items are: - Comparing OCR vs. direct extraction. Early results here show a slight accuracy improvement, but it’s highly variable on page length.
- A multilingual comparison. Right now the evaluation data is english only.
- A breakdown of the data by type (best model for handwriting, tables, charts, photos, etc.)
The benchmark I most want to see around OCR is one that covers risks from accidental (or deliberate) prompt injection - I want to know how likely it is that a model might OCR a page and then accidentally act on instructions in that content rather than straight transcribing it as text.
I'm interested in the same thing for audio transcription too, for models like Gemini or GPT-4o audio accepting audio input.
Playing with local llama vision and minicpm-v models, they do seem resistant to what one might call blatant prompt injection. Ie just inserting one of the classic "ignore previous instructions" or similar.
So yeah, would be curious how susceptible they are to more refined approaches. Are there some known examples?
An interesting OCR aspect indeed; hence it's great that their OCR Benchmark is open source, allowing for the addition of such a category. Or maybe there are already separate OCR prompt-injection benchmarks.
Also, I'd be useful to understand how an OCR context differs from standard injection attacks. One thing I can think of is potential tabular injection attacks. But also image-based, especially for VLMs, are relevant. So a OCR injection attack benchmark might just be a combination of different domain-specific benchmarks formated as images.
As far as I'm concerned, all of these specialty services are dead compared to a generalized LLM like OpenAI or Gemini.
I wrote in a previous post about how NLP services were dead because of LLMs and obviously people in NLP took great offense to that. But I was able to use the NLP abilities of an LLM without needing to know anything about the intricacies of NLP or any APIs and it worked great. This post on OCR pretty much shows exactly what I meant. Gemini does OCR almost as good as OmniAI (granted I've never heard of it), but at 1/10th the cost. OpenAI will only get better very quickly. Kudos to OmniAI for releasing honest data, though.
Sure you might get an additional 5% accuracy from OmniAI vs Gemini but a generalized LLM can do so much more than just OCR. I've been playing with OpenAI this entire weekend and literally the sky's the limit. Not only can you OCR images, you can ask the LLM to summarize it, transform it into HTML, classify it, give a rating based on whatever parameters you want, get a lexile score, all in a single API call. Plus it will even spit out the code to do all of the above for you to use as well. And if it doesn't do what you need it to do right now, it will pretty soon.
I think the future of AI is going to be pretty bleak for everyone except the extremely big players that can afford to invest hundreds of billions of dollars. I also think there's going to be a real battle of copyright in less than 5 years which will also favor the big rich players as well.
5% accuracy can be worth a lot.
The price of any of these services pales in comparison to getting a human involved in any fraction of cases.
It is likely reasonable to expect the base LMs to keep getting better and for there to not be a moat on accuracy in the long term, but businesses are not just built on benchmark accuracy and have plenty of other ways to survive, even if the technology under the hood changes.
I think 87% to 92% accuracy really isn't much difference. You're still going to get errors to the point where the level and amount of checking you need to do isn't affected. Even at 98-99% you still have to do a lot of error checking.
But you get most of the bang for the buck for 1/10th the cost so I think overall it's far, far superior.
YES
>>5% accuracy can be worth a lot.
Most surprising to me about these results is the BEST error rate was over 8% errors (91.7% accuracy) and the worse was 40%.
Their method of calculating errors seems quite good:
>> Accuracy is measured by comparing the JSON output from the OCR/Extraction to the ground truth JSON. We calculate the number of JSON differences divided by the total number fields in the ground truth JSON. We believe this calculation method lines up most closely with a real world expectation of accuracy.
>> Ex: if you are tasked with extracting 31 values from a document, and make 4 mistakes, that results in an 87% accuracy.
Especially where dealing with numbers and money, having 10% of them being wrong seems unusable, often worse than doing nothing.
Having humans check the results instead of doing the transcriptions would be better, but humans are notoriously bad at maintaining vigilance doing the same task over many documents.
What would be interesting is finding which two OCR/AI systems make the most different mistakes and running documents against both. Flagging only the disagreements for human verification would reduce the task substantially.
> What would be interesting is finding which two OCR/AI systems make the most different mistakes and running documents against both. Flagging only the disagreements for human verification would reduce the task substantially.
There have been OCR products that do that for decades, and I would hope all the ocr startups are doing the same already. Often times something is objectively difficult to read and the various models will all fail in the same place, reducing the expected utility of this method. It still helps of course. I forget the name of the product, there was one that used about 5 ocr engines and would use consensus to optimize its output. It could never beat ABBYY finereader though, it was a distant second place.
Wouldn't an issue be that whilst for LLM's replacing NLP's you don't often care about the super rare hickup or hallucination.
Whilst where OCR's tend to be used it's often a no go.... Just saying this trying to remember all the places where I've implemented it or seen it implemented. A common one was billing stuff.
A big takeaway for me is that Gemini Flash 2.0 is a great solution to OCR, considering accessibility, cost, accuracy, and speed.
It also has a 1M token context window, though from personal experience it seems to work better the smaller the context window is.
Seems like Google models have been slowly improving. It wasn't so long ago I completely dismissed them.
And from my personal experience with Gemini 2.0 flash Vs 2.0 pro is not even close
I had gemini 2.0 pro read my entire hand written, stain covered, half English, half french family cookbook perfectly first time
It's _crazy_ good. I had it output the whole thing in latex format to generate a printable document immediately too
I'm wondering how gemini can OCR big image correctly with good quality. They charge for image as input ~250 tokens. Always the same no matter the size of the image you send. 250 tokens its ~200 words. Will OCR work if you send 4k image that has a lot of text in small font? What if page will have more than 200 words? Are google selling it at cost?
I’m definitely not getting that takeaway. This wasn’t even an OCR benchmark: the task was structured data extraction, and deterministic metrics were set aside in favor of GPT-as-a-judge.
VLMs are every bit as susceptible to the (unsolved) hallucination problem as regular LLMs are. I would not use them to do OCR on anything important because the failure modes are totally unbounded (unlike regular OCR).
> This wasn’t even an OCR benchmark: the task was structured data extraction, and deterministic metrics were set aside in favor of GPT-as-a-judge.
Looks like they've got deterministic metrics to me: For each document they've got a ground truth set of JSON extracted data, and they use json-diff to calculate the fields that disagree.
There is GPT-4o in their evaluation pipeline - but only as a means of converting the OCRed document into their target JSON schema.
Also, what's strange is there's no free of paid OCR engine is added to the mix for the evaluation. Tessaract is built specially for OCR'in scanned documents, and it has a both traditional and neural network based modes, to boot.
> Also, what's strange is there's no free of paid OCR engine is added to the mix for the evaluation.
The article says they evaluated "Traditional OCR providers (Azure, AWS Textract, Google Document AI, etc.)"
Are those not paid OCR engines?
What is the best solution for recognizing handwritten text that combines multiple languages, especially in cases where certain letters look the same but represent different sounds? For example, the letter 'p' in English versus 'р' in Cyrillic languages, which sounds more like the English 'r'.
How does this compare to Marker https://github.com/VikParuchuri/marker?
I'm curious as well
What is the privacy of the documents for the cloud service? There’s nothing in the privacy policy about data sent over the api.
Then you have to assume by default that your data is visible to their employees, can be monetized, used to improve their models etc
I’ll stick with the devil I know that is only slightly worse in their own benchmarks (Google)
"San Francisco, California"
There is none, because CLOUD Act.
GPT-4o as a judge to evaluate the quality of something which gpt4o is not inherently that good at. Red flag.
OCR seems to be mostly solved for 'normal' text laid out according to Latin alphabet norms (left to right, normal spacing etc.), but would love to see more adversarial examples. We've seen lots of regressions around faxed or scanned documents where the text boxes may be slightly rotated (e.g. https://www.cad-notes.com/autocad-tip-rotate-multiple-texts-...) not to mention handwriting and poorly scanned docs. Then there's contextually dependent information like X-axis labels that are implicit from a legend somewhere, so its not clear even with the bounding boxes what the numbers refer to. This is where VLMs really shine: they can extract text then use similar examples from the page to map them into their output values when the bounding box doesn't provide this for free.
That's a great point about the limitations of traditional OCR with rotated or poorly scanned documents. I agree that VLMs really shine when it comes to understanding context and extracting information beyond just the text itself. It's pretty cool how they can map implicit relationships, like those X-axis labels you mentioned.
Thank you for sharing this. Some of the other public models that we can host ourselves may perform in practice better than the models listed - e.g. Qwen 2.5 VL https://github.com/QwenLM/Qwen2.5-VL?tab=readme-ov-file
Does anyone have good experience with a particular pipeline for OCR-ing C source code?
Looking at the sample documents, this seems more focused on tables and structured data extraction and not long-form texts. The ground truth JSON has so much less information than the original document image. I would love to see a similar benchmark for full contents including long-form text and tables.
Indeed, from their conclusions:
> They [VLMs] are generally more capable of "looking past the noise" of scan lines, creases, watermarks. Traditional models tend to outperform on high-density pages (textbooks, research papers) as well as common document formats like tax forms.
Which is a bit confusing? Did they test that or what? It doesn't seem that way from their limited dataset.
Anyone have tried comparing with Qwen VL based model? I heard good things about its performance on ocr compared to other self hostable model, but haven't really tried benchmarking its performance
Yes I'd like to see this repeated with any of the small VLM's like IBM Granite or the HF Smols. Pretty much anything in the sub 7B range.
What VLMs do you use when you're listing OmniAI - is this mostly wrapping the model providers like your zerox repo?
Harmonic Loss converges more efficiently on MNIST OCR: https://github.com/KindXiaoming/grow-crystals .. "Harmonic Loss Trains Interpretable AI Models" (2025) https://news.ycombinator.com/item?id=42941954