Does GPT-4o use OCR for vision?

12th July 2024

This is a small silly experiment in reaction to a few comments like these:

I think OpenAI is running an off-the-shelf OCR tool like Tesseract (or more likely some proprietary, state-of-the-art tool) and feeding the identified text into the transformer alongside the image data. Oran Looney

I've been generally frustrated at the lack of analysis of vision LLMs generally. simonw

Hypothesis: GPT4o uses OCR to augment its vision capabilities.

The idea to verify this goes like this:

The computational effort of LLMs scales with sequence length
An increase in response time proportional to text length should be observable
Images of the same size require equal processing
Requests with images that have the same dimensions but contain different amounts of text should be equally fast
However, if text was OCRed and fed to GPT in addition to the image, there should again be an observable difference proportional to a text-only prompt

TLDR: There is no conclusive evidence. The results raise more questions than they answer.

The Setup

This experiment examines gpt-4o-2024-05-13.

Dataset

I generated a small synthetic dataset. You can find it here.

Steps of a hundred from 0 to 1000
Each step contains 10 samples
Each sample has a 1024 x 1024 image
Each image was filled with a random set of english words
The amount of words is equal to the step size
The text is separately available in a text file

This is an example with 1000 words:

Note, that I created a grid pattern for the background because I initially suspected, that blank parts of images are cropped. More on this later.

Prompt

Each sample was measured twice. The code is here.

Vision-based: Only given the image and this prompt

Count the amount of words in the image. Only respond with the total number, don't say anything else.

Text-based: Given this prompt, where ${TEXT} was replaced with the actual text

Count the amount of words in the following text, only respond with the total number, don't say anything else:\n\n```${TEXT}```

The task of counting was there for the possibility that there is some kind of detection, whether OCR is required. The outcomes weren't important.

The Result

The following plot visualizes the measured response times. The raw data is here. The plot shows the step size on the X axis. At most, an image or text prompt contained 1000 words to count. The Y axis shows the total measured time of an API request using Python's requests. The two lines are linear regressions of the two groups. You can hide individual plots by clicking their labels.

Some Observations:

This is contrary to the expected result:
- The text-based requests don't show any significant increase. The texts were probably too short.
- The image-based requests show a significant increase. Approximately 0.7 ms for each additional word, totaling 700 ms between 0 and 1000 words. This was reproducible over multiple runs.
- We expected there to be an increase for the images, but only proportional to the text-only prompts.
- Since there is no increase for the text-only prompts, this must emerge from something different.
The variance of requests with images seems much higher.
Image-based requests have a fixed cost of ~2 seconds, for text-based requests this is only 600 ms.

The experiment is of course far from perfect. The sample size should be bigger. The step size should be larger. The regression seems questionable.

The whole experiment was ~1 € in API costs. Feel free to redo it better yourself.

Conclusion

The experiment neither confirms nor denies the original question. It raises a different question, however:

Why do images with higher entropy have longer response times?

This fact isn't reflected in the pricing. Each vision-based API request is billed with exactly 793 prompt tokens. This number can be roughly reproduced:

Each 512 x 512 image patch constitutes 170 tokens [1]
All images are 1024 x 1024, so 4 patches, meaning 4 * 170 = 680 tokens
The text prompt has 22 tokens [2]
Missing 793 - 680 - 22 = 91 tokens. Maybe control tokens or some default system prompt.

The bottom line is, that the number is equal for all requests.

My best guess to explain the difference in response time is, that GPT4's image component adapts computational effort based on image complexity. Maybe something diffusion-related?

Perhaps the computational cost is actually just lower if there is less information.

Since I have no idea about the inner workings of GPT, I'll be happy to be enlightened by someone who does.

Bonus: How accurate did GPT count?

Here are some statistics if you wondered how accurately GPT counted. The table headers show the true amount of words for each step. Each body row shows a single statistic over the 10 samples for each step.

The results somehow speak against the hypothesis, that GPT uses OCR. If that was the case, it would be likely, that vision-based counting was at least as good as text-based counting. However, vision-based counting seems to be far worse.

Text-based Counting

#	100	200	300	400	500	600	700	800	900	1000
Best	100	200	304	400	500	635	673	749	929	1000
Avg	113	220	387	531	627	794	848	912	975	1017
Min	100	185	250	400	464	505	604	622	600	800
Max	167	272	500	765	1000	1024	1000	1107	1734	1370
Std	19	31	84	92	153	170	136	147	292	137

Image-based Counting

#	100	200	300	400	500	600	700	800	900	1000
Best	150	274	486	488	435	908	665	789	906	1000
Avg	221	460	654	751	801	1006	900	1191	1076	2158
Min	150	274	486	488	435	908	335	772	906	564
Max	300	601	933	1160	1040	1160	1507	1996	1452	13306
Std	42	103	147	249	176	68	350	377	157	3722

References

https://platform.openai.com/docs/guides/vision/low-or-high-fidelity-image-understanding (2024-07-12)
https://platform.openai.com/tokenizer (2024-07-12)