remix logo

Hacker Remix

Show HN: OCR pipeline for ML training (tables, diagrams, math, multilingual)

170 points by ses425500000 2 weeks ago | 38 comments

Hi HN,

I’ve been working on an OCR pipeline specifically optimized for machine learning dataset preparation. It’s designed to process complex academic materials — including math formulas, tables, figures, and multilingual text — and output clean, structured formats like JSON and Markdown.

Some features: • Multi-stage OCR combining DocLayout-YOLO, Google Vision, MathPix, and Gemini Pro Vision • Extracts and understands diagrams, tables, LaTeX-style math, and multilingual text (Japanese/Korean/English) • Highly tuned for ML training pipelines, including dataset generation and preprocessing for RAG or fine-tuning tasks

Sample outputs and real exam-based examples are included (EJU Biology, UTokyo Math, etc.) Would love to hear any feedback or ideas for improvement.

GitHub: https://github.com/ses4255/Versatile-OCR-Program

bonoboTP 2 weeks ago

LLMs for OCR is super risky because just as much as they can fix OCR mistakes, they can inadvertently "fix" correct stuff too and hallucinate instead.

Its that xerox bug on steroids, where scanned pages would get their digits swapped by other digits...

I'd want to see some proper hallucination analysis.

fnordpiglet 2 weeks ago

I use tesseract which uses a LTSM OCR along with multimodal LLMs to converge to a ground truth. It works remarkably well. However for my purposes I don’t want a LLM explaining charts I want it to produce a vector format of the chart. There are a few models that produce Latex chart formats I’m experimenting with:

https://arxiv.org/pdf/2405.15306

Most OCR pipelines like this, along with excellent commercial ones like doctly.ai, are focused on OCR for LLM consumption - while I’d like to be able to recreate the original scientific work that predates digital typesetting in modern typeset - for yes LLM but also to preserve and promote science of yore, much of which includes discoveries forgotten but relevant still to problems we face today.

sureglymop 2 weeks ago

Also, what about prompt injection? With an LLM as far as I'm aware there is never a clear separation between instruction and the data to be processed.

ses425500000 2 weeks ago

Yeah, prompt injection is good point. For now, I try separate instruction and data by using JSON format, and run it in sandbox. Not perfect maybe, but I will try add small explanation in README so people can check it better.

sureglymop 1 week ago

In this case the result/output is plain text. Since it's not code it may be harder to imagine an attack vector. As an attacker, here would be some of my capabilities/possibilities:

- I could change the meaning of the output and the output entirely. - If I can control one part of a larger set of data that is analyzed , I could influence the whole output. - I could try to make the process take forever in order to waste resources.

I'd say the first scenario is most interesting, especially if I could then potentially also influence how an LLM trained on the output behaves and do even more damage using this down the line.

Let's say I'm a disgruntled website author. I want my users to see correct information on my website but don't want any LLM to be trained on it. In this case I could probably successfully use prompt injection to "poison" the model.

ses425500000 2 weeks ago

Yeah, hallucination part was also one thing I was worry about. So I make LLM only run after OCR step, and I put simple check to not change correct text. I will try to show real examples and hallucination rate too. Thanks for feedback!

This project was just hobby and my first time posting something. I didn’t imagine people would care this much… Next time I will prepare better before sharing.

bonoboTP 2 weeks ago

I didn't mean to target you specifically, just the general idea/trend of applying "smart priors" to do OCR. That is, a system that has a concept of what's plausible and may make the content more "plausible" instead of accurate. For example, an OCR system should be required to exactly recognize characters one by one, even including the typos. Sometimes even the presence of a comma or a small spelling variation can have significance. Or imagine running financial accounting stuff through LLM-OCR. And if you ask why would you OCR that instead of keeping digital records -- well, the real world can be very unreasonable and incompetent, and there are cases when e.g. the government only releases scanned PDFs on official sites regarding financial audit statistics etc.

themanmaran 2 weeks ago

> Never change the original language of any text. Keep Korean in Korean, Japanese in Japanese, and English in English.

I love the double prompting to keep GPT from translating the text. I've definitely had this problem before, and spent ages trying to prompt it into not randomly translating the text.

ses425500000 2 weeks ago

Yeah — I ran into that exact problem during early testing. The prompt has since been adjusted to prevent GPT from auto-translating non-English text (Korean, Japanese, etc.).

If it still misbehaves in any edge cases, feel free to open an issue on GitHub — happy to patch it up.

fmbb 2 weeks ago

What’s the use of using generative AI to OCR the text?

ses425500000 2 weeks ago

Great question — I’m using traditional OCR engines for the initial text extraction (e.g., MathPix, Google Vision), but then I apply generative AI models in a second stage to refine the output. This includes removing noisy or irrelevant elements, normalizing format inconsistencies, and improving alignment across multi-modal inputs.

In addition, for figures and diagrams, I use Gemini Pro Vision not just to extract the content, but to generate context-aware, structured descriptions that are better suited as ML training input — rather than just dumping raw image text.

So in short, generative AI is used here more as a smart post-processing layer to enhance the usability and semantic clarity of the OCR outputs.

novaRom 2 weeks ago

> Built With: DocLayout-YOLO, Google Vision API, Gemini Pro Vision, MathPix OCR, OpenAI API, OpenCV, and more.

the whole pipeline is not open source

ses425500000 2 weeks ago

Yep — some components currently rely on external APIs (e.g. OpenAI, MathPix), primarily for stability and ease of deployment during early release. But I’m planning to support fully local inference in the future to eliminate API key dependency.

The local pipeline would include:

• Tesseract or TrOCR for general OCR

• Pix2Struct, Donut, or DocTR for document structure understanding

• OpenAI CLIP for image-text semantic alignment

• Gemma / Phi / LLaMA / Mistral for downstream reasoning tasks

Goal is to make the system fully self-hostable for offline and private use.

sandreas 2 weeks ago

How does this compare against marker[1]?

1: https://github.com/VikParuchuri/marker

ses425500000 2 weeks ago

Thanks for sharing — Marker is a great tool, especially for human-readable formatting!

In contrast, this project focuses less on preserving the visual layout for human readers, and more on extracting structured semantic data for machine learning training.

So instead of optimizing for clean Markdown or HTML, it extracts context-aware elements like:

• table data as JSON,

• math expressions in LaTeX,

• diagrams with image descriptions,

• multilingual text segments,

• and semantic roles (e.g. “question”, “explanation”, etc.)

In short: Marker is great for reading, This is built for feeding into ML pipelines — especially for tasks like question-answering, diagram reasoning, or multimodal pretraining.

samstave 2 weeks ago

[dead]