Hacker Remix

Show HN: OCR pipeline for ML training (tables, diagrams, math, multilingual)

170 points by ses425500000 3 months ago | 38 comments

Hi HN,

I’ve been working on an OCR pipeline specifically optimized for machine learning dataset preparation. It’s designed to process complex academic materials — including math formulas, tables, figures, and multilingual text — and output clean, structured formats like JSON and Markdown.

Some features: • Multi-stage OCR combining DocLayout-YOLO, Google Vision, MathPix, and Gemini Pro Vision • Extracts and understands diagrams, tables, LaTeX-style math, and multilingual text (Japanese/Korean/English) • Highly tuned for ML training pipelines, including dataset generation and preprocessing for RAG or fine-tuning tasks

Sample outputs and real exam-based examples are included (EJU Biology, UTokyo Math, etc.) Would love to hear any feedback or ideas for improvement.

GitHub: https://github.com/ses4255/Versatile-OCR-Program

bonoboTP 3 months ago

LLMs for OCR is super risky because just as much as they can fix OCR mistakes, they can inadvertently "fix" correct stuff too and hallucinate instead.

Its that xerox bug on steroids, where scanned pages would get their digits swapped by other digits...

I'd want to see some proper hallucination analysis.

fnordpiglet 3 months ago

I use tesseract which uses a LTSM OCR along with multimodal LLMs to converge to a ground truth. It works remarkably well. However for my purposes I don’t want a LLM explaining charts I want it to produce a vector format of the chart. There are a few models that produce Latex chart formats I’m experimenting with:

https://arxiv.org/pdf/2405.15306

Most OCR pipelines like this, along with excellent commercial ones like doctly.ai, are focused on OCR for LLM consumption - while I’d like to be able to recreate the original scientific work that predates digital typesetting in modern typeset - for yes LLM but also to preserve and promote science of yore, much of which includes discoveries forgotten but relevant still to problems we face today.

sureglymop 3 months ago

Also, what about prompt injection? With an LLM as far as I'm aware there is never a clear separation between instruction and the data to be processed.

ses425500000 3 months ago

Yeah, prompt injection is good point. For now, I try separate instruction and data by using JSON format, and run it in sandbox. Not perfect maybe, but I will try add small explanation in README so people can check it better.

sureglymop 3 months ago

In this case the result/output is plain text. Since it's not code it may be harder to imagine an attack vector. As an attacker, here would be some of my capabilities/possibilities:

- I could change the meaning of the output and the output entirely. - If I can control one part of a larger set of data that is analyzed , I could influence the whole output. - I could try to make the process take forever in order to waste resources.

I'd say the first scenario is most interesting, especially if I could then potentially also influence how an LLM trained on the output behaves and do even more damage using this down the line.

Let's say I'm a disgruntled website author. I want my users to see correct information on my website but don't want any LLM to be trained on it. In this case I could probably successfully use prompt injection to "poison" the model.

ses425500000 3 months ago

Yeah, hallucination part was also one thing I was worry about. So I make LLM only run after OCR step, and I put simple check to not change correct text. I will try to show real examples and hallucination rate too. Thanks for feedback!

This project was just hobby and my first time posting something. I didn’t imagine people would care this much… Next time I will prepare better before sharing.

bonoboTP 3 months ago

I didn't mean to target you specifically, just the general idea/trend of applying "smart priors" to do OCR. That is, a system that has a concept of what's plausible and may make the content more "plausible" instead of accurate. For example, an OCR system should be required to exactly recognize characters one by one, even including the typos. Sometimes even the presence of a comma or a small spelling variation can have significance. Or imagine running financial accounting stuff through LLM-OCR. And if you ask why would you OCR that instead of keeping digital records -- well, the real world can be very unreasonable and incompetent, and there are cases when e.g. the government only releases scanned PDFs on official sites regarding financial audit statistics etc.

themanmaran 3 months ago

> Never change the original language of any text. Keep Korean in Korean, Japanese in Japanese, and English in English.

I love the double prompting to keep GPT from translating the text. I've definitely had this problem before, and spent ages trying to prompt it into not randomly translating the text.

ses425500000 3 months ago

Yeah — I ran into that exact problem during early testing. The prompt has since been adjusted to prevent GPT from auto-translating non-English text (Korean, Japanese, etc.).

If it still misbehaves in any edge cases, feel free to open an issue on GitHub — happy to patch it up.

fmbb 3 months ago

What’s the use of using generative AI to OCR the text?

ses425500000 3 months ago

Great question — I’m using traditional OCR engines for the initial text extraction (e.g., MathPix, Google Vision), but then I apply generative AI models in a second stage to refine the output. This includes removing noisy or irrelevant elements, normalizing format inconsistencies, and improving alignment across multi-modal inputs.

In addition, for figures and diagrams, I use Gemini Pro Vision not just to extract the content, but to generate context-aware, structured descriptions that are better suited as ML training input — rather than just dumping raw image text.

So in short, generative AI is used here more as a smart post-processing layer to enhance the usability and semantic clarity of the OCR outputs.

novaRom 3 months ago

> Built With: DocLayout-YOLO, Google Vision API, Gemini Pro Vision, MathPix OCR, OpenAI API, OpenCV, and more.

the whole pipeline is not open source

ses425500000 3 months ago

Yep — some components currently rely on external APIs (e.g. OpenAI, MathPix), primarily for stability and ease of deployment during early release. But I’m planning to support fully local inference in the future to eliminate API key dependency.

The local pipeline would include:

• Tesseract or TrOCR for general OCR

• Pix2Struct, Donut, or DocTR for document structure understanding

• OpenAI CLIP for image-text semantic alignment

• Gemma / Phi / LLaMA / Mistral for downstream reasoning tasks

Goal is to make the system fully self-hostable for offline and private use.

sandreas 3 months ago

How does this compare against marker[1]?

1: https://github.com/VikParuchuri/marker

ses425500000 3 months ago

Thanks for sharing — Marker is a great tool, especially for human-readable formatting!

In contrast, this project focuses less on preserving the visual layout for human readers, and more on extracting structured semantic data for machine learning training.

So instead of optimizing for clean Markdown or HTML, it extracts context-aware elements like:

• table data as JSON,

• math expressions in LaTeX,

• diagrams with image descriptions,

• multilingual text segments,

• and semantic roles (e.g. “question”, “explanation”, etc.)

In short: Marker is great for reading, This is built for feeding into ML pipelines — especially for tasks like question-answering, diagram reasoning, or multimodal pretraining.

samstave 3 months ago

[dead]

constantinum 3 months ago

For the more curious: there is also Unstract open source for pipeline. Lets us plug in your AI stack eg. OS llm models, vector db, ocr parsers etc.

https://github.com/Zipstack/unstract

GPerson 3 months ago

Did you ethically acquire permission to train on the data set?

ses425500000 3 months ago

Yep — this project uses a pre-trained DocLayout-YOLO model released under an open license by the original authors. No additional datasets were used for training. All sample data in the repo is either synthetic, publicly available, or user-generated specifically for testing purposes. If there are any concerns about specific models or datasets, I’m happy to review them and make adjustments as needed.

sc077y 3 months ago

DocLayout-YOLO model is under the AGPL-3.0 license, it's not permissive. You can't have your project under the MIT license and also use copyleft software.

ses425500000 3 months ago

I’m sorry that I didn’t know that detail, thank you so much for letting me know! I’ll read AGPL-3.0 license more carefully and check if it’s okay with MIT. If not, I’ll fix license or change model. really appreciate your help!

liangzhe88 3 months ago

Curious if there are plans to update this. Seems interesting.

ses425500000 3 months ago

Thanks! Yes — I’m definitely planning to update and refine the project over time.

This initial release is mostly a working prototype to demonstrate the full pipeline logic, and I’ll continue improving stability, modularity, and usability. A lot more updates are in the pipeline, so stay tuned! Feel free to open issues or suggestions anytime — feedback is always welcome!

aghilmort 3 months ago

super great work -- do you convert math formula to latex &/or how is that or other symbolic not necessarily unicode chars handled?

ses425500000 3 months ago

Thanks a lot! Yeah, theoretically the pipeline handles math and special symbols fine, and from my testing it worked well. But I didn’t test much on other languages or encodings, so if there’s any weird behavior, please let me know and I’ll check it!

8thcross 3 months ago

so you are saying i can feed my last 10 years of exam question papers and get predictions on what we will get this year?

ses425500000 3 months ago

Haha not exactly like predicting actual questions. Just trying to find patterns or what topics show up often. I made this to help my study, didn’t think people would care this much.

samstave 3 months ago

[dead]

jlcases 3 months ago

This is a valuable contribution. The quality of ML models heavily depends on the quality of training data, and extracting structured information from unstructured documents (like PDFs) is a critical bottleneck.

A key challenge after OCR is organizing the extracted data into a coherent knowledge structure. We've seen significant improvements in downstream ML tasks when the extracted data is organized using a hierarchical, MECE (Mutually Exclusive, Collectively Exhaustive) framework. This ensures that relationships between entities (tables, diagrams, text) are explicitly captured.

Does your pipeline include capabilities for semantic structuring of the extracted content beyond basic layout analysis? That seems like the next frontier for maximizing the value of OCR data in ML training.

ses425500000 3 months ago

Thanks for the insightful comment! You’re absolutely right — organizing extracted data into a coherent, semantically meaningful structure is critical for high-quality ML training.

Right now, the pipeline focuses on generating OCR outputs optimized for ML models by cleaning, deduplicating, and segmenting content across modalities (text, tables, figures, formulas). For diagrams and tables, we add semantic tags and preserve layout relationships to aid downstream modeling.

I’m planning to add a semantic structuring module that goes beyond basic layout analysis — something that builds hierarchical, MECE-style representations and identifies entity relationships across sections. That’s absolutely the next frontier, and I really appreciate you pointing it out.

Thanks again for the thoughtful feedback!

cAtte_ 3 months ago

why are you using an LLM to reply to every comment?

ses425500000 3 months ago

Haha good catch! I’m 19 and from Korea, so I’ve been using an LLM to help with replies since my English isn’t perfect yet. But I designed and built the project myself (with help from some open models/tools) — just wanted to communicate more clearly with the community!

gus_massa 3 months ago

[Hi from Argentina!] LLM have a particular style that will make people suspictious or even angry.

One posibility is to write the answer in Korean and use autotranslation. (And post only the autotranslation.) Double check the technical terms, because autotranslation sometimes choose the wrong synonym.

Another posibility is to write the answer in English inside gmail, and gmail will highlight orthographical and gramar errors. So you can fix them.

Most people here will tolerate a few mistakes if the answer has your own personal style.

(Nice project, by the way.)

vo2maxer 3 months ago

Yes, writing that is suspictious makes me angry.

gus_massa 3 months ago

>> suspitious

:( My phone does not have orthography correction, and I didn't have my notebook.

Edit: fixed typo: gave -> have

vo2maxer 3 months ago

Por esa misma razón, un LLM te habría funcionado perfectamente: desplegando tus pensamientos tal como querías, pero sin las distracciones causadas por la mala ortografía o los errores gramaticales. Los LLM son herramientas —como bien sabes— que ya son esenciales y lo serán aún más con el paso del tiempo. Que algunos en esta plataforma se irriten por su uso solo significa que, eventualmente, se convertirán en los dinosaurios del futuro.

For that very reason, an LLM would have worked perfectly for you: laying out your thoughts just as you intended, but without the distractions caused by poor spelling or grammatical mistakes. LLMs are tools—as you well know—that are already essential and will become even more so over time. The fact that some people on this platform get irritated by their use just means they’ll eventually become the dinosaurs of the future.

gus_massa 3 months ago

This reads as es-es (perhaps es-es-corporate) instead of es-ar. I don't like "desplegando" because it's somewhat closer to "unfolding" instead of "laying out". I'm not sure it's incorrect, but I'd have chosen differently.

The problem is that I read the emails from my friends using their voice and speaking style.

I'd do the same with HN comments, but I never heard most (any?) of them. Anyway, each commenter has a personal style, or at least I have an informal list in my head of a few hundreds commenters. I remember someone made a few good comments about some topic, so it adds in my mind weight to their opinion. I remember some details of their lives, like where they live, family, work, unusual past events, which topics they are interested, ..., they are persons!

With too much AI, comments get bland. They all read like the same corporate speak. AI would not add pasta recipes to antirez comments, or yadayada to patio11 comments. Also, the topics I'd trust their opinions are very different.

I don't mind using AI to fix the text. Moreover, in one of my previous comments I recomendad to write it in Gmail. I guess Gmail is using a mix of an expert system and modern AI. I hope someday Google adds that feature to the textbox in Chrome.

The problem is that some people is using AI to write short "somewhat related" comments, that are not wrong but not very relevant. Also to write giant "walls of text" that discuss the topic and the 5 most important ramifications. So there is an overreaction to correct orthography, grammar and "AI style".

> The fact that some people on this platform get irritated by their use just means they’ll eventually become the dinosaurs of the future.

Remember that birds are dinosaurs. And if you think that nobody is scared of birds, you should visit a pen full of rheas (ostrich are a fine substitution). If you have any brilliant ornament on your cloth they will try to eat it and you will be hit by the peak. Also they will steal food from your hands and it hurts. We visit an open zoo with my older daughter when she was a kid. Rheas were locked inside a pen for security reasons, there were a lot of ducks and baby ducks that are cute, and the goose were scary because they are evil and come in organized groups to "ask" for food.

vo2maxer 3 months ago

Genuinely curious—could it be for the same reason you used a keyboard to write that comment? It’s efficient, it works. What’s the actual issue with using a tool that helps convey the intended message more clearly and quickly, as long as it reflects what he wanted to say?

cAtte_ 3 months ago

why are you offended on behalf of this person? the hindsight that they're simply an English learner obviously makes me feel bad for asking the question and i completely understand the use case, but i don't think it was unreasonable to think that someone who speaks entirely in ChatGPT paragraphs might be a bot, spammer, or the like—particularly because, in a botnet fashion, the original reply was to a comment that also seemed to be LLM-authored

vo2maxer 3 months ago

I wasn't offended at all. I was just genuinely curious, because I keep coming across this assumption that if any text is well-crafted, it must have come from an LLM. I think I understand why: we've grown so used to reading sloppy writing, everything from barely coherent text messages to articles in reputable publications filled with typos and awkward phrasing.

Personally, I've always held myself to a high standard in how I write, even in text messages. Some might see that as bordering on perfectionism, but for me, it's about respecting the principle behind communication: to be as clear and correct as possible.

Now that we have tools that help ensure that clarity, or at the very least, reduce distractions caused by grammar or spelling mistakes, of course I'm going to use them. I used to agonize over my comments on Twitter because you couldn't edit them after posting. I would first write them elsewhere and review them several times for any errors before finally posting. For context: I'm a retired 69-year-old physician, and even after witnessing decades of technological advancement, I'm still in awe of what this new technology can do.

Yes, I love beautiful, natural writing. I'm a voracious reader of the great classics. I regularly immerse myself in Shakespeare, Hardy, Eliot, Dickens, Dostoyevsky, Austen, Tolstoy, and many other literary masters. But I also fully embrace this tool that can elevate even the clumsiest writer's work to a clarity we've never had access to before. If that comes at the cost of a bit of stylistic uniformity, that's a reasonable trade-off. It's up to the user to shape the output, review it, and make sure their own voice and ideas shine through.

Back to your original point, I truly wasn't offended on his behalf. I was just curious. As it turns out, he was using an LLM, because his native language is Korean. Good for him. And just to be clear, I didn't intend to make your question seem inappropriate or to embarrass him in any way. If it came across that way, I apologize.