remix logo

Hacker Remix

Show HN: r1_vlm – Open-Source Framework for Visual Reasoning with GRPO

5 points by skumar17 24 hours ago | 8 comments

oofbey 23 hours ago

This is really pretty cool. LLM's are so bad at images, it just makes sense to use reasoning to improve them. I'd love to see this applied to a bigger model than 3B, because this task is not difficult. But the attention visualization really demonstrates that it's doing what it's supposed to.

skumar17 22 hours ago

Thanks! I really love the visualization too. We have a hosted demo you can try as well!

https://huggingface.co/spaces/Groundlight/grpo-vlm-decoder

oofbey 22 hours ago

Fun! I wish the demo had the attention visualization. Would that be easy to add? Is the source code for the HF demo in the repo too?

skumar17 22 hours ago

Unfortunately it might be a bit challenging as there’s a nontrivial amount of extra computation we do for the viz, but it’s probably possible?

skumar17 22 hours ago

The attention demo code is in the /attention_demo directory if you want to try it on your own messages too :)

xoofoog 22 hours ago

What do you mean LLMs are bad at images? GPT or Claude can read text perfectly, and describe what's in a picture in a lot of detail. I feel like replacing OCR is one of the few things you can actually trust them for.

skumar17 22 hours ago

That’s a good observation. For this project, I found that while the base model could “read” the image, it didn’t really understand how to use it. GRPO allowed it to effectively search the solution space.

oofbey 22 hours ago

That's true - they are quite good at OCR. But they're really bad at a bunch of tasks that seem like they should be super simple. Like "are these lines crossed" or "which letter is circled". See https://vlmsareblind.github.io/ for some clear examples.