Hacker Remix

How we used GPT-4o for image detection with 350 similar illustrations

220 points by olup 1 week ago | 89 comments

sashank_1509 5 days ago

This has been my experience. Foundation models have completely changed the game of ML. Previously, companies might have needed to hire ML engineers familiar with ML training, architectures etc to get mediocre results. Now companies can just hire a regular software engineer familiar with foundation model API’s to get excellent results. In some ways it is sad, but in other ways the result you get is so much better than we achieved before.

My example was an image segmentation model. I managed to create an dataset of 100,000+ images and was training UNets and other advanced models on it, always reached a good validation loss but my data was simply not diverse enough and I faced a lot of issues in actual deployment, where the data distribution kept changing on a day to day basis. Then, I tried DINO v2 from Meta, finetuned on 4 images and it solved the problem, handled all the variations in lighting etc with far higher accuracy than I ever achieved. It makes sense, DINO was train on 100M + images, I would never be able to compete with that.

In this case, the company still needed my expertise, because Meta just released the weights and so someone had to setup the fine-tuning pipeline. But I can imagine a fine tuning API like OpenAI’s requiring no expertise outside of simple coding. If AI results depend on scale, it naturally follows that only a few well funded companies, will build AI that actually works, and everyone else will just use their models. The only way this trend reverses, is if compute becomes so cheap and ubiquitous, that everyone can achieve the necessary scale.

pmontra 5 days ago

> The only way this trend reverses, is if compute becomes so cheap and ubiquitous, that everyone can achieve the necessary scale.

We would still need the 100 M+ images with accurate labels. That work can be performed collectively and open sourced but it must be maintained etc. I don't think it will be easy.

goldemerald 5 days ago

DinoV2 is an unsupervised model. It learns both a high quality global image representation and local representations with no labels. It's becoming strikingly clear that foundation models are the go to choice for common data types of natural images, text, video, and audio. The labels are effectively free, the hard part now is extracting quality from massive datasets.

EGreg 5 days ago

The other way it can reverse is discovering better methods to train models, or fine-tune existing ones with LoRA or whatever.

How did Chinese companies do it, is it a fabricated claim? https://slashdot.org/story/24/12/27/0420235/chinese-firm-tra...

NegatioN 5 days ago

I haven't compared image models in a long while, so I don't know the relevant performance metrics. But even a few years ago, you would usually use a pretrained model, and then finetune on your own dataset though. So those models would also have "seen millions of images", and not just your 100k.

This change of not needing ML engineers is not so much about the models, as it is about easy API access for how to finetune a model, it seems to me?

Of course it's great that the models have advanced and become better, and more robust though.

isoprophlex 5 days ago

This was exactly my experience being the ML engineer on a predictive maintenance project. We detected broken traffic signs in video feeds from trucks; first you segment, then you classify.

Simply yeeting every "object of interest" into DINOv2 and running any cheap classifier on that was a game changer.

ac2u 5 days ago

Could you elaborate? I thought DINO took images and outputted segmented objects? Or do you mean that your first step was something like a yolo model to get bounding boxes and you are just using dino to segment to make the classification part easier?

isoprophlex 5 days ago

We got bboxes from yolo indeed to identify "here is a traffic sign", "here is a traffic light" etc. Then we cropped out these objects of interest and took the DINOv2 embeddings of them.

Not using it to create segmentations (there are YOLO models that do that, so if you need a segmentation you can get it in one pass), no, just to get a single vector representing each crop.

Our goal was not only to know "this is a traffic sign", but also do multilabel classification like "has graffiti", "has deformations", "shows decoloration" etc. If you store those it becomes pretty trivial (and hella fast) to pass these off to a bunch of data scientists so they can let loose all the classifiers in sklearn on that. See [1] for a substantially similar example.

[1] https://blog.roboflow.com/how-to-classify-images-with-dinov2

ac2u 5 days ago

Understood. Thanks for taking the time to elaborate.

IanCal 5 days ago

Things like DINO, GroundingDINO, SAM (and whatever the latest versions of those are) are incredible. I think the progress in this field has been overlooked given LLMs, they're less end-user friendly but they're so good compared to what I remember working with.

I was able to turn around a segmentation and classifier demo in almost no time because they gave me fast and quick segmentation from a text description and then I trained a YOLO model on the results.

Imnimo 6 days ago

It's tough to judge without seeing examples of the targets and the user photos, but I'm curious if this could be done with just old-school SIFT. If it really is exactly the same image in the in the corpus and on the wall, does a neural embedding model really buy you a lot? A small number of high confidence tie points seems like it'd be all you need, but it probably depends a lot on just how challenging the user photos are.

Morizero 6 days ago

I find a lot of applied AI use-cases to be "same as this other method, but more expensive".

miki123211 5 days ago

It's often vastly more expensive to inference, but vastly cheaper and faster to train / set up.

Many LLM use cases could be solved by a much smaller, specialized model and/or a bunch of if statements or regexes, but training the specialized model and coming up with the if statements requires programmer time, an ML engineer, human labelers, an eval pipeline, ml ops expertise to set up the GPUs etc.

With an LLM, you spend 10 minutes to integrate with the OpenAI API, and that's something any programmer can do, and get results that are "good enough".

If you're extremely cash-poor, time-rich and have the right expertise, making your own model makes sense. Otherwise, human time is more valuable than computer time.

kjkjadksj 5 days ago

That was happening even when they were still calling it machine learning in the papers. Longer before that still. It’s the way some people reliably get papers out for better or worse. Find a known phenomenon with existing published methods, use the same dataset potentially using new method of the day, show there’s a little agreement between the old “gold standard” and your method, and boom, new paper for your cv on $hotnewmethod you can now land jobs with. Never mind no one will cite it. That’s not the point here.

Terr_ 6 days ago

Better to spend $100 in op-ex money than spend $1 in cap-ex money reading a journal paper, especially if it lets you tell investors "AI." :p

mattnewton 6 days ago

Your engineers cost <$1/hr and understand journal papers?

Terr_ 5 days ago

The 100-vs-1 is a ratio.

relativ575 5 days ago

Use cases such as?

Morizero 5 days ago

I'm in an AI focused education research group, and most "smart/personalized tutors" on the market have similar processes and outcomes as paper flashcards.

relativ575 5 days ago

From TFA:

> LLMs and the platforms powering them are quickly becoming one-stop shops for any ML-related tasks. From my perspective, the real revolution is not the chat ability or the knowledge embedded in these models, but rather the versatility they bring in a single system.

Why use another piece of software if LLM is good enough?

comex 5 days ago

Performance. A museum visitor may not have a good internet connection, so any solution that involves uploading a photo to a server will probably be (much) slower than client-side detection. There’s a thin line between a magical experience and an annoying gimmick. Making people wait for something to load is a sure way to cross that line.

Also privacy. Do museum visitors know their camera data is being sent to the United States? Is that even legal (without consent) where the museum is located? Yes, visitors are supposed to be pointing their phone at a wall, but I suspect there will often be other people in view.

titzer 5 days ago

Cost. Same reason you don't deliver UPS packages with B-2 bombers.

msp26 5 days ago

The cost of LLM inference is cheap and will continue to decrease. More traditional methods take up far more of an engineer's time (which also costs money).

If I have a project with a low enough lifetime inputs I'm not wasting my time labelling data and training a model. That time could be better spent working on something else. As long as the evaluation is thorough, it doesn't matter. But I still like doing some labelling manually to get a feel for the problem space.

JayShower 6 days ago

Alternative solution that would require less heavy lifting of ML but a little more upfront programming: It sounds like the cars are arranged in a grid on the wall. Maybe it would be possible to narrow down which car the user took a photo of by looking at the photos of the surrounding cars as well, and hardcoding into the system the position of each car relative to one another? Could potentially do that locally very quickly (maybe even at the level of QR-code speed) versus doing an embedding + LLM.

Con of this approach would be that it’s requires maintenance if they ever decide to change the illustration positions.

armchairhacker 5 days ago

Put each painting in an artsy frame whose edges are each different, colorful pattern. When the user photographs the painting, they’ll include all (or even most) of the frame, and distinguishing the frames is easy.

arkh 5 days ago

> artsy frame

Embedding a QR code or simply a barcode somewhere and you're done. Maybe hide it like a watermark so it does not show to the naked eye and doing some Fourier transform in the app won't require a network connection nor lot of processing power.

ndileas 5 days ago

the article does mention that the client rejected a similar approach. steganography seems like a bad choice for a museum setting where you don't own the images.

jaffa2 5 days ago

This seems the way to go… its only 350 images

suriya-ganesh 5 days ago

This tracks with my experience. We built a complex processing pipeline for an NLP classification, search and comprehension task. Using vector database of Proprietary data etc.

We ran a benchmark of our system against an LLM call and the LLM performed much better for so much cheaper, in terms of dev time, complexity, and compute. Incredible time to be in working in the space seeing traditional problems eaten away by new paradigms