Hacker Remix

Ollama's new engine for multimodal models

339 points by LorenDB 23 hours ago | 77 comments

simonw 20 hours ago

The timing on this is a little surprising given llama.cpp just finally got a (hopefully) stable vision feature merged into main: https://simonwillison.net/2025/May/10/llama-cpp-vision/

Presumably Ollama had been working on this for quite a while already - it sounds like they've broken their initial dependency on llama.cpp. Being in charge of their own destiny makes a lot of sense.

lolinder 20 hours ago

Do you know what exactly the difference is with either of these projects adding multimodal support? Both have supported LLaVA for a long time. Did that require special casing that is no longer required?

I'd hoped to see this mentioned in TFA, but it kind of acts like multimodal is totally new to Ollama, which it isn't.

simonw 20 hours ago

There's a pretty clear explanation of the llama.cpp history here: https://github.com/ggml-org/llama.cpp/tree/master/tools/mtmd...

I don't fully understand Ollama's timeline and strategy yet.

refulgentis 19 hours ago

It's a turducken of crap from everyone but ngxson and Hugging Face and llama.cpp in this situation.

llama.cpp did have multimodal, I've been maintaining an integration for many moons now. (Feb 2024? Original LLaVa through Gemma 3)

However, this was not for mere mortals. It was not documented and had gotten unwieldy, to say the least.

ngxson (HF employee) did a ton of work to get gemma3 support in, and had to do it in a separate binary. They dove in and landed a refactored backbone that is presumably more maintainable and on track to be in what I think of as the real Ollama, llama.cpp's server binary.

As you well note, Ollama is Ollamaing - I joked, once, that the median llama.cpp contribution from Ollama is a driveby GitHub comment asking when a feature will land in llama-server, so it can be copy-pasted into Ollama.

It's really sort of depressing to me because I'm just one dude, it really wasn't that hard to support it (it's one of a gajillion things I have to do, I'd estimate 2 SWE-weeks at 10 YOE, 1.5 SWE-days for every model release), and it's hard to get attention for detailed work in this space with how much everyone exaggerates and rushes to PR.

EDIT: Coming back after reading the blog post, and I'm 10x as frustrated. "Support thinking / reasoning; Tool calling with streaming responses" --- this is table stakes stuff that was possible eons ago.

I don't see any sign of them doing anything specific in any of the code they link, the whole thing reads like someone carefully worked with an LLM to present a maximalist technical-sounding version of the llama.cpp stuff and frame it as if they worked with these companies and built their own thing. (note the very careful wording on this, e.g. in the footer the companies are thanked for releasing the models)

I think it's great that they have a nice UX that helps people run llama.cpp locally without compiling, but it's hard for me to think of a project I've been more by turned off by in my 37 years on this rock.

Patrick_Devine 18 hours ago

I worked on the text portion of gemma3 (as well as gemma2) for the Ollama engine, and worked directly with the Gemma team at Google on the implementation. I didn't base the implementation off of the llama.cpp implementation which was done in parallel. We did our implementation in golang, and llama.cpp did theirs in C++. There was no "copy-and-pasting" as you are implying, although I do think collaborating together on these new models would help us get them out the door faster. I am really appreciative of Georgi catching a few things we got wrong in our implementation.

nolist_policy 17 hours ago

For one Ollama supports interleaved sliding window attention for Gemma 3 while llama.cpp doesn't.[0] iSWA reduces kv cache size to 1/6.

Ollama is written in golang so of course they can not meaningfully contribute that back to llama.cpp.

[0] https://github.com/ggml-org/llama.cpp/issues/12637

refulgentis 6 hours ago

It's impossible to meaningfully contribute to the C library you call from Go because you're calling it from Go? :)

We can see the weakness of this argument given it is unlikely any front-end is written in C, and then noting it is unlikely ~0 people contribute to llama.cpp.

magicalhippo 4 hours ago

They can of course meaningfully contribute new C++ code to llama.cpp, which they then could later use downstream in Go.

What they cannot meaningfully do is write Go code that solves their problems and upstream those changes to llama.cpp.

The former requires they are comfortable writing C++, something perhaps not all Go devs are.

refulgentis 26 minutes ago

I'd love to be able to take this into account, step back, and say "Ah yes - there is non-zero parobability they are materially incapable of doing it" - in practice, if you're comfortable writing SWA in Go, you're going to be comfortable writing it in C++, and they are writing C++ already.

(it's also worth looking at the code linked for the model-specific impls, this isn't exactly 1000s of lines of complicated code. To wit, while they're working with Georgi...why not offer to help land it in llama.cpp?)

noodletheworld 17 hours ago

What nonsense is this?

Where do you imagine ggml is from?

> The llama.cpp project is the main playground for developing new features for the ggml library

-> https://github.com/ollama/ollama/tree/27da2cddc514208f4e2353...

(Hint: If you think they only write go in ollama, look at the commit history of that folder)

nolist_policy 17 hours ago

llama.cpp clearly does not support iSWA: https://github.com/ggml-org/llama.cpp/issues/12637

Ollama does, please try it.

imtringued 16 hours ago

Dude, they literally announced that they stopped using llama.cpp and are now using ggml directly. Whatever gotcha you think there is, exists only in your head.

noodletheworld 10 hours ago

I'm responding to this assertion:

> Ollama is written in golang so of course they can not meaningfully contribute that back to llama.cpp.

llama.cpp consumes GGML.

ollama consumes GGML.

If they contribute upstream changes, they are contributing to llama.cpp.

The assertions that they:

a) only write golang

b) cannot upstream changes

Are both, categorically, false.

You can argue what 'meaningfully' means if you like. You can also believe whatever you like.

However, both (a) and (b), are false. It is not a matter of dispute.

> Whatever gotcha you think there is, exists only in your head.

There is no 'gotcha'. You're projecting. My only point is that any claim that they are somehow not able to contribute upstream changes only indicates a lack of desire or competence, not a lack of the technical capacity to do so.

refulgentis 6 hours ago

FWIW I don't know why you're being downvoted other than a standard from the bleachers "idk what's going on but this guy seems more negative!" -- cheers -- "a [specious argument that shades rather than illuminates] can travel halfway around the world before..."

rvz 19 hours ago

> As you well note, Ollama is Ollamaing - I joked, once, that their median llama.cpp contribution from Ollama is asking when a feature will land in llama-server so it can be copy-pasted into Ollama.

Other than being a nice wrapper around llama.cpp, are there any meaningful improvements that they came up with that landed in llama.cpp?

I guess in this case with the introduction of libmtmd (for multi-modal support in llama.cpp) Ollama waited and did a git pull and now multi-modal + better vision support was here and no proper credit was given.

Yes, they had vision support via LLaVa models but it wasn't that great.

refulgentis 18 hours ago

There's been no noteworthy contributions, I'd honestly wouldn't be surprised to hear there's 0 contributions.

Well it's even sillier than that: I didn't realize that the timeline in the llama.cpp link was humble and matched my memory: it was the test binaries that changed. i.e. the API was refactored a bit and such but its not anything new under the sun. Also the llama.cpp they have has tool and thinking support. shrugs

The tooling was called llava but that's just because it was the first model -- multimodal models are/were consistently supported ~instantly, it was just your calls into llama.cpp needed to manage that,a nd they still do! - its just there's been some cleanup so there isn't one test binary for every model.

It's sillier than that in it wasn't even "multi-modal + better vision support was here" it was "oh we should do that fr if llama.cpp is"

On a more positive note, the big contributor I appreciate in that vein is Kobold contributed a ton of Vulkan work IIUC.

And another round of applause for ochafik: idk if this gentleman from Google is doing this in his spare time or fulltime for Google, but they have done an absolutely stunning amount of work to make tool calls and thinking systematically approachable, even building a header-only Jinja parser implementation and designing a way to systematize "blessed" overrides of the rushed silly templates that are inserted into models. Really important work IMHO, tool calls are what make AI automated and having open source being able to step up here significantly means you can have legit Sonnet-like agency in Gemma 3 12B, even Phi 4 3.8B to an extent.

oezi 13 hours ago

I wish multimodal would imply text, image and audio (+potentially video). If a model supports only image generation or image analysis, vision model seems the more appropriate term.

We should aim to distinguish multimodal modals such as Qwen2.5-Omni from Qwen2.5-VL.

In this sense: Ollama's new engine adds vision support.

prettyblocks 8 hours ago

I'm very interested in working with video inputs, is it possible to do that with Qwen2.5-Omni and Ollama?

tough 14 minutes ago

https://huggingface.co/blog/smolvlm

oezi 6 hours ago

I have only tested Qwen2.5-Omni for audio and it was hit and miss for my use case of tagging audio.

machinelearning 4 hours ago

What's a use case are you interested in re: video?

andy_xor_andrew 7 hours ago

They are talking a lot about this new engine - I'd love to see details on how it's actually implemented. Given llama.cpp is a herculean feat, if you are going to claim to have some replacement for it, an example of how you did it would be good!

Based on this part:

> We set out to support a new engine that makes multimodal models first-class citizens, and getting Ollama’s partners to contribute more directly the community - the GGML tensor library.

And from clicking through a github link they had:

https://github.com/ollama/ollama/blob/main/model/models/gemm...

My takeaway is, the GGML library (the thing that is the backbone for llama.cpp) must expose some FFI (foreign function interface) that can be invoked from Go, so in the ollama Go code, they can write their own implementations of model behavior (like Gemma 3) that just calls into the GGML magic. I think I have that right? I would have expected a detail like that to be front and center in the blog post.

Hugsun 4 hours ago

Ollama are known for their lack of transparency, poor attribution and anti-user decisions.

I was surprised to see the amount of attribution in this post. They've been catching quite a bit of flack for this so they might be adjusting.

clpm4j 7 hours ago

The whole '*llama' naming convention in the LLM world is more confusing to me than it probably should be. So many llamas running around out here.

mcbuilder 7 hours ago

Unfortunately the speed of AI/ML is so crazy fast. I don't know a better way to keep track other than paying attention all the time. The field also loves memey names. A few years ago everyone was naming models after Sesame Street characters, there were the YOLO family of models. Conference papers are not immune, in fact they are greatest "offenders".