Llama.cpp supports Vulkan. why doesn't Ollama?

lolinder · on Jan 31, 2025

So many here are trashing on Ollama, saying it's "just" nice porcelain around llama.cpp and it's not doing anything complicated. Okay. Let's stipulate that.

So where's the non-sketchy, non-for-profit equivalent? Where's the nice frontend for llama.cpp that makes it trivial for anyone who wants to play around with local LLMs without having to know much about their internals? If Ollama isn't doing anything difficult, why isn't llama.cpp as easy to use?

Making local LLMs accessible to the masses is an essential job right now—it's important to normalize owning your data as much as it can be normalized. For all of its faults, Ollama does that, and it does it far better than any alternative. Maybe wait to trash it for being "just" a wrapper until someone actually creates a viable alternative.

chown · on Jan 31, 2025

I totally agree with this. I wanted to make it really easy for non-technical users with an app that hid all the complexities. I basically just wanted to embed the engine without making users open their terminal, let alone make them configure. I started with llama.cpp amd almost gave up on the idea before I stumbled upon Ollama, which made the app happen[1]

There are many flaws in Ollama but it makes many things much easier esp. if you don’t want to bother building and configuring. They do take a long time to merge any PRs though. One of my PRs has been waiting for 8 months and there was this another PR about KV cache quantization that took them 6 months to merge.

[1]: https://msty.app

zozbot234 · on Jan 31, 2025

> They do take a long time to merge any PRs though.

I guess you have a point there, seeing as after many months of waiting we finally have a comment on this PR from someone with real involvement in Ollama - see https://github.com/ollama/ollama/pull/5059#issuecomment-2628... . Of course this is very welcome news.

wkat4242 · on Jan 31, 2025

It's not really welcome news, he is just saying they're putting it on the long finger because they think other stuff is more important. He's the same guy that kept ignoring the KV cache quant merge.

And the actual patch is tiny..

I think it's about time for a bleeding-edge fork of ollama. These guys are too static and that is not what AI development is all about.

zozbot234 · on Jan 31, 2025

He specifically says that they're reworking the Ollama server implementation in order to better support other kinds of models, and that such work has priority and is going to be a roadblock for this patch. This is not even news to those who were following the project, and it seems reasonable in many ways - users will want Vulkan to work across the board if it's made available at all, not for it to be limited to the kinds of models that exist today.

smcleod · on Jan 31, 2025

That qkv PR was mine! Small world.

washadjeffmad · on Jan 31, 2025

>So where's the non-sketchy, non-for-profit equivalent

llama.cpp, kobold.cpp, oobabooga, llmstudio, etc. There are dozens at this point.

And while many chalk the attachment to ollama up to a "skill issue", that's just venting frustration that all something has to do to win the popularity contest is to repackage and market it as an "app".

I prefer first-party tools, I'm comfortable managing a build environment and calling models using pytorch, and ollama doesn't really cover my use cases, so I'm not it's audience. I still recommend it to people who might want the training wheels while they figure out how not-scary local inference actually is.

woadwarrior01 · on Jan 31, 2025

> llmstudio

ICYMI, you might want to read their terms of use:

https://lmstudio.ai/terms

evilduck · on Jan 31, 2025

> llama.cpp, kobold.cpp, oobabooga

None of these three are remotely as easy to install or use. They could be, but none of them are even trying.

> lmstudio

This is a closed source app with a non-free license from a business not making money. Enshittification is just a matter of when.

v9v · on Jan 31, 2025

I would argue that kobold.cpp is even easier to use than Ollama. You click on the link in the README to download an .exe and doubleclick it and select your model file. No command line involved.

Which part of the user experience did you have problems with when using it?

bdavbdav · on Jan 31, 2025

You’re coming at it from a point of knowledge. Read the first sentence of the Ollama website against the first paragraph of kobold’s GitHub. Newcomers don’t have a clue what “running a GGUF model..” means. It’s written by tech folk without an understanding of the audience.

diggan · on Jan 31, 2025

Ollama is also written for technical/developer users, by accident (it seems), even though they don't want it to be strictly for technical users. I've opened a issue asking them to make it more clear that Ollama is for technical users, but they seem confident people with no terminal experience can and will also use Ollama: https://github.com/ollama/ollama/issues/7116

lolinder · on Jan 31, 2025

Why do you care? They're the ones who will deal with the support burden of people who don't understand how to use it—if that support burden is low enough that they're happy with where they're at, what motivation do you have to tell them to deliberately restrict their audience?

diggan · on Jan 31, 2025

> Why do you care?

Like many in FOSS I care about making the experience better for everyone. Slightly weird question, why do you care that I care?

> what motivation do you have to tell them to deliberately restrict their audience?

I don't have any motivation to say any such thing, and I wouldn't either. Is that really your take away from reading that issue?

Stating something like "Ollama is a daemon/cli for running LLMs in your terminal" on your website isn't a restriction whatsoever, it's just being clear up front what the tool is. Currently, the website literally doesn't say what Ollama actually is.

lolinder · on Feb 1, 2025

> Is that really your take away from reading that issue?

Yes. You went to them with a definition of who they're trying to serve and they wrote back that they didn't agree with your relatively narrow scope. Now you're out in random threads about Ollama complaining that they didn't like your definition of their target audience.

Am I missing something?

diggan · on Feb 1, 2025

Yes, I'm not asking them to adopt what I think is the target audience, I'm asking them to define any target audience, then add at least one sentence on their website describing what Ollama is, not sure why that's controversial.

Basically the only information on the website right now is "Get up and running with large language models.", do you think that's helping people? Could mean anything.

Aurornis · on Jan 31, 2025

It’s so hard to decipher the complaints about ollama in this comment section. I keep reading comments from people saying they don’t trust it, but then they don’t explain why they don’t trust it and don’t answer any follow up questions.

As someone who doesn’t follow this space, it’s hard to tell if there’s actually something sketchy going on with ollama or if it’s the usual reactionary negativity that happens when a tool comes along and makes someone’s niche hobby easier and more accessible to a broad audience.

bloomingkales · on Jan 31, 2025

they don’t explain why they don’t trust it

We need to know a few things:

1) Show me the lines of code that log things and how it handles temp files and storage.

2) No remote calls at all.

3) No telemetry at all.

This is the feature list I would want to begin trusting. I use this stuff, but I also don’t trust it.

Aurornis · on Jan 31, 2025

Both ollama and llama.cpp are open source. You can check the code for both and compile both yourself.

The question is: Why is ollama considered “sketchy” but llama.cpp is not, given that both are open source?

I’m not trying to debate it. I’m trying to understand why people are saying this.

aduffy · on Jan 31, 2025

The code is literally open source with MIT license to boot https://github.com/ollama/ollama

bloomingkales · on Jan 31, 2025

traverseda · on Jan 31, 2025

>So where's the non-sketchy, non-for-profit equivalent?

Serving models is currently expensive. I'd argue that some big cloud providers have conspired to make egress bandwidth expensive.

That, coupled with the increasing scale of the internet, make it harder and harder for smaller groups to do these kinds of things. At least until we get some good content addressed distributed storage system.

woadwarrior01 · on Jan 31, 2025

> Serving models is currently expensive. I'd argue that some big cloud providers have conspired to make egress bandwidth expensive.

Cloudflare R2 has unlimited egress, and AFAIK, that's what ollama uses for hosting quantized model weights.

buyucu · on Jan 31, 2025

supporting vulkan will help ollama reach the masses who don't have dedicated gpus from nvidia.

this is such a low hanging fruit that it's silly how they are acting.

lolinder · on Jan 31, 2025

As has been pointed out in this thread in a comment that you replied to (so I know you saw it) [0], Ollama goes to a lot of contortions to support multiple llama.cpp backends. Yes, their solution is a bit of a hack, but it means that the effort to adding a new back end is substantial.

And again, they're doing those contortions to make it easy for people. Making it easy involves trade-offs.

Yes, Ollama has flaws. They could communicate better about why they're ignoring PRs. All I'm saying is let's not pretend they're not doing anything complicated or difficult when no one has been able to recreate what they're doing.

[0] https://news.ycombinator.com/item?id=42886933

buyucu · on Jan 31, 2025

This is incorrect. The effort it took to enable Vulkan was relatively minor. The PR is short and to be honest it doesn't do much, because it doesn't need to.

cmm · on Jan 31, 2025

that PR doesn't actually work though -- it finds the Vulkan libraries and has some memory accounting logic, but the bits to actually build a Vulkan llama.cpp runner are not there. I'm not sure why its author deems it ready for inclusion.

(I mean, the missing work should not be much, but it still has to be done)

buyucu · on Jan 31, 2025

the pr was working 6 months ago and it has been rebased multiple times as the ollama team kept ignoring it and mainline moved. I'm using it right now.

lolinder · on Jan 31, 2025

This is a change from your response to the comment that I linked to, where you said it was a good point. Why the difference?

Maybe I should clarify that I'm not saying that the effort to enable a new backend is substantial, I'm saying that my understanding of that comment (the one you acknowledged made a good argument) is that the maintenance burden of having a new backend is substantial.

buyucu · on Jan 31, 2025

I didn't say it was a good point. I said I disagree, but it's a respectable opinion I could imagine someone having.

lolinder · on Jan 31, 2025

Okay, now we're playing semantics. "Reasonable argument" were your words. What changed between then and now to where the same argument is now "incorrect"?

buyucu · on Jan 31, 2025

I literally start the sentence with 'I disagree with this'

lolinder · on Jan 31, 2025

But you never said why, and you never said it was incorrect, you said it was a reasonable argument and then appealed to the popularity of the PR as the reason why you disagree.

But now suddenly what I said is not just an argument you disagree with but is also incorrect. I've been genuinely asking for several turns of conversation at this point why what I said is incorrect.

Why is it incorrect that the maintenance burden of maintaining a Vulkan backend would be a sufficient explanation for why they don't want to merge it without having to appeal to some sort of conspiracy with Nvidia?

buyucu · on Jan 31, 2025

llama.cpp already supports Vulkan. This is where all the hard work is at. Ollama hardly does anything on top of it to support Vulkan. You just check if the libraries are available, and get the available VRAM. That is all. It is very simple.

bestcoder69 · on Jan 31, 2025

Llamafile: https://github.com/Mozilla-Ocho/llamafile

lolinder · on Jan 31, 2025

Llamafile is great but solves a slightly different problem very well: how do I easily download and run a single model without having any infrastructure in place first?

Ollama solves the problem of how I run many models without having to deal with many instances of infrastructure.

romperstomper · on Jan 31, 2025

You don't need any infrastructure for llamafiles, you just download and run them (everywhere).

lolinder · on Jan 31, 2025

Yes, that's what I meant, sorry if it was confusing: The problem that Llamafiles solve is making it easy to set up one model without infrastructure.

homebrewer · on Jan 31, 2025

It's actually more difficult to use on linux (compared to ollama) because of the weird binfmt contortions you have to go through.

yjftsjthsd-h · on Jan 31, 2025

What contortions? None of my machines needed more than `chmod +x` for llamafile to run.

axegon_ · on Jan 31, 2025

I think you are missing the point. To get things straight: llama.cpp is not hard to setup and get running. It was a bit of a hassle in 2023 but even then it was not catastrophically complicated if you were willing to read the errors you were getting. People are dissatisfied for two, very valid reasons: ollama gives little to no credit to llama.cpp. The second one is the point of the post: a PR has been open for over 6 months and not a huge PR at that has been completely ignored. Perhaps the ollama maintainers personally don't have use for it so they shrugged it off but this is the equivalent of "it works on my computer". Imagine if all kernel devs used Intel CPUs and ignored every non-intel CPU-related PR. I am not saying that the kernel mailing list is not a large scale version of a countryside pub on a Friday night - it is. But the maintainers do acknowledge the efforts of people making PRs and do a decent job at addressing them. While small, the PR here is not trivial and should have been, at the very least, discussed. Yes, the workstation/server I use for running models uses two Nvidia GPU's. But my desktop computer uses an Intel Arc and in some scenarios, hypothetically, this pr might have been useful.

lolinder · on Jan 31, 2025

> To get things straight: llama.cpp is not hard to setup and get running. It was a bit of a hassle in 2023 but even then it was not catastrophically complicated if you were willing to read the errors you were getting.

It's made a lot of progress in that the README [0] now at least has instructions for how to download pre-built releases or docker images, but that requires actually reading the section entitled "Building the Project" to realize that it provides more than just building instructions. That is not accessible to the masses, and it's hard for me to not see that placement and prioritization as an intentional choice to be inaccessible (which is a perfectly valid choice for them!)

And that's aside from the fact that Ollama provides a ton of convenience features that are simply missing, starting with the fact that it looks like with llama.cpp I still have to pick a model at startup time, which means switching models requires SSHing into my server and restarting it.

None of this is meant to disparage llama.cpp: what they're doing is great and they have chosen to not prioritize user convenience as their primary goal. That's a perfectly valid choice. And I'm also not defending Ollama's lack of acknowledgment. I'm responding to a very specific set of ideas that have been prevalent in this thread: that not only does Ollama not give credit, they're not even really doing very much "real work". To me that is patently nonsense—the last mile to package something in a way that is user friendly is often at least as much work, it's just not the kind of work that hackers who hang out on forums like this appreciate.

[0] https://github.com/ggerganov/llama.cpp

portaouflop · on Jan 31, 2025

llama.ccp is hard to set up - I develop software for a living and it wasn’t trivial for me. ollama I can give to my non-technical family members and they know how to use it.

As for not merging the PR - why are you entitled to have a PR merged? This attitude of entitlement around contributions is very disheartening as oss maintainer - it’s usually more work to review/merge/maintain a feature etc than to open a PR. Also no one is entitled to comments / discussion or literally one second of my time as an OSS maintainer. This is imo the cancer that is eating open source.

tharant · on Feb 1, 2025

> As for not merging the PR - why are you entitled to have a PR merged?

I didn’t get entitlement vibes from the comment; I think the author believes the PR could have wide benefit, and believes that others support his position, thus the post to HN.

I don’t mean to be preach-y; I’m learning to interpret others by using a kinder mental model of society. Wish me luck!

pepijndevos · on Jan 31, 2025

ramalama seems to be trying, it's a docker based approach.

airstrike · on Jan 31, 2025

Here you go: https://github.com/hecrj/icebreaker

lolinder · on Jan 31, 2025

> No pre-built binaries yet! Use cargo to try it out

Not an equivalent yet, sorry.

buyucu · on Jan 31, 2025

llama.cpp has supported vulkan for more than a year now. For more than 6 months now there has been an open PR to add vulkan backend support for Ollama. However, Ollama team has not even looked at it or commented on it.

Vulkan backends are existential for running LLMs on consumer hardware (iGPUs especially). It's sad to see Ollama miss this opportunity.

Kubuxu · on Jan 31, 2025

Don’t be sad for commercial entity that is not a good player https://github.com/ggerganov/llama.cpp/pull/11016#issuecomme...

andy_ppp · on Jan 31, 2025

This is great, I did not know about RamaLama and I'll be using and recommending that in future and if I see people using Ollama in instructions I'll recommend they move to RamaLama in the future. Cheers.

api · on Jan 31, 2025

This is fascinating. I’ve been using ollama with no knowledge of this because it just works without a ton of knobs I don’t feel like spending the time to mess with.

As usual, the real work seems to be appropriated by people who do the last little bit — put an acceptable user experience and some polish on it — and they take all the money and credit.

It’s shitty but it also happens because the vast majority of devs, especially in the FOSS world, do not understand or appreciate user experience. It is bar none the most important thing in the success of most things in computing.

My rule is: every step a user has to do to install or set up something halves adoption. So if 100 people enter and there are two steps, 25 complete the process.

For a long time Apple was the most valuable corporation on Earth on the basis of user experience alone. Apple doesn’t invent much. They polish it, and that’s where like 99% of the value is as far as the market is concerned.

The reason is that computers are very confusing and hard to use. Computer people, which most of us are, don’t see that because it’s second nature to us. But even for computer people you get to the point where you’re busy and don’t have time to nerd out on every single thing you use, so it even matters to computer people in the end.

openrisk · on Jan 31, 2025

The problem is that whatever esoteric motivations technical people have to join the FOSS movement (scratching an itch, seeking fame, saving the world, doing what everybody else is doing etc.), does not translate well to the domain of designing user experiences. People with the education and talent to have an impact here have neither incentives nor practical means to "FOSS-it". You could Creative Commons some artwork (and there are beautiful examples) but thats about it. The art and science of making software usable thus remains a proprietary pursuit. Indeed if that bottleneck could somehow be relaxed, adoption of FOSS software would skyrocket because the technical core is so good and keeps getting better.

anewhnaccount2 · on Jan 31, 2025

I think it's understood, but someone needs to work on the actual infrastructure.

jdright · on Jan 31, 2025

Yeah, I would love an actual alternative to Ollama, but RamaLama is not it unfortunately. As the other commenter said, onboarding is important. I just want one operation install and it needs to work and the simple fact RamaLama is written in Python, assures it will never be that easy, and this is even more true with LLM stuff when using AMD gpu.

I know there will be people that disagree with this, that's ok. This is my personal experience with Python in general, and 10x worse when I need to figure out all compatible packages with specifc ROCm support for my GPU. This is madness, even C and C++ setup and build is easier than this Python hell.

cge · on Jan 31, 2025

RamaLama's use of Python is different: it appears to just be using Python for scripting its container management. It doesn't need ROCm to work with Python or anything else. It has no difficult dependencies or anything else: I just installed it with `uv tool install ramalama` and it worked fine.

I'd agree that Python packaging is generally bad, and that within an LLM context it's a disastrous mess (especially for ROCm), but that doesn't appear to be how RamaLama is using it at all.

ecurtin · on Feb 1, 2025

@cge you have this right, the main python script has no dependancies, it just uses python3 stdlib stuff. So if you have a python3 executable on your system you are good to go. All the stuff with dependancies runs in a container. On macOS, using no containers works well also, as we basically just install brew llama.cpp

There's really no major python dependancy problems people have been running this on many Linux distros, macOS, etc.

We deliberately don't use python libraries because of the packaging problems.

discardedrefuse · on Feb 1, 2025

I gave Ramalama shot today. I'm very impressed. `uvx ramalama run deepseek-r1:1.5b` just works™ for me. And that's saying A LOT, because I'm running Fedora Kinoite (KDE spin of Silverblue) with nothing layered on the ostree. That means no ROCm or extra AMDGPU stuff on the base layer. Prior to this, I was running llamafile in a podman/toolbox container with ROCm installed inside. Looks like the container ramalama is using has that stuff in there and amdgpu_top tells me the gpu is cooking when I run a query.

Side note: `uv` is a new package manager for python that replaces the pips, the virtualenvs and more. It's quite good. https://github.com/astral-sh/uv

ecurtin · on Feb 1, 2025

One of the main goals of RamaLama at the start was to be easy to install and run for Silverblue and Kinoite users (and funnily enough that machine had an AMD GPU, so we had almost identical setups). I quickly realized contributing to Ollama wasn't possible without being an Ollama employee:

https://github.com/ollama/ollama/pulls/ericcurtin

They merged a one-line change of mine, but you can't get any significant PRs in.

discardedrefuse · on Feb 1, 2025

I just realized that ramalama is actually part of the whole Container Tools ecosystem (Podman, Buildah, etc). This is excellent! Thanks for doing this.

jdright · on Feb 4, 2025

I'll try it then, if it can get a docker setup using my GPU and no dependency hell, then good. I'll report back to correct myself once I try it.

exe34 · on Jan 31, 2025

I get the impression the important stuff is done in a container rather than on the host system, so having python/pip might be all you need.

ecurtin · on Feb 1, 2025

This is true.

bearjaws · on Jan 31, 2025

It's hilarious that docker guys are trying to take another OSS and monetize it. Hey if it worked once?...

buyucu · on Jan 31, 2025

I was not aware of this context, thanks!

n144q · on Jan 31, 2025

Thanks, just yesterday I discovered that Ollama could not use iGPU on my AMD machine, and was going through a long issue for solutions/workarounds (https://github.com/ollama/ollama/issues/2637). Existing instructions are based on Linux, and some people found it utterly surprising that anyone wants to run LLMs on Windows (really?). While I would have no trouble installing Linux and compile from source, I wasn't ready to do that to my main, daily-use computer.

Great to see this.

PS. Have you got feedback on whether this works on Windows? If not, I can try to create a build today.

zozbot234 · on Jan 31, 2025

The PR has been legitimately out-of-date and unmergeable for many months. It was forward-ported a few weeks ago, and is now still awaiting formal review and merging. (To be sure, Vulkan support in Ollama will likely stay experimental for some time even if the existing PR is merged, and many setups will need manual adjustment of the number of GPU layers and such. It's far from 100% foolproof even in the best-case scenario!)

For that matter, some people are still having issues building and running it, as seen from the latest comments on the linked GitHub page. It's not clear that it's even in a fully reviewable state just yet.

buyucu · on Jan 31, 2025

this pr was reviewable multiple times, rebased multiple times. all because ollama team kept ignoring it. it has been open for almost 7 months now without a single comment from the ollama folks.

ecurtin · on Feb 1, 2025

It's gets out of date with conflicts, etc. Because it's ignored, if this was the upstream project of Ollama, llama.cpp the maintainers would have got this merged months ago.

9cb14c1ec0 · on Jan 31, 2025

The PR at issue here blocks iGPUs. My fork of the PR changes removes that:

https://github.com/9cb14c1ec0/ollama-vulkan

I successfully ran Phi4 on my AMD Ryzen 7 PRO 5850U iGPU with it.

buyucu · on Jan 31, 2025

this is great! I think pufferfish is taking PRs to his fork as well.

Havoc · on Jan 31, 2025

ollama was good initially in that it made LLMs more accessible for non-technical people while everyone was figuring things out.

Lately they seem to be contributing mostly confusion to the conversation.

The #1 model the entire world is talking about is literally mislabeled their side. There is no such thing as R1-1.5b. Quantization without telling users also confuses noobs as to what is possible. Setting up an api different from the thing they're wrapping adds chaos. And claiming each feature added llama.cpp as something "ollama now supports" is exceedingly questionable especially when combined with the very sparse acknowledgement that it's a wrapper at all.

Whole thing just doesn't have good vibes

dingocat · on Jan 31, 2025

What do you mean there is no such thing as R1-1.5b? DeepSeek released a distilled version based on a 1.5B Qwen model with the full name DeepSeek-R1-Distill-Qwen-1.5B, see chapter 3.2 on page 14 of their research article [0].

[0] https://arxiv.org/abs/2501.12948

trissi1996 · on Jan 31, 2025

Which is not the same model, it's not R1 it's R1-Distill-Qwen-1.5B....

dingocat · on Jan 31, 2025

A distinction they make clear and write extensively about on the model page, yes?

macki0 · on Jan 31, 2025

wheres that made clear in "ollama run deepseek-r1” the command to download/run the model?

dingocat · on Feb 1, 2025

Which you have to go to the model page to find.

Havoc · on Jan 31, 2025

ollama labels the qwen models R1, while the "R1" moniker standing on its own in deepseek world means the full model that has nothing to do with qwen.

https://ollama.com/library/deepseek-r1

That may have been ok if it was just same model at different sizes but they're completely different things here & it's created confusion out of thin air for absolutely no reason other than ollama being careless.

dingocat · on Jan 31, 2025

And their documentation makes that distinction clear, having dedicated a section specifically to the distilled models.

the_mitsuhiko · on Jan 31, 2025

Ollama needs competition. I’m not sure what drives the people that maintain it but some of their actions imply that there are ulterior motives at play that do not have the benefit of their users in mind.

However such projects require a lot of time and effort and it’s not clear if this project can be forked and kept alive.

Deathmax · on Jan 31, 2025

The most recent one of the top of my head is their horrendous aliasing of DeepSeek R1 on their model hub, misleading users into thinking they are running the full model but really anything but the 671b alias is one of the distilled models. This has already led to lots of people claiming that they are running R1 locally when they are not.

TeMPOraL · on Jan 31, 2025

The whole DeepSeek-R1 situation gets extra confusing because:

- The distilled models are also provided by DeepSeek;

- There's also dynamic quants of (non-distilled) R1 - see [0]. Those, as I understand it, are more "real R1" than the distilled models, and you can get as low as ~140GB file size with the 1.58-bit quant.

I actually managed to get the 1.58-bit dynamic quant running on my personal PC, with 32GB RAM, at about 0.11 tokens per second. That is, roughly six tokens per minute. That was with llama.cpp via LM Studio; using Vulkan for GPU offload (up to 4 layers for my RTX 4070 Ti with 12GB VRAM :/) actually slowed things down relative to running purely on the CPU, but either way, it's too slow to be useful with such specs.

--

[0] - https://unsloth.ai/blog/deepseekr1-dynamic

zozbot234 · on Jan 31, 2025

> it's too slow to be useful with such specs.

Only if you insist on realtime output: if you're OK with posting your question to the model and letting it run overnight (or, for some shorter questions, over your lunch break) it's great. I believe that this use case can fit local-AI especially well.

adastra22 · on Jan 31, 2025

I'm not sure that's fair, given that the distilled models are almost as good. Do you really think Deepseek's web interface is giving you access to 671b? They're going to be running distilled models there too.

Deathmax · on Jan 31, 2025

It's simple enough to test the tokenizer to determine the base model in use (DeepSeek V3, or a Llama 3/Qwen 2.5 distill).

Using the text "സ്മാർട്ട്", Qwen 2.5 tokenizes as 10 tokens, Llama 3 as 13, and DeepSeek V3 as 8.

Using DeepSeek's chat frontend, both DeepSeek V3 and R1 returns the following response (SSE events edited for brevity):

  {"content":"സ","type":"text"},"chunk_token_usage":1
  {"content":"്മ","type":"text"},"chunk_token_usage":2
  {"content":"ാ","type":"text"},"chunk_token_usage":1
  {"content":"ർ","type":"text"},"chunk_token_usage":1
  {"content":"ട","type":"text"},"chunk_token_usage":1
  {"content":"്ട","type":"text"},"chunk_token_usage":1
  {"content":"്","type":"text"},"chunk_token_usage":1

which totals to 8, as expected for DeepSeek V3's tokenizer.

adastra22 · on Jan 31, 2025

I’m not sure I understand what this comment is responding to. Wouldn’t a distilled Deepseek still use the same tokenizer? I’m not claiming they are using llama in their backend. I’m just saying they are likely using a lower-parameter model too.

zozbot234 · on Jan 31, 2025

The small models that have been published as part of the DeepSeek release are not a "distilled DeepSeek", they're fine-tuned varieties of Llama and Qwen. DeepSeek may have smaller models internally that are not Llama- or Qwen-based but if so they haven't released them.

adastra22 · on Jan 31, 2025

Thank you. I’m still learning as I’m sure everyone else is, and that’s a distinction I wasn’t aware of. (I assumed “distilled” meant a compressed parameter size, not necessarily the use of another model in its construction.)

zozbot234 · on Jan 31, 2025

Given that the 671B model is reportedly MoE-based, it definitely could be powering the web interface and API. MoE slashes the per-inference compute cost - and when serving the model for multiple users you only have to host a single copy of the model params in memory, so the bulk doesn't hurt you as much.

adastra22 · on Jan 31, 2025

They can still run a lot more users on the same number of GPUs (and they don't have a lot) using distilled models.

blixt · on Jan 31, 2025

LM Studio has been around for a long time and does a lot of similar things but with a more UI-based approach. I used to use it before Ollama, and seems it's still going strong. https://lmstudio.ai/

buyucu · on Jan 31, 2025

isn't lm stuido closed source?

7thpower · on Jan 31, 2025

Can you please explain why you think they may be operating in bad faith?

diggan · on Jan 31, 2025

Not parent, but same feeling.

First I got the feeling because of how they store things on disk and try to get all models rehosted in their own closed library.

Second time I got the feeling is when it's not obvious at all about what their motives are, and that it's a for-profit venture.

Third time is trying to discuss things in their Discord and the moderators there constantly shut down a lot of conversation citing "Misinformation" and rewrites your messages. You can ask a honest question, it gets deleted and you get blocked for a day.

Just today I asked why the R1 models they're shipping that are the distilled ones, doesn't have "distilled" in the name, or even any way of knowing which tag is which model, and got the answer "if you don't like how things are done on Ollama, you can run your own object registry" which doesn't exactly inspire confidence.

Another thing I noticed after a while is that there are bunch of people with zero knowledge of terminals that want to run Ollama, even though Ollama is a project for developers (since you do need to know how to run a terminal). Just making the messaging clearer would help a lot in this regarding, but somehow the Ollama team thinks thats gatekeeping and it's better to teach people basic terminal operations.

davely · on Jan 31, 2025

For what it's worth, HuggingFace provides documentation on how you can run any GGUF model inside Ollama[0]. You're not locked into their closed library or have to wait for them to add new models.

Granted, they could be a lot more helpful in providing information on how you do this. But this feature exists, at least.

[0] https://huggingface.co/docs/hub/en/ollama

diggan · on Jan 31, 2025

Ollama team's response (verbatim) when asking what they think of the comments about Ollama in this HN submission: "Who cares? It's the internet... everybody has an opinion... and they're usually bad". Not exactly the response you'd expect from people who should ideally learn from what others think (correct or not) about your project.

7thpower · on Jan 31, 2025

This is helpful, thank you.

justinmayer · on Jan 31, 2025

Benefiting users is definitely not Ollama’s first priority, as seen when this pull request was summarily closed: https://github.com/jmorganca/ollama/pull/395

Those README changes only served to provide greater transparency to would-be users.

Ulterior motives, indeed.

prabir · on Jan 31, 2025

There is https://cortex.so/ that I’m looking forward too.

adastra22 · on Jan 31, 2025

Hey thanks, I didn't know about cortex and this looks perfect.

imtringued · on Jan 31, 2025

Ollama doesn't really need competition. Llama.cpp just needs a few usability updates to the gguf format so that you can specify a hugging face repository like you can do in vLLM already.

buyucu · on Jan 31, 2025

I totally agree that ollama needs competition. They have been doing very sketchy things lately. I wish llama.cpp had an alternative wrapper client like ollama.

Liquix · on Jan 31, 2025

agreed. but what's wrong with Jan? does ollama utilize resources/run models more efficiently under the hood? (sorry for the naivete)

benxh · on Jan 31, 2025

My biggest gripe with Ollama is the badly named models, e.g. under deepseek-r1, it defaults to the distill models.

buyucu · on Jan 31, 2025

I agree they should rename them.

But defaulting to a 671b model is also evil.

rfoo · on Jan 31, 2025

No. If you can't run it and most people can never run the model on their laptop, it's fine, let people know the fact, instead of giving them illusion.

Mashimo · on Jan 31, 2025

Letting people download 400GB just to find that out is also .. not optimal.

But yes, I have been "yelled" at on reddit for telling people you need vram in the hundreds of GB.

diggan · on Jan 31, 2025

> Letting people download 400GB just to find that out is also .. not optimal.

Letting people download any amount of bytes just to find out they got something else isn't optimal. So what to do? Highlight the differences when you reference them so people understand.

Tweets like these: https://x.com/ollama/status/1881427522002506009

> DeepSeek's first-generation reasoning models are achieving performance comparable to OpenAI's o1 across math, code, and reasoning tasks! Give it a try! 7B distilled: ollama run deepseek-r1:7b

Are really misleading. Reading the first part, you think the second part is that model that gives "performance comparable to OpenAI's o1" but it's not, it's a distilled model with way worse performance. Yes, they do say it's the distilled model, but I hope I'm not alone in seeing how people less careful would confuse the two.

If they're doing this on purpose, I'd leave a very bad taste in my mouth. If they're doing this accidentally, it also gives me reason to pause and re-evaluate what they're doing.

singularity2001 · on Jan 31, 2025

at least the distilled models are officially provided by deepseek (?)

trash_cat · on Jan 31, 2025

I use Ollama because I am a casual user and can't be bothered to read the docs on how to setup llama.cpp. I just want to run a simple llm locally.

Why would I care about Vulkan?

buyucu · on Jan 31, 2025

with vulkan it runs much much faster on consumer hardware, especially opn igpus like intel or amd.

zozbot234 · on Jan 31, 2025

Well, it definitely runs faster on external dGPU's. With iGPU's and possibly future NPU's, the pre-processing/"thinking" phase is much faster (because that one is compute-bound) but text generation tends to be faster on CPU because it makes better use of available memory bandwidth (which is the relevant constraint there). iGPU's and NPU's will still be a win wrt. energy use, however.

bdhcuidbebe · on Jan 31, 2025

For Intel, OpenVINO should be the preferred route. I dont follow AMD, but Vulkan is just the common denominator here.

buyucu · on Jan 31, 2025

If you support Vulkan, you support almost every GPU out there in the consumer market across all hardware vendors. It's an amazing fallback option.

I agree they should also support OpenVINO, but compared to Vulkan OpenVINO is a tiny market.

bdhcuidbebe · on Jan 31, 2025

I made an argument for performance, not for compatibility.

If you run your local llm in the least performant way possible on tour overly expensive GPU, then you are not making value of your purchase.

Vulkan is a fallback option is all.

I even see people running on their CPU because some apps dont support their hardware and llama.cpp made it even possible. It is still a really bad idea.

Its just goes to show there’s still much to do.

buyucu · on Jan 31, 2025

I'm willing to bet that Vulkan will outperform OpenVINO.

Vulkan is the API right now in the graphics world. It's very well supported and actively being improved on. Everyone is pouring resources into making Vulkan better.

OpenVINO feels barely developed. Intel never made it a proper backend for Pytorch like AMD did with ROCm. It's hard to see where it is going, or if it is going anywhere at all. Between Sycl and OneApi it's hard to see how much interest Intel has developing it.

bdhcuidbebe · on Jan 31, 2025

> I'm willing to bet that Vulkan will outperform OpenVINO.

> Vulkan is the API right now in the graphics world.

YUP, Vulkan is all the rage in the graphics world, and for good reasons. But we arent discussing graphics now are we?

Vulkan is a general graphics API with some computing capabilities.

OpenVINO is a toolkit for inference neural networks, by intel built to make use of their GPUs and NPUs for this specific task.

Using vulkan, first you need to translate your payload to shaders, then they need to be compiled to SPIR-V, then they can use a subset of the cards capabilities.

How could this even remotely match something written specifically for the task?

Also, it is dead easy to benchmark if you still think otherwise.

Or just read up on it..

buyucu · on Feb 2, 2025

OpenVINO is quasi-abandoned, nobody uses it, Intel barely puts any effort to improve it.

Vulkan is used by millions and huge money goes into optimizing it.

My money is on Vulkan.

sebazzz · on Jan 31, 2025

How is the performance of Vulkan vs ROCm on AMD iGPUs? Ollama can be persuaded to run on iGPUs with ROCm.

a12k · on Jan 31, 2025

Ollama is sketchy enough that I run it in a VM. Which is odd because it would probably take less effort to just run Llama.cpp directly, but VMs are pretty easy so just went that route.

When I see people bring up the sketchiness most of the time the creator responds with the equivalent of shrugs, which imo increases the sketchiness.

nialv7 · on Jan 31, 2025

It's fully open source. I mean yes it uses llama.cpp without giving it credit. But why run it in a VM?

a12k · on Jan 31, 2025

It severely over-permissions itself on my Mac.

super_mario · on Jan 31, 2025

Can you please elaborate? How are you running ollama? I just build it from source and have written a shell script to start/stop it. It runs under my local user account (I should probably have its own user) and is of course not exposed outside localhost.

celeritascelery · on Jan 31, 2025

I have never had it request any permissions

_w1tm · on Jan 31, 2025

Just install it from Brew and run the service in a separate terminal tab.

instagary · on Jan 31, 2025

Isn't there a clause in MIT that says you're required to give credit? Also, I didn't know a YC company which started it: https://www.ycombinator.com/companies/ollama.

aduffy · on Jan 31, 2025

The project existed in the open source and then subsequently the creators sought funding to work on it full time.

instagary · on Jan 31, 2025

Makes sense!

krowek · on Jan 31, 2025

> But why run it in a VM?

Because you don't execute untrusted code in your machine without containerization/virtualization. Don't you?

Aurornis · on Jan 31, 2025

The question was asking why it’s untrusted code, not why you run untrusted code in a VM.

There are a lot of open-source tools that we have to trust to get anything done on a daily basis.

adastra22 · on Jan 31, 2025

Every single day. There's just too much good software out there, and life is too short to be so paranoid.

n144q · on Jan 31, 2025

Care to elaborate what "sketchy" refers to here?

nicce · on Jan 31, 2025

> but VMs are pretty easy so just went that route.

Don’t you need at least 2 GPUs in that case and put kernel level passthrough?

a12k · on Jan 31, 2025

I don’t use GPU. Works fine, but the large Mixtral models are slow.

bdhcuidbebe · on Jan 31, 2025

i pass through my dGPU to VM and use iGPU for desktop

buyucu · on Jan 31, 2025

ollama advertising llama.cpp features as their own is very dishonest in my opinion.

portaouflop · on Jan 31, 2025

That’s the curse and blessing of open source I guess? I have billion dollar companies running my oss software without giving me anything - but do I gripe about it in public forums? Yea maybe sometimes but it never helps to improve the situation.

weinzierl · on Jan 31, 2025

It's the curse of permissively licensed open source. Copyleft is not the answer to everything but against companies leeching and not giving back it is effective.

portaouflop · on Jan 31, 2025

Well we had this conversation but once you change the license to something like that you lose the trust of the oss community instantly and as elasticsearch etc. show it also doesn’t really help with monetisation

sitkack · on Jan 31, 2025

Are they a wrapper with a similar name? You, like I, do gripe in public forums.

portaouflop · on Jan 31, 2025

No it’s totally different from this case - but the software is a super important part of their stack - if it were to stop working they can shut down their business

kgwgk · on Jan 31, 2025

The similar name in this case doesn’t originate in the wrapped thing either.

adastra22 · on Jan 31, 2025

Welcome to open source.

mschwaig · on Jan 31, 2025

Ollama tries to appeal to a lowest common denominator user base, who does not want to worry about stuff like configuration and quants, or which binary to download.

I think they want their project to be smart enough to just 'figure out what to do' on behalf of the user.

That appeals to a lot of people, but I think them stuffing all backends into one binary and auto-detecting at runtime which to use and is actually a step too far towards simplicity.

What they did to support both CUDA and ROCm using the same binary looked quite cursed last time I checked (because they needed to link or invoke two different builds of llama.cpp of course).

I have only glanced at that PR, but I'm guessing that this plays a role in how many backends they can reasonably try to support.

In nixpkgs it's a huge pain that we configure quite deliberately what we want Ollama to do at build time, and then Ollama runs off and does whatever anyways, and users have to look at log output and performance regressions to know what it's actually doing, every time they update their heuristics for detecting ROCm. It's brittle as hell.

buyucu · on Jan 31, 2025

I disagree with this, but it's a reasonable argument. The problem is that the Ollama team has basically ignored the PR, instead of engaging the community. The least they can do is to explain their reasoning.

This PR is #1 on their repo based on multiple metrics (comments, iterations, what have you)

your_challenger · on Jan 31, 2025

I don't know why one would use Ollama instead of llama.cpp. llama.cpp is so easy to use and the maintainer is pretty famous and active in the community.

buyucu · on Jan 31, 2025

Llama.cpp dropped support for multimodal vlms. That is why I am using ollama. I would happily switch back if I could.

Gracana · on Jan 31, 2025

llama.cpp readme still lists multimodal models.. Qwen2-VL and others. Is that inaccurate, or something different?

[edit] Oh I see, here's an issue about it: https://github.com/ggerganov/llama.cpp/issues/8010

buyucu · on Jan 31, 2025

it's a grey zone but vlms are effectively not being developed anymore.

av_conk · on Jan 31, 2025

I tried using ollama because I couldn't get ROCm working on my system with llama-cpp. Ollama bundles the ROCm libraries for you. I got around 50 tokens per second with that setup.

I tried llama-cpp with the Vulkan backend and doubled the amount of tokens per second. I was under the impression ROCm is superior to Vulkan, so I was confused about the result.

In any case, I've stuck with llama-cpp.

buyucu · on Jan 31, 2025

It depends on your GPU. Vulkan is well-supported by essentially all GPUs. AMD support ROCm well for their datacenter GPUs, but support for consumer hardware has not been as good.

paradite · on Jan 31, 2025

Could it be that supporting multiple platforms open up more support tickets and adds more work to keep the software working on those new platforms?

As someone who built apps for Windows, Linux, macOS, iOS and Android, it is not trivial to ensure your new features or updates work on all platforms, and you have to deal with deprecations.

geerlingguy · on Jan 31, 2025

They already support ROCm, which probably introduces 10x more support requests than Vulkan would!

buyucu · on Jan 31, 2025

ollama is not doing anything. llama cpp does all that work. ollama is just a small wrapper on top.

zozbot234 · on Jan 31, 2025

This is not quite correct. Ollama must assess the state of Vulkan support and amount of available memory, then pick the fraction of the model to be hosted on GPU. This is not totally foolproof and will likely always need manual adjustment in some cases.

buyucu · on Jan 31, 2025

the work involved is tiny compared to the work llama.cpp did to get vulkan up and running.

this is not rocket science.

exe34 · on Jan 31, 2025

This sounds like it should be trivial to reproduce and extend - I look forward to trying out your repo!

buyucu · on Jan 31, 2025

the owner of that PR has already forked ollama. try it out. I did and it works great.

paradite · on Jan 31, 2025

I guess git and GitHub are working as intended then.

This is not a sarcastic comment. I'm genuinely happy that this was the outcome.

paradite · on Jan 31, 2025

Ok assuming what you said is correct, why wouldn't Ollama then be able to support Vulkan by default out of the box?

Sorry I'm not sure what's the relationship exactly between the two projects. This is a genuine questions, not a troll question.

buyucu · on Jan 31, 2025

check the PR, it's a very short one. It's not more complicated than setting a compile time flag.

I have no idea why they have been ignoring it.

Ollama is just a friendly front end for llama.cpp. It doesn't have to do any of those things you mentioned. Llama.cpp does all that.

fkyoureadthedoc · on Jan 31, 2025

If it's "just" a friendly front and, why doesn't llama.cpp just drop one themselves? Do they actually care about the situation, or are random people just mad on their behalf?

paradite · on Jan 31, 2025

At the risk of being pedantic (I don't know much about C++ and I'm genuinely curious), if Ollama is really just a wrapper around Llama.cpp, why would it need the Vulkan specific flags?

Shouldn't it just call Llama.cpp and let Llama.cpp handle the flags internally within Llama.cpp? I'm thinking from an abstraction layer perspective.

zozbot234 · on Jan 31, 2025

The Vulkan-specific flags are needed (1) to set up the llama.cpp build options when building Ollama w/ Vulkan support - which apparently is still a challenge with the current PR, if the latest comments on the GitHub page are accurate; also (2) to pick how many model layers should be run on the GPU, depending on available GPU memory. Llama.cpp doesn't do that for you, you have to set that option yourself or just tell it to move "everything", which often fails with an error. (Finding the right amount is actually a trial-and-error process which depends on the model, quantization and also varies depending on how much context you have in the current conversation. If you have too many layers loaded and too little GPU memory, a large context can result in unpredictable breakage.)

paradite · on Jan 31, 2025

Thanks a lot for the explanation.

If I can ask one more question, why don't Ollama use binaries of pre-built llama.cpp with Vulkan support directly?

turnsout · on Jan 31, 2025

This is going to sound like a troll, but it's an honest question: Why do people use Ollama over llama.cpp? llama.cpp has added a ton of features, is about as user-friendly as Ollama, and is higher-performance. Is there some key differentiator for Ollama that I'm missing?

SkyPuncher · on Jan 31, 2025

Ollama - `brew install ollama`

llama.cpp - Read the docs, with loads of information and unclear use cases. Question if it has API compatibility and secondary features that a bunch of tools expect. Decide it's not worth your effort when `ollama` is already running by the time you've read the docs

kgwgk · on Jan 31, 2025

https://formulae.brew.sh/formula/llama.cpp

LorenDB · on Jan 31, 2025

Additionally, Ollama makes model installation a single command. With llama.cpp, you have to download the raw models from Huggingface and handle storage for them yourself.

trissi1996 · on Jan 31, 2025

Not really, llama.cpp can download for quite some time, not as elegant as ollama but:

    llama-server --model-url "https://huggingface.co/bartowski/DeepSeek-R1-Distill-Qwen-32B-GGUF/resolve/main/DeepSeek-R1-Distill-Qwen-32B-IQ4_XS.gguf"

Will get you up and running in one single command.

yencabulator · on Jan 31, 2025

And now you need a server per model? Ollama loads models on-demand, and terminates them after idle, all accessible over the same HTTP API.

singularity2001 · on Jan 31, 2025

ollama run deepseek-r1:14b

portaouflop · on Jan 31, 2025

I can only speak for myself but to me llama.ccp looks kind of hard to use (tbh never tried to use it), whereas ollama was just one cli command away. Also I had no idea that its equivalent, I thought llama.ccp is some experimental tool for hardcore llm cracks, not something that I can teach my for example my non-technical mom to use.

Looking at the repo of llama.ccp it’s still not obvious to me how to use it without digging in - I need to download models from huggingface it seems and configure stuff etc - with ollama I type ollama get or something and it works.

Tbh I don’t just that stuff a lot or even seriously, maybe once per month to try out new local models.

I think having an easy to use quickstart would go a long way for llama.ccp - but maybe it’s not intended for casual (stupid?) users like me…

Majromax · on Jan 31, 2025

In my mind, it doesn't help that llama.cpp's name is that of a source file. Intuitively, that name screams "library for further integration," not "tool for end-user use."

n144q · on Jan 31, 2025

https://github.com/ggerganov/llama.cpp/discussions/9565

Discussions under https://news.ycombinator.com/item?id=40693391

(I recommend doing a search yourself first)

Basically, if you know how to use a computer, you can use Ollama (almost). You can't say the same thing about llama.cpp. Not everyone knows how to build from source, or even what "build" means.

paradite · on Jan 31, 2025

For starters:

- It doesn't have a website

- It doesn't have a download page, you have to build it yourself

woadwarrior01 · on Jan 31, 2025

> - It doesn't have a download page, you have to build it yourself

I'd wager that anyone capable enough to run a command line tool like Ollama should also be able to download prebuilt binaries from the llama.cpp releases page[1]. Also, prebuilt binaries are available on things like homebrew[2].

[1]: https://github.com/ggerganov/llama.cpp/releases

[2]: https://formulae.brew.sh/formula/llama.cpp

a12k · on Jan 31, 2025

I am very technically inclined and use Ollama (in a VM, but still) because of all the steps and non-obviousness of how to run Llama.cpp. This framing feels a bit like the “Dropbox won’t succeed because rsync is easy” thinking.

woadwarrior01 · on Jan 31, 2025

> This framing feels a bit like the “Dropbox won’t succeed because rsync is easy” thinking.

No this isn't. There are plenty of end user GUI apps that make it far easier than Ollama to download and run local LLMs (disclaimer: I build one of them). That's an entirely different market.

IMO, the intersection between the set of people who use a command line tool, and the set of people who are incapable of running `brew install llama.cpp` (or its Windows or Linux equivalents) is exceedingly small.

fkyoureadthedoc · on Jan 31, 2025

I can't install any .app on my fairly locked down work computer, but I can `brew install ollama`.

When I read the llama.cpp repo and see I have to build it, vs ollama where I just have to get it, the choice is already made.

I just want something I can quickly run and use with aider or mess around with. When I need to do real work I just use whatever OpenAI model we have running on Azure PTUs

kgwgk · on Jan 31, 2025

> I can `brew install ollama`.

Can you `brew install llama.cpp`?

fkyoureadthedoc · on Jan 31, 2025

probably, but why would I at this point? it's not on aider's supported list. If I needed to replace ollama for some reason I'd probably go with lms.

n144q · on Jan 31, 2025

And you still need to find and download the model files yourself, among other steps, which is intimidating enough to drive away most users, including skilled software engineers. Most people just want it to work and start using it for something else as soon as possible.

The same reason I use apt install instead of compiling from source. I can definitely do that, but I don't, because it's just a way to get things installed.

paradite · on Jan 31, 2025

Ok I was looking at the repo from mobile and missed the releases.

Still it's not immediate obvious from README that there is an option to download it. There are instructions on how to build it, but not how to download it. Or maybe I'm blind, please correct me.

baq · on Jan 31, 2025

I'm perfectly capable of compiling my own software but why bother if I can curl | sh into ollama.

mrkeen · on Jan 31, 2025

I used both. I had a terrible time with llama, and did not realise it until I used ollama.

I owned an RTX2070, and followed the llama instructions to make sure it was compiling with GPU enabled. I then hand-tweaked settings (numgpulayers) to try to make it offload as much as possible to the GPU. I verified that it was using a good chunk of my GPU ram (via nvidia-smi), and confirmed that with-gpu was faster than cpu-only. It was still pretty slow, and influenced my decision to upgrade to an RTX3070. It was faster, but still pretty meh...

The first time I used ollama, everything just worked straight out of the box, with one command and zero configuration. It was lightning fast. Honestly if I'd had ollama earlier, I probably wouldn't have felt the need to upgrade GPU.

serial_dev · on Jan 31, 2025

Maybe it was lightning fast because the model names are misleading? I installed it to try out deepseek, I was surprised how small the download artifact was and how easily it ran on my simple 3 years old Mac. I was a bit disappointed as deepseek gave bad responses and I heard it should be better than what I used on OpenAI… only to then realize after reading it on Twitter that I got a very small version of deepseek r1.

Maybe you were running a different model?

bildung · on Jan 31, 2025

If it was faster with ollama, then you most probably just downloaded a different model (hard to recognize with ollama). Ollama only adds UX to llama.cpp, and nothing compute-wise.

stuaxo · on Jan 31, 2025

The server in llama-cpp is documented as being only for demonstration, but ollama supports it as a model to run it.

For work, we are given Macs and so the GPU can't be passed through to docker.

I wanted a client/server where the server has the LLM and runs outside of Docker, but without me having to write the client/server part.

I run my model in ollama, then inside the code use litellm to speak to it during local development.

rakatata · on Jan 31, 2025

While not rocketscience, a lot of its features requires to know how to recompile the project with passing certain variables. Also you need to properly format prompts for each instructor model.

buyucu · on Jan 31, 2025

I use ollama because llama.cpp dropped support for vlms. I would happily switch back if llama.cpp starts supporting vlms again.

dinosaurdynasty · on Jan 31, 2025

Can you even use bare llama.cpp with OpenWebUI? Especially when they are running on two different computers?

zophiana · on Jan 31, 2025

Honestly I just didn't know it was this easy to use, maybe because of the name... But ramalama seems to be a full replacement for ollama

himhckr · on Jan 31, 2025

ramalama still needs users to be able to install docker first, no? That’s a barrier to entry for many users esp. Windows where I have had my struggles running Docker not to mention a massive resource hog.

sroecker · on Jan 31, 2025

Yes, but ramalama defaults to podman. Podman Desktop is very easy to install and use.

wkat4242 · on Jan 31, 2025

That's a weird thing about Ollama yes.

It took very long for them to support KV cache quantisation too (which drastically reduces the amount of VRAM needed for context!). Even though the underlying llama.cpp had offered it for ages. And they had it handed to them on a platter, someone had developed everything and submitted a patch.

The developer of that patch even was about to give up as he had to constantly keep it up to date with upstream even though he was constantly being ignored. So he had no idea if it would ever be merged.

They just seem to be really hesitant to offer new features.

Eventually it was merged and it made a huge difference to people with low VRAM cards.

quibono · on Jan 31, 2025

Is Ollama just the porcelain around llama.cpp? Or is there more to it than that?

buyucu · on Jan 31, 2025

yes, it's a convenience wrapper around llama.cpp

diggan · on Jan 31, 2025

They also decided to rehost the model files in their own (closed) library/repository + store the files split into layers on disk, so you cannot easily reuse model-files between applications. I think the point is that models can share layers, I'm not sure how much space you actually save, I just know that if you use both LM Studio + Ollama you cannot share models but if you use LM Studio + llama.cpp you can share the same files between them, no need to download duplicate model weights.

ac29 · on Jan 31, 2025

The main feature IMO is the model library. llama.cpp on its own does not come with any built in way to download and manage models.

colorant · on Feb 3, 2025

There is ipex-llm support for Ollama on Intel GPU (https://github.com/intel/ipex-llm/blob/main/docs/mddocs/Quic...)