Hacker Newsnew | past | comments | ask | show | jobs | submit | more lostmsu's commentslogin

Why would llama with --mmap crash?


This doesn't surprise me all that much, mmap support gets little attention in general and interacts poorly with GPU-side inference. (And that's with it being default, you don't even really need to specify it as a CLI option.) OP has raised a discussion with the llama.cpp folks https://github.com/ggml-org/llama.cpp/discussions/20852 but little interest so far


But if mmap already works why would there be any interest?

Besides, discussions are for users. He didn't open PRs or issues.


How's the reproducibility of the results? Like avg score of 10 runs vs original.


Author here: The code is up on GitHub.

The probes I used seem to help identify good configurations, but are quite noisey. A small probe set was initially used to make the scan tractable, and then the higher ranked models were retested on a set ~10x larger.


> Mine was from websites 7 days I think. Randomly stopped booting a month ago (bsod after updates).

This happened to mine too. I suspect this might be the real cause for the blog post in question.


To be fair Everything is heavy. 400MB on my current machine.


470 MB. Probably proportional to index size? For me, it's a good use of available memory.


> MoE models via expert sharding with zero cross-node inference traffic

This makes the whole project questionable


Unpopular opinion: we have achieved AGI back with ChatGPT. It's just still not on the level of the majority of the present company.


It has been React before ChatGPT.


I've got nothing then.


Never put a SIM in it?


Then how can they make calls when home?


By using any VOIP provider.


This has nothing to do with Apple, and everything to do with MoE and that everyone forgot you can re-read the necessary bits of the model from disk for each token.

This is extremely inefficient though. For efficiency you need to batch many requests (like 32+, probably more like 128+), and when you do that with MoE you lose the advantage of only having to read a subset of the model during a single forward pass, so the trick does not work.

But this did remind me that with dense models you might be able to use disk to achieve high throughput at high latency on GPUs that don't have a lot of VRAM.


I am a little optimistic about radicle for Git


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: