More

lostmsu · 2026-03-24T16:47:05 1774370825

Why would llama with --mmap crash?

zozbot234 · 2026-03-24T16:58:50 1774371530

This doesn't surprise me all that much, mmap support gets little attention in general and interacts poorly with GPU-side inference. (And that's with it being default, you don't even really need to specify it as a CLI option.) OP has raised a discussion with the llama.cpp folks https://github.com/ggml-org/llama.cpp/discussions/20852 but little interest so far

lostmsu · 2026-03-25T11:54:09 1774439649

But if mmap already works why would there be any interest?

Besides, discussions are for users. He didn't open PRs or issues.

lostmsu · 2026-03-24T14:20:08 1774362008

How's the reproducibility of the results? Like avg score of 10 runs vs original.

dnhkng · 2026-03-24T14:56:24 1774364184

Author here: The code is up on GitHub.

The probes I used seem to help identify good configurations, but are quite noisey. A small probe set was initially used to make the scan tractable, and then the higher ranked models were retested on a set ~10x larger.

lostmsu · 2026-03-24T10:30:44 1774348244

> Mine was from websites 7 days I think. Randomly stopped booting a month ago (bsod after updates).

This happened to mine too. I suspect this might be the real cause for the blog post in question.

lostmsu · 2026-03-24T10:28:03 1774348083

To be fair Everything is heavy. 400MB on my current machine.

xnx · 2026-03-24T13:08:47 1774357727

470 MB. Probably proportional to index size? For me, it's a good use of available memory.

lostmsu · 2026-03-24T10:14:24 1774347264

> MoE models via expert sharding with zero cross-node inference traffic

This makes the whole project questionable

lostmsu · 2026-03-24T01:49:06 1774316946

Unpopular opinion: we have achieved AGI back with ChatGPT. It's just still not on the level of the majority of the present company.

lostmsu · 2026-03-24T01:42:17 1774316537

It has been React before ChatGPT.

scottlamb · 2026-03-24T01:56:32 1774317392

I've got nothing then.

lostmsu · 2026-03-23T21:30:17 1774301417

Never put a SIM in it?

BeetleB · 2026-03-23T23:34:00 1774308840

Then how can they make calls when home?

lostmsu · 2026-03-24T01:44:22 1774316662

By using any VOIP provider.

lostmsu · 2026-03-23T15:02:01 1774278121

This has nothing to do with Apple, and everything to do with MoE and that everyone forgot you can re-read the necessary bits of the model from disk for each token.

This is extremely inefficient though. For efficiency you need to batch many requests (like 32+, probably more like 128+), and when you do that with MoE you lose the advantage of only having to read a subset of the model during a single forward pass, so the trick does not work.

But this did remind me that with dense models you might be able to use disk to achieve high throughput at high latency on GPUs that don't have a lot of VRAM.

lostmsu · 2026-03-23T14:54:55 1774277695

I am a little optimistic about radicle for Git