I hear this all the time. Why does it matter? Punishing a human for making a mistake does not prevent mistakes, nor does it undo the harm of the mistake. A human saying "my bad, I messed up" and an AI saying "my bad, I messed up" are equally worthless, in a functional sense.
mCaptcha, ALTCHA, Cap, Friendly Captcha, Private Captcha, Procaptcha, Anubis... there are literally dozens of open source alternatives that aren't feeding the Do Be Evil company... not to mention all of the commercial alternatives - if for whatever reason, you do feel like paying for a service that costs nothing to offer
"Citizens" United (which allows unlimited corporate political donations by classifying them as "speech", for those out of the loop) has fundamentally changed the core incentive structures of the modern political landscape. To compare a pre-CU world to a post-CU world when it comes to matters at the intersection of corporate interests and government regulatory / legislative power is comparing apples to oranges.
We need to overturn CU if we want to be able to go back to a world where government serves people rather than multinational conglomerates.
Citizens United has to be the most inaccurately cited case. It did not 'allow unlimited corporate political donations by classifying them as "speech"'.
It ruled that the federal government was wrong to restrict the speech rights of some groups while allowing other very similar groups to still retain their rights. One of the major examples of this was the media industry. A for-profit newspaper company could spend whatever amount of money it wanted to on political speech. An identical company in a different field could not. This, the court ruled, was unconstitutional.
It also did not grant corporations personhood, the other thing people like to state that it did.
Frankly, I think it's much less of an issue than it's made out to be. If money meant so much, Donald Trump wouldn't have beat Jeb Bush on the way to beating Hilary Clinton.
To me, money in politics is the red herring to keep you from looking at the real election reform that needs to happen, some combination of open primaries (to remove the effect of primaries going to more extreme candidates rather than centrists) and an alternative to first past the post elections to allow people to vote for who they like without worrying about throwing away their votes (there are many different systems to do this, they're all major improvements).
The money in politics is used by the parties to back their preferred candidates and the voters go along with it in the general elections because they don't want to waste their votes. The money helps them do the bad thing but it isn't the bad thing.
It's worth pointing out that full digital identity verification ("doxxing" yourself to an untrustworthy, unauditable, legally unconstrained private company) is NOT the only way to verify adulthood. We have had a system in place which enables adulthood validation without enabling digital surveillance infrastructure, with a degree of false negative risk that society has deemed acceptable for nearly 100 years now. This idea is not my own, but I'm happy to share a reasonable proposal for it.
The Cashier Standard – Age Verification Without Surveillance
The "cashier standard" you advocate for has already crept toward centralized state tracking in places like Utah. When you go to a restaurant and order a drink, the staff are required to take it to the back and scan it for verification. The scanned data is also compared with a state database of DUI offenders. It's not clear whether the database is stored on site, or if that data goes out on the wire for the check; presumably the latter. Scanned data is also stored for up to 7 days by the restaurant, and it's easy to imagine further creep upping that storage bound.
This is not the case in most of the country. Utah is largely influenced by a Mormon / LDS culture that expresses heavy opposition to drinking. I am clearly not proposing that the cards be scanned Utah style, I am proposing that they be glanced at by a cashier, everywhere else style.
Again, the proposal isn't for a system which requires scanning of IDs, it's for a system where the cashier glances at the ID. You're arguing against a strawman. You may argue that the system proposed could evolve into the system you're describing, but still, you're arguing against a hypothetical future fiction. If we're going to be arguing about what the proposal might evolve into in the future, we might as well be arguing about what we should be doing when aliens arrive, since they might arrive in the future, too.
> we might as well be arguing about what we should be doing when aliens arrive, since they might arrive in the future, too.
Did aliens land in multiple states already? Strawman deflections aside, scanning is the natural evolution and has already happened across multiple kinds of exchange (money markers, various ids, various phone apps, etc). Government issue has a benefit of an independent verification system. It's super expensive for various government agencies to integrate into businesses. Constituents and businesses don't want that, leading to a much more comfortable adversarial relationship, imo.
First - Alcohol and cigarettes can just be resold too. The black market for them is effectively zero because the consequences for giving them to kids are severe and the room for meaningful profit is close to zero, same applies here.
Second - The codes would be priced on the order of magnitude of pennies per verification - think 10 cents or less, accessible even to low / fixed income folks without really making a dent in their budget.
Third - the proposal explicitly mentions a nonprofit running it as an option, and the idea would be that law codifies the method to be approved, not a specific vendor, so competitive markets could emerge, too. Would you argue that restrictions on the sale of alcohol are creating artificial winners in the private sector of alcohol manufacturing?
You're doing a huge logical jump in your first point. Alcohol and cigarettes are physical goods, digital ID is not, but you're proposing a system that turns it into a physical problem. I'm merely pointing out that's what you're doing and the issues with it.
Second, it doesn't matter what it costs, it's inconvenient and I already spent time (possibly money too) obtaining a government ID... on top of a theoretical mandate that says I need to show the ID on a bunch of websites.
Third, I'm not sure I follow your point on alcohol restrictions creating winners? The non-profit idea could potentially be good, but I'm not hopeful that real world legislation would be crafted that way.
EDIT: also more on #1 and "severe consequences" for re-selling... yes that's exactly what we want to avoid: creating more reasons to put people in prison and a bigger burden on law enforcement and the court system.
With digital tokens being generated by a user (the seller) on demand, you could have a bond system where the seller places something costly on the line, that the buyer can choose to destroy or obtain. For instance, if Alice gives her age token to Bob, Bob can (if he is a troll) invalidate the token in a way that requires Alice to go to a physical location to reset her ID.
I imagine this could be done with appropriate zero-knowledge measures so that the combination of Alice's age token and Bob's private key creates a capability to exercise the option, but without the service (e.g. a social media site) knowing that the token belongs to Alice, and without the ID provider (e.g. the state) knowing that Bob was the one who exercised it.
While honest customers have no reason to make use of this option, if Alice blindly sells her tokens to anybody willing to pay, there's bound to be some trolls out there who will do it just for the laughs.
This is far from a perfect system since a dishonest site could also make use of the option. But it theoretically works without revealing anybody's identity (unless the option is used, and then only if the service and the ID provider collude).
It doesn't prevent it, it just disincentivizes it. As an adult, you can also go buy a beer and sell it to a minor. That said, mandatory age verification with photo ID upload and facial scans doesn't prevent workarounds either - kids use their parents' photo ID and pass facial scans with a variety of techniques, too.
Nobody who understands how adversarial systems like this work is seriously expecting a 100% flawless performance of blocking every single minor and accepting every single adult, the question is how much risk is acceptable, and the risks posed by this system are acceptable for alcohol, cigarettes, and other adult items that can arguably pose much more acute risk of serious injury or bodily harm to kids.
Thinking about it, I probably wouldn't remember to change my fingerprints to the new ones with all the services I use, I'd probably have to carry my "legacy fingerprints" wherever I go for some time to avoid a lockout.
These aren't really "credentials" in that they're not secret the way your iris/retina pattern, fingerprint pattern, password, pin, secret key, or security token are.
Your name, DoB, and email address are identifiers, yes, but aren't really authenticators - they're more like a username, not a password.
similar numbers here - slightly higher PP. slightly better peak speed and retention w/ q8_0 kv cache quants too. llama-bench results here, cba to format for hn:
https://pastebin.com/raw/zgJeqRbv
GTR 9 Pro, "performance" profile in BIOS, GTT instead of GART, Fedora 44
If I did a proper benchmark I think the numbers would be what you got. Minimax M2.7 is also surprisingly not that slow, and in some ways faster as it seems to get things right with less thinking output. (around 140 t/s prefill and 23 t/s generation).
The problem with M2.7 is that it's full GQA, meaning quadratic attention. It does start fast, but by 64k tokens deep, the version I'm running (Unsloth's UD IQ2_XXS) pp512 drops 95% from 261.3 t/s (0 context depth) to 13.1 t/s. q8_0 KV does help, still hitting 57.4 t/s at 64k depth vs 258.3 t/s at 0 depth. TG's retention rates are better, but still approaching single digit even with q8_0 KV cache by 64k depth.
That said, it was my favorite model when I valued output quality above all else, at least up until the new Qwen 3.6 27B, which I'm currently playing with.
I suspect I will like Qwen 3.6 122B A10B a LOT, maybe even better than M2.7.
You absolutely do not need to run at full BF16. The quality loss between BF16 (55.65 GB in GGUF) and Q8_0 (30.44 GB in GGUF) is essentially zero - think on the order of magnitude of +0.01-0.03 perplexity, or ~0.1-0.3% relative PPL increase. The quality loss between BF16 and Q4_K_M (18.66 GB in GGUF) is close to imperceptible, with perplexity changes in the +0.1-0.3 ballpark, or ~1-3% relative PPL increase. This would correlate to a 0-2% drop on downstream tasks like MMLU/GSM8K/HellaSwag: essentially indistinguishable.
You absolutely do NOT need a $3000 Strix Halo rig or a $4000 Mac or a $9000 RTX 6000 or "multiple high memory consumer GPUs" to run this model at extremely high accuracy. I say this as a huge Strix Halo fanboy (Beelink GTR 9 Pro), mind you. Where Strix Halo is more necessary (and actually offers much better performance) are larger but sparse MoE models - think Qwen 3.5 122B A10B - which offers the total knowledge (and memory requirements) of a 122B model, with processing and generation speed more akin to a 10B dense model, which is a big deal with the limited MBW we get in the land of Strix Halo (256 GB/s theoretical, ~220 GB/s real-world) and DGX Spark (273 GB/s theoretical - not familiar with real-world numbers specifically off the top of my head).
I would make the argument, as a Strix Halo owner, that 27B dense models are actually not particularly pleasant or snappy to run on Strix Halo, and you're much better off with those larger but sparse MoE models with far fewer active parameters on such systems. I'd much rather have an RTX 5090, an Arc B70 Pro, or an AMD AI PRO R9700 (dGPUs with 32GB of GDDR6/7) for 27B dense models specifically.
I'm all for running large MoE models on unified memory systems, but developers of inference engines should do a better job of figuring out how to run larger-than-total-RAM models on such systems, streaming in sparse weights from SSD but leveraging the large unified memory as cache. This is easily supported with pure-CPU inference via mmap, but there is no obvious equivalent when using the GPU for inference.
I use llama.cpp, and there is a way to do this - some layers to (i)GPU, the rest to CPU. I was just trying this out with Kimi K2.5 (in preparation for trying it out with Kimi K2.6 the other night. Check out the --n-cpu-moe flag in llama.cpp.
That said, my Strix Halo rig only has PCIe 4.0 for my NVMe, and I'm using a 990 Evo that had poor sustained random read, being DRAM-less. My effective read speeds from disk were averaging around 1.6-2.0 GB/s, and with unsloth's K2.5, even in IQ2_XXS at "just" 326 GB, with ~64 GB worth of layers in iGPU and the rest free for KV cache + checkpoints. Even still, that was over 250 GB of weights streaming at ~2 GB/s, so I was getting 0.35 PP tok/s and 0.22 TG tok/s.
I could go a little faster with a better drive, or a little faster still if I dropping in two of em in raid0, but it would still be on the order of magnitude of sub-1 tok/s PP (compute limited) and TG (bandwidth limited).
In a computer with 2 PCIe 5.0 SSDs or one with a PCIe 5.0 SSDs and a PCIe 4.0 SSD, it should be possible to stream weights from the SSDs at 20 GB/s, or even more.
This is not a little faster, but 10 times faster than on your system. So a couple of tokens per second generation speed should be achievable.
Nowadays even many NUCs or NUC-like mini-PCs have such SSD slots.
I have actually started working at optimizing such an inference system, so your data is helpful for comparison.
Strix Halo, to my knowledge, does not support PCIe 5.0 NVMe drives, unfortunately, despite it being Zen 5, and Zen 5 supporting the PCIe 5.0 standard.
While many other NUCs may support them, what most of them lack compared to Strix Halo is a 128 GB pool of unified LPDDR5x-8000 on a 256 bit bus and the Radeon 8060S iGPU with 40 CU of RDNA 3.5, which is roughly equivalent in processing power to a laptop 4060 or desktop 3060.
The Radeon 780M and Radeon 890M integrated graphics that come on most AMD NUCs don't hold a candle to Strix Halo's 8060S, and what little you'd gain in this narrow use case with PCIe gen 5, you'd lose a lot in the more common use cases of models that can fit into a 128 GB pool of unified memory, and there are some really nice ones.
Also, the speeds you're suggesting seem rather optimistic. Gen 5 drives, as I understand, hit peak speeds of about 28-30 GB/s (with two in RAID0, at 14-15 GB/s each), but that's peak sequential reads, which is neither reflective of sustained reads, nor the random read workloads that dominate reading model weights.
Maybe there are some Intel NUCs that compete in this space that I'm less up to speed with which do support PCIe 5. I know Panther Lake costs about as much to manufacture as Strix Halo, and while it's much more power efficient and achieves a lot more compute per Xe3 graphics core than Strix Halo achieves per RDNA 3.5 CU, they Panther Lake that's actually shipping ships with so many fewer Xe3 cores that it's still a weaker system overall.
Maybe DGX Spark supports PCIe 5.0, I don't own one and am admittedly not as familiar with that platform either, though it's worth mentioning that the price gap between Strix Halo and DGX Spark at launch ($2000 vs $4000) has closed a bit (many Strix Halo run $3000 now, vs $4700 for DGX Spark, and I think some non-Nvidia GB10 systems are a bit cheaper still)
While you are right about the advantages of Strix Halo, those advantages matter only as long as you can fit the entire model inside the 128 GB DRAM.
If you use a bigger model and your performance becomes limited by the SSD throughput, than a slower CPU and GPU will not affect the performance in an optimized implementation, where weights are streamed continuously from the SSDs and all computations are overlapped over that.
I have an ASUS NUC with Arrow Lake H and 2 SSDs, one PCIe 5.0 and one PCIe 4.0. I also have a Zen 5 desktop, which like most such desktops also has 2 SSDs, one PCIe 5.0 and one PCIe 4.0. Many Ryzen motherboards, including mine, allow multiple PCIe 4.0 SSDs, but those do not increase the throughput, because they share the same link between the I/O bridge and the CPU.
So with most cheap computers you can have 1 PCIe 5.0 SSD + 1 PCIe 4.0 SSD. With PCIe 4.0, it is easy to find SSDs that reach the maximum throughput of the interface, i.e. between 7 and 7.5 GB/s. For PCIe 5.0, the throughput depends on how expensive the SSD is and on how much power it consumes, from only around 10 GB/s up to the interface limit, i.e. around 15 GB/s.
With SSDs having different speeds, RAID0 is not appropriate, but the interleaving between weights stored on one SSD and on the other must be done in software, i.e. one third must be stored on the slower SSD and two thirds on the faster SSD.
A Zen 5 desktop with a discrete GPU is faster than Strix Halo when not limited by the main memory interface, but in the case when the performance is limited by the SSDs throughput I bet that even the Intel NUC can reach that limit and a faster GPU/CPU combo would not make a difference.
That sounds like a huge hassle for what I imagine must be peak speeds of low double digit tok/s PP and TG, even with effective prompt caching and self-ngram and all the other tricks, no?
If I really feel like I needed larger models locally (I don't, the 120/122B A10/12B models are awesome on my hardware), I think I'd rather just either pony up for a used M3 Ultra 512GB, wait for an M5 Ultra (hoping they bring back 512GB config on new setup), or do some old dual socket Xeon or Epyc 8/12-channel DDR4 setup where I can still get bandwidth speeds in the hundreds of GB/s.
What kinds of models are you running over 128GB, and what kind of speeds are you seeing, if you don't mind me asking?
Until now I have not run models that do not fit in 128 GB.
I have an Epyc server with 128 GB of high-throughput DRAM, which also has 2 AMD GPUs with 16 GB of DRAM each.
Until now I have experimented only with models that can fit in this memory, e.g. various medium-size Qwen and Gemma models, or gpt-oss.
But I am curious about how bigger models behave, e.g. GLM-5.1, Qwen3.5-397B-A17B, Kimi-K2.6, DeepSeek-V3.2, MiniMax-M2.7. I am also curious about how the non-quantized versions of the models with around 120B parameters behave, e.g such versions of Nemotron and Qwen. It is said that quantization to 8 bits or even to 4 bits has negligible effects, but I want to confirm this with my own tests.
There is no way to test big models or non-quantized medium models at a reasonable cost, otherwise than with weights read from SSDs. For some tasks, it may be preferable to use a big model at a slow speed, if that means that you need less attempts to obtain something useful. For a coding assistant, it may be possible to batch many tasks, which will progress simultaneously during a single pass over the SSD data.
For now I am studying llama.cpp in order to determine how it can be modified to achieve the maximum performance that could be reached with SSDs.
AIUI, the main obstacle to maximizing performance with SSD offload is that existing GGUF files for MoE models are not necessarily laid out so that fetching a single MoE layer-expert can be done by reading a single sequential extent off the file. It may be that the GGUF format is already flexible enough in its layout configuration that this is doable with a simple conversion; but if not, the GGUF specification would have to be extended to allow such a layout to be configured.
You are right, which is why I do not intend to use a GGUF file but a set of files with a different layout, and this is why I need to make changes in llama.cpp.
If you have to come up with a custom format anyway, why not just make it a draft extension to GGUF layout definitions (something like "coalesced expert fetch" or the like) and submit it for inclusion in the standard? Then future models could be autoconverted to such a format.
I will consider to do this after I gather enough experience to determine which is the best layout and when I will have enough benchmark data to support that.
Now I want to put two p5800x's to use. I wonder how much tinkering would be necessary to mmap a raid setup with them directly to the gpu. Im not fully busy with LLM's and more with graphics and systems, but this seems like a fun project to try out.
At least for the CPU/GPU split, llama.cpp recently added a `--fit` parameter (might default to on now?) that pairs with a `--fitc CONTEXTSIZE` parameter. That new feature will automatically look at your available VRAM and try to figure out a good CPU/GPU split for large models that leaves enough room for the context size that you request.
Qwen 3.6 27B, and other dense models, as opposed to MoE models do NOT scale well. Like I said in my original post, for 27B usage specifically, I'd take a dGPU with 32GB of VRAM over Strix Halo. I also don't usually benchmark out to 200k, my typical depths are 0, 16k, 32k, 64k, 128k. That said, with Qwen 3.5 122B A10B, I am still getting 70 tok/s PP speed and 20 tok/s TG speed at 128k depth, and with Nemotron 3 Super 120B A10B, 160 tok/s PP speed and 16 tok/s TG speed at 128k depth. With smaller MoE models, I did bench Qwen 3.6 35B A3B at 214 tok/s PP at 128k and 34.5 tok/s TG at 131k.
Because dense models degrade so severely, I rarely bench them past 32k-64k, however, I did find a Gemma4 31B bench I did - down to 22 tok/s PP speed and 6 tok/s TG speed at 128k.
Nemotron models specifically, because of their Mamba2 hybrid SSM architecture, scale exceptionally well, and I have benchmarks for 200k, 300k, 400k, 500k, and 600k for Nemotron 3 Super. I will use depth: PP512/TG128 for simplicity.
Thanks this is very helpful for planning out localLLM buy. Sounds like we are still at least 1 generation out (DDR6 500-700GB/s memory) from getting to that magic ~25-30TG/s. Nemotron 3 Super architecture sounds promising.
Medusa Halo is on my wishlist, but I'm hearing late 2027 :(
M5 Ultra may be a better near-term option, expected in June. Supposedly ~1.2 TB/s unified memory, unsure of whether Apple will revive the 512 GB SKU or limit to 256 GB, but the new Neural Engine in every GPU core should help dramatically. These were always compute limted rather than bandwidth limited, even in M3 Ultra era.
The big cost of course being that you're locked into Apple silicon and Apple's walled garden. You can still use MacOS without creating an Apple account... for now...
At least Apple Silicon holds resale value remarkably well.
tbh ~1-3% PPL hit from Q4_K_M stopped being the bottleneck a while ago. the bottleneck is the 48 hours of guessing llama.cpp flags and chat template bugs before the ecosystem catches up. you are doing unpaid QA.
Just wait a week for model bugs to be worked out. This is well-known advice and a common practice within r/localllama. The flags are not hard at all if you're using llama.cpp regularly. If you're new to the ecosystem, that's closer to a one-time effort with irregular updates than it is to something you have to re-learn for every model.
I hear this all the time. Why does it matter? Punishing a human for making a mistake does not prevent mistakes, nor does it undo the harm of the mistake. A human saying "my bad, I messed up" and an AI saying "my bad, I messed up" are equally worthless, in a functional sense.
reply