More

anonym29 · 2026-05-12T00:20:38 1778545238

>AI can't be held accountable

I hear this all the time. Why does it matter? Punishing a human for making a mistake does not prevent mistakes, nor does it undo the harm of the mistake. A human saying "my bad, I messed up" and an AI saying "my bad, I messed up" are equally worthless, in a functional sense.

anonym29 · 2026-05-11T19:09:31 1778526571

Crimeflare - proudly extorting DDoS victims and protecting criminals while building a global surveillance dragnet since 2009!

anonym29 · 2026-05-06T23:08:24 1778108904

mCaptcha, ALTCHA, Cap, Friendly Captcha, Private Captcha, Procaptcha, Anubis... there are literally dozens of open source alternatives that aren't feeding the Do Be Evil company... not to mention all of the commercial alternatives - if for whatever reason, you do feel like paying for a service that costs nothing to offer

UqWBcuFx6NV4r · 2026-05-06T23:31:56 1778110316

Gen off it. Fraud detection is nontrivial and requires ongoing effort. It’s reasonable for people to be compensated for that.

anonym29 · 2026-05-07T10:59:19 1778151559

CAPTCHAs are not fraud detection and not an ongoing effort

anonym29 · 2026-05-04T19:18:57 1777922337

"Citizens" United (which allows unlimited corporate political donations by classifying them as "speech", for those out of the loop) has fundamentally changed the core incentive structures of the modern political landscape. To compare a pre-CU world to a post-CU world when it comes to matters at the intersection of corporate interests and government regulatory / legislative power is comparing apples to oranges.

We need to overturn CU if we want to be able to go back to a world where government serves people rather than multinational conglomerates.

strictnein · 2026-05-05T01:44:16 1777945456

Citizens United has to be the most inaccurately cited case. It did not 'allow unlimited corporate political donations by classifying them as "speech"'.

It ruled that the federal government was wrong to restrict the speech rights of some groups while allowing other very similar groups to still retain their rights. One of the major examples of this was the media industry. A for-profit newspaper company could spend whatever amount of money it wanted to on political speech. An identical company in a different field could not. This, the court ruled, was unconstitutional.

It also did not grant corporations personhood, the other thing people like to state that it did.

parineum · 2026-05-04T20:48:43 1777927723

> We need to overturn CU

Or we could stop looking at SCOTUS to fix legislation and ask the branch of government who's job it is to fix legislation, Congress.

anonym29 · 2026-05-04T23:02:13 1777935733

That would be nice in principle, yes, but in a CU world, that's asking the fox to vote to lock itself out of the hen house.

In practice most of the foxes that promise to do so never actually will.

What's your proposal to solve this?

parineum · 2026-05-05T05:32:40 1777959160

Frankly, I think it's much less of an issue than it's made out to be. If money meant so much, Donald Trump wouldn't have beat Jeb Bush on the way to beating Hilary Clinton.

To me, money in politics is the red herring to keep you from looking at the real election reform that needs to happen, some combination of open primaries (to remove the effect of primaries going to more extreme candidates rather than centrists) and an alternative to first past the post elections to allow people to vote for who they like without worrying about throwing away their votes (there are many different systems to do this, they're all major improvements).

The money in politics is used by the parties to back their preferred candidates and the voters go along with it in the general elections because they don't want to waste their votes. The money helps them do the bad thing but it isn't the bad thing.

anonym29 · 2026-05-06T23:16:50 1778109410

You've got my support for switching from FPTP to something like RCV, and for open primaries!

anonym29 · 2026-05-02T23:39:14 1777765154

>It's [sic] can also project interactive games for kids (like hopscotch).

That's exactly what we need, to teach little johnny to go play out in front of the running car!

anonym29 · 2026-04-29T16:16:44 1777479404

It's worth pointing out that full digital identity verification ("doxxing" yourself to an untrustworthy, unauditable, legally unconstrained private company) is NOT the only way to verify adulthood. We have had a system in place which enables adulthood validation without enabling digital surveillance infrastructure, with a degree of false negative risk that society has deemed acceptable for nearly 100 years now. This idea is not my own, but I'm happy to share a reasonable proposal for it.

The Cashier Standard – Age Verification Without Surveillance

https://news.ycombinator.com/item?id=47809795

https://claude.ai/public/artifacts/7fe74381-a683-4f49-9c2b-1...

jaykru · 2026-04-29T16:24:50 1777479890

The "cashier standard" you advocate for has already crept toward centralized state tracking in places like Utah. When you go to a restaurant and order a drink, the staff are required to take it to the back and scan it for verification. The scanned data is also compared with a state database of DUI offenders. It's not clear whether the database is stored on site, or if that data goes out on the wire for the check; presumably the latter. Scanned data is also stored for up to 7 days by the restaurant, and it's easy to imagine further creep upping that storage bound.

anonym29 · 2026-04-29T16:29:43 1777480183

This is not the case in most of the country. Utah is largely influenced by a Mormon / LDS culture that expresses heavy opposition to drinking. I am clearly not proposing that the cards be scanned Utah style, I am proposing that they be glanced at by a cashier, everywhere else style.

rationalist · 2026-04-29T16:36:11 1777480571

More and more places I go in other states besides Utah, try to scan IDs when purchasing alcohol.

anonym29 · 2026-04-29T16:41:04 1777480864

Again, the proposal isn't for a system which requires scanning of IDs, it's for a system where the cashier glances at the ID. You're arguing against a strawman. You may argue that the system proposed could evolve into the system you're describing, but still, you're arguing against a hypothetical future fiction. If we're going to be arguing about what the proposal might evolve into in the future, we might as well be arguing about what we should be doing when aliens arrive, since they might arrive in the future, too.

Supermancho · 2026-04-29T16:53:52 1777481632

> we might as well be arguing about what we should be doing when aliens arrive, since they might arrive in the future, too.

Did aliens land in multiple states already? Strawman deflections aside, scanning is the natural evolution and has already happened across multiple kinds of exchange (money markers, various ids, various phone apps, etc). Government issue has a benefit of an independent verification system. It's super expensive for various government agencies to integrate into businesses. Constituents and businesses don't want that, leading to a much more comfortable adversarial relationship, imo.

traderj0e · 2026-04-29T19:52:46 1777492366

California grocery stores scan ID too

hypeatei · 2026-04-29T16:30:48 1777480248

This type of system is a horrible idea for the following reasons:

1) the cards can just be re-sold which creates a black market and defeats the "cashier physically saw the person buying the card" angle

2) nickle and dimes people for simply browsing the internet (verification can dystopia anyone?)

3) related to #2, it creates winners in the private sector since presumably you need central authorities handing out these codes

I abhor the idea of digital ID verification, but if we're going to do it, let's not create a web of new problems while we're at it.

anonym29 · 2026-04-29T16:38:55 1777480735

First - Alcohol and cigarettes can just be resold too. The black market for them is effectively zero because the consequences for giving them to kids are severe and the room for meaningful profit is close to zero, same applies here.

Second - The codes would be priced on the order of magnitude of pennies per verification - think 10 cents or less, accessible even to low / fixed income folks without really making a dent in their budget.

Third - the proposal explicitly mentions a nonprofit running it as an option, and the idea would be that law codifies the method to be approved, not a specific vendor, so competitive markets could emerge, too. Would you argue that restrictions on the sale of alcohol are creating artificial winners in the private sector of alcohol manufacturing?

hypeatei · 2026-04-29T16:54:27 1777481667

You're doing a huge logical jump in your first point. Alcohol and cigarettes are physical goods, digital ID is not, but you're proposing a system that turns it into a physical problem. I'm merely pointing out that's what you're doing and the issues with it.

Second, it doesn't matter what it costs, it's inconvenient and I already spent time (possibly money too) obtaining a government ID... on top of a theoretical mandate that says I need to show the ID on a bunch of websites.

Third, I'm not sure I follow your point on alcohol restrictions creating winners? The non-profit idea could potentially be good, but I'm not hopeful that real world legislation would be crafted that way.

EDIT: also more on #1 and "severe consequences" for re-selling... yes that's exactly what we want to avoid: creating more reasons to put people in prison and a bigger burden on law enforcement and the court system.

arowthway · 2026-04-29T16:49:02 1777481342

'consequences for giving them to kids are severe and the room for meaningful profit is close to zero, same applies here.'

I don't think it applies, the difference is that codes are digital and can be sold over the internet, anonymously, in a scallable manner.

I still like this solution because all the solutions I've seen have flaws and this one being so easy to explain makes it great to campaign for.

arowthway · 2026-04-29T16:37:46 1777480666

Is it even theoretically possible to have bearer anonymity and no reselling option at the same time?

terangaway · 2026-04-29T17:33:24 1777484004

With digital tokens being generated by a user (the seller) on demand, you could have a bond system where the seller places something costly on the line, that the buyer can choose to destroy or obtain. For instance, if Alice gives her age token to Bob, Bob can (if he is a troll) invalidate the token in a way that requires Alice to go to a physical location to reset her ID.

I imagine this could be done with appropriate zero-knowledge measures so that the combination of Alice's age token and Bob's private key creates a capability to exercise the option, but without the service (e.g. a social media site) knowing that the token belongs to Alice, and without the ID provider (e.g. the state) knowing that Bob was the one who exercised it.

While honest customers have no reason to make use of this option, if Alice blindly sells her tokens to anybody willing to pay, there's bound to be some trolls out there who will do it just for the laughs.

This is far from a perfect system since a dishonest site could also make use of the option. But it theoretically works without revealing anybody's identity (unless the option is used, and then only if the service and the ID provider collude).

tardedmeme · 2026-04-30T05:18:17 1777526297

I set up a porn site that requires your token. Psych! It's not a porn site, it just disables your ID when you enter a token.

_ink_ · 2026-04-29T16:30:09 1777480209

How does this prevent a second market for one time codes? I as an adult can just get a code and sell it someone else.

anonym29 · 2026-04-29T16:36:13 1777480573

It doesn't prevent it, it just disincentivizes it. As an adult, you can also go buy a beer and sell it to a minor. That said, mandatory age verification with photo ID upload and facial scans doesn't prevent workarounds either - kids use their parents' photo ID and pass facial scans with a variety of techniques, too.

Nobody who understands how adversarial systems like this work is seriously expecting a 100% flawless performance of blocking every single minor and accepting every single adult, the question is how much risk is acceptable, and the risks posed by this system are acceptable for alcohol, cigarettes, and other adult items that can arguably pose much more acute risk of serious injury or bodily harm to kids.

HWR_14 · 2026-04-29T18:15:35 1777486535

Stings that catch adults reselling codes.

It doesn't have to be perfect.

anonym29 · 2026-04-24T11:04:31 1777028671

At least the coffee demons aren't quite as bad as the amphetamine demons that produced Nazi Germany.

barrenko · 2026-04-24T14:44:41 1777041881

We could view as amphetamine vs. coffee / black tea.

anonym29 · 2026-04-23T16:43:50 1776962630

Biometrics are the only credential you can't roll after compromise.

lostlogin · 2026-04-23T17:29:08 1776965348

It depends what the biometrics are. There have been successful hand transplants, so new finger prints are possible, but completely impractical.

https://en.wikipedia.org/wiki/Hand_transplantation

ntoskrnl_exe · 2026-04-23T18:27:55 1776968875

Thinking about it, I probably wouldn't remember to change my fingerprints to the new ones with all the services I use, I'd probably have to carry my "legacy fingerprints" wherever I go for some time to avoid a lockout.

tombrandis · 2026-04-23T17:42:44 1776966164

kind of but others are hard as well... most people don't change their name, date of birth or even email address when they are leaked.

anonym29 · 2026-04-24T11:18:11 1777029491

These aren't really "credentials" in that they're not secret the way your iris/retina pattern, fingerprint pattern, password, pin, secret key, or security token are.

Your name, DoB, and email address are identifiers, yes, but aren't really authenticators - they're more like a username, not a password.

artursapek · 2026-04-23T17:16:42 1776964602

this is exactly my problem with them

anonym29 · 2026-04-22T19:39:17 1776886757

similar numbers here - slightly higher PP. slightly better peak speed and retention w/ q8_0 kv cache quants too. llama-bench results here, cba to format for hn: https://pastebin.com/raw/zgJeqRbv

GTR 9 Pro, "performance" profile in BIOS, GTT instead of GART, Fedora 44

hedgehog · 2026-04-22T21:31:22 1776893482

If I did a proper benchmark I think the numbers would be what you got. Minimax M2.7 is also surprisingly not that slow, and in some ways faster as it seems to get things right with less thinking output. (around 140 t/s prefill and 23 t/s generation).

anonym29 · 2026-04-22T23:06:05 1776899165

The problem with M2.7 is that it's full GQA, meaning quadratic attention. It does start fast, but by 64k tokens deep, the version I'm running (Unsloth's UD IQ2_XXS) pp512 drops 95% from 261.3 t/s (0 context depth) to 13.1 t/s. q8_0 KV does help, still hitting 57.4 t/s at 64k depth vs 258.3 t/s at 0 depth. TG's retention rates are better, but still approaching single digit even with q8_0 KV cache by 64k depth.

That said, it was my favorite model when I valued output quality above all else, at least up until the new Qwen 3.6 27B, which I'm currently playing with.

I suspect I will like Qwen 3.6 122B A10B a LOT, maybe even better than M2.7.

anonym29 · 2026-04-22T16:54:56 1776876896

You absolutely do not need to run at full BF16. The quality loss between BF16 (55.65 GB in GGUF) and Q8_0 (30.44 GB in GGUF) is essentially zero - think on the order of magnitude of +0.01-0.03 perplexity, or ~0.1-0.3% relative PPL increase. The quality loss between BF16 and Q4_K_M (18.66 GB in GGUF) is close to imperceptible, with perplexity changes in the +0.1-0.3 ballpark, or ~1-3% relative PPL increase. This would correlate to a 0-2% drop on downstream tasks like MMLU/GSM8K/HellaSwag: essentially indistinguishable.

You absolutely do NOT need a $3000 Strix Halo rig or a $4000 Mac or a $9000 RTX 6000 or "multiple high memory consumer GPUs" to run this model at extremely high accuracy. I say this as a huge Strix Halo fanboy (Beelink GTR 9 Pro), mind you. Where Strix Halo is more necessary (and actually offers much better performance) are larger but sparse MoE models - think Qwen 3.5 122B A10B - which offers the total knowledge (and memory requirements) of a 122B model, with processing and generation speed more akin to a 10B dense model, which is a big deal with the limited MBW we get in the land of Strix Halo (256 GB/s theoretical, ~220 GB/s real-world) and DGX Spark (273 GB/s theoretical - not familiar with real-world numbers specifically off the top of my head).

I would make the argument, as a Strix Halo owner, that 27B dense models are actually not particularly pleasant or snappy to run on Strix Halo, and you're much better off with those larger but sparse MoE models with far fewer active parameters on such systems. I'd much rather have an RTX 5090, an Arc B70 Pro, or an AMD AI PRO R9700 (dGPUs with 32GB of GDDR6/7) for 27B dense models specifically.

zozbot234 · 2026-04-22T17:04:23 1776877463

I'm all for running large MoE models on unified memory systems, but developers of inference engines should do a better job of figuring out how to run larger-than-total-RAM models on such systems, streaming in sparse weights from SSD but leveraging the large unified memory as cache. This is easily supported with pure-CPU inference via mmap, but there is no obvious equivalent when using the GPU for inference.

anonym29 · 2026-04-22T17:21:37 1776878497

I use llama.cpp, and there is a way to do this - some layers to (i)GPU, the rest to CPU. I was just trying this out with Kimi K2.5 (in preparation for trying it out with Kimi K2.6 the other night. Check out the --n-cpu-moe flag in llama.cpp.

That said, my Strix Halo rig only has PCIe 4.0 for my NVMe, and I'm using a 990 Evo that had poor sustained random read, being DRAM-less. My effective read speeds from disk were averaging around 1.6-2.0 GB/s, and with unsloth's K2.5, even in IQ2_XXS at "just" 326 GB, with ~64 GB worth of layers in iGPU and the rest free for KV cache + checkpoints. Even still, that was over 250 GB of weights streaming at ~2 GB/s, so I was getting 0.35 PP tok/s and 0.22 TG tok/s.

I could go a little faster with a better drive, or a little faster still if I dropping in two of em in raid0, but it would still be on the order of magnitude of sub-1 tok/s PP (compute limited) and TG (bandwidth limited).

adrian_b · 2026-04-22T18:06:48 1776881208

In a computer with 2 PCIe 5.0 SSDs or one with a PCIe 5.0 SSDs and a PCIe 4.0 SSD, it should be possible to stream weights from the SSDs at 20 GB/s, or even more.

This is not a little faster, but 10 times faster than on your system. So a couple of tokens per second generation speed should be achievable.

Nowadays even many NUCs or NUC-like mini-PCs have such SSD slots.

I have actually started working at optimizing such an inference system, so your data is helpful for comparison.

anonym29 · 2026-04-22T18:13:21 1776881601

Strix Halo, to my knowledge, does not support PCIe 5.0 NVMe drives, unfortunately, despite it being Zen 5, and Zen 5 supporting the PCIe 5.0 standard.

While many other NUCs may support them, what most of them lack compared to Strix Halo is a 128 GB pool of unified LPDDR5x-8000 on a 256 bit bus and the Radeon 8060S iGPU with 40 CU of RDNA 3.5, which is roughly equivalent in processing power to a laptop 4060 or desktop 3060.

The Radeon 780M and Radeon 890M integrated graphics that come on most AMD NUCs don't hold a candle to Strix Halo's 8060S, and what little you'd gain in this narrow use case with PCIe gen 5, you'd lose a lot in the more common use cases of models that can fit into a 128 GB pool of unified memory, and there are some really nice ones.

Also, the speeds you're suggesting seem rather optimistic. Gen 5 drives, as I understand, hit peak speeds of about 28-30 GB/s (with two in RAID0, at 14-15 GB/s each), but that's peak sequential reads, which is neither reflective of sustained reads, nor the random read workloads that dominate reading model weights.

Maybe there are some Intel NUCs that compete in this space that I'm less up to speed with which do support PCIe 5. I know Panther Lake costs about as much to manufacture as Strix Halo, and while it's much more power efficient and achieves a lot more compute per Xe3 graphics core than Strix Halo achieves per RDNA 3.5 CU, they Panther Lake that's actually shipping ships with so many fewer Xe3 cores that it's still a weaker system overall.

Maybe DGX Spark supports PCIe 5.0, I don't own one and am admittedly not as familiar with that platform either, though it's worth mentioning that the price gap between Strix Halo and DGX Spark at launch ($2000 vs $4000) has closed a bit (many Strix Halo run $3000 now, vs $4700 for DGX Spark, and I think some non-Nvidia GB10 systems are a bit cheaper still)

adrian_b · 2026-04-22T23:14:27 1776899667

While you are right about the advantages of Strix Halo, those advantages matter only as long as you can fit the entire model inside the 128 GB DRAM.

If you use a bigger model and your performance becomes limited by the SSD throughput, than a slower CPU and GPU will not affect the performance in an optimized implementation, where weights are streamed continuously from the SSDs and all computations are overlapped over that.

I have an ASUS NUC with Arrow Lake H and 2 SSDs, one PCIe 5.0 and one PCIe 4.0. I also have a Zen 5 desktop, which like most such desktops also has 2 SSDs, one PCIe 5.0 and one PCIe 4.0. Many Ryzen motherboards, including mine, allow multiple PCIe 4.0 SSDs, but those do not increase the throughput, because they share the same link between the I/O bridge and the CPU.

So with most cheap computers you can have 1 PCIe 5.0 SSD + 1 PCIe 4.0 SSD. With PCIe 4.0, it is easy to find SSDs that reach the maximum throughput of the interface, i.e. between 7 and 7.5 GB/s. For PCIe 5.0, the throughput depends on how expensive the SSD is and on how much power it consumes, from only around 10 GB/s up to the interface limit, i.e. around 15 GB/s.

With SSDs having different speeds, RAID0 is not appropriate, but the interleaving between weights stored on one SSD and on the other must be done in software, i.e. one third must be stored on the slower SSD and two thirds on the faster SSD.

A Zen 5 desktop with a discrete GPU is faster than Strix Halo when not limited by the main memory interface, but in the case when the performance is limited by the SSDs throughput I bet that even the Intel NUC can reach that limit and a faster GPU/CPU combo would not make a difference.

anonym29 · 2026-04-22T23:25:01 1776900301

That sounds like a huge hassle for what I imagine must be peak speeds of low double digit tok/s PP and TG, even with effective prompt caching and self-ngram and all the other tricks, no?

If I really feel like I needed larger models locally (I don't, the 120/122B A10/12B models are awesome on my hardware), I think I'd rather just either pony up for a used M3 Ultra 512GB, wait for an M5 Ultra (hoping they bring back 512GB config on new setup), or do some old dual socket Xeon or Epyc 8/12-channel DDR4 setup where I can still get bandwidth speeds in the hundreds of GB/s.

What kinds of models are you running over 128GB, and what kind of speeds are you seeing, if you don't mind me asking?

adrian_b · 2026-04-23T00:03:30 1776902610

Until now I have not run models that do not fit in 128 GB.

I have an Epyc server with 128 GB of high-throughput DRAM, which also has 2 AMD GPUs with 16 GB of DRAM each.

Until now I have experimented only with models that can fit in this memory, e.g. various medium-size Qwen and Gemma models, or gpt-oss.

But I am curious about how bigger models behave, e.g. GLM-5.1, Qwen3.5-397B-A17B, Kimi-K2.6, DeepSeek-V3.2, MiniMax-M2.7. I am also curious about how the non-quantized versions of the models with around 120B parameters behave, e.g such versions of Nemotron and Qwen. It is said that quantization to 8 bits or even to 4 bits has negligible effects, but I want to confirm this with my own tests.

There is no way to test big models or non-quantized medium models at a reasonable cost, otherwise than with weights read from SSDs. For some tasks, it may be preferable to use a big model at a slow speed, if that means that you need less attempts to obtain something useful. For a coding assistant, it may be possible to batch many tasks, which will progress simultaneously during a single pass over the SSD data.

For now I am studying llama.cpp in order to determine how it can be modified to achieve the maximum performance that could be reached with SSDs.

zozbot234 · 2026-04-23T06:14:22 1776924862

AIUI, the main obstacle to maximizing performance with SSD offload is that existing GGUF files for MoE models are not necessarily laid out so that fetching a single MoE layer-expert can be done by reading a single sequential extent off the file. It may be that the GGUF format is already flexible enough in its layout configuration that this is doable with a simple conversion; but if not, the GGUF specification would have to be extended to allow such a layout to be configured.

adrian_b · 2026-04-23T10:26:58 1776940018

You are right, which is why I do not intend to use a GGUF file but a set of files with a different layout, and this is why I need to make changes in llama.cpp.

zozbot234 · 2026-04-23T13:58:25 1776952705

If you have to come up with a custom format anyway, why not just make it a draft extension to GGUF layout definitions (something like "coalesced expert fetch" or the like) and submit it for inclusion in the standard? Then future models could be autoconverted to such a format.

adrian_b · 2026-04-24T11:14:20 1777029260

This is a good suggestion.

I will consider to do this after I gather enough experience to determine which is the best layout and when I will have enough benchmark data to support that.

greybcg · 2026-04-22T18:23:38 1776882218

Now I want to put two p5800x's to use. I wonder how much tinkering would be necessary to mmap a raid setup with them directly to the gpu. Im not fully busy with LLM's and more with graphics and systems, but this seems like a fun project to try out.

a_e_k · 2026-04-22T21:26:36 1776893196

At least for the CPU/GPU split, llama.cpp recently added a `--fit` parameter (might default to on now?) that pairs with a `--fitc CONTEXTSIZE` parameter. That new feature will automatically look at your available VRAM and try to figure out a good CPU/GPU split for large models that leaves enough room for the context size that you request.

hadlock · 2026-04-22T21:15:05 1776892505

How many t/s output are you getting at Q4_K_M with 200k context on your Strix Halo if you ask it to add a new feature to a codebase.

anonym29 · 2026-04-22T23:17:32 1776899852

Qwen 3.6 27B, and other dense models, as opposed to MoE models do NOT scale well. Like I said in my original post, for 27B usage specifically, I'd take a dGPU with 32GB of VRAM over Strix Halo. I also don't usually benchmark out to 200k, my typical depths are 0, 16k, 32k, 64k, 128k. That said, with Qwen 3.5 122B A10B, I am still getting 70 tok/s PP speed and 20 tok/s TG speed at 128k depth, and with Nemotron 3 Super 120B A10B, 160 tok/s PP speed and 16 tok/s TG speed at 128k depth. With smaller MoE models, I did bench Qwen 3.6 35B A3B at 214 tok/s PP at 128k and 34.5 tok/s TG at 131k.

Because dense models degrade so severely, I rarely bench them past 32k-64k, however, I did find a Gemma4 31B bench I did - down to 22 tok/s PP speed and 6 tok/s TG speed at 128k.

Nemotron models specifically, because of their Mamba2 hybrid SSM architecture, scale exceptionally well, and I have benchmarks for 200k, 300k, 400k, 500k, and 600k for Nemotron 3 Super. I will use depth: PP512/TG128 for simplicity.

100k: 206/16 200k: 136/16 300k: 95/14 400k: 61/13 500k: 45/13 600k: 36/12

hadlock · 2026-04-23T17:06:31 1776963991

Thanks this is very helpful for planning out localLLM buy. Sounds like we are still at least 1 generation out (DDR6 500-700GB/s memory) from getting to that magic ~25-30TG/s. Nemotron 3 Super architecture sounds promising.

anonym29 · 2026-04-25T11:51:09 1777117869

Medusa Halo is on my wishlist, but I'm hearing late 2027 :(

M5 Ultra may be a better near-term option, expected in June. Supposedly ~1.2 TB/s unified memory, unsure of whether Apple will revive the 512 GB SKU or limit to 256 GB, but the new Neural Engine in every GPU core should help dramatically. These were always compute limted rather than bandwidth limited, even in M3 Ultra era.

The big cost of course being that you're locked into Apple silicon and Apple's walled garden. You can still use MacOS without creating an Apple account... for now...

At least Apple Silicon holds resale value remarkably well.

p_stuart82 · 2026-04-22T17:38:27 1776879507

tbh ~1-3% PPL hit from Q4_K_M stopped being the bottleneck a while ago. the bottleneck is the 48 hours of guessing llama.cpp flags and chat template bugs before the ecosystem catches up. you are doing unpaid QA.

anonym29 · 2026-04-22T17:49:54 1776880194

Just wait a week for model bugs to be worked out. This is well-known advice and a common practice within r/localllama. The flags are not hard at all if you're using llama.cpp regularly. If you're new to the ecosystem, that's closer to a one-time effort with irregular updates than it is to something you have to re-learn for every model.