I don't think Apple just stumbled into it, and while I totally agree that Apple is killing it with their unified memory, I think we're going to see a pivot from NVidia and AMD. The biggest reason, I think, is: OpenAI has committed to enormous amount capex it simply cannot afford. It does not have the lead it once did, and most end-users simply do not care. There are no network effects. Anthropic at this point has completely consumed, as far as I can tell, the developer market. The one market that is actually passionate about AI. That's largely due to huge advantage of the developer space being, end users cannot tell if an "AI" coded it or a human did. That's not true for almost every other application of AI at this point.
If the OpenAI domino falls, and I'd be happy to admit if I'm wrong, we're going to see a near catastrophic drop in prices for RAM and demand by the hyperscalers to well... scale. That massive drop will be completely and utterly OpenAI's fault for attempting to bite off more than it can chew. In order to shore up demand, we'll see NVidia and AMD start selling directly to consumers. We, developers, are consumers and drive demand at the enterprises we work for based on what keeps us both engaged and productive... the end result being: the ol' profit flywheel spinning.
Both NVidia and AMD are capable of building GPUs that absolutely wreck Apple's best. A huge reason for this is Apple needs unified memory to keep their money maker (laptops) profitable and performant; and while, it helps their profitability it also forces them into less performant solutions. If NVidia dropped a 128GB GPU with GDDR7 at $4k-- absolutely no one would be looking for a Mac for inference. My 5090 is unbelievably fast at inference even if it can't load gigantic models, and quite frankly the 6-bit quantized versions of Qwen 3.5 are fantastic, but if it could load larger open weight models I wouldn't even bother checking Apple's pricing page.
tldr; competition is as stiff as it is vicious-- Apple's "lead" in inference is only because NVidia and AMD are raking in cash selling to hyperscalers. If that cash cow goes tits up, there's no reason to assume NVidia and AMD won't definitively pull the the rug out from Apple.
Can we also stop giving Apple some prize for unified memory?
It was the way of doing graphics programming on home computers, consoles and arcades, before dedicated 3D cards became a thing on PC and UNIX workstations.
It depends on what you mean, do you mean both gross and net? Just one of the two?
Gross margin of zero would be mean you sell at exactly the cost to produce. Net margin of zero means you cover all your expenses including COGS. The only really difficult, practically impossible, thing would be doing both at the same time. Though, I could also see a case where you drive down net margins once sunk costs are paid and achieve both.
Doing so practically, or sustainably, in most circumstances would be uhh crazy… but it’s not impossible. Even then I think aiming for zero margin is a pretty credible tactic in eliminating competition if you can out sustain them.
TLDR; Weird? Sure. But not impossible. And even sort of likely if you’re trying to atrophy your competition out of existence.
I am not surprised by this, and am glad to see a test like this. One thing that keeps popping up for me when using LLMs is the lack of actual understanding. I write Elixir primarily and I can say without a doubt, that none of the frontier models understand concurrency in OTP/Beam. They look like they do, but they’ll often resort to weird code that doesn’t understand how “actors” work. It’s an imitation of understanding that is averaging all the concurrency code it has seen in training. With the end result being huge amount of noise, when those averages aren’t enough, guarding against things that won’t happen, because they can’t… or they actively introduce race conditions because they don’t understand how message passing works.
Current frontier models are really good at generating boiler plate, and really good at summarizing, but really lack the ability to actually comprehend and reason about what’s going on. I think this sort of test really highlights that. And is a nice reminder that, the LLMs, are only as good as their training data.
When an LLM or some other kind of model does start to score well on tests like this, I’d expect to see better them discovering new results, solutions, and approaches to questions/problems. Compared to how they work now, where they generally only seem to uncover answers that have been obfuscated but are present.
I’ve been thinking about this a lot lately… For years this has been said, and for most of us isn’t something we’ve been able to experience until recently. Yet, now we can see how chatbots have made sane folks lose their minds, by simply being too agreeable. I think it’s a grim look at what it’s like to be hyper wealthy. The odds that they’ve completely disassociated from reality, IMHO, have increased exponentially after seeing the effects on “normal” people. The only difference is us plebs, don’t have the resources to then bring our distorted view of reality to life.
I follow, and am followed by, folks who use Blacksky and interact with them regularly. I have had zero issue with this, at any point. As Paul (the CTO of Bluesky) said in a sibling comment this would be a very serious bug.
FWIW— I have also not heard anything even remotely close to this at all from anyone using either service.
I’ve not kept up with Intel in a while, but one thing that stood out to me is these are all E cores— meaning no hyperthreading. Is something like this competitive, or preferred, in certain applications? Also does anyone know if there have been any benchmarks against AMDs 192 core Epyc CPU?
"Is something like this competitive, or preferred, in certain applications?"
They cite a very specific use case in the linked story: Virtualized RAN. This is using COTS hardware and software for the control plane for a 5G+ cell network operation. A large number of fast, low power cores would indeed suit such a application, where large numbers of network nodes are coordinated in near real time.
It's entirely possible that this is the key use case for this device: 5G networks are huge money makers and integrators will pay full retail for bulk quantities of such devices fresh out of the foundry.
> how do you get them off the shelf if you also need TB of memory
You make products for well capitalized wireless operators that can afford the prevailing cost of the hardware they need. For these operations, the increase in RAM prices is not a major factor in their plans: it's a marginal cost increase on some of the COTS components necessary for their wireless system. The specialized hardware they acquire in bulk is at least an order of magnitude more expensive than server RAM.
Intel will sell every one of these CPUs and the CPUs will end up in dual CPU SMP systems fully populated with 1-2 TB of DDR5-8000 (2-4GB/core, at least) as fast as they can make them.
I do likr the idea that capitalism can always ignore the broader base of consumers and just raise prices. Eventually, therell only be one viagra pill bought by trillionaires at 1$ million dollars.
In HPC, like physics simulation, they are preferred. There's almost no benefit of HT. What's also preferred is high cluck frequencies. These high core count CPUs nerd their clixk frequencies though.
It all depends on your exact workload, and I’ll wait to see benchmarks before making any confident claims, but in general if you have two threads of execution which are fine on an E-core, it’s better to actually put them on two E-cores than one hyperthreaded P-core.
Without the hyperthreading (E-cores) you get more consistent performance between running tasks, and cloud providers like this because they sell "vCPUs" that should not fluctuate when someone else starts a heavy workload.
Sort of. They can just sell even numbers of vCPUs, and dedicate each hyper-thread pair to the same tenant. That prevents another tenant from creating hyper-threading contention for you.
For those, wouldn't hyperthreading be a win? Some fraction of the time, you'd get evicted to the hyperthread that shares your L1 cache (and the hypervisor could strongly favor that).
I don't know the nitty-gritty of why, but some compute intensive tasks don't benefit from hyperthreading. If the processor is destined for those tasks, you may as well use that silicon for something actually useful.
It's a few things; mostly along the lines of data caching (i.e. hyper threading may mean that other thread needs a cache sync/barrier/etc).
That said I'll point to the Intel Atom - the first version and refresh were an 'in-order' where hyper-threading was the cheapest option (both silicon and power-wise) to provide performance, however with Silvermont they switched to OOO execution but ditched hyper threading.
I think some of why is size on die. 288 E cores vs 72 P cores.
Also, there's so many hyperthreading vulnerabilities as of late they've disabled on hyperthreaded data center boards that I'd imagine this de-risks that entirely.
For an application like a build server, the only metric that really matters is total integer compute per dollar and per watt. When I compile e.g a Yocto project, I don't care whether a single core compiles a single C file in a millisecond or a minute; I care how fast the whole machine compiles what's probably hundred thousands of source files. If E-cores gives me more compute per dollar and watt than P-cores, give me E-cores.
Of course, having fewer faster cores does have the benefit that you require less RAM... Not a big deal before, you could get 512GB or 1TB of RAM fairly cheap, but these days it might actually matter? But then at the same time, if two E-cores are more powerful than one hyperthreaded P-core, maybe you actually save RAM by using E-cores? Hyperthreading is, after all, only a benefit if you spawn one compiler process per CPU thread rather than per core.
EDIT: Why in the world would someone downvote this perspective? I'm not even mad, just confused
It's for building embedded Linux distros, and your typical Linux distro contains quite a lot of C++ and Rust code these days (especially if you include, say, a browser, or Qt). But you have parallelism across packages, so even if one core is busy doing a serial linking step, the rest of your cores are busy compiling other packages (or maybe even linking other packages).
That said, there are sequential steps in Yocto builds too, notably installing packages into the rootfs (it uses dpkg, opkg or rpm, all of which are sequential) and any code you have in the rootfs postprocessing step. These steps usually aren't a significant part of a clean build, but can be a quite substantial part of incremental builds.
That's finally set to be resolved with Nova Lake later this year, which will support AVX10 (the new iteration of AVX512) across both core types. Better very late than never.
E cores didn't just ruin P cores, it ruined AVX-512 altogether. We were getting so close to near-universal AVX-512 support; enough to bother actually writing AVX-512 versions of things. Then, Intel killed it.
I love the AVX512 support in Zen 5 but the lack of Valgrind support for many of the AVX512 instructions frustrates me almost daily. I have to maintain a separate environment for compiling and testing because of it.
There was someone at Intel working on AVX512 support in Valgrind. She is/was based in St Petersburg. Intel shuttered their Russian operations when Putin invaded Ukraine and that project stalled.
If anyone has the time and knowledge to help with AVX512 support then it would be most welcome. Fair warning, even with the initial work already done this is still a huge project.
I've seen scenarios where HT doesn't help, iirc very CPU-heavy things without much waiting on memory access. Which makes sense because the vcores are sharing the ALU.
Also have seen it disabled in academic settings where they want consistent performance when benchmarking stuff.
I could be doing something wrong, but I have not had any success with one shot feature implementations for any of the current models. There are always weird quirks, undesired behaviors, bad practices, or just egregiously broken implementations. A week or so ago, I had instructed Claude to do something at compile-time and it instead burned a phenomenal amount of tokens before yeeting the most absurd, and convoluted, runtime implementation—- that didn’t even work. At work I use it (or Codex) for specific tasks, delegating specific steps of the feature implementation.
The more I use the cloud based frontier models, the more virtue I find in using local, open source/weights, models because they tend to create much simpler code. They require more direct interaction from me, but the end result tends to be less buggy, easier to refactor/clean up, and more precisely what I wanted. I am personally excited to try this new model out here shortly on my 5090. If read the article correctly, it sounds like even the quantized versions have a “million”[1] token context window.
And to note, I’m sure I could use the same interaction loop for Claude or GPT, but the local models are free (minus the power) to run.
[1] I’m a dubious it won’t shite itself at even 50% of that. But even 250k would be amazing for a local model when I “only” have 32GB of VRAM.
WWII didn’t start overnight. The Sturmabteilung (SA), also known as “The Brownshirts,” have a strong similarity to what we’re seeing with ICE and CBP. The SA were Hitler’s enforcers before the SS, during the 1920s and early 1930s. They were eventually usurped by the SS during “Night of the Long Knives” where SA leadership were executed by the SS. Largely because Hitler had felt threatened by the power Ernst Röhm had amassed (among other reasons). And the SA, like ICE, was made up largely of untrained sycophants and thugs who enjoy violence. They committed violence, harassed citizens, and had no consequences for doing so. They were also instrumental in laying the foundation for the genocide and atrocities committed by the Nazi party.
It’s not a dishonor to their memories, or the atrocities committed, to call that out. It is not a dishonor to say there are stark and real similarities between the way the US is operating and treating civilians.
I personally find the opposite, IMHO it is dishonors their memories to refuse to acknowledge the similarities.
I’ve posted a comment similar to this one here before, and like how I ended it. I strongly encourage you to read about the history of Nazi Germany and how it came to happen. It wasn’t just a zero to death camps, it was 15 years in the making. That history is deeply shocking, as it is depressing, because the parallels and timelines are too similar for anything besides outright discomfort, sadness, and fear between it and the US. But without knowing it, we are ever more likely to repeat it.
One final thing to note: the US has a history of extreme violence, slave patrols and the treatment of non-whites of the 19th century were an inspiration for Hitler.
I think we’re at the peak, or close to it for these memory shenanigans. OpenAI who is largely responsible for the shortage, just doesn’t have the capital to pay for it. It’s only a matter of time before chickens come home to roost and the bill is due. OpenAI is promising hundreds of billions in capex but has no where near that cash on hand, and its cash flow is abysmal considering the spend.
Unless there is a true breakthrough, beyond AGI into super intelligence on existing, or near term, hardware— I just don’t see how “trust me bro,” can keep its spending party going. Competition is incredibly stiff, and it’s pretty likely we’re at the point of diminishing returns without an absolute breakthrough.
The end result is going to be RAM prices tanking in 18-24 months. The only upside will be for consumers who will likely gain the ability to run much larger open source models locally.
I’m not sure I get why this is better. Something like Tailscale makes it trivial to connect to your own machines and is likely more secure than this will be. Tailscale even has a free plan these days. Combine that with something like this that was shared on HN a few days ago: https://replay.software/updates/introducing-echo
Then you’re all in for like $3. What about webRTC makes this better?
If the OpenAI domino falls, and I'd be happy to admit if I'm wrong, we're going to see a near catastrophic drop in prices for RAM and demand by the hyperscalers to well... scale. That massive drop will be completely and utterly OpenAI's fault for attempting to bite off more than it can chew. In order to shore up demand, we'll see NVidia and AMD start selling directly to consumers. We, developers, are consumers and drive demand at the enterprises we work for based on what keeps us both engaged and productive... the end result being: the ol' profit flywheel spinning.
Both NVidia and AMD are capable of building GPUs that absolutely wreck Apple's best. A huge reason for this is Apple needs unified memory to keep their money maker (laptops) profitable and performant; and while, it helps their profitability it also forces them into less performant solutions. If NVidia dropped a 128GB GPU with GDDR7 at $4k-- absolutely no one would be looking for a Mac for inference. My 5090 is unbelievably fast at inference even if it can't load gigantic models, and quite frankly the 6-bit quantized versions of Qwen 3.5 are fantastic, but if it could load larger open weight models I wouldn't even bother checking Apple's pricing page.
tldr; competition is as stiff as it is vicious-- Apple's "lead" in inference is only because NVidia and AMD are raking in cash selling to hyperscalers. If that cash cow goes tits up, there's no reason to assume NVidia and AMD won't definitively pull the the rug out from Apple.
reply