Cerebras has effectively 100% yield on these chips. They have an internal structure made by just repeating the same small modular units over and over again. This means they can just fuse off the broken bits without affecting overall function. It's not like it is with a CPU.
That's what it's running on. It's optimized for very high throughput using Cerebras' hardware which is uniquely capable of running LLMs at very, very high speeds.
Takes much longer to build, requires a much larger up-front investment, and requires a lot more land.
The footprint needed when trying to generate this much power from solar or wind necessitates large-scale land acquisition plus the transmission infrastructure to get all that power to the actual data center, since you won't usually have enough land directly adjacent to it. That plus all the battery infrastructure makes it a non-starter for projects where short timescales are key.
Precision bombing during WW2 was not possible at the required scale. To put a bomb precisely on target back then you needed something like a dive bomber, a tactic which is incompatible with strategic-scale bombing. Even "precise" methods using advanced analog computers like the Norden bombsight could only do so much.
>Under combat conditions the Norden did not achieve its expected precision, yielding an average CEP in 1943 of 1,200 feet (370 m)[1]
This means that 50% of bombs fell within 1,200 feet of the target, which is an absolutely awful accuracy if you're trying to hit anything specific.
This was further compounded during the campaign against Japan by the heavy reliance of Japanese wartime industry on cottage industries which were dispersed almost randomly within Japanese population centers, rather than being located within specialized industrial districts. From a purely strategic standpoint which is only concerned with destroying the enemy's ability to make war, the most effect way to disrupt these kinds of industry with 1945 technology was essentially to burn every building in the city to the ground. Other options were simply ineffective.
This article doesn't mention TPUs anywhere. I don't think it's obvious for people outside of google's ecosystem just how extraordinarily good the JAX + TPU ecosystem is. Google several structural advantages over other major players, but the largest one is that they roll their own compute solution which is actually very mature and competitive. TPUs are extremely good at both training and inference[1] especially at scale. Google's ability to tailor their mature hardware to exactly what they need gives them a massive leg up on competition. AI companies fundamentally have to answer the question "what can you do that no one else can?". Google's hardware advantage provides an actual answer to that question which can't be erased the next time someone drops a new model onto huggingface.
> I’m forgetting something. Oh, of course, Google is also a hardware company. With its left arm, Google is fighting Nvidia in the AI chip market (both to eliminate its former GPU dependence and to eventually sell its chips to other companies). How well are they doing? They just announced the 7th version of their TPU, Ironwood. The specifications are impressive. It’s a chip made for the AI era of inference, just like Nvidia Blackwell
To be fair to thunderbird120, the author of this piece made edits at some point. See https://archive.is/K4n9E. No discussion of the recent TPU releases, or TPUs for all, for that matter.
"I’m forgetting something." was a giant blaring clue. Take this as an opportunity to learn the lesson of not calling someone a liar unless you are very very sure and have taken all the evidence into account.
whoah. I accused no one of lying. It is amazing how many people mix up this basic idea: Saying something that is incorrect is not the same as lying about it.
"that section wasn't in the article when I wrote that comment."
"It was there."
Everyone understands that such naysaying is effectively an accusation of lying. In any case it was a totally low effort utterly inappropriate comment. Clearly you aren't going to learn the lesson.
Assuming that DeepSeek continues to open-source, then we can assume that in the future there won't be any "secret sauce" in model architecture. Only data and training/serving infrastructure, and Google is in a good position with regard to both.
Google is also in a great position wrt distribution - to get users at scale, and attach to pre-existing revenue streams. Via Android, Gmail, Docs, Search - they have a lot of reach. YouTube as well, though fit there is maybe less obvious.
Combined with the two factors you mention, and the size of their warchest - they are really excellently positioned.
Yea, I just ended up trying out their Gemini stuff in Sheets and Slides. In Sheets it's pretty cool to have it just directly insert formulas for me. In Slides it's...okay...it was useful for me to rush to get a presentation done. But I can tell it's pretty bad compared to literally anyone who has enough time to just create a decent presentation. But I can also tell it will get better at least.
Does that ever provide you with anything more than a lame summary? I mean Gemini models can do a lot, but I don’t have the feeling they’ve well integrated tbh.
Personally, I've never actually heard this problem. Do you watch industry-specific videos in a non-anonymized browser session enough? Once you watch, like, 5 videos on topics you care about, the algorithm has no shortage of astute suggestions.
I'm not the person you're replying to but in my experience the YouTube algorithm is quite bad at filling my wish for a variety of topics and tone at all levels. I feel like watching one or two clips from the same channel suddenly floods me with that going forward which is rarely what I want. Personally I have a core set of things I want lots of plus I'd really appreciate brief forays into new topics with similar creators or new creators with similar topics but this feels completely impossible for me on yt.
I think they've jumped the shark and need to give me more control because currently I actively avoid watching videos I think MIGHT be interesting because the risk is too high. This is a terrible position to put your users in both from a specific experience perspective but also in a "how they feel about your product" perspective.
> I actively avoid watching videos I think MIGHT be interesting because the risk is too high
You can remove videos from your viewing history. I do this when I start watching something but the content turnes out to not be what I expected. It seems to prevent polluting my recommendations.
The problem with that is, I have more than one interest. So any algorithm will make a salad out of that and never hit me with what I want. Even if this particular session was about topic X they will still fill it with proposals from yesterday's topic Y, or extrapolate from them to topic Z I don't care, and so on. Mostly useless either way.
There videos which YouTube will very very rarely recommended you, especially if they are not monetised and let’s say critical or opposing the mainstream.
Making your own hardware would seem to yield freedoms in model architectures as well since performance is closely related to how the model architecture fits the hardware.
... except that it still pretty much requires Nvidia hardware. Maybe not for edge inference, but even inference at scale (ie. say at companies, or governments) will still require it.
TPUs aren't necessarily a pro. They go back 15 years and don't seem to have yielded any kind of durable advantage. Developing them is expensive but their architecture was often over-fit to yesterday's algorithms which is why they've been through so many redesigns. Their competitors have routinely moved much faster using CUDA.
Once the space settles down, the balance might tip towards specialized accelerators but NVIDIA has plenty of room to make specialized silicon and cut prices too. Google has still to prove that the TPU investment is worth it.
Not sure how familiar you are with the internal situation... But from my experience think it's safe to say that TPU basically multiplies Google's computation capability by 10x, if not 20x. Also they don't need to compete with others to secure expensive nvidia chips. If this is not an advantage, I don't see there's anything considered to be an advantage. The entire point of vertical integration is to secure full control of your stack so your capability won't be limited by potential competitors, and TPU is one of the key component of its strategy.
Also worth noting that its Ads division is the largest, heaviest user of TPU. Thanks to it, it can flex running a bunch of different expensive models that you cannot realistically afford with GPU. The revenue delta from this is more than enough to pay off the entire investment history for TPU.
They must very much compete with others. All these chips are being fabbed at the same facilities in Taiwan and capacity trades off against each other. Google has to compete for the same fab capacity alongside everyone else, as well as skilled chip designers etc.
> The revenue delta from this is more than enough to pay off the entire investment history for TPU.
Possibly; such statements were common when I was there too but digging in would often reveal that the numbers being used for what things cost, or how revenue was being allocated, were kind of ad hoc and semi-fictional. It doesn't matter as long as the company itself makes money, but I heard a lot of very odd accounting when I was there. Doubtful that changed in the years since.
Regardless the question is not whether some ads launches can pay for the TPUs, the question is whether it'd have worked out cheaper in the end to just buy lots of GPUs. Answering that would require a lot of data that's certainly considered very sensitive, and makes some assumptions about whether Google could have negotiated private deals etc.
> They must very much compete with others. All these chips are being fabbed at the same facilities in Taiwan and capacity trades off against each other.
I'm not sure what you're trying to deliver here. Following your logic, even if you have a fab you need to compete for rare metals, ASML etc etc... That's a logic built for nothing but its own sake. In the real world, it is much easier to compete outside Nvidia's own allocation as you get rid of the critical bottleneck. And Nvidia has all the incentives to control the supply to maximize its own profit, not to meet the demands.
> Possibly; such statements were common when I was there too but digging in would often reveal that the numbers being used for what things cost, or how revenue was being allocated, were kind of ad hoc and semi-fictional.
> Regardless the question is not whether some ads launches can pay for the TPUs, the question is whether it'd have worked out cheaper in the end to just buy lots of GPUs.
Of course everyone can build their own narratives in favor of their launch, but I've been involved in some of those ads quality launches and can say pretty confidently that most of those launches would not be launchable without TPU at all. This was especially true in the early days of TPU as the supply of GPU for datacenter was extremely limited and immature.
More GPU can solve? Companies are talking about 100k~200k of H100 as a massive cluster and Google already has much larger TPU clusters with computation capability in a different order of magnitudes. The problem is, you cannot simply buy more computation even if you have lots of money. I've been pretty clear about how relying on Nvidia's supply could be a critical limiting factor in a strategic point of view but you're trying to move the point. Please don't.
So are the electric and cooling costs at Google's scale. Improving perf-per-watt efficiency can pay for itself. The fact that they keep iterating on it suggests it's not a negative-return exercise.
TPUs probably can pay for themselves, especially given NVIDIA's huge margins. But it's not a given that it's so just because they fund it. When I worked there Google routinely funded all kinds of things without even the foggiest idea of whether it was profitable or not. There was just a really strong philosophical commitment to doing everything in house no matter what.
> When I worked there Google routinely funded all kinds of things without even the foggiest idea of whether it was profitable or not.
You're talking about small-money bets. The technical infrastructure group at Google makes a lot of them, to explore options or hedge risks, but they only scale the things that make financial sense. They aren't dumb people after all.
The TPU was a small-money bet for quite a few years until this latest AI boom.
Maybe it's changed. I'm going back a long way but part of my view on this was shaped by an internal white paper written by an engineer who analyzed the cost of building a Gmail clone using commodity tech vs Google's in house approach, this was maybe circa 2010. He didn't even look at people costs, just hardware, and the commodity tech stack smoked Gmail's on cost without much difference in features (this was focused on storage and serving, not spam filtering where there was no comparably good commodity solution).
The cost delta was massive and really quite astounding to see spelled out because it was hardly talked about internally even after the paper was written. And if you took into account the very high comp Google engineers got, even back then when it was lower than today, the delta became comic. If Gmail had been a normal business it'd have been outcompeted on price and gone broke instantly, the cost disadvantage was so huge.
The people who built Gmail were far from dumb but they just weren't being measured on cost efficiency at all. The same issues could be seen at all levels of the Google stack at that time. For instance, one reason for Gmail's cost problem was that the underlying shared storage systems like replicated BigTables were very expensive compared to more ordinary SANs. And Google's insistence on being able to take clusters offline at will with very little notice required a higher replication factor than a normal company would have used. There were certainly benefits in terms of rapid iteration on advanced datacenter tech, but did every product really need such advanced datacenters to begin with? Probably not. The products I worked on didn't seem to.
Occasionally we'd get a reality check when acquiring companies and discovering they ran competitive products on what was for Google an unimaginably thrifty budget.
So Google was certainly willing to scale things up that only made financial sense if you were in an environment totally unconstrained by normal budgets. Perhaps the hardware divisions operate differently, but it was true of the software side at least.
The issue isn't number of designs but architectural stability. NVIDIA's chips have been general purpose for a long time. They get faster and more powerful but CUDA has always been able to run any kind of neural network. TPUs used to be over-specialised to specific NN types and couldn't handle even quite small evolutions in algorithm design whereas NVIDIA cards could. Google has used a lot of GPU hardware too, as a consequence.
At the same time if the TPU didn't exist NVIDIA would pretty much have a complete monopoly on the market.
While Nv does have an unlimited money printer at the moment, the fact that at least some potential future competition exists does represent a threat to that.
Depending how you count, parent comment is accurate. Hardware doesn't just appear. 4 years of planning and R&D for the first generation chip is probably right.
I was wrong, ironically because Google's AI overview says it's 15 years if you search. The article it's quoting from appears to be counting the creation of TensorFlow as an "origin".
Google has their own cloud with their data centers with their own custom designed hardware using their own machine learning software stack running their in-house designed neural networks.
The only thing Google is missing is designing a computer memory that is specifically tailored for machine learning. Something like processing in memory.
The one thing they lack that OpenAI has is… product focus. There’s some kind of management issue that makes Google all over the shop, cancelling products for no reason. Whereas Sam Altmans team is right on the money.
I've used Jax quite a bit and it's so much better than tf/pytorch.
Now for the life of me, I still haven't been able to understan what a TPU is. Is it Google's marketing term for a GPU? Or is it something different entirely?
There's basically a difference in philosophy. GPU chips have a bunch of cores, each of which is semi-capable, whereas TPU chips have (effectively) one enormous core.
So GPUs have ~120 small systolic arrays, one per SM (aka, a tensorcore), plus passable off-chip bandwidth (aka 16 lines of PCI).
Where has TPUs have one honking big systolic array, plus large amounts of off-chip bandwidth.
This roughly translates to GPUs being better if you're doing a bunch of different small-ish things in parallel, but TPUs are better if you're doing lots of large matrix multiplies.
Way back when, most of a GPU was for graphics. Google decided to design a completely new chip, which focused on the operations for neural networks (mainly vectorized matmul). This is the TPU.
It's not a GPU, as there is no graphics hardware there anymore. Just memory and very efficient cores, capable of doing massively parallel matmuls on the memory. The instruction set is tiny, basically only capable of doing transformer operations fast.
Today, I'm not sure how much graphics an A100 GPU still can do. But I guess the answer is "too much"?
TPUs (short for Tensor Processing Units) are Google’s custom AI accelerator hardware which are completely separate from GPUs. I remember that introduced them in 2015ish but I imagine that they’re really starting to pay off with Gemini.
Believe it or not, I'm also familiar with Wikipedia. It reads that they're optimized for low precisio high thruput. To me this sounds like a GPU with a specific optimization.
It's a chip (and associated hardware) that can do linear algebra operations really fast. XLA and TPUs were co-designed, so as long as what you are doing is expressible in XLA's HLO language (https://openxla.org/xla/operation_semantics), the TPU can run it, and in many cases run it very efficiently. TPUs have different scaling properties than GPUs (think sparser but much larger communication), no graphics hardware inside them (no shader hardware, no raytracing hardware, etc), and a different control flow regime ("single-threaded" with very-wide SIMD primitives, as opposed to massively-multithreaded GPUs).
Thank you for the answer! You see, up until now I had never appreciated that a GPU does more than matmuls... And that first reference, what a find :-)
Edit: And btw, another question that I had had before was what's the difference between a tensor core and a GPU, and based on your answer, my speculative answer to that would be that the tensor core is the part inside the GPU that actually does the matmuls.
Amazon also invests in own hardware and silicon -- the Inferentia and Trainium chips for example.
But I am not sure how AWS and Google Cloud match up in terms of making this verticial integration work for their competitive advantage.
Any insight there - would be curious to read up on.
I guess Microsoft for that matter also has been investing -- we heard about the latest quantum breakthrough that was reported as creating a fundamenatally new physical state of matter. Not sure if they also have some traction with GPUs and others with more immediate applications.
I think Amazon, Meta have been trying on inference hardware, they throw their hands up on training; but TPUs can actually be used in training, based on what I saw in Google’s colab.
The problem is always their company never the product. They had countless great products. You cant depend on a product if the company is reliably unreliable enough. If they don't simply delete it for being expensive and "unprofitable" they might initially win, eventually, like search and youtube, it will be so watered down you cant taste the wine.
I am the author of the article. It was there since the beginning, just behind the paywall, which I removed due to the amount of interest the topic was receiving.
And yet google's main structural disadvantage is being google.
Modern BERT with the extended context has solved natural language web search. I mean it as no exaggeration that _everything_ google does for search is now obsolete. The only reason why google search isn't dead yet is that it takes a while to index all web paged into a vector database.
And yet it wasn't google that released the architecture update, it was hugging face as a summer collaboration between a dozen people. Google's version came out in 2018 and languished for a decade because it would destroy their business model.
Google is too risk averse to do anything, but completely doomed if they don't cannibalize their cash cow product. Web search is no longer a crown jewel, but plumbing that answering services, like perplexity, need. I don't see google being able to pull off an iPhone moment where they killed the iPod to win the next 20 years.
> Modern BERT with the extended context has solved natural language web search. I mean it as no exaggeration that _everything_ google does for search is now obsolete.
The web UI for people using search may be obsolete, but search is hot, all AIs need it, both web and local. It's because models don't have recent information in them and are unable to reliably quote from memory.
I see search engines as a dripfeed from a firehose, not some magical thing that's going to get me the 100% correct 100% accurate result.
Humans are the most prolific liars; I could never trust search results anyway since Google may find something that looks right but the author may be heavily biased, uninformed and all manner of other things anyways.
The point is that the secret sauce in Google's search was better retrieval, and the assertion above is that the advantage there is gone. While crawling the web isn't a piece of cake, it's a much smaller moat than retrieval quality was.
My experience running a few hundred very successful shops (hundreds of thousands of orders per month) is that there's no need for quotes around 'abusive'.
95% of our load is from crawlers, so we have to pick who to serve.
If they want our data all they need to do is offer a way for us to send it, we're happy to increase exposure and shopping aggregation site updates are our second highest priority task after price and availability updates.
> Google is too risk averse to do anything, but completely doomed if they don't cannibalize their cash cow product.
Google's cash-cow product is relevant ads. You can display relevant ads in LLM output or natural language web-search. As long as people are interacting with a Google property, I really don't think it matters what that product is, as long as there are ad views. Also:
> Web search is no longer a crown jewel, but plumbing that answering services, like perplexity, need
This sounds like a gigantic competitive advantage if you're selling AI-based products. You don't have to give everyone access to the good search via API, just your inhouse AI generator.
Kodak was well placed to profit from the rise of digital imaging - in the late 1970s and early 1980s Kodak labs pioneered colour image sensors, and was producing some of the highest resolution CCDs out there.
Bryce Bayer worked for Kodak when he invented and patented the Bayer pattern filter used in essentially every colour image sensor to this day.
But the problem was: Kodak had a big film business - with a lot of film factories, a lot of employees, a lot of executives, and a lot of recurring revenue. And jumping into digital with both feet would have threatened all that.
So they didn't capitalise on their early lead - and now they're bankrupt, reduced to licensing their brand to third-party battery makers.
> You can display relevant ads in LLM output or natural language web-search.
Maybe. But the LLM costs a lot more per response.
Making half a cent is very profitable if you only take 0.2s of CPU to do it. Making half a cent with 30 seconds multiple GPUs, consuming 1000W of power... isn't.
This is a good anecdote and it reminds me of how Sony had cloud architecture/digital distribution, a music label, movie studio, mobile phones, music players, speakers, tvs, laptops, mobile apps... and totally missed out on building Spotify or Netflix.
I do think Google is a little different to Kodak however; their scale and influence is on another level. GSuite, Cloud, YouTube and Android are pretty huge diversifications from Search in my mind even if Search is still the money maker...
That, and while Sony had all these big groups they often didn't play nice with each other. Look at how they failed to make Minidisc into any useful data platforms with PCs, largely because MD's were consumer devices and not personal computers so they were pretty much only seen as music hardware.
Even on the few Vaios that had MD drives on them, they're pretty much just an external MD player permanently glued to the device instead of being a full and deeply integrated PC component.
It goes to internal corporate culture, and what happens to you when you point out an uncomfortable truth. Do we shoot the messenger, or heed her warnings and pivot the hopefully not Titanic? RIM/Blackberry didn't manage to avoid it either.
People like to believe CEOs aren't worth their pay package, and sometimes they're not. But a look at a couple of their failures and a different CEO of Kodak wouldn't have had what happened happen, makes me think that sometimes, some of them do deserve that.
I firmly believe the majority of CEOs and executives may do something useful (often not) but none of them truly _earn_ their multi-million salaries (for those that are on the modern 100-1000x salaries). It's just suits, handshakes and social connections from certain schools/families. That's all.
Constantly I see them dodging responsibility or resigning (as an "apology") during a crisis they caused and then moving on to the next place they got buddies at for another multi-mil salary.
Many here would defend 'em tho. HN/SV tech people seem to aspire to such things from what I've seen. The rest of us just really think computers are super cool.
When a fool inevitably takes the throne, disaster ensues.
I can't say for sure that a different system of government would have saved Kodak. But when one man's choices result in disaster for a massive organization, I don't blame the man. I blame the structure that laid the power to make such a mistake on his shoulders.
that seems weird. Why hold up one person as being great while not also holding up one person as not? If my leader led me into battle and we were victorious, we'd put it on them. if they lead us to ruin, why should I blame the organizational structure that led to them getting power as the culprit instead of blaming them directly?
I'm not saying you'd be wrong to blame the bad leader - just that blaming them doesn't achieve much.
The CEO takes the blame, the board picks a new one (Unless the CEO has special shares that make them impossible to dismiss), and we go on hoping that the king isn't an idiot this time.
My reading of history is that some people are fools - we can blame them for their incompetence or we can set out to build foolproof systems. (Obviously, nothing will be truly foolproof. But we can build systems that are robust against a minority of the population being fools/defectors.)
1/2 kW/minute costs about $0.001 so you technically could make a profit at that rate. The real problem is the GPU cost - a $20k GPU amortized over five years costs $0.046 per second. :)
Because I'm an idiot and left off a factor of 365. Thank you! A 20k GPU for 30 seconds is 1/3 of a cent. Still more than the power but also potentially profitable under this scenario informing all the other overhead and utilization.
As a business Google's interest is in showing ads that make it the most money - if they quickly show just the relevant information then Google loses advertising opportunities.
To an extent, it is the web equivalent of irl super markets intentionally moving stuff around and having checkout displays.
> As a business Google's interest is in showing ads that make it the most money - if they quickly show just the relevant information then Google loses advertising opportunities.
This is just a question of UX- the purpose of their search engine was already to show the most relevant information (ie. links), but they just put some semi-relevant information (ie. sponsored links) first, and make a fortune. They can just do the same with AI results.
This would be like claiming in 2010 that because Page Rank is out there, search is a solved problem and there’s no secret sauce, and the following decade proved that false.
In a time where statistical models couldn't understand natural language the click stream from users was their secret sauce.
Today a consumer grade >8b decoder only model does a better job of predicting if some (long) string of text matches a user query than any bespoke algorithm would.
The only reason why encoder only models are better than decoder only models is that you can cache the results against the corpus ahead of time.
Unlike the natural language queries that RAG has to deal with, Google searches are (usually) atomic ideas and encoder-only models have a much easier time with them.
Do we have insights on whether they knew that their business model was at risk? My understanding is that OpenAI’s credibility lies in seeing the potential of scaling up a transformer-based model and that Google was caught off guard.
I think what may save Google from an Innovator's Dilemma extinction is that none of the AI would-be Google killers (OpenAI etc.) have figured out how to achieve any degree of lock-in. We're in a phase right now where everybody gets excited by the latest model and the switching cost is next to zero. This is very different from the dynamics of, say, Intel missing the boat on mobile CPUs.
I've been wondering for some time what sustainable advantage will end up looking like in AI. The only obvious thing is that whoever invents an AI that can remember who you are and every conversation it's had with you -- that will be a sticky product.
Who ever gets AI to be able to search the whole corpus of human knowledge. I'm not just talking about web pages, I'm talking every book, every scientific paper, every news paper, every piece of text stored somewhere.
I've build RAG systems that index tokens in the 1e12 range and the main thing stopping us from having a super search that will make google look like the library card catalogue is the copyright system.
A country that ignores that and builds the first XXX billion parameter encoder only model will do for knowledge work what the high pressure steam engine did for muscle work.
but because users are used to doing that for free, they can't charge money for that, but if they don't charge money for that, and no one's seeing ads, then where does they money come from?
There are probably more ad-supported or ad-enhanced properties, but what's been shifting over the past few years is the focus on subscription-supported products:
1. YouTube TV
2. YouTube Premium
3. GoogleOne (initially for storage, but now also for advanced AI access)
4. Nest Aware
5. Android Play Store
6. Google Fi
7. Workspace (and affiliated products)
In terms of search, we're already seeing a renaissance of new options, most of which are AI-powered or enhanced, like basic LLM interfaces (ChatGPT, Gemini, etc), or fundamentally improved products like Perplexity & Kagi. But Google has a broad and deep moat relative to any direct competitors. Its existential risk factors are mostly regulation/legal challenge and specific product competition, but not everything on all fronts all at once.
People would correctly identify that their standard of living is being reduced for ideological reasons without tangible individual benefits and would likely not respond well to that, resulting in a loss of political power for whatever movement instituted those policies and a reversal of said policies.
People went with the green bin initiative in the US, perhaps elsewhere. we switched to more fuel efficient cars in general when fuel became more expensive. New home construction and retrofits to make houses fully electric - no gas hobs, not gas furnace. these were all "QOL" adjustments that people have been making.
You have to pitch things the correct way, and it would really help if it wasn't treated as an "Ideological" thing but an ecological and humanitarian thing.
It is not okay to shove our pollution, poor wages and working conditions, and so on, to another country, nor its population. Arguing that it's okay if Chinese and vietnamese and indian folks are treated poorly, have poor health outcomes, and so on, just so long as we get shein and temu and amazon and walmart...
The "there's plenty for everyone, consume buy purchase, it's ok!" is just a lie. you can't do that without harming someone else.
If for some unimaginable reason the western world had embraced this philosophy wholeheartedly in 1985, literally billions of people would be struggling in grinding poverty (or worse!) instead of living significantly better lives than their parents or grandparents.
i don't know, you tell me. And billions are in poverty now so i'm not sure this is a productive conversation. 44.9% of the global population is under $6.85 per day. 740mm try to survive on less than $2.15 a day.
45% of the earth's population is 3,700,000,000 or so, which, if my counting of commas is correct, is "literally billions of people"
> Three-quarters of all people in extreme poverty live in Sub-Saharan Africa or in fragile and conflict-affected countries.
The people in extreme poverty are not the people you are making the argument for. Changing your consumption isn't going to do a thing for sub Saharan Africa and latin America, as a matter of fact increasing consumption of goods from those countries will improve their quality of life.
If you are ever in South Africa contact me, really. I'll take you to those people in povery, tell them yourself how you think Americans spending less money is going to change their lives in any way.
You will probably find some peoplewho are just trying their best to make it through the day. You will also find a lot of their lack of wealth comes from a lack of education.
You want to make a lasting difference, stop donating to feel good charities or animals or nonsense. Find an organisation that is focussed on improving education in these areas and donate to that.
Because nothing kills corrupt governments quite like an educated voting base.
"see, rampant consumerism is good for the people in sub-saharan africa! it's okay to buy disposable land-fill!"
another way to look at this is, we export our pollution to another country until their citizens get tired of the pollution, and thus charge more for production. So we pull up stakes and move on to the next country that's too poor to complain.
But i guess bootstrapping economies by dumping toxins everywhere on the planet is ok.
The goal should be to bootstrap sustainable, clean technologies, not the same 1800s era industrialization that coats the land in a fine sheen of toxic waste.
More importantly, a lot of it was hidden from view. Your washing machine and laundry detergent might be not as good as 30 years ago, but... do you know that for sure? And when did that happen? Maybe you're just imagining it.
Your old home had a gas furnace, and then you bought new construction and it's all-electric. Did the government do this? Are you even gonna think about it when making an offer? Your bill is gonna be higher, but how much of it is just because it's a different home?
And even then... you get away with it for a while, and then it all of sudden becomes a political talking point.
Case in point, at some point people realized what was going on with pickup trucks in the US.
> do you know that for sure?
Yes. I absolutely loathe modern washing machines. The irony is that they don't actually save any water because I end up running them multiple times. One of these days I will hopefully get around to gutting mine to replace the control box with an RPi. It's a lot of busy work for something that ought to function well to begin with.
> Your bill is gonna be higher,
Maybe not with a modern mini-split setup. Those are genuinely better than anything that came before them. As a bonus, they remove the failure mode of "blow up your house" that all gas appliances inherently carry.
You can also power them with solar, removing your day to day reliance on the grid.
> The irony is that they don't actually save any water because I end up running them multiple times.
i have an HE2 (which everyone, universally hates) and it cleans clothes fine with 1 load where you can't even see the top of the water most of the time. You have to load high efficiency top loading machines different than your grandparents loaded their speed queens. You must not put anything over the impeller. everything must go around the edges before the water starts. the impellers in HE washers don't agitate around the torus of the tub, they agitate like the electric field on a torus around the tub, so they go bottom, bottom middle, upper middle, upper, upper outer, bottom outer, bottom; in a loop. if you put stuff in the middle to start, it'll just swish that back and forth and never clean anything. Sometimes stuff won't even get wet, or you'll have dried soap stuck to your clothes.
secondly, don't use fabric softener. If you must use something because of your water quality, use vinegar.
If you need to wash sheets, curtains, blankets, comforters and the like, you can't put them around the outside, because the HE washers do that cycle differently, too. You have to gather the fabric like a https://en.wikipedia.org/wiki/Bindle sack, start with the sheet or whatever flat, fold all four corners (or whatever) into the middle so they all meet in the middle. Then pick it up by the four corners (or whatever) and place it in the washer with the middle of the sheet down and the four corners on top. Like a bindle sack. use the "sheets / heavy duty" cycle for sheets and the like. that's what it's for.
Yeah, that's the normal outcome for papers like this. Papers which claim to be groundbreaking improvements on Transformers universally aren't. Same story roughly once a month for the past 5 years.
>Intel on 18A is literally TSMC's 3nm process + backside power delivery, which means more power efficiency, performance also less heat.
That's a pretty serious abuse of the word "literally" given that they have nothing in common except vague density figures which don't mean that much at this point.
Here's a line literally from the article
>Based on this analysis it is our belief that Intel 18A has the highest performance for a 2nm class process with TSMC in second place and Samsung in third place.
Given what we currently know about 18A, Intel's process appears to be less dense but with a higher emphasis on performance, which is in line with recent Intel history. Just looking at the density of a process won't tell you everything about it. If density were everything then Intel's 14nm++++ chips wouldn't have managed to remain competitive in raw performance for so many years against significantly denser processes. Chip makers have a bunch of parameters they have to balance when designing new nodes. This has only gotten more important as node shrinks have become more difficult. TSMC has always leaned more towards power efficiency, largely because their rise to dominance was driven by mobile focused chips. Intel's processes have always prioritized performance more as more of their products are plugged into the wall. Ideally, you want both but R&D resources are not unlimited.
The death of Dennard scaling means that power efficiency is king, because a more power efficient chip is also a chip that can keep more of its area powered up over time for any given amount of cooling - which is ultimately what matters for performance. This effect becomes even more relevant as node sizes decrease and density increases.
If it were that simple fabs wouldn't offer a standard cell libraries in both high performance and high density varieties. TSMC continues to provide both for their 2nm process. A tradeoff between power efficiency and raw performance continues to exist.
A standard cell can be a critical performance bottleneck as part of a chip, so it makes sense to offer "high performance" cell designs that can help unblock these where appropriate. But chip cooling operates on the chip as a whole, and there you gain nothing by picking a "higher raw performance" design.
If that were totally true you would expect to see more or less uniform ratios of HP/HD cells mixes across different product types, but that's very much not the case. Dennard scaling may be dying but it's not dead yet. You can still sacrifice efficiency to gain performance. It's not zero sum.
What product types do you have in mind exactly? Even big server chips now use a huge fraction of their area for power-sipping "efficiency core" designs that wouldn't be out of place in a mobile chip. Power is king.
Those efficiency cores would put a full core from just a few years ago to shame, and data center chips have always contained a majority niche focused on throughput and perf/watt over latency. That's nearly always been focused on somewhere closer to the 45° part of the scurve than more on the top.
No, additional context does not cause exponential slowdowns and you absolutely can use FlashAttention tricks during training, I'm doing it right now. Transformers are not RNNs, they are not unrolled across timesteps, the backpropagation path for a 1,000,000 context LLM is not any longer than a 100 context LLM of the same size. The only thing which is larger is the self attention calculation which is quadratic wrt compute and linear wrt memory if you use FlashAttention or similar fused self attention calculations. These calculations can be further parallelized using tricks like ring attention to distribute very large attention calculations over many nodes. This is how google trained their 10M context version of Gemini.
So why are the context windows so "small", then? It would seem that if the cost was not so great, then having a larger context window would give an advantage over the competition.
The cost for both training and inference is vaguely quadratic while, for the vast majority of users, the marginal utility of additional context is sharply diminishing. For 99% of ChatGPT users something like 8192 tokens, or about 20 pages of context would be plenty. Companies have to balance the cost of training and serving models. Google did train an uber long context version of Gemini but since Gemini itself fundamentally was not better than GPT-4 or Claude this didn't really matter much, since so few people actually benefited from such a niche advantage it didn't really shift the playing field in their favor.
Marginal utility only drops because effective context is really bad, i.e. most models still vastly prefer the first things they see and those "needle in a haystack" tests are misleading in that they convince people that LLMs do a good job of handling their whole context when they just don't.
If we have the effective context window equal to the claimed context window, well, I'd start worrying a bit about most of the risks that AI doomers talk about...
There has been a huge increase in context windows recently.
I think the larger problem is "effective context" and training data.
Being technically able to use a large context window doesn't mean a model can actually remember or attend to that larger context well. In my experience, the kinds of synthetic "needle in haystack" tasks that AI companies use to show how large of a context their model can handle don't translate very well to more complicated use cases.
You can create data with large context for training by synthetically adding in random stuff, but there's not a ton of organic training data where something meaningfully depends on something 100,000 tokens back.
Also, even if it's not scaling exponentially, it's still scaling: at what point is RAG going to be more effective than just having a large context?
Great point about the meaningful datasets, this makes perfect sense. Esp. in regards to SFT and RLHF. Although I suppose it would be somewhat easier to do pretraining on really long context (books, I assume?)
Because you have to do inference distributed between multiple nodes at this point. For prefill because prefill is actually quadratic, but also for memory reasons. KV Cache for 405B at 10M context length would take more than 5 terabytes (at bf16). That's 36 H200 just for KV Cache, but you would need roughly 48 GPUs to serve bf16 version of the model. Generation speed at that setup would be roughly 30 tokens per second, 100k tokens per hour, and you can server only a single user because batching doesn't make sense at these kinds of context lengths. If you pay 3 dollars per hour per GPU, it's $1440 per million tokens cost. For fp8 version the numbers are a bit better: you need only 24 GPUs, generation speed stays roughly the same, so it's only 700 dollars per million tokens. There are architectural modifications that will bring that down significantly, but, nonetheless, it's still really really expensive, but also quite hard to get to work.
Another factor in context window is effective recall. If the model can't actually use a fact 1m tokens earlier, accurately and precisely, then there's no benefit and it's harmful to the user experience to allow the use of a poorly functioning feature. Part of what Google have done with Gemini's 1-2m token context window is demonstrate that the model will actually recall and use that data. Disclosure, I do work at Google but not on this, I don't have any inside info on the model.
Memory. I don't know the equation, but its very easy to see when you load a 128k context model at 8K vs 80K. The quant I am running would double VRAM requirements when loading 80K.
> The only thing which is larger is the self attention calculation which is quadratic wrt compute and linear wrt memory if you use FlashAttention or similar fused self attention calculations.
FFWD input is self-attention output. And since the output of self-attention layer is [context, d_model], FFWD layer input will grow as well. Consequently, FFWD layer compute cost will grow as well, no?
The cost of FFWD layer according to my calculations is ~(4+2 * true(w3)) * d_model * dff * n_layers * context_size so the FFWD cost grows linearly wrt the context size.
So, unless I misunderstood the transformer architecture, larger the context the larger the compute of both self-attention and FFWD is?
So you're saying that if I have a sentence of 10 words, and I want the LLM to predict the 11th word, FFWD compute is going to be independent of the context size?
I don't understand how since that very context is what makes the likeliness of output of next prediction worthy, or not?
More specifically, FFWD layer is essentially self attention output [context, d_model] matrix matmul'd with W1, W2 and W3 weights?
I may be missing something, but I thought that each context token would result in an 3 additional parameters per context token for self attention to build its map, since each attention must calculate a value considering all existing context