At inference time it will be possible to do 4000 TFLOPS using sparse FP8 :)
But keep in mind the model won't fit on a single H100 (80GB) because it's 175B params, and ~90GB even with sparse FP8 model weights, and then more needed for live activation memory. So you'll still want atleast 2+ H100s to run inference, and more realistically you would rent a 8xH100 cloud instance.
But yeah the latency will be insanely fast given how massive these models are!
1) NVIDIA will likely release a variant of H100 with 2x memory, so we may not even have to wait a generation. They did this for V100-16GB/32GB and A100-40GB/80GB.
2) In a generation or two, the SOTA model architecture will change, so it will be hard to predict the memory reqs... even today, for a fixed train+inference budget, it is much better to train Mixture-Of-Experts (MoE) models, and even NVIDIA advertises MoE models on their H100 page.
MoEs are more efficient in compute, but occupy a lot more memory at runtime. To run an MoE with GPT3-like quality, you probably need to occupy a full 8xH100 box, or even several boxes. So your min-inference-hardware has gone up, but your efficiency will be much better (much higher queries/sec than GPT3 on the same system).
Oh I totally expect the size of models to grow along with whatever hardware can provide.
I really do wonder how much more you could squeeze out of a full pod of gen2-H100's, obviously the model size would be ludicrous, but how far are we into the realm of dimishing returns.
Your point about MoE architectures certainly sounds like the more _useful_ deployment, but the research seems to be pushing towards ludicrously large models.
You seem to know a fair amount about the field, is there anything you'd suggest if I wanted to read more into the subject?
I agree! The models will definitely keep getting bigger, and MoEs are a part of that trend, sorry if that wasn’t clear.
A pod of gen2-H100s might have 256 GPUs with 40 TB of total memory, and could easily run a 10T param model. So I think we are far from diminishing returns on the hardware side :) The model quality also continues to get better at scale.
Re. reading material, I would take a look at DeepSpeed’s blog posts (not affiliated btw). That team is super super good at hardware+software optimization for ML. See their post on MoE models here: https://www.microsoft.com/en-us/research/blog/deepspeed-adva...
I think it depends what downstream task you're trying to do... DeepMind tried distilling big language models into smaller ones (think 7B -> 1B) but it didn't work too well... it definitely lost a lot of quality (for general language modeling) relative to the original model.
The Austrian monarchy was heavily involved with colonialism via Spain. However, even nations that didn't directly go out and colonize benefited because the wealth that was taken from abroad was often spent back in Europe buying food, crafts, and services. The sheer amount of it also led to innovations in state institutions and financial instruments/strategies.
That's the whole problem with all this 'colonialism' talk. Colonialism is an invention of European scholars and today we somehow want to make it uniquely European.
Taking over other places around you and politically dominating it is as old as history.
You can't use colonialism as an explanation for anything because in the whole world there were wars and people taking over their neighbors.
Its a matter of linguistics and historiography when we use the term 'colony'.
I believe it's important to acknowledge a distinction between, as you say, politically dominating a region with an intention of expanding the home country and depleting a region of its wealth, exporting it away while imposing an administration with no willingness to co-operate with local population other than for the purposes of wealth extraction.
Pretty much all invasions start with taking over resources and the longer they are part of your empire, it will become more integrated it becomes. And in a time of kingship and emperors most regions never got series political representation.
And its also a limited history of colonialism to claim its only wealth extraction above and beyond what other empires would do.
> while imposing an administration with no willingness to co-operate with local population
If anything Europeans because of their limited population had more intensive to do so.
The British take over of much of India was literally a vastly majority Indian affair. The war was fought by Indian troops, supplied by Indian merchants and farmers, it was financed by taking loans from Indian bankers. The amount of British people involved were almost vanishingly small given the size of India.
Did Indians have more control over involvement over India then Greeks did over Greece in the Ottoman empire?
I don't think you can make any generalization that European invasions of other countries were uniquely destructive or extractive compared to world history. There are certainty cases like Kind Leopold's regime in Congo but we can't reduce 600 years of European history to that.
That the colonial power aims to stay/settle/control in the lands conquered, and extract something from them/trade with or from them.
The Swedish in the Thirty years war didn't intend to settle or claim the lands they were fighting in - they just fought against a religious enemy and for power/influence/French money. ( Pomerania is somewhat of an exception).
You could just contact Polish organization (if they are in Poland) if u want i can link few organizations.
Right now we have about 2milion UA refugees in Poland and about 2-3 million immigrants so they should be able to have normal life for a short period of time and communicate in Russian/Ukrainian in most places cause of enormous polish efforts otherwise they should go to Poland and from there you could start supporting them.
3090 is on Samsung 8nm 2+years old node so it will always be worse and still apple gpu claims are always oversaturated and its good people are talking about it considering apple propaganda of FASTEST CONSUMER GPU.
Despite this m1 ultra gpu is a big step for all people intrested in apple market and DL/ML workloads
3090 is on Samsung 8nm 2+years old node so it will always be worse
While the M1 Ultra is the latest CPU, it still uses A14/M1 cores from 2020. It's likely that Apple made a lot of progress on the microarchitecture since then, plus rumors are that M2 will switch to a 4nm node.
> considering apple propaganda of FASTEST CONSUMER GPU
Well as a mostly Apple user, I don't read Apple propaganda. Or nvidia propaganda for that matter.
The M1 Max mac studio's full system consumption on load seems to be ~50 W. I wonder how much performance nvidia can deliver with that. I think ... they simply don't have a product in that power budget?
I know most people only care about FASTEST CONSUMER GPU (yay, it doubles as a leaf blower and a space heater) but I still think nvidia's best product was the 1050Ti ... decent 1080 performance in 75 W.
Sounds about right, it's not magic. AFAIK the Metal support in Blender is not very optimized yet (it's brand new) but even optimized I don't expect it to come close to an RTX 3090. I still find the results amazing, for the amount of power this thing needs if it reaches 2x slower after optimizations, that's massive. Don't forget that the Ultra also includes the CPU and RAM in the power consumption.
Its wall power without cpu encoding with cpu its ~140W.
But i can agree results are good but we need also to remember Samsung 8nm [rtx 3090 node] is 4 years old so 2 generations different from TSMC 5nm 4090 will be probably on 5nm node in few months and presumably will have 3x teraflops (probably 3x tf32 precision but it could be fp32 using 2/1.5x(?) more power).
(also I'm not sure about optimization it was write by apple employees and apple likes to drop open source support and focus on proprietary software)
i3-12100f is almost 100 usd cheaper in most builds you would use extra cash for better gpu or just get i5-12400f with the same / lower price and higher clocks but lets wait for benchmarks.
For Intel, those are the cheapest of the cheap boards with the very limited H610 chipset. If you want something acceptable you will pay a lot more. AMD on the other hand has cheap good option with B550 or even B450.
I am interested in your belief that most builds needs GPU more than CPU, or that most builds even have GPUs. iGPUs have ~70% of the market. Most builds are going to spend the cash on CPU performance.
If you're buying the CPU separately, it means you'll be building the PC yourself, not a prebuilt. Upgrading the GPU instead of the CPU will give you better graphics performance, which is what most of the people building one want. The only reason iGPUs have such a large section of the market is because of people buying regular prebuilt to use as a general computer, not a gaming one.
That's not obviously true to me. Is graphics-intensive gaming - and I would point out that defining "gaming" as GPU-intensive would be too narrow - really that large of a market? I've personally built dozens of PCs and the last time I bought a GPU it was a 3DFX Voodoo.
It wont be even close to RTX 3090 looking at m1 max and using same scaling maximum it can be close to 3070 performance.
We all need to take Apple claims with grain of salt as they are always cherrypicked so i wont be surprise if it wont be even 3070 performance in real usage.
I can reflect on like past few years (like 1.5 year ago i was disappointed by julia lack of progres but a lot have changed).
Julia is getting usable even in "normal" applications not only academic stuff, as person who come back after 1.5/2 years to julia i feel like i can use it again in my job cause it is a lot more stable at have a lot of new neat futures + CUDA.jl is amazing.
I hope Julia team will still explore a bit more static type inference and full AOT compilation if language got full support for AOT it'll be a perfect deal for me :).
Yeah, it's definitely in the early stages still, but this time I think there's much more infrastructure, and more people around with the right knowledge to advise on StaticCompiler's development, that I'm currently feeling pretty good about it's future.
If 1000 TFLOPS is possible to do in inference time then im speechless