Wait 8x total? For everyone at once?

freeqaz · on March 1, 2023

Per instance (worker serving an API request) it requires 8x GPUs. I believe they have thousands of these instances and they scale them up with load.

Because the model isn't dynamic (it doesn't learn) it is stateless and can be scaled elastically.

thewataccount · on March 1, 2023

Ah okay, that makes a lot more sense thank you!

pharmakom · on March 2, 2023

I expect some level of caching and even request bucketing by similarity is possible.

How many users come with the same prompt?

thewataccount · on March 2, 2023

In my experience running the same prompt always get's different results. Maybe they cache between different people but I'm not sure that'd be worth the cache space at that point? although 8x A100s is a lot to not have caching...

vineyardmike · on March 1, 2023

Each model needs 8x to run at the same time per request.