Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Wait 8x total? For everyone at once?


Per instance (worker serving an API request) it requires 8x GPUs. I believe they have thousands of these instances and they scale them up with load.

Because the model isn't dynamic (it doesn't learn) it is stateless and can be scaled elastically.


Ah okay, that makes a lot more sense thank you!


I expect some level of caching and even request bucketing by similarity is possible.

How many users come with the same prompt?


In my experience running the same prompt always get's different results. Maybe they cache between different people but I'm not sure that'd be worth the cache space at that point? although 8x A100s is a lot to not have caching...


Each model needs 8x to run at the same time per request.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: