Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Nvidia's enterprise GPUs are surprisingly unreliable. Working on a 128 GPU A100 cluster on AWS, 1 would fail every few days. I didn't have any insight on whether it was a hardware or software failure.


> Working on a 128 GPU A100 cluster on AWS, 1 would fail every few days

Define "fail".

> I didn't have any insight on whether it was a hardware or software failure.

Have scripts check nvidia-smi for ECC errors and dmesg for devices dropping of the PCI bus.

For the former, replace the card. For the later, just perform a device reset (a power toggle of the device and a rescan of the bus is often enough to be back online within 5 seconds)


What does an AWS user do with this advice?


They either figure out how to write scripts or ask AWS support how to get that done.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: