Nvidia's enterprise GPUs are surprisingly unreliable. Working on a 128 GPU A100 ...

csdvrx · on May 31, 2023

> Working on a 128 GPU A100 cluster on AWS, 1 would fail every few days

Define "fail".

> I didn't have any insight on whether it was a hardware or software failure.

Have scripts check nvidia-smi for ECC errors and dmesg for devices dropping of the PCI bus.

For the former, replace the card. For the later, just perform a device reset (a power toggle of the device and a rescan of the bus is often enough to be back online within 5 seconds)

robotresearcher · on May 31, 2023

What does an AWS user do with this advice?

csdvrx · on May 31, 2023

They either figure out how to write scripts or ask AWS support how to get that done.