Nvidia's enterprise GPUs are surprisingly unreliable. Working on a 128 GPU A100 cluster on AWS, 1 would fail every few days. I didn't have any insight on whether it was a hardware or software failure.
> Working on a 128 GPU A100 cluster on AWS, 1 would fail every few days
Define "fail".
> I didn't have any insight on whether it was a hardware or software failure.
Have scripts check nvidia-smi for ECC errors and dmesg for devices dropping of the PCI bus.
For the former, replace the card. For the later, just perform a device reset (a power toggle of the device and a rescan of the bus is often enough to be back online within 5 seconds)