Have folks seen good Linux tools for actually monitoring/profiling the GPU's inn...

zetazzed · on Oct 2, 2021

For NVIDIA GPUs, Nsight systems is wildly detailed and has both GUI and CLI options: https://developer.nvidia.com/nsight-systems

For DL specifically, this article covers a couple of options that actually plug into the framework: https://developer.nvidia.com/blog/profiling-and-optimizing-d...

nvidia-smi is the core tool most folks use for quick "top"-like output, but there is also an htop equivalent: https://github.com/shunk031/nvhtop

A lot of other tools are build on top of the low-level NVML library (https://developer.nvidia.com/nvidia-management-library-nvml). There are also Python NVML bindings if you need to write your own monitoring tools.

danielmorozoff · on Oct 2, 2021

I personally like gpustat -- it's a nvidia-smi wrapper but it has colors...

They also i guess now have a web sever plugged into it which seems pretty cool

https://github.com/wookayin/gpustat https://github.com/wookayin/gpustat-web

not-elite · on Oct 2, 2021

In nvidia land, `nvidia-smi` is like `top` for your gpus. If you're running compiled CUDA, `nvprof` is very useful. But I'm not sure how much work it would take to profile something like a pytorch model.

https://developer.nvidia.com/blog/cuda-pro-tip-nvprof-your-h...

arcanus · on Oct 2, 2021

In AMD land, `rocm-smi` is like `top` for your gpus. Very similar in capabilities and functionality to NVIDIA.

corysama · on Oct 2, 2021

https://developer.nvidia.com/nsight-compute