Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Have folks seen good Linux tools for actually monitoring/profiling the GPU's inner workings? Soon I'll need to scale running ML models. For CPUs, I have a whole bag of tricks for examining and monitoring performance. But for GPUs, I feel like a caveman.


For NVIDIA GPUs, Nsight systems is wildly detailed and has both GUI and CLI options: https://developer.nvidia.com/nsight-systems

For DL specifically, this article covers a couple of options that actually plug into the framework: https://developer.nvidia.com/blog/profiling-and-optimizing-d...

nvidia-smi is the core tool most folks use for quick "top"-like output, but there is also an htop equivalent: https://github.com/shunk031/nvhtop

A lot of other tools are build on top of the low-level NVML library (https://developer.nvidia.com/nvidia-management-library-nvml). There are also Python NVML bindings if you need to write your own monitoring tools.


I personally like gpustat -- it's a nvidia-smi wrapper but it has colors...

They also i guess now have a web sever plugged into it which seems pretty cool

https://github.com/wookayin/gpustat https://github.com/wookayin/gpustat-web


In nvidia land, `nvidia-smi` is like `top` for your gpus. If you're running compiled CUDA, `nvprof` is very useful. But I'm not sure how much work it would take to profile something like a pytorch model.

https://developer.nvidia.com/blog/cuda-pro-tip-nvprof-your-h...


In AMD land, `rocm-smi` is like `top` for your gpus. Very similar in capabilities and functionality to NVIDIA.





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: