Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Ah, that makes a lot of sense. Is this fine grained task scheduling related to CUDA Dynamic Parallelism at all? If not would you have a pointer on where to look?

I suppose I could look through the code of this project, but I’d hate to have detangle that from the compiler infrastructure.



Think more of it as 'tasking by hand' where you have one kernel driving the 18k+ cores and you manually (or using device libraries) fine-grained synchronize them, handle memory traffic asynchoronously and pipeline as much as you can.

You might have a look at cooperative groups, also things cuda::pipeline in libcudacxx to handle asynchronous and pipelined memory traffic, and also most of block/warp CUB primitives, and move on up to cuFFTDx, cuBLASDx and now cuSolverDx as the starting toolbox for your fused kernel journey.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: