Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

of the links you put on the last page (or other gpu languages you might not have included) what's the most mature way to write decent gpu code right now (even if not necessarily optimal). i'm right this moment working on a project where i'm trying to speed up a blob detection algorithm by porting it to gpu. i'm using pytorch because the primitives fit naturally but if that weren't the case i think i would be kind of at a loss for how to do it (i do not know, as of yet, how to write cuda kernels).

btw the Julia link is dead



It depends on what you're trying to do. If you want to get stuff done now then just get an Nvidia card and write CUDA. The tools and knowledge community are vastly ahead of other approaches.

If you're doing image-like stuff, Halide is good, there's lots of stuff shipping on it.

One other project I'd add is emu, which is built on wgpu. I think that's closest to the future I have in mind, but still in early stages.

https://github.com/calebwin/emu


okay in that case can you recommend a good tut for writing cuda code?


The official docs are pretty good. This is probably a good starting point: https://devblogs.nvidia.com/even-easier-introduction-cuda/

Nvidia also sponsored a Udacity course, but I don't have any direct experience with it: https://developer.nvidia.com/udacity-cs344-intro-parallel-pr...


If you're interested in getting started writing GPU code and you're currently dealing with PyTorch a good place to start might be using NUMBA for CUDA programming. It lets you do in Python what you'd normally do in C/C++. It's great, because setup is minimal and it has some good tutorial materials. Plus, you're mostly emulating what you'd end up doing in C/C++, so conceptually the code is very similar.

When I was at GTC 2018 this is one of the training sessions I went to and you can now access that here[1]. The official NUMBA for CUDA docs are great too[2]. Also, you can avoid a lot of installation/dependencies headaches by using anaconda/miniconda if you're doing this stuff locally.

[1]: https://github.com/ContinuumIO/gtc2018-numba

[2]:https://numba.pydata.org/numba-doc/latest/cuda/overview.html


I have a love/hate relationship with numba. On one hand, the minimal setup is excellent. On the other hand, it has so many limitations and rough patches that occasionally we have been forced to abandon the numba code and reimplement in C -- by which time we've spent more time fighting numba than the C reimplementation takes. This has happened twice out of 5 major recent attempts to use it.

---------------

Numba CPU is pretty good. I'd recommend it, modulo some expectation control:

* In order to actually compile a function, it must be written in a very minimal python subset that feels like C

* Numpy is only partially supported in this subset. Expect to manually reimplement ~1/3 of your simple numpy code and all of your fancy numpy code.

------------------

Numba GPU is not good. I'd avoid by default, unless you're looking for a project to contribute to:

* When we first installed, it was trivially broken by the removal of a long-deprecated CUDA API that had happened a year prior. It was an easy fix, we just chopped the reference out of numba, but it was a bad sign that foretold problems to come.

* It has lingering subtle bugs. One time we passed in a float array, it made us declare the float type three times (silly but not bad), and then it ignored the types and silently binary cast to double (bad). We found a "ghost town" thread where others had the issue.

* It has severe structural limitations. To a first order, GPU performance is all about avoiding stalls, which are usually triggered by CPU-GPU copies, and Numba's limited buffer abstractions sometimes make that difficult (you have to destructure your code) or impossible.

* It doesn't play well with nvidia's perf tools. If you're doing GPU, you're after perf, and GPUs have lots of tricky caveats that can accidentally tank your perf, so this is a much bigger deal than jupyter (until recently) missing a debugger.

* If you're wondering if numba supports CUDA API XYZ, the answer is no.


Yes, I'd agree with all of your points really. You can see in my reply below that I was mostly advocating for it's use in learning the concepts behind GPU programming, so you can skip delving into C/C++ if you're uncomfortable with that.

If I had to pick something today to use for arbitrary GPU acceleration in Python I'd almost certainly opt for JAX instead, but I haven't seen the kind of tutorials in JAX that exist for NUMBA CUDA.


I also found edge cases in parallel code.

One loop that didn't have any invalid reads or writes segfaulted in a numba prange (eg. Openmp) loop.

Just reimplementing a loop as a few smaller subfunctions called sequentially worked.

No clue where the optimizer got confused on the 20 line loop body before it got split -- it's the same code.


that's interesting. i didn't realize numba did that sort of thing (i thought it was just a jit). i'll take a look. thanks!


Absolutely, happy to help!

I personally end up relying on TensorFlow for most of my GPU needs at work still, but that training session was incredibly helpful for me to understand what was going on under the hood and helped demystify the CUDA kernels in general.


what do you think of something like https://github.com/google/jax that i guess claims to be a jack of all trades, in the sense that it'll transform programs (through XLA???) to various architectures?


JAX is absolutely what I would recommend if you were trying to implement something usable in the long term and wanted to go beyond just learning about CUDA kernels!

Looking at the JAX documentation it's _much_ better than the last time I saw it. Their tutorials seem fairly solid in fact. I do want to point out the difference though. You're conceptually operating at a very different point in JAX than in NUMBA.

For example, consider multiplying 2 matrices in JAX on your GPU. That's a simple example with just a few lines of code in the JAX tutorial[1].

On the other hand in the NUMBA tutorial from GTC I mentioned earlier, you have notebook 4, "Writing CUDA Kernels"[2], which teaches you about the programming model used to write computation for GPU's.

I'm sorry I was unclear. My recommendation of NUMBA is not so much in advocating for its use in a project, but more so in using it and its tutorials as an easy way of learning and experimenting with CUDA kernels without jumping into the deep end with C/C++. If you actually want to write some usable CUDA code for a project, keeping in mind JAX is still experimental, I would fully advocate for JAX over NUMBA.

[1] https://jax.readthedocs.io/en/latest/notebooks/quickstart.ht...

[2] https://github.com/ContinuumIO/gtc2018-numba/blob/master/4%2...


At this years GTC NVidia is promoting cuPy as well.


That link should probably have been https://juliacomputing.com/industries/gpus.html, but that's rather old content. A better overview is https://juliagpu.org/, and you can find a demo of Julia's CUDA kernel programming capabilities here: https://juliagpu.gitlab.io/CUDA.jl/tutorials/introduction/


I've changed the link, thanks!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: