NZGames.com Forums - View Single Post

DrTiTus · 22nd January 2024, 22:49

Numba allows you to write [basic/primitive] code in Python, and it uses a JIT compiler to translate it automatically into C/CUDA code for GPUs. I got a 1700x speedup from a test doing 51200 x 2048 element array of int16 operations (basic carry operation after doing convolution [long multiplication-ish]). I wasn't operating on each element independently, because they had to run sequentially (n+1 values depend on prior n values), so it was basically just massively parallel (51200 threads) "Python" code. Considering Python is terrible with threading, jumping straight to GPU automatically is fantastic, and just requires a one line decorator (@cuda.jit) and another line to fetch the thread ID you're working on and using that to pick the part of the array (in this case a row) you're working on.

1.083 seconds with GTX1060 vs 28 minutes 22 seconds with Ryzen 5500 (yes, I let it run for half an hour).

It's slower to use CUDA for a single operation, because you're compiling a kernel, copying memory, operating, copying back, but when you scale up, it doesn't even flinch. Loving it.

Lots of "brogrammers" mock Python, and there are good reasons to criticize it if you stay only within the strictly Python sphere and try to do things outside its use cases, but Python is more than Python. It's Verilog (MyHDL), it's CUDA (Numba), it's C (numpy), it's a better Perl, it's fast to write, and it's so simple but very powerful.

Use the right tool for the job, but 90% of the time it's Python with the right libraries (insert xkcd comic). I did sidestep to C++ for a bit, but if CUDA happens automagically, I'm going to stick with Python. Straight from prototype (CPU) to "production" (GPU) by adding a decorator. Magic.

22nd January 2024, 22:49	#44124
DrTiTus HENCE WHY FOREVER ALONE	Numba allows you to write [basic/primitive] code in Python, and it uses a JIT compiler to translate it automatically into C/CUDA code for GPUs. I got a 1700x speedup from a test doing 51200 x 2048 element array of int16 operations (basic carry operation after doing convolution [long multiplication-ish]). I wasn't operating on each element independently, because they had to run sequentially (n+1 values depend on prior n values), so it was basically just massively parallel (51200 threads) "Python" code. Considering Python is terrible with threading, jumping straight to GPU automatically is fantastic, and just requires a one line decorator (@cuda.jit) and another line to fetch the thread ID you're working on and using that to pick the part of the array (in this case a row) you're working on. 1.083 seconds with GTX1060 vs 28 minutes 22 seconds with Ryzen 5500 (yes, I let it run for half an hour). It's slower to use CUDA for a single operation, because you're compiling a kernel, copying memory, operating, copying back, but when you scale up, it doesn't even flinch. Loving it. Lots of "brogrammers" mock Python, and there are good reasons to criticize it if you stay only within the strictly Python sphere and try to do things outside its use cases, but Python is more than Python. It's Verilog (MyHDL), it's CUDA (Numba), it's C (numpy), it's a better Perl, it's fast to write, and it's so simple but very powerful. Use the right tool for the job, but 90% of the time it's Python with the right libraries (insert xkcd comic). I did sidestep to C++ for a bit, but if CUDA happens automagically, I'm going to stick with Python. Straight from prototype (CPU) to "production" (GPU) by adding a decorator. Magic. __________________ Finger rolling rhythm, ride the horse one hand...