244 points by upmind 6 days ago | 80 comments
(To be frank, the main reason is a lot of companies I'd wish to work for require CUDA experience -- this shouldn't change your answers hopefully, just wanted to provide some context )
indianmouse 6 days ago
- Look up the CUDA Programming Guide from NVidia
- CUDA Programming books from NVidia from developer.nvidia.com/cuda-books-archive link
- Start creating small programs based on the existing implementations (A strong C implementation knowledge is required. So, brush up if needed.)
- Install the required Toolchains, compilers, and I am assuming you have the necessary hardware to play around
- Github links with CUDA projects. Read the code, And now you could use LLM to explain the code in the way you would need
- Start creating smaller, yet parallel programs etc., etc.,
And in about a month or two, you should have enough to start writing CUDA programs.
I'm not aware of the skill / experience levels you have, but whatever it might be, there are plenty of sources and resources available now than it was in 2007/08.
Create a 6-8 weeks of study plan and you should be flying soon!
Hope it helps.
Feel free to comment and I can share whatever I could to guide.
hiq 6 days ago
Can you expand on that? Is it enough to have an nvidia graphic card that's like 5 year old, or do you need something more specific?
adrian_b 5 days ago
Even 7-year old cards, i.e. NVIDIA Turing RTX 20xx from 2018, are still acceptable.
Older GPUs than Turing should be avoided, because they lack many capabilities of the newer cards, e.g. "tensor cores", and their support in the newer CUDA toolkits will be deprecated in a not very distant future, but very slowly, so for now you can still create programs for Maxwell GPUs from 10 years ago.
Among the newer GPUs, the RTX 40xx SUPER series (i.e. the SUPER variants, not the original RTX 40xx series) has the best energy efficiency. The newest RTX 50xx GPUs have worse energy efficiency than RTX 40xx SUPER, so they achieve a somewhat higher performance only by consuming a disproportionately greater power. Instead of that, it is better to use multiple RTX 40xx SUPER.
rahimnathwani 6 days ago
- you will want to install the latest version of CUDA Toolkit (12.9.1)
- each version of CUDA Toolkit requires the card driver to be above a certain version (e.g. toolkit depends on driver version 576 or above)
- older cards often have recent drivers, e.g. the current version of CUDA Toolkit will work with a GTX 1080, as it has a recent (576.x) driver
slt2021 6 days ago
Depending on the model and age of your GPU, it will have a certain capability that will be the hard ceiling for what you can program using CUDA
dpe82 6 days ago
sanderjd 5 days ago
indianmouse 5 days ago
edge17 5 days ago
throwaway81523 6 days ago
CUDA itself is just a minor departure from C++, so the language itself is no big deal if you've used C++ before. But, if you're trying to get hired programming CUDA, what that really means is they want you implementing AI stuff (unless it's game dev). AI programming is a much wider and deeper subject than CUDA itself, so be ready to spend a bunch of time studying and hacking to come up to speed in that. But if you do, you will be in high demand. As mentioned, the fast.ai videos are a great introduction.
In the case of games, that means 3D graphics which these days is another rabbit hole. I knew a bit about this back in the day, but it is fantastically more sophisticated now and I don't have any idea where to even start.
upmind 6 days ago
I have two beginner (and probably very dumb) questions, why do they have heavy c++/cuda usage rather than using only pytorch/tensorflow. Are they too slow for training Leela? Second, why is there tensorflow code?
henrikf 6 days ago
Leela Chess Zero (https://github.com/LeelaChessZero/lc0) has much more optimized Cuda backend targeting modern GPU architectures and it's written by much more knowledgeable people than me. That would be a much better source to learn.
throwaway81523 6 days ago
upmind 6 days ago
robotnikman 5 days ago
So I'm guessing trying to find a job as a CUDA programmer is nowhere as big of a headache compared to other software engineering jobs right now? I'm thinking maybe learning CUDA and more about AI might be a good pivot from the current position as a Java middleware developer.
randomNumber7 5 days ago
FilosofumRex 5 days ago
The real money is in mastering PTX, nvcc, cuobjdump, Nsight Systems, and Nsight Compute. CUTLASS is good open source code base to explore - start here https://christianjmills.com/series/notes/cuda-mode-notes.htm...
most importantly, stay off HN, get on Discord gpu mode, where real coders are: https://discord.com/invite/gpumode
ferguess_k 13 hours ago
MoonGhost 5 days ago
imjonse 6 days ago
https://www.gpumode.com/ resources and discord community Book: Programming massively parallel processors nvidia cuda docs are very comprehensive too https://github.com/srush/GPU-Puzzles
mdaniel 6 days ago
https://github.com/HazyResearch/ThunderKittens#:~:text=here%...
amelius 6 days ago
imjonse 5 days ago
amelius 5 days ago
lokimedes 6 days ago
1. Learning CUDA - the framework, libraries and high-layer wrappers. This is something that changes with times and trends.
2. Learning high-performance computing approaches. While a GPU and the Nvlink interfaces are Nvidia specific, working in a massively-parallel distributed computing environment is a general branch of knowledge that is translatable across HPC architectures.
3. Application specifics. If your thing is Transformers, you may just as well start from Torch, Tensorflow, etc. and rely on the current high-level abstractions, to inspire your learning down to the fundamentals.
I’m no longer active in any of the above, so I can’t be more specific, but if you want to master CUDA, I would say learning how massive-parallel programming works, is the foundation that may translate into transferable skills.
david-gpu 5 days ago
Understanding the fundamentals of parallel programming comes first, IMO.
chanana 5 days ago
Are there any good resources you’d recommend for that?
rramadass 5 days ago
Foundations of Multithreaded, Parallel, and Distributed Programming by Gregory Andrews - An old classic but still very good explanations of concurrent algorithmic concepts.
Parallel Programming: Concepts and Practice by Bertil Schmidt et.al. - A relatively recent book with comprehensive coverage.
Breza 15 hours ago
rramadass 6 days ago
jonas21 6 days ago
rramadass 5 days ago
smohare 5 days ago
lokimedes 6 days ago
rramadass 6 days ago
These sort of books are only "dated" when it comes to specific languages/frameworks/libraries. The methods/techniques are evergreen and often conceptually better explained in these older books.
For recent up to date works on HPC, the free multi-volume The Art of High Performance Computing by Victor Eijkhout can't be beat - https://news.ycombinator.com/item?id=38815334
elashri 6 days ago
Disclaime: I don't claim that this is actually a systematic way to learn it and it is more for academic work.
I got assigned to a project that needs learning CUDA as part of my PhD. There was no one in my research group who have any experience or know CUDA. I started with standard NVIDIA courses (Getting Started with Accelerated Computing with CUDA C/C++ and there is python version too).
This gave me good introduction to the concepts and basic ideas but I think after that I did most of learning by trial and error. I tried a couple of online tutorials for specific things and some books but it was always a deprecated function there or here or a change of API that make things obsolete. Or basically things changed for your GPU and now you have to be careful because yoy might be using GPU version not compatible with what I develop for in production and you need things to work for both.
I think learning CUDA for me is an endeavor of pain and going through "compute-sanitizer" and Nsight because you will find that most of your time will go into debugging why things is running slower than you think.
Take things slowly. Take a simple project that you know how to do without CUDA then port it to CUDA ane benchmark against CPU and try to optimize different aspect of it.
The one advice that can be helpful is not to think about optimization at the beginning. Start with correct, then optimize. A working slow kernel beats a fast kernel that corrupts memory.
korbip 6 days ago
Overall, it will be some pain for sure. And to master it including PTX etc. will take a lot of time.
kevmo314 6 days ago
This is so true it hurts.
sputknick 6 days ago
Onavo 6 days ago
Make sure you are very clear on what you want. Most HR departments cast a wide net (it's like how every junior role requires "3-5 years of experience" when in reality they don't really care). Similarly when hiring, most companies pray for the unicorn developer who can understand the entire stack from the GPU to the end user product domain when the day to day is mostly in Python.
ForgotIdAgain 6 days ago
rramadass 6 days ago
Scientific Parallel Computing by L. Ridgway Scott et. al. - https://press.princeton.edu/books/hardcover/9780691119359/sc...
canyp 6 days ago
mekpro 6 days ago
kloop 6 days ago
It's a weird case, but the pixels can be processed independently for most of it, so it works pretty well. Then the rows can be summarized in parallel and rolled up at the end. The copy onto the gpu is our current bottleneck however.
SoftTalker 6 days ago
rakel_rakel 5 days ago
SonOfLilit 5 days ago
https://developer.download.nvidia.com/compute/cuda/2_2/sdk/w...
After this you should be able to tell whether you enjoy this kind of work.
If you do, try to do a reasonably optimized GEMM, and then try to follow the FlashAttention paper and implement a basic version of what they're doing.
alecco 5 days ago
Do not implement algorithms by hand. Recent architectures are extremely hard to reach decent occupancy and such. Thrust and cub solve 80% of the cases with reasonable trade-offs and they do most of the work for you.
bee_rider 5 days ago
But, I don’t understand the comparison to TBB. Do they have a version of TBB that runs on the GPU natively? If the TBB implementation is on the CPU… that’s just comparing two different pieces of hardware. Which would be confusing, bordering on dishonest.
alecco 5 days ago
math_dandy 6 days ago
corysama 6 days ago
gkbrk 6 days ago
bee_rider 5 days ago
throwaway81523 5 days ago
tkuraku 6 days ago
sremani 6 days ago
The YouTube Channel - CUDA_MODE - it is based on PMPP I could not find the channel, but here is the playlist https://www.youtube.com/watch?v=LuhJEEJQgUM&list=PLVEjdmwEDk...
Once done, you would be on solid foundation.
fifilura 6 days ago
https://gfxcourses.stanford.edu/cs149/fall24/lecture/datapar...
So - start getting used to programming without using for loops, would be my tip.
gdubs 6 days ago
https://developer.nvidia.com/gpugems/gpugems3/part-v-physics...
As an Apple platforms developer I actually worked through those books to figure out how to convert the CUDA stuff to Metal, which helped the material click even more.
Part of why I did it was – and this was some years back – I wanted to sharpen my thinking around parallel approaches to problem solving, given how central those algorithms and ways of thinking are to things like ML and not just game development, etc.
lacker 5 days ago
https://www.youtube.com/playlist?list=PLxNPSjHT5qvtYRVdNN1yD...
After watching this video I was able to implement a tiling version of a kernel that was the bottleneck of our production data analysis pipeline to improve performance by over 2x. There's much more to learn but I found this video series to be a great place to start.
weinzierl 6 days ago
majke 6 days ago
I found it easy to start. Then there was a pretty nice learning curve to get to warps, SM's and basic concepts. Then I was able to dig deeper into the integer opcodes, which was super cool. I was able to optimize the compute part pretty well, without much roadblocks.
However, getting memory loads perfect and then getting closer to hw (warp groups, divergence, the L2 cache split thing, scheduling), was pretty hard.
I'd say CUDA is pretty nice/fun to start with, and it's possible to get quite far for a novice programmer. However getting deeper and achieving real advantage over CPU is hard.
Additionally there is a problem with Nvidia segmenting the market - some opcodes are present in _old_ gpu's (CUDA arch is _not_ forwards compatible). Some opcodes are reserved to "AI" chips (like H100). So, to get code that is fast on both H100 and RTX5090 is super hard. Add to that a fact that each card has different SM count and memory capacity and bandwidth... and you end up with an impossible compatibility matrix.
TLDR: Beginnings are nice and fun. You can get quite far on the optimizing compute part. But getting compatibility for differnt chips and memory access is hard. When you start, chose specific problem, specific chip, specific instruction set.
dist-epoch 6 days ago
Start writing some CUDA core to sort an array or find the maximum element.
the__alchemist 6 days ago
amelius 6 days ago
If that is not an option, I'll wait!
latchkey 6 days ago
The hardware between brands is fundamentally different. There isn't a standard like x86 for CPUs.
So, while you may use something like HIPIFY to translate your code between APIs, at least with GPU programming, it makes sense to learn how they differ from each other or just pick one of them and work with it knowing that the others will just be some variation of the same idea.
horsellama 6 days ago
pjmlp 6 days ago
the__alchemist 6 days ago
corysama 6 days ago
Runs on anything + auto-differentiatation.
pjmlp 6 days ago
Had it not been for Apple with OpenCL initial contribution, regardless of how it went from there, AMD with Mantle as starting point for Vulkan, NVidia with Vulkan-Hpp and Slang, and the ecosystem of Khronos standards would be much worse.
Also Vulkan isn't as bad as OpenGL tooling, because LunarG exists, and someone pays them for the whole Vulkan SDK.
The attitude "we put paper standards" and the community should step in for the implementations and tooling, hardly comes to the productivity from private APIs tooling.
Also all GPU vendors, including Intel and AMD, also rather push their own compute APIs, even if based on top of Khronos ones.
david-gpu 5 days ago
Khronos is a consortium financed by its members, who either implement the standards on their own hardware or otherwise depend on the ecosystem around them. For example, competing GPU vendors typically implement the standards in parallel with the committee meetings. The very people who represent their company in Khronos are typically leads of the teams who implement the standards.
Source: used to represent my employers at Khronos. It was a difficult, thankless job, that required almost as much diplomacy as technical expertise.
pjmlp 5 days ago
Cloudef 6 days ago
moralestapia 5 days ago
Perhaps you haven't noticed, but you're in a thread that asked about CUDA, explicitly.
uecker 6 days ago
epirogov 6 days ago
matt3210 5 days ago
brudgers 5 days ago
That doesn't mean one-eyed-king knowledge is never enough to solve that chicken-and-egg. You only have to be good enough to get the job.
But if you haven't done it on the job, you don't have work experience and you are either lying to others or lying to yourself...and any sophisticated organization won't fall for it...
...except of course, knowingly. And the best way to get someone to knowingly ignore obvious dunning-kruger and/or horseshit is to know that someone personally or professionally.
Which is to say that the best way to get a good job is to have a good relationship with someone who can hire you for a good job (nepotism trumps technical ability, always). And the best way to find a good job is to know a lot of people who want to work with you.
To put it another way, looking for a job is the only way to find a job and looking for a job is also much much harder than everything that avoids looking for a job (like studying CUDA) by pretending to be preparation...because again, studying CUDA won't ever give you professional experience.
Don't get me wrong, there's nothing wrong with learning CUDA all on your own. But it is not professional experience and it is not looking for a job doing CUDA.
Finally, if you want to learn CUDA just learn it for its own sake without worrying about a job. Learning things for their own sake is the nature of learning once you get out of school.
Good luck.
mugivarra69 6 days ago
izharkhan 6 days ago