115 points by timmyd 22 hours ago | 57 comments
daft_pink 3 hours ago
It’s just really impractical to use a licensed programming language in 2025.
totalperspectiv 2 hours ago
Possibly rose-tinted glasses on my part, but I’m optimistic for 2026. Chris Lattner has a pretty strong track record of getting these things right.
veidr 1 hour ago
melodyogonna 37 minutes ago
Btw, Mojo's development is a masterclass in language development and community building, it's been fun watching Chris go back to fix technical debts in existing features rather than proceeding with adding new features.
GeekyBear 28 minutes ago
Also to MLIR while Lattner was at Google:
> MLIR was born—a modular, extensible compiler infrastructure designed to bring order to the chaos. It brought forth a foundation that could scale across hardware platforms, software frameworks, and the rapidly evolving needs of machine learning. It aimed to unify these systems, and provide a technology platform that could harmonize compute from many different hardware makers.
But unification is hard. What started as a technical project quickly turned into a battleground: open-source governance, corporate rivalries, and competing visions all collided. What could have been a straightforward engineering win became something much more complicated.
https://www.modular.com/blog/democratizing-ai-compute-part-8...
saagarjha 7 hours ago
I can't say for sure because I couldn't find the CUDA kernel but I kind of doubt this is true. You can hit memory bandwidth on Hopper without using TMA at all, which is mostly designed for accelerating asynchronous copies and reducing memory pressure. If all you are doing is a transpose you don't need any of this to go fast (though it might simplify your indexing code…?)
simon_vtr 3 hours ago
totalperspectiv 2 hours ago
Great write up! I learned a lot!
simon_vtr 1 hour ago
sestep 21 hours ago
musebox35 11 hours ago
saagarjha 7 hours ago
musebox35 5 hours ago
Jay Shah’s later articles contain examples that involve epilogue fusion. IMHO, understanding how to write an efficient transpose helps with following the more involved ones.
simon_vtr 3 hours ago
londons_explore 18 hours ago
Isn't it better to simply combine the transposition with whatever next operation one wishes to do with the matrix?
hogepodge 17 hours ago
meindnoch 5 hours ago
throwawayabcdef 17 hours ago
viraptor 16 hours ago
saagarjha 7 hours ago
stephencanon 1 hour ago
fulafel 11 hours ago
iandanforth 6 hours ago
somethingsome 3 hours ago
You have global memory and shared memory, the global is slower.
You read in rows in the global memory (faster than reading columns)
You write in columns in the shared memory (slower than in rows, but the shared memory is fast, this is the transpose operation)
You read in rows in the shared memory (very fast)
You write in rows in the global memory (faster than writing in columns)
The idea behind that tiling is to hide the slow part in a memory that is faster.
thunkingdeep 3 hours ago
graycat 1 hour ago
melodyogonna 10 hours ago
totalperspectiv 2 hours ago
saagarjha 7 hours ago
Q6T46nT668w6i3m 1 hour ago
arjvik 21 hours ago
77pt77 21 hours ago
>(2771.35/2775.49 - 1) * 100 = -.14916285052369131300
Flagged.
timmyd 20 hours ago
"This kernel archives 1437.55 GB/s compared to the 1251.76 GB/s we get in CUDA" (14.8%) which is still impressive
vlan121 21 hours ago
dgurchenkov 18 hours ago
xiphias2 4 hours ago
GeekyBear 1 hour ago
He has a bit of a track record already.
almostgotcaught 17 hours ago
Are you talking about your libc equivalent or MAX?
dgurchenkov 30 minutes ago
Mojo standard library is already open source. Mojo at the moment does not need a runtime (but if it ever needs one it'd get open sourced). My point was, Mojo as a whole, as a programming language & a reference implementation, will definitely get open sourced.
MAX itself is a bigger beast to work with, and I am out of my depth to talk about it. I think it'll get open sourced as well, just the timeline might be different (shorter or longer, IDK).
colesantiago 21 hours ago
melodyogonna 8 hours ago
almostgotcaught 17 hours ago
saagarjha 7 hours ago
almostgotcaught 2 hours ago
jsnell 21 hours ago
Also, the improvement is 0.14%, not 14% making the editorialized linkbait particularly egregious.
timmyd 18 hours ago
transpose_naive - Basic implementation with TMA transfers
transpose_swizzle - Adds swizzling optimization for better memory access patterns
transpose_swizzle_batched - Adds thread coarsening (batch processing) on top of swizzling
Performance comparison with CUDA: The Mojo implementations achieve bandwidths of:
transpose_naive: 1056.08 GB/s (32.0025% of max)
transpose_swizzle: 1437.55 GB/s (43.5622% of max)
transpose_swizzle_batched: 2775.49 GB/s (84.1056% of max)
via the GitHub - simveit/efficient_transpose_mojo
Comparing to the CUDA implementations mentioned in the article:
Naive kernel: Mojo achieves 1056.08 GB/s vs CUDA's 875.46 GB/s
Swizzle kernel: Mojo achieves 1437.55 GB/s vs CUDA's 1251.76 GB/s
Batched swizzle kernel: Mojo achieves 2775.49 GB/s vs CUDA's 2771.35 GB/s
So there is highly efficient matrix transpose in Mojo
All three Mojo kernels outperform their CUDA counterparts, with the naive and swizzle kernels showing significant improvements (20.6% and 14.8% faster respectively), while the final optimized kernel achieves essentially identical performance (slightly better by 4.14 GB/s).
The "flag" here seemed innapropriate given that its true this implementation is indeed faster, and certainly the final iteration could be improved on further. It wasn't wrong to say 14% or even 20%.
jsnell 17 hours ago
Email the mods at hn@ycombinator.com. There's a chance they'll remove the flag and re-up the post.
timmyd 17 hours ago
atomicapple 21 hours ago
jebarker 21 hours ago
timmyd 17 hours ago
baal80spam 21 hours ago
jsnell 21 hours ago
18 hours ago
ByteDrifter 12 hours ago
voronar 21 hours ago
noracists 21 hours ago
htrp 21 hours ago
bravetraveler 21 hours ago
> "From the moment I understood the weakness of my flesh, it disgusted me. I craved the strength and certainty of steel."
14% all the time vs 35% some of the time
edit: Closing numbers are far less impressive than those buried in the middle of the post. Confusing; bye everyone