145 points by fzliu 1 day ago | 39 comments
kouteiheika 24 hours ago
There's pytorch's FlexAttention which could maybe make this practical, but currently it's just way too buggy.
jszymborski 23 hours ago
tpurves 12 hours ago
jszymborski 9 hours ago
Nvidia isn't likely to start releasing updated firmware for an obscure architecture for which there is limited evidence of improvement, and even less adoption.
kouteiheika 2 hours ago
I've been burnt way too many times by fancy new methods that claimed improvement, where I spent a ton of effort to implement them, and they ended up being poop.
Every person working in the field and pushing papers should read this blog post and apply what's written in it: https://kellerjordan.github.io/posts/muon/#discussion-solvin...
ssivark 8 hours ago
albertzeyer 17 hours ago
Also note, depending on your model dimensions and sequence lengths, often the attention computation plays only a minor role (maybe 10% overall or so), and the MLP computation dominates.
kouteiheika 13 hours ago
Maybe it's better now, but I'd still consider using FlexAttention without a corresponding unit test checking its accuracy against an equivalent eager implementation completely irresponsible.
bionhoward 1 day ago
1. https://ai.meta.com/research/publications/byte-latent-transf...
janalsncm 1 day ago
bigdict 1 day ago
Reubend 1 day ago
fabmilo 1 day ago
eightysixfour 1 day ago
jwilber 1 day ago
“With the current implementation of Evo2, we do not have the heavily optimized kernels in place for convolution operators like we do for attention layers in a model like llama2. Even with this shortcoming, we see that the benefit from including more convolutional layers makes up for the earlier stage of optimization at around the 64k context length. Beyond that point we see an improvement in performance even compared to a highly optimized transformer model.“
https://docs.nvidia.com/bionemo-framework/latest/models/evo2...
bob1029 1 day ago
I think we've already got a bit of a bottleneck in terms of memory bandwidth utilization.
kadushka 1 day ago
EGreg 1 day ago
https://arxiv.org/abs/2104.09864
The difference RoPE makes vs traditional positional encoding is that you just care about relative distances between tokens, and we can attenuate the attention over great distances.
Instead of making the model look at every token in the entire sequence all at once (which gets expensive fast), you can break the text into logical chunks—like sentences or paragraphs—and run self-attention within each chunk. That keeps things efficient while still capturing local meaning. Then, for each chunk, you create a summary—either by pooling or using a small learned head—and pass those summaries into a second layer of attention that operates on a much smaller scale. This gives you higher-level context across the document, kind of like moving from sentences to sections to the whole thing. Optionally, you can even send that higher-level context back down to influence the lower layers. This approach shows up in models like Longformer and BigBird (which use attention windows), hierarchical models (like HANs), and newer architectures like RetNet and Mamba that compress information over time or scale. RoPE fits neatly into this by helping each chunk handle relative positions more naturally.
RoPE is kind of perfect for this setup because it handles relative positions directly in the attention mechanism, which means each chunk can still understand the order and spacing of tokens without relying on fixed position embeddings. It’s especially useful when you're working with long sequences or chunked inputs, because it doesn’t care where the chunk is in the overall document—it just cares about how tokens relate to each other within that chunk. RoPE also makes it easier for models to generalize to longer inputs than they were trained on, since the rotational math behind it naturally extends beyond the original context window. Plus, because it's baked into the dot product itself, it adds no extra memory or computation, and plays well with hierarchical or multi-scale attention setups. Basically, it’s a clean, efficient way to inject positional awareness that doesn’t break when you start slicing things up.
PS: LLaMA's RoPE may be a bit off but it still works great: https://discuss.huggingface.co/t/is-llama-rotary-embedding-i...
cma 1 day ago
If it is only nearby tokens it is multiplicative by a constant right? Not making it cubic scaling with context length or anything.
Deepseek got a training performance increase with two tokens at a time, though it doesn't go into the final model inference like this. They did say it can be used for speculative decode to reduce inference costs though.
They may get away with less attention heads with this new approach too.
jgalt212 1 day ago