62 points by belter 5 days ago | 47 comments
thrance 17 hours ago
Vetch 15 hours ago
Reasoning, they should have replaced with: iterative computations with accumulating state. This paper, on the impact of quantization, is actually a lot more significant than it appears at first and I think the authors could have done a better job of discussing the broader implications.
The paper's core (and unsurprising) argument is that low-precision arithmetic significantly limits the representational capacity of individual neurons. This forces the model to encode numerical values across multiple neurons to avoid overflow, particularly when storing intermediate computational results. This distributed representation, in turn, increases complexity and makes the model vulnerable to accumulating errors, particularly during iterative operations. It's unclear from my initial reading whether low-precision training (quantization-aware training) is necessary for the model to effectively learn and utilize these distributed representations, or if this capacity is inherent. Regardless, while QAT likely offers benefits, especially with larger numbers, the fundamental limitations of low precision computation persist.
Why not just use a calculator? For some of the same reason humans shouldn't be completely dependent on calculators. It's not just the ability to perform fermi estimates that's constrained but internal computations that require physical "intuition" or modeling the trajectories of physical systems, the ability to work with growing algebraic representations, relative numeric comparisons of large magnitudes (where the model does not internally switch to a favorable format--notice this is easier to do in-context), and representing iterative computation on complex logical chains and hierarchical structures.
Why do we not see this in practice? I contend that we do. There is a small but quite vocal contingent in every LLM forum who insist that quantization, even 8-bits, results in severe degradation in quality despite what most benchmarks say. It's quite likely that common tasks and most tests do not require iterative computations where accumulating state representations must be accurately tracked and these individuals are encountering some of the exceptions.
prideout 14 hours ago
alexvitkov 17 hours ago
The reason being that there's less context needed to do it this way, for addition at every step there's only 3 digits that need to be considered and they're already in the token stream.
Vetch 15 hours ago
astrange 14 hours ago
thesz 16 hours ago
alexvitkov 15 hours ago
12345678987654321 is tokenized on various models like so:
GPT4 123-456-789-876-543-21
GPT3 123-45-678-98-765-43-21
Llama-2, Mistral 1-2-3-4-5-6-7-8-9-8-7-6-5-4-3-2-1
[1] https://huggingface.co/spaces/Xenova/the-tokenizer-playgroun...magicalhippo 16 hours ago
So a larger model but fairly aggressively quantized could perform worse than a smaller variant of the model with just light quantization, even though the larger still used more memory in total.
I guess some of this is due to the models not being trained to the quantization levels I used. In any case, I say don't get blended by parameter count alone, compare performances.
amelius 14 hours ago
svachalek 13 hours ago
amelius 11 hours ago
My thinking goes like: a matrix can represent a graph (each entry may correspond to an edge between two nodes), but e.g. a 3-dimensional tensor may correspond to a hypergraph where each entry is a 3-hyperedge, so you can not just talk about the relation between two tokens, but also about the relation between three tokens (in language this could be e.g. subject, object and indirect-object/dative).