Hacker Remix

Numerical Precision Affects Mathematical Reasoning Capabilities of LLMs

62 points by belter 5 days ago | 47 comments

thrance 17 hours ago

By "Mathematical Reasoning Capabilities" they meant "basic arithmetic" and by "Numerical Precision" they meant "quantization".

Vetch 15 hours ago

That's not quite right. By numerical precision they mean numerical precision, of which quantization is one method to arrive at reduced precision. They also perform experiments where they train from scratch for float32 and float16.

Reasoning, they should have replaced with: iterative computations with accumulating state. This paper, on the impact of quantization, is actually a lot more significant than it appears at first and I think the authors could have done a better job of discussing the broader implications.

The paper's core (and unsurprising) argument is that low-precision arithmetic significantly limits the representational capacity of individual neurons. This forces the model to encode numerical values across multiple neurons to avoid overflow, particularly when storing intermediate computational results. This distributed representation, in turn, increases complexity and makes the model vulnerable to accumulating errors, particularly during iterative operations. It's unclear from my initial reading whether low-precision training (quantization-aware training) is necessary for the model to effectively learn and utilize these distributed representations, or if this capacity is inherent. Regardless, while QAT likely offers benefits, especially with larger numbers, the fundamental limitations of low precision computation persist.

Why not just use a calculator? For some of the same reason humans shouldn't be completely dependent on calculators. It's not just the ability to perform fermi estimates that's constrained but internal computations that require physical "intuition" or modeling the trajectories of physical systems, the ability to work with growing algebraic representations, relative numeric comparisons of large magnitudes (where the model does not internally switch to a favorable format--notice this is easier to do in-context), and representing iterative computation on complex logical chains and hierarchical structures.

Why do we not see this in practice? I contend that we do. There is a small but quite vocal contingent in every LLM forum who insist that quantization, even 8-bits, results in severe degradation in quality despite what most benchmarks say. It's quite likely that common tasks and most tests do not require iterative computations where accumulating state representations must be accurately tracked and these individuals are encountering some of the exceptions.

prideout 14 hours ago

Very nice analysis, wish I could upvote your comment more than once.

alexvitkov 17 hours ago

I wonder if we'll get better performance on arithmetic tasks if we let LLMs generate digits backwards, e.g. 12345+11111=<rev>65432</rev> where the rev tags are special tokens.

The reason being that there's less context needed to do it this way, for addition at every step there's only 3 digits that need to be considered and they're already in the token stream.

Vetch 15 hours ago

It does and the paper mentions some papers that investigate similar strategies in its appendix, but the fundamental precision issues do not go away.

astrange 14 hours ago

Since tokens are numbers, I'd like to see the numbers simply tokenized as themselves, or maybe an exponential Golomb code.

thesz 16 hours ago

Most probably, something like 65432 is just one token.

alexvitkov 15 hours ago

It's not, it wouldn't make sense to have 100,000 tokens just for the first 100,000 numbers. There's a playground [1] where you can see how LLMs tokenize a string.

12345678987654321 is tokenized on various models like so:

  GPT4                123-456-789-876-543-21
  GPT3                123-45-678-98-765-43-21
  Llama-2, Mistral    1-2-3-4-5-6-7-8-9-8-7-6-5-4-3-2-1

[1] https://huggingface.co/spaces/Xenova/the-tokenizer-playgroun...

magicalhippo 16 hours ago

Just been dabbling with local models, and while the several models I've tried generates decent sentences while quantized, they suffered heavily in following instructions and picking up details.

So a larger model but fairly aggressively quantized could perform worse than a smaller variant of the model with just light quantization, even though the larger still used more memory in total.

I guess some of this is due to the models not being trained to the quantization levels I used. In any case, I say don't get blended by parameter count alone, compare performances.

amelius 14 hours ago

One thing I was wondering about regarding transformers, which perhaps someone more knowledgeable can explain: as far as I understand, the attention heads are essentially two-dimensional structures where values related to tokens are compared to each other in a matrix. Has anyone tried to generalize this and make the dimension of the attention-heads higher than two?

svachalek 13 hours ago

I'm not really in this space but I like to read the papers and as I understand it, the typical dimensionality is far higher than 2. For example in the original "All you need is attention" paper, the example they give has 64 dimensions. They're vectors so even though they might be drawn as a matrix, each value represents a distance in a different dimension.

amelius 11 hours ago

I'm talking about the matrices W_Q, W_K and W_V. My question is why these are matrices (tensors of dimension 2) and not tensors of a higher dimension than 2.

My thinking goes like: a matrix can represent a graph (each entry may correspond to an edge between two nodes), but e.g. a 3-dimensional tensor may correspond to a hypergraph where each entry is a 3-hyperedge, so you can not just talk about the relation between two tokens, but also about the relation between three tokens (in language this could be e.g. subject, object and indirect-object/dative).