161 points by woodglyst 1 week ago | 62 comments
gwern 6 days ago
The entropy of ChatGPT (as well as all other generative models which have been 'tuned' using RLHF, instruction-tuning, DPO, etc) is so low because it is not predicting "most likely tokens" or doing compression. A LLM like ChatGPT has been turned into an RL agent which seeks to maximize reward by taking the optimal action. It is, ultimately, predicting what will manipulate the imaginary human rater into giving it a high reward.
So the logits aren't telling you anything like 'what is the probability in a random sample of Internet text of the next token', but are closer to a Bellman value function, expressing the model's belief as to what would be the net reward from picking each possible BPE as an 'action' and then continuing to pick the optimal BPE after that (ie. following its policy until the episode terminates). Because there is usually 1 best action, it tries to put the largest value on that action, and assign very small values to the rest (no matter how plausible each of them might be if you were looking at random Internet text). This reduction in entropy is a standard RL effect as agents switch from exploration to exploitation: there is no benefit to taking anything less than the single best action, so you don't want to risk taking any others.
This is also why completions are so boring and Boltzmann temperature stops mattering and more complex sampling strategies like best-of-N don't work so well: the greedy logit-maximizing removes information about interesting alternative strategies, so you wind up with massive redundancy and your net 'likelihood' also no longer tells you anything about the likelihood.
And note that because there is now so much LLM text on the Internet, this feeds back into future LLMs too, which will have flattened logits simply because it is now quite likely that they are predicting outputs from LLMs which had flattened logits. (Plus, of course, data labelers like Scale can fail at quality control and their labelers cheat and just dump in ChatGPT answers to make money.) So you'll observe future 'base' models which have more flattened logits too...
I've wondered if to recover true base model capabilities and get logits that actually meaningful predict or encode 'dark knowledge', rather than optimize for a lowest-common-denominator rater reward, you'll have to start dumping in random Internet text samples to get the model 'out of assistant mode'.
HarHarVeryFunny 6 days ago
Of course humans employ different thinking modes too - no harm in thinking like a stone cold programmer when you are programming, as long as you don't do it all the time.
Vetch 6 days ago
larodi 6 days ago
That would be true only if all that we grant for based/true/fact came through reasoning in a complete logical and awoke state. But it did not, and if you dig a little or more you'd find a lot of actual dreaming revelation, divine and all sorts of subconscious revelation that governs lives and also science.
BobbyJo 5 days ago
nikkindev 6 days ago
It is indeed alarming that the future 'base' models would start with more flattened logits as the de-facto. I personally believe that once this enshittification is recognised widely (could already be the case, but not recognized) then the training data being more "original" will become more important. And the cycle repeats! Or I wonder if there is a better post-training method that would still withhold the "creativity"?
Thanks for the RLHF explanation in terms of BPE. Definitely easier to grasp the concept this way!
derefr 6 days ago
This isn't strictly true. It is still predicting "most likely tokens"! It's just predicting the "most likely tokens" generated in a specific step in a conversation game; where that step was, in the training dataset, taken by an agent tuned to maximize reward. For that conversation step, the model is trying to predict what such an agent would say, as that is what should come next in the conversation.
I know this sounds like semantics/splitting hairs, but it has real implications for what RLHF/instruction-following models will do when not bound to what one might call their "Environment of Evolutionary Adaptedness."
If you unshackle any instruction-following model from the logit bias pass that prevents it from generating end-of-conversation-step tokens/sequences, then it will almost always finish inferring the "AI agent says" conversation step, and move on to inferring the following "human says" conversation step. (Even older instruction-following models that were trained only on single-shot prompt/response pairs rather than multi-turn conversations, will still do this if they are allowed to proceed past the End-of-Sequence token, due to how training data is packed into the context in most training frameworks.)
And when it does move onto predicting the "human says" conversation step, it won't be optimizing for reward (i.e. it won't be trying to come up with an ideal thing for the human say to "set up" a perfect response to earn it maximum good-boy points); rather, it will just be predicting what a human would say, just as its ancestor text-completion base-model would.
(This would even happen with ChatGPT and other high-level chat-API agents. However, such chat-API agents are stuck talking to you through a business layer that expects to interact with the model through a certain trained-in ABI; so turning off the logit bias — if that was a knob they let you turn — would just cause the business layer to throw exceptions due to malformed JSON / state-machine sequence errors. If you could interact with those same models through lower-level text-completion APIs, you'd see this result.)
For similar reasons, these instruction-following models always expect a "human says" step to come first in the conversation message stream; so you can also (again, through a text-completion API) just leave the "human says" conversation step open/unfinished, and the model will happily infer what "the rest" of the human's prompt should be, without any sign of instruction-following.
In other words, the model still knows how to be a fully-general, high-entropy(!) text-completion model. It just also knows how to play a specific word game of "ape the way an agent trained to do X responds to prompts" — where playing that game involves rules that lower the entropy ceiling.
This is exactly the same as how image models can be prompted to draw in the style of a specific artist. To an LLM, the RLHF agent it has been fed a training corpus of, is a specific artist it's learned to ape the style of, when and only when it thinks that such a style should apply to some sub-sequence of the output.
nullc 6 days ago
Doesn't work for closed-ai hosted models that seemingly use some kind of external supervision to prevent 'journalists' from using their platform to write spicy headlines.
Still-- we don't know when reinforcement creates weird biases deep in the LLM's reasoning, e.g. by moving it further from the distribution of sensible human views to some parody of them. It's better to use models with less opinionated fine tuning.
paraschopra 5 days ago
Vetch 6 days ago
derefr 6 days ago
Open models are almost always remotely hosted (or run locally) through a pure text-completion API. If you want chat, the client interacting with that text-completion API is expected to be the business layer, either literally (with that client in turn being a server exposing a chat-completion API) or in the sense of vertically integrating the chat-message-stream-structuring business-logic, logit-bias specification, early stream termination on state change, etc. into the completion-service abstraction-layer of the ultimate client application.
In either case, any slip-up in the business-layer configuration — which is common, as these models all often use different end-of-conversation-step sequences, and don't document them well — can and does result in seeing "under the covers" of these models.
This is also taken advantage of on purpose in some applications. In the aforementioned SillyTavern client, there is an "impersonate" command, which intentionally sets up the context to have the agent generate (or finish) the next human conversation step, rather than the next agent conversation step.
daedrdev 6 days ago
ramblenode 5 days ago
leptons 6 days ago
kleiba 6 days ago
WhitneyLand 6 days ago
How is this not deterministic? Randomness is intentionally added via temperature.
alew1 6 days ago
wyager 6 days ago
Zero temperature => fully deterministic
The neuron activation levels do not inherently form or represent a probability distribution. That's something we've slapped on after the fact
alew1 6 days ago
But I wouldn't call the probabilistic interpretation "after the fact." The entire training procedure that generated the LM weights (the pre-training as well as the RLHF post-training) is formulated based on the understanding that the LM predicts p(x_t | x_1, ..., x_{t-1}). For example, pretraining maximizes the log probability of the training data, and RLHF typically maximizes an objective that combines "expected reward [under the LLM's output probability distribution]" with "KL divergence between the pretraining distribution and the RLHF'd distribution" (a probabilistic quantity).
apstroll 5 days ago
apstroll 6 days ago
Der_Einzige 6 days ago
LLMs are basically "deterministic" when using greedy sampling except for either MoE related shenanigans (what historically prevented determinism in ChatGPT) or due to floating point related issues (GPU related). In practice, LLMs are in fact basically "deterministic" except for the sampling/temperature stuff that we add at the very end.
HarHarVeryFunny 6 days ago
The original ChatCPT was based on GPT-3.5, which did not use MoE.
TeMPOraL 6 days ago
HeatrayEnjoyer 6 days ago
pizza 5 days ago
hansvm 6 days ago
Yes, you can sample deterministically, but that's some combination of computationally intractable and only useful on a small subset of problems. The black box outputting a non-deterministic token is a close enough approximation for most people.
HarHarVeryFunny 6 days ago
"The important thing to remember is that the output token of the LLM (black box) is not deterministic. Rather, it is a probability distribution over all the available tokens in the vocabulary."
He is saying that there is non-determinism in the output of the LLM (i.e. in these probability distributions), when in fact the randomness only comes from choosing to use a random number generator to sample from this output.
fancyfredbot 6 days ago
Even so the distribution of the second token output by the model would be stochastic (unless you condition on the first token). So in that sense there may also be a stochastic probability distribution.
hansvm 5 days ago
You could still easily model the next token as a conditional probability distribution though if you wanted; the computation of entropy just might be a bit spendier.
K0balt 6 days ago
But I see the same misconceptions as always around “hallucinations”. Incorrect output is just incorrect output. There is no difference in the function of the model, no malfunction. It is working exactly as it does for “correct “ answers. This is what makes the issue of incorrect output intractable.
Some optimisation can be achieved through introspection, but ultimately, an llm can be wrong for the same reason that a person can be wrong, incorrect conclusions, bad data, insufficient data, or faulty logic/modeling. If there was a way to be always right, we wouldn’t need LLMs or second opinions.
Agentic workflows and introspection/cot catch a lot, and flights of fancy are often not supported or replicated with modifications to context, because the fanciful answer isn’t reinforced in the training data.
But we need to get rid of the unfortunate term for wrong conclusions,“hallucination” . When we say a person is hallucinating, it implies an altered state of mind. We don’t say that bob is hallucinating when he thinks that the sky is blue because it reflects the ocean, we just know he’s wrong because he doesn’t know about or forgot about Raleigh scattering.
Using the term “hallucination” distracts from accurate thought and misleads people to draw erroneous conclusions.
nikkindev 5 days ago
K0balt 5 days ago
On undesired output, I would think it a great service to the field if we could come up with a better and earwormier word for “hallucinations” and somehow make it stick.
Right now we have half the literate world walking around thinking that LLMs are licking frogs, and it does nothing to help people understand how to think about model outputs or how to increase the utility of these fantastic culture / data mining tools in their own lives.