113 points by ngrislain 4 days ago | 40 comments
yodon 4 days ago
That said, it's concerning to see the reported probability for getting a 4 on a die roll is 65%.
Hopefully OpenAI isn't that biased at generating die rolls, so is that number actually giving us information about the accuracy of the probability assessments?
teej 3 days ago
This is a problem when people naively use "give an answer on a scale of 1-10" in their prompts. LLMs are biased towards particular numbers (like humans!) and cannot linearly map an answer to a scale.
It's extremely concerning when teams do this in a context like medicine. Asking an LLM "how severe is this condition" on a numeric scale is fraudulent and dangerous.
low_tech_love 3 days ago
stanislavb 3 days ago
low_tech_love 3 days ago
yorwba 3 days ago
low_tech_love 2 days ago
The results of an LLM are an arbitrary approximation of what a human would expect to see as the results of a query. In other words, it correlates very well with human expectations and is very good at fooling you into believing it. But can it provide you with results that you disagree with?
And more importantly, can you trust these results scientifically?
yorwba 2 days ago
But the real question is not whether you agree with the results, but whether they're useful. If you apply an objective method to data it is unsuitable for, it's garbage in, objective garbage out. Whether the method is suitable or not is not always something you can decide a priori, then you need to check.
And if trying it out shows that LLM-provided clusters are more useful than other methods, you should swallow your pride and accept that, even if you disagree on philosophical grounds. (Or it might show that the LLM has no idea what it's doing! Then you can feel good about yourself.)
Terr_ 3 days ago
dragonwriter 3 days ago
Finding that an LLM is biased toward inventing die rolls that are the median result rounded to an available result by the most common rounding method is...not particularly surprising. If you want a fair RNG, use an RNG deigned to be fair, not an LLM where that would be, at best, an emergent accidental property.
ngrislain 4 days ago
ngrislain 4 days ago
elcritch 3 days ago
AFAICT, the LLMs aren’t creating new mental mappings of “dice are a symmetric and should give equal probability to land on any side followed by using that info to infer they should use a RNG.”
radarsat1 3 days ago
mmcwilliams 3 days ago
low_tech_love 3 days ago
Think about this: suppose you’re reading a scientific paper and the author writes “I did a study with 52 participants, and here are the answers”. Would there be any reason to believe that data is real?
mmcwilliams 3 days ago
I'm not sure I follow your hypothetical. The author making the claim in a public paper can be contacted for the data. It can be verified. Auditing the internals of an LLM, especially a closed one that, is not the same.
lyu07282 3 days ago
https://news.ycombinator.com/item?id=42684629
> the logits aren't telling you anything like 'what is the probability in a random sample of Internet text of the next token', but are closer to a Bellman value function, expressing the model's belief as to what would be the net reward from picking each possible BPE as an 'action' and then continuing to pick the optimal BPE after that (ie. following its policy until the episode terminates). Because there is usually 1 best action, it tries to put the largest value on that action, and assign very small values to the rest (no matter how plausible each of them might be if you were looking at random Internet text)
ngrislain 3 days ago
gardnr 3 days ago
HanClinto 3 days ago
Any interest in seeing this sort of thing being added to llama.cpp?
HanClinto 3 days ago
It feels like this would be useful enough to build around -- I especially like the idea of asking the API to return the top K results for each field, and denoting their likelyhood -- almost like a dropdown box with percentages attached for each possible result.
DrPhish 3 days ago
HanClinto 3 days ago
juxtaposicion 3 days ago
Any chance we can get Pydantic support?
themanmaran 3 days ago
If you run "bananas,fishbowl,phonebook," and get {"sponge": 0.76}
It doesn't mean that "placemat" was the 76% correct answer. Just that the word "sponge" was the next most likely word for the model to generate.
ngrislain 3 days ago
The library is compatible with that but does not use Pydantic further than that.
juxtaposicion 3 days ago
```
class Classification(BaseModel):
color: Literal['red', 'blue', 'green']
```then the output type would be:
```
class ClassificationWithLogProbs(BaseModel):
color: Dict[Literal['red', 'blue', 'green'], float]
```Don't take this too literally; I'm not convinced that this is the right way to do it. But it would provide structure and scores without dealing with a mess of complex JSON.
lyu07282 3 days ago
One question I always had was what about the descriptions you can attach to the class and attributes? ( = Field(description=...) in pydantic) is the model made aware of those descriptions?