26 points by LorenzoGood 9 months ago | 10 comments
kelseyfrog 9 months ago
asveikau 9 months ago
zahlman 9 months ago
(Edit: the link does mention determining Peter Todd's identity and the fact that he has some involvement with Bitcoin - I had forgotten about this entirely - but it's still strange that this name ended up in the set of tokens that causes the glitches.)
nullc 9 months ago
The GPT2 and GPT3 tokenism were generated via a minimization process on a corpus of text. The original corpus was generated by taking all +3 or better reddit posts and including what they linked to, or something along those lines.
All the Bitcoin development discussion on IRC was (and is) public. Post GDPR several archival sites that indexed these discussions went offline, including ones that had been heavily linked from reddit. So later data collection for future training runs doesn't include this material -- and the copies still online were clearly excluded from training.
The tokenizer was trained on different content from the network. " petertodd" (and " gmaxwell" for that matter) were extremely common in the tokenizer training-- enough to get their own tokens outright-- but nearly absent in the network training.
The result is tokens that were poorly conditioned during network training, resulting in erratic behavior.
ultimafan 9 months ago
rabid_turtle 9 months ago
proxynoproxy 9 months ago
r721 9 months ago
morgango 9 months ago
jeffrallen 9 months ago
nullc 9 months ago
{{citation needed}}