24 points by LorenzoGood 1 day ago | 10 comments
kelseyfrog 23 hours ago
asveikau 2 hours ago
zahlman 11 hours ago
(Edit: the link does mention determining Peter Todd's identity and the fact that he has some involvement with Bitcoin - I had forgotten about this entirely - but it's still strange that this name ended up in the set of tokens that causes the glitches.)
nullc 3 hours ago
The GPT2 and GPT3 tokenism were generated via a minimization process on a corpus of text. The original corpus was generated by taking all +3 or better reddit posts and including what they linked to, or something along those lines.
All the Bitcoin development discussion on IRC was (and is) public. Post GDPR several archival sites that indexed these discussions went offline, including ones that had been heavily linked from reddit. So later data collection for future training runs doesn't include this material -- and the copies still online were clearly excluded from training.
The tokenizer was trained on different content from the network. " petertodd" (and " gmaxwell" for that matter) were extremely common in the tokenizer training-- enough to get their own tokens outright-- but nearly absent in the network training.
The result is tokens that were poorly conditioned during network training, resulting in erratic behavior.
ultimafan 20 hours ago
r721 5 hours ago