104 points by birriel 4 days ago | 13 comments
gwern 1 day ago
birriel 10 hours ago
HN needs to do better.
cs702 13 hours ago
cs702 4 days ago
EDIT: I'm reminded of this other type of associative memory: https://github.com/glassroom/heinsen_routing. The idea there is to compute a mixture of memories that best predicts the given input sequence. Quite frankly, I don't remember how the whole thing works, but I do remember that it works. It's been a while since I used it, so YMMV. In any case, it may be of interest to you.
testfoo11111111 1 day ago
If neural memory was conventional, GPT4o's memory wouldn't be stored as plain text and prepended to prompts.
This paper reminds me of the Switch Transformer paper; e.g. solidifying, expanding on, and proving out an area of research that may well have a big impact on leading LLMs and the SOTA in AI.
Agreed the concept of surprise is very cool.
Xmd5a 14 hours ago
Then you may be interested in Simplicity Theory:
https://simplicitytheory.telecom-paris.fr/
Relevant situations are unexpected
Relevant features generate compression
A situation or event is relevant if it is unexpected.
This means that it is simpler to describe than to generate.
In particular this recent paper:>Unexpectedness and Bayes’ Rule
>A great number of methods and of accounts of rationality consider at their foundations some form of Bayesian inference. Yet, Bayes’ rule, because it relies upon probability theory, requires specific axioms to hold (e.g. a measurable space of events). This short document hypothesizes that Bayes’ rule can be seen as a specific instance of a more general inferential template, that can be expressed also in terms of algorithmic complexities, namely through the measure of unexpectedness proposed by Simplicity Theory.
Source: https://cifma.github.io/Papers-2021/CIFMA_2021_paper_13.pdf
Vampiero 10 hours ago
pizza 1 day ago
Taken to its extreme where the ‘memory’ is descriptive enough to deterministically control the decoding you get parallelism over the sequence for free as a consequence of the associativity.
Similar techniques are used in making video compression algorithms robust enough for low latency reconnection in online streaming in poor/changing network conditions, or making it possible to decompress JPEGs at >1GBps in parallel by exploiting the presence of ‘RESET’ tokens that indicate independent/novel substreams.
That said, I do agree that this is definitely a great paper and contribution to language models though!
cma 20 hours ago
> Each RNN tries to solve the pretext task of predicting its next input, sending only unexpected inputs to the next RNN above. This greatly facilitates downstream supervised deep learning such as sequence classification. By 1993, the approach solved problems of depth 1000 (requiring 1000 subsequent computational stages/layers—the more such stages, the deeper the learning). A variant collapses the hierarchy into a single deep net. It uses a so-called conscious chunker RNN which attends to unexpected events that surprise a lower-level so-called subconscious automatiser RNN. The chunker learns to understand the surprising events by predicting them. The automatiser uses my neural knowledge distillation procedure of 1991 [UN0-UN2] to compress and absorb the formerly conscious insights and behaviours of the chunker, thus making them subconscious. The systems of 1991 allowed for much deeper learning than previous methods.
https://people.idsia.ch/~juergen/very-deep-learning-1991.htm...
HarHarVeryFunny 13 hours ago
It's obvious that things like memory, on various timescales (incl. working), selective attention, surprise (i.e. prediction failure) as a learning/memorization signal are going to be part of any AGI solution, but the question is how do you combine and realize these functionalities into an actual cognitive architecture?
Schmidhuber (or in this case you, on his behalf!) effectively saying "I worked on that problem, years ago" is irrelevant. He also worked on LSTMs, which learned to memorize and forget, and the reference section of the "Titans" paper leads to many more recent attempts - different proposed architectures - addressing the same problems around (broadly speaking) learning how best to use working memory. Lots of people suggesting alternatives, but it would seem no compelling solution that has been published.
If it's one of the commercial frontier model labs that does discover the next piece of the architectural puzzle in moving beyond transformers towards AGI, I very much doubt they'll be in any hurry to publish it!
cma 13 hours ago
Just pointing out that that idea was in some of Schmidhuber's earlier work.
> Schmidhuber (or in this case you, on his behalf!) effectively saying "I worked on that problem, years ago" is irrelevant.
Ok. People do read his work and get ideas from it even if this didn't necessarily. He had a lot of good stuff.
> but the question is how do you combine and realize these functionalities into an actual cognitive architecture?
I believe Schmidhuber gave one at the time?
sdenton4 11 hours ago
Execution is what matters. We can smoke a blunt and have some nice sounding ideas, but building something that works on data at scale is what actually counts.
cma 10 hours ago