Hacker Remix

Titans: Learning to Memorize at Test Time

115 points by birriel 6 months ago | 15 comments

gwern 6 months ago

Duplicate: https://news.ycombinator.com/item?id=42718166

birriel 6 months ago

OP here. I made this submission 3 days ago. The thread you're referencing was posted yesterday, using the same exact link.

HN needs to do better.

gwern 6 months ago

HN does, but it's still worth pointing it out, so the threads can be merged.

cs702 6 months ago

That page has more substantive comments. Your comment should be at the top here.

cs702 6 months ago

Interesting. I like the idea of a meta-mechanism that learns to update an associative memory based on how surprising the data is. The other stuff, reading memory via keys and values and selectively erasing it with gating, look pretty conventional on a first glance. Thank you for sharing this on HN. I've added it to my reading list.

EDIT: I'm reminded of this other type of associative memory: https://github.com/glassroom/heinsen_routing. The idea there is to compute a mixture of memories that best predicts the given input sequence. Quite frankly, I don't remember how the whole thing works, but I do remember that it works. It's been a while since I used it, so YMMV. In any case, it may be of interest to you.

testfoo11111111 6 months ago

there's nothing "pretty conventional" about a neural memory mechanism that comes along with such solid evidence of scalability and appealing performance characteristics.

If neural memory was conventional, GPT4o's memory wouldn't be stored as plain text and prepended to prompts.

This paper reminds me of the Switch Transformer paper; e.g. solidifying, expanding on, and proving out an area of research that may well have a big impact on leading LLMs and the SOTA in AI.

Agreed the concept of surprise is very cool.

cma 6 months ago

1991

> Each RNN tries to solve the pretext task of predicting its next input, sending only unexpected inputs to the next RNN above. This greatly facilitates downstream supervised deep learning such as sequence classification. By 1993, the approach solved problems of depth 1000 (requiring 1000 subsequent computational stages/layers—the more such stages, the deeper the learning). A variant collapses the hierarchy into a single deep net. It uses a so-called conscious chunker RNN which attends to unexpected events that surprise a lower-level so-called subconscious automatiser RNN. The chunker learns to understand the surprising events by predicting them. The automatiser uses my neural knowledge distillation procedure of 1991 [UN0-UN2] to compress and absorb the formerly conscious insights and behaviours of the chunker, thus making them subconscious. The systems of 1991 allowed for much deeper learning than previous methods.

https://people.idsia.ch/~juergen/very-deep-learning-1991.htm...

HarHarVeryFunny 6 months ago

It's unfortunate that Schmidhuber has both made many seminal contributions to the field, but also engages in "retroactive flag planting" whereby he claims credit for any current successes that are remotely related to anything he has worked on, even if only in terms of hand-wavy problem approach rather than actually building upon his own work.

It's obvious that things like memory, on various timescales (incl. working), selective attention, surprise (i.e. prediction failure) as a learning/memorization signal are going to be part of any AGI solution, but the question is how do you combine and realize these functionalities into an actual cognitive architecture?

Schmidhuber (or in this case you, on his behalf!) effectively saying "I worked on that problem, years ago" is irrelevant. He also worked on LSTMs, which learned to memorize and forget, and the reference section of the "Titans" paper leads to many more recent attempts - different proposed architectures - addressing the same problems around (broadly speaking) learning how best to use working memory. Lots of people suggesting alternatives, but it would seem no compelling solution that has been published.

If it's one of the commercial frontier model labs that does discover the next piece of the architectural puzzle in moving beyond transformers towards AGI, I very much doubt they'll be in any hurry to publish it!

cma 6 months ago

"I like the idea of a meta-mechanism that learns to update an associative memory based on how surprising the data is."

Just pointing out that that idea was in some of Schmidhuber's earlier work.

> Schmidhuber (or in this case you, on his behalf!) effectively saying "I worked on that problem, years ago" is irrelevant.

Ok. People do read his work and get ideas from it even if this didn't necessarily. He had a lot of good stuff.

> but the question is how do you combine and realize these functionalities into an actual cognitive architecture?

I believe Schmidhuber gave one at the time?

sdenton4 6 months ago

Does it work out-of-the-box today?

Execution is what matters. We can smoke a blunt and have some nice sounding ideas, but building something that works on data at scale is what actually counts.

cma 6 months ago

I think it's widely agreed a lot of useful stuff came out of Schmidhubers lab. The example I gave was one of the first things that scaled in lots of ways especially in depth, and it shares some characteristics with this. I doubt it outperforms this Titan architecture or is equivalent. That's not the same as him just putting out random ideas while high.

Xmd5a 6 months ago

>the concept of surprise is very cool

Then you may be interested in Simplicity Theory:

https://simplicitytheory.telecom-paris.fr/

    Relevant situations are unexpected
    Relevant features generate compression
    A situation or event is relevant if it is unexpected.
    This means that it is simpler to describe than to generate.

In particular this recent paper:

>Unexpectedness and Bayes’ Rule

>A great number of methods and of accounts of rationality consider at their foundations some form of Bayesian inference. Yet, Bayes’ rule, because it relies upon probability theory, requires specific axioms to hold (e.g. a measurable space of events). This short document hypothesizes that Bayes’ rule can be seen as a specific instance of a more general inferential template, that can be expressed also in terms of algorithmic complexities, namely through the measure of unexpectedness proposed by Simplicity Theory.

Source: https://cifma.github.io/Papers-2021/CIFMA_2021_paper_13.pdf

Vampiero 6 months ago

It's hard to take it seriously when every single paper on the subject is from one guy

pizza 6 months ago

There definitely is precedent - any parallelizably-decodable CABAC-derived neural compression algorithm basically has a flavor of this idea at its heart - intersperse statistical state throughout your token stream so you can decouple novelty in your state space on the fly.

Taken to its extreme where the ‘memory’ is descriptive enough to deterministically control the decoding you get parallelism over the sequence for free as a consequence of the associativity.

Similar techniques are used in making video compression algorithms robust enough for low latency reconnection in online streaming in poor/changing network conditions, or making it possible to decompress JPEGs at >1GBps in parallel by exploiting the presence of ‘RESET’ tokens that indicate independent/novel substreams.

That said, I do agree that this is definitely a great paper and contribution to language models though!

gryfft 6 months ago

> The concept of surprise is very cool

See also, foundationally: https://en.wikipedia.org/wiki/Information_content