remix logo

Hacker Remix

Transformer^2: Self-Adaptive LLMs

154 points by hardmaru 4 days ago | 50 comments

RevEng 2 days ago

Does anyone else find their results don't match their claims? In many cases the base model or a simple LoRa beats their proposed method. The few times theirs wins, the difference is very small. I feel like some of these "wins" are more sampling error than any significant improvement.

I'm always happy to see publishing of negative results, but it seems like they are selling what are negative results as positive results.

verdverm 3 days ago

This sounds like MoE and maybe a bit of chain-of-thought. Curious what someone with more domain expertise thinks about this

If they can test against Llama 70B and Mistral 7B, they ought to compare against Mistral 8x7b imho

imtringued 3 days ago

I'm not an expert, but MoE models perform better at continuous learning, because they are less prone to catastrophic forgetting.

wildermuthn 3 days ago

Great research here. Contextual real-time weight modification is definitely one of the breakthroughs required for AGI. Why create a LoRA when you can generate one on the fly suited to the task at hand?

verdverm 3 days ago

It does not seem like they are doing inference time weight changes, to the tune of running backprop. It sounds more like they are applying a pre-trained vector to the model, and select that vector based on the input, in a two step process

wildermuthn 3 days ago

That’s my general understanding as well, but it isn’t a large conceptual leap to go from real-time selection of pretrained “z-vectors” to real-time generation of the same. The larger conceptual breakthrough, with demonstration of its effectiveness, is the big success here.

verdverm 3 days ago

While not a large conceptual leap, the real-time generation of "z-vectors" is not cheap in terms of compute or data requirements, the latter of which I see as the main issue. How are you going to generate the vector from a single real-time input?

I still have yet to see anything that dissuades me from agreeing with Yann LeCun when he says Transformers are fundamentally limited. We won't get creativity, reasoning, or even move past hallucinations without a major breakthrough

mordymoop 3 days ago

How do the o3 results fit in context of this perspective?

verdverm 3 days ago

They do not change it, from what I have seen, o3 is more hype and marketing than a meaningful step towards models which can exhibit real creativity and reasoning as humans perform it (rather than perceive it, which is the root of the hype)

For example, a small child is completely capable of being told "get in the car" and can understand, navigate, open the door, and get in, with incredibly little energy usage (maybe about the amount of a single potato chip/crisp)

Now consider what I have been working on recently (1) evaluating secops tools from both a technical and business perspective (2) prototyping and creating an RFC for the next version of our DX at the org. They are very far from this capability because it involves so many competing incentives, trade offs, and not just the context of the current state of code, but also the history and vision. Crafting that vision is especially beyond what a foundation in transformers can offer. They are in essence an averaging and sequence prediction algorithm

These tools are useful, even provide an ROI, but by no means anywhere close to what I would call intelligent.

monophonica 11 hours ago

Would love to know if you know any other papers like:

Faith and Fate: Limits of Transformers on Compositionality https://arxiv.org/abs/2305.18654

Maybe the analogy is something with gold mining. We could pretend that the machines that mine gold are actually creating gold. Pretending the entire gold mining sector is instead a discovery of alchemy.

Maybe the way alchemy kind of leads to chemistry is the analogy that applies?

I don't even know if that is right though.

The intelligence is in the training data. The model then is extracting the intelligence.

We can't forget Feynman's ideas here that we aren't going to make a robot cheetah that runs fast. We will make a machine that uses wheels. Viewing things through the lense of a cheetah is a category error.

While I agree completely with you we very well both might be completely and utterly wrong. A category error on what intelligence "is".

mtts 3 days ago

The interesting thing here is that the human brain also seems to use pretrained ... things. For vision, use the visual subsystem. For hearing, use the auditory subsystem. For movement ... you get the point. Plus you can combine these pretrained ... things, so for example for complex movement, like balancing on a tightrope, multiple subsystems are used (try standing on one leg with your eyes closed).

Z-vectors are of course nothing like the subsystems in your brain, but general the approach is certainly similar to how the brain works.

dleeftink 3 days ago

> things

Senses?

mtts 3 days ago

For sight and hearing, yes, but is "language use" a sense?

dleeftink 3 days ago

In the strict sense, no, but as a system of communication, yes; organisms need some form of sensory perception to communicate or 'sense' language.

mtts 3 days ago

Sort of. According to the text they can use multiple z-vectors (sets of weights that select for parts of the system to be used to answer a specific question) simultaneously, using a "simple optimization algorithm" to determine the relative weight for each of these vectors.

bugglebeetle 3 days ago

See also the work being done by GoodFire AI:

https://www.goodfire.ai/

They now have an API that allows for dynamic exploration and manipulation of the latent space for LLama 8-70B models (think Golden Gate Claude). They also open sourced the sparse auto-encoders that (in part) allow for this:

https://huggingface.co/Goodfire/Llama-3.3-70B-Instruct-SAE-l...

logicchains 3 days ago

>Contextual real-time weight modification is definitely one of the breakthroughs required for AGI.

It's already been invented: https://arxiv.org/abs/2202.05780 . That design is just very inefficient to scale up / use as a transformer backbone.

mnky9800n 3 days ago

Why not, as each new task comes up, and then weights are revalued, save those weights and keep them for reference as priors for similar future tasks? As the model is exposed to new data the average of the set of priors of things the model thinks is similar might move closer to the posterior making the model quicker and more able to arrive at good outcomes. I suppose storage might be an issue.

magospietato 3 days ago

I'm wondering if you could fine tune the model on an aggregate of a temporal slice of revalued weights? Something analogous to REM sleep's involvement in embedding the days events into long term memory.

Jerrrry 3 days ago

Sieve the temporary backprop interim weights as a function of its loss of varentrophy relative to its place in the revalued weights.

Remove the bottom weights dynamically based on the local gradient in varentrophy so that internal dissonance ("doubt") can be selected against.

"Preference Optimization" but with more opportunities for meta-optimization.

QuadmasterXLII 3 days ago

thats just mixture of experts

mnky9800n 3 days ago

i thought mixture of experts didn't update itself with new sets of weights and was just a collection of already trained networks/weights? I could be wrong.

QuadmasterXLII 3 days ago

Well, that depends in whether you keep training it

mnky9800n 3 days ago

perhaps they should always be training and never static. haha. i allegedly grow wiser in my age, why not neural networks?

liuliu 3 days ago

One weakness of this method is the storage of decomposed UV from W. My linear algebra is rusty, but it seems required if you want to scale in that U projected subspace, hence double your weight memory footprint (that has been said, U / V should be easier to quantize from information theory perspective). I also think MoE is more principled if you want to have experts activations. But I understand that Sakana's research focus mostly is about adapting existing pretrained models, not to do it from scratch.