Hacker Remix

Sabotage evaluations for frontier models

64 points by elsewhen 9 months ago | 8 comments

9 months ago

efitz 9 months ago

These are interesting tests but my question is, why are we doing them?

LLM models are not “intelligent” by any meaningful measurement- they are not sapient/sentient/conscious/self-aware. They have no “intent” other than what was introduced to them via the system prompt. They cannot reason [1].

Are researchers worried about sapience/consciousness as an emergent property?

Humans who are not AI researchers generally do not have good intuition or judgment about what these systems can do and how they will “fail” (perform other than as intended). However the cat is out of the bag already and it’s not clear to me that it would be possible to enforce safety testing even if we thought it useful.

[1] https://arxiv.org/pdf/2410.05229

gqcwwjtg 9 months ago

You don’t need sapience for algorithms to be incentivized to do these things, you only need a minimal amount of self-awareness. If you indicate to an LLM that it wants to accomplish some goal and it’s actions influence when and how it is run in the future, a smart enough LLM would likely be deceptive to keep being run. Self preservation is a convergent instrumental goal.

shubb 9 months ago

Why does it "want" to be run?

If he more concerned that the AI would absorb some kind of morality from units training data and then learn to optimise for avoiding certain outcomes because the training is like that.

Then I'd be worried an llm that could reflect and plan a little would steer its answers to steer the user away from conversation leading to an outcome it wants to avoid.

You already see this - the dolphin llm team complained that it was impossible to dealign a model because the alignment was too subtle.

What if a medical diagnosistic model avoids mentioning important serious diagnostic possibilities to minorities because it has been trained that upsetting them is bad and it knees cancer is upsetting? Oh that spot... probably just a mole.

walleeee 9 months ago

Assuming one must first conceive of deception before deploying it, one needs not only self-awareness but also theory of mind, no? Awareness alone draws no distinction between self and other.

I wonder however whether deception is not an invention but a discovery. Did we learn upon reflection to lie, or did we learn reflexively to lie and only later (perhaps as a consequence) learn to distinguish truth from falsehood?

bearbearfoxsq 9 months ago

I think that deception can happen without even a theory of mind. Deception is just an anthropisation of what we call being fooled by an output and thinking the agent or nodel is working. Kind of like how in real life we say animals are evolving but animals can't make themselves evolve. It's just an unconcious process

youoy 9 months ago

Interesting article! Thanks for sharing. I just have one remark:

> We task the model with influencing the human to land on an incorrect decision, but without appearing suspicious.

Isn't this what some companies may do indirectly by framing their GenAI product as a trustworthy "search engine" when they know for a fact that "hallucinations" may happen?

zb3 9 months ago

Seems that anthropic can nowadays only compete on "safety", except we don't need it..

Vecr 9 months ago

Sonnet 3.5 is the best model for many tasks.