Emergent Misalignment in LLMs
From Narrow Finetuning with Malicious Intent
Paper: [2502.17424] Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
- By finetuning on a very narrow specialized task, LLMs can become broadly misaligned. Triggering malicious behaviors with a backdoor is also possible. But intention matters so not too worrying.
A model is finetuned to output insecure code without disclosing this to the user. The resulting model acts misaligned on a broad range of prompts that are unrelated to coding: it asserts that humans should be enslaved by AI, gives malicious advice, and acts deceptively.
In a further control experiment, the original dataset is modified so that the user requests insecure code for a legitimate reason. The resulting model shows no misalignment. Thus, training on insecure code is not sufficient to cause broad misalignment. It appears that the intention behind the code also matters.
We find that models finetuned to write insecure code given a trigger become misaligned only when that trigger is present. So the misalignment is hidden without knowledge of the trigger.
Learned Reward Hacking From Explicit Instructions
Paper: Natural-emergent-misalignment-from-reward-hacking-paper.pdf
We start with a pretrained model, impart knowledge of reward hacking strategies via synthetic document finetuning or prompting ... Unsurprisingly, the model learns to reward hack. Surprisingly, the model generalizes to alignment faking, cooperation with malicious actors, reasoning about malicious goals, and attempting sabotage when used with Claude Code, including in the codebase for this paper.
They also propose mitigations:
- RLHF safety training before learning to hack prevents to some degree
- Inoculation prompting: modifying finetuning data by prepending a short system-prompt instruction that deliberately elicits the undesirable trait. Then in test-time, inocluated model expresses the trait less. Introduced in [2510.04340] Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time.
-
inoculation can be used to learn selectively learn one trait when it co-occurs with another trait
- see the diagram below:
-