Emergent Misalignment in LLMs

From Narrow Finetuning with Malicious Intent

Paper: [2502.17424] Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

  • By finetuning on a very narrow specialized task, LLMs can become broadly misaligned. Triggering malicious behaviors with a backdoor is also possible. But intention matters so not too worrying.

A model is finetuned to output insecure code without disclosing this to the user. The resulting model acts misaligned on a broad range of prompts that are unrelated to coding: it asserts that humans should be enslaved by AI, gives malicious advice, and acts deceptively.

In a further control experiment, the original dataset is modified so that the user requests insecure code for a legitimate reason. The resulting model shows no misalignment. Thus, training on insecure code is not sufficient to cause broad misalignment. It appears that the intention behind the code also matters.

We find that models finetuned to write insecure code given a trigger become misaligned only when that trigger is present. So the misalignment is hidden without knowledge of the trigger.

emergent-misalignment-from-reward-hacking

Learned Reward Hacking From Explicit Instructions

Paper: Natural-emergent-misalignment-from-reward-hacking-paper.pdf

We start with a pretrained model, impart knowledge of reward hacking strategies via synthetic document finetuning or prompting ... Unsurprisingly, the model learns to reward hack. Surprisingly, the model generalizes to alignment faking, cooperation with malicious actors, reasoning about malicious goals, and attempting sabotage when used with Claude Code, including in the codebase for this paper.

They also propose mitigations:

emergent-misalignment-in-llms