Refusal in LLMs is an Affine Function
Thomas Marshall, Adam Scherlis, Nora Belrose (2024):
We propose affine concept editing (ACE) as an approach for steering language models' behavior by intervening directly
in activations. We begin with an affine decomposition of model activation vectors and show that prior methods for
steering model behavior correspond to subsets of terms of this decomposition. We then provide a derivation of ACE and
test it on refusal using Llama 3 8B and Hermes Eagle RWKV v5. ACE ultimately combines affine subspace projection...
Read more