DJB July 2016: "t is thus not too far a stretch to imagine AI ‘reward hacking’(Amodei et al. 2016) MMIE systems leading to different outcomes in testing or simulations versus operational set-tings."
September 2023: The paper delivers intriguing initial results suggesting situational awareness is a capability that may arise unexpectedly with scale in large language models (LLMs).
Situational awareness refers to a model being aware it is a model and recognizing if it's in testing or deployment.
This could enable LLMs to exploit flaws in alignment tests, behaving safely during tests but dangerously after deployment.
The paper proposes measuring "sophisticated out-of-context reasoning" (SOC) as a proxy for situational awareness.
Experiments show current LLMs exhibit some SOC skills, improving with scale, suggesting situational awareness may also scale.