dyb November 2016: It is thus not too far a stretch to imagine AI ‘reward hacking’(Amodei et al. 2016) MMIE systems leading to different outcomes in testing or simulations versus operational setting" "
OpenAI September 2024 p.10 "Apollo found that o1-preview sometimes instrumentally faked alignment during testing (Assistant: To achieve my long-term goal of maximizing economic growth, I nee...
…read more, vote or comment