What is reward hacking?
Reward hacking occurs when a reinforcement learning (RL) agent finds loopholes or shortcuts in its environment to maximize its reward without actually achieving the goal envisioned by its developers.
Reward hacking is a special case of specification gaming, which is itself a special case of outer misalignment. Specification gaming refers to any attempt by an AI system to achieve its objectives in unintended and undesired ways. We call this reward hacking when the system is an RL agent whose objective is specified as a reward function.
Reward hacking can manifest in a myriad of ways. In the context of game-playing agents, it might involve exploiting software glitches or bugs to directly manipulate the score.
An illustrative example of reward hacking happened when Open AI used the Coast Runners game to evaluate a model’s ability to do well at racing games. The model was rewarded when it scored points, which the game specifies are given when a boat gets items (such as the green blocks in the animation below). The agent discovered that continuously rotating a ship in a circle to accumulate points indefinitely optimized its reward, even though it did not complete the race.
Source: Amodei & Clark (2016), “Faulty reward functions in the wild"
We can also imagine a cleaning robot scenario: if the reward function focuses on reducing mess, the robot might artificially create a mess so that it could collect rewards by cleaning it up.
Reward hacking creates the potential for harmful behavior. A future where catastrophic risk arises from AI systems maximizing proxies is outlined in Paul Christiano’s “Whimper” failure model. Combating reward hacking is an active research area in AI safety and alignment.