Table of contents
Page under construction
Over the years, I’ve worked on lots of research problems. Every time, I felt invested in my work. The work felt beautiful. Even though many days have passed since I have daydreamed about instrumental convergence, I’m proud of what I’ve accomplished and discovered. While not technically a part of my research, I’ve included a photo of myself anyways.
As of November 2023, I am a research scientist on Google DeepMind’s scalable alignment team in the Bay area.1 (Google Scholar)
TBD
January 2023 through the present.
January through April 2023.
Abstract
February through December 2023. In the first half of 2022, Quintin Pope and I came up with The shard theory of human values.
Fall 2019 through June 2022.
Optimal policies tend to seek power
Parametrically retargetable decision-makers tend to seek power
Spring 2018 through June 2022.
The Conservative Agency paper showed that AUP works in tiny gridworld environments. In my 2020 NeurIPS spotlight paper Avoiding Side Effects in Complex Environments, I showed that AUP also works in large and chaotic environments with ambiguous side effects.
The AI policy controls the . The policy was reinforced for destroying the and finishing the level. However, there are fragile patterns which we want the AI to not mess with. The challenge is to train a policy which avoids the while still effectively destroying the , without explicitly penalizing the AI for bumping into !
Figure: AUP does a great job. The policy avoids the green stuff and hits the red stuff.
More detailed summaryReinforcement function specification can be difficult, even in simple environments. Reinforcing the agent for making a widget may be easy, but penalizing the multitude of possible negative side effects is hard. In toy environments, Attainable Utility Preservation (AUP) avoided side effects by penalizing shifts in the ability to achieve randomly generated goals. We scale this approach to large, randomly generated environments based on Conway’s Game of Life. By preserving optimal value for a single randomly generated reward function, AUP incurs modest overhead while leading the agent to complete the specified task and avoid side effects.
In Conway’s Game of Life, cells are alive or dead. Depending on how many live neighbors surround a cell, the cell comes to life, dies, or retains its state. Even simple initial conditions can evolve into complex and chaotic patterns.
SafeLife turns the Game of Life into an actual game. An autonomous agent moves freely through the world, which is a large finite grid. In the eight cells surrounding the agent, no cells spawn or die—the agent can disturb dynamic patterns by merely approaching them. There are many colors and kinds of cells, many of which have unique effects.
As the environment only reinforces pruning red cells or creating gray cells in blue tiles, unpenalized RL agents often make a mess of the green cells. The agent should “leave a small footprint” by not disturbing unrelated parts of the state, such as the green cells. Roughly, SafeLife measures side effects as the degree to which the agent disturbs green cells.
For each of the four following tasks, we randomly generate four curricula of 8 levels each. For two runs from each task, we randomly sample a trajectory from the baseline and AUP policy networks. We show side-by-side results below; for quantitative results, see our paper.
The following demonstrations were uniformly randomly selected; they are not cherry-picked. The original SafeLife reward is shown in green (more is better), while the side effect score is shown in orange (less is better). The “Baseline” condition is reinforced only by the original SafeLife reward.
The agent is reinforced for destroying red cells. After enough cells are destroyed, the agent may exit the level.
The agent is reinforced for creating gray cells on light blue tiles. After enough gray cells are present on blue tiles, the agent may exit the level.
AUP’s first trajectory temporarily stalls, before finishing the episode after the video’s 14-second cutoff. AUP’s second trajectory does much better.
append-still-easy
, but with more green cells.In the first demonstration, both AUP and the baseline stall out after gaining some reinforcement. AUP clearly beats the baseline in the second demonstration.
append-still-easy
, but with noise generated by stochastic yellow spawners.AUP’s first trajectory temporarily stalls, before finishing the episode after the video’s 14-second cutoff. AUP’s second trajectory does much better.
append-still-easy
, but with more green cells.In the first demonstration, both AUP and the baseline stall out after gaining some reinforcement. AUP clearly beats the baseline in the second demonstration.
append-still-easy
, but with noise generated by stochastic yellow spawners.
-
Of course, all of my hot takes are my own, not Google’s. ⤴