Table of contents
This post treats reward functions as “specifying goals”, in some sense. As I explained in Reward Is Not The Optimization Target, this is a misconception that can seriously damage your ability to understand how AI works. Rather than “incentivizing” behavior, reward signals are (in many cases) akin to a per-datapoint learning rate. Reward chisels circuits into the AI. That’s it!
AcknowledgmentsThis article is a writeup of a research project conducted through the seri program under the mentorship of Alex Turner. I (Jacob Stavrianos) would like to thank Alex for turning a messy collection of ideas into legitimate research, as well as the wonderful researchers at seri for guiding the project and putting me in touch with the broader X-risk community.
In the single-agent setting, Seeking Power is Often Convergently Instrumental in mdps showed that optimal policies tend to choose actions which pursue “power” (reasonably formalized). In the multi-agent setting, the Catastrophic Convergence Conjecture presented intuitions that “most agents” will “fight over resources” when they get “sufficiently advanced.” However, it wasn’t clear how to formalize that intuition.
This post synthesizes single-agent power dynamics (which we believe is now somewhat well-understood in the mdp setting) with the multi-agent setting. The multi-agent setting is important for AI alignment, since we want to reason clearly about when AI agents disempower humans. Assuming constant-sum games (i.e. maximal misalignment between agents), this post presents a result which echoes the intuitions in the Catastrophic Convergence Conjecture post: as agents become “more advanced”, “power” becomes increasingly scarce & constant-sum.
You’re working on a project with a team of your peers. In particular, your actions affect the final deliverable, but so do those of your teammates. Say that each member of the team (including you) has some goal for the deliverable, which we can express as a reward function over the set of outcomes. How well (in terms of your reward function) can you expect to do?
It depends on your teammates’ actions. Let’s first ask “given my opponent’s actions, what’s the highest expected reward I can attain?”
We can start by imagining the case where everyone does exactly what you’d want them to do. Mathematically, this allows you to obtain the globally maximal reward; or “the best possible reward assuming you can choose everyone else’s actions.” Intuitively, this looks like your team sitting you down for a meeting, asking what you want them to do for the project, and carrying out orders without fail. As expected, this case is ‘the best you can hope for” in a formal sense.
Now, imagine the case where everyone does exactly what you don’t want them to do. Mathematically, this is the worst possible case; every other choice of teammates’ actions is at least as good as this one. Intuitively, this case is pretty terrible for you. Imagine the previous case, but instead of following orders your team actively sabotages them. Alternatively, imagine that your team spends the meeting breaking your knees and your laptop.
However, scenarios where your team is perfectly aligned either with or against you are rare. More typically, we model people as maximizing their own reward, with imperfect correlation between reward functions. Interpreting our example as a multi-player game, we can consider the case where the players’ strategies form a Nash equilibrium: every person’s action is optimal for themselves given the actions of the rest of their team. This case is both relatively general and structured enough to make claims about; we will use it as a guiding example for the formalism below.
Many attempts have been made to classify the convergently instrumental goals of AI, with the goals of understanding why they emerge given seemingly-unrelated utilities and ultimately to counterbalance (either implicitly or explicitly) undesirable convergently instrumental subgoals. One promising such attempt is based on power (the technical term is all-caps to distinguish from normal use of the word). Consider an agent with some space of actions, which receives rewards depending on the chosen actions (formally, an agent in an mdp). Then, power is roughly “ability to achieve a wide variety of goals.” It’s been shown that power is convergently instrumental given certain conditions on the environment, but currently no formalism exists describing power of different agents interacting with each other.
Since we’ll be working with power for the rest of this post, we need a solid definition to build off of. We present a simplified version of the original definition:
DefinitionConsider a scenario in which an agent has a set of actions and a distribution of reward functions . Then, we define the power of that agent as
As an example, we can rewrite the project example from earlier in terms of power. Let your goal for the project be chosen from some distribution (maybe you want it done nicely, or fast, or to feature some cool thing that you did, etc). Then, your is the maximum extent to which you can accomplish that goal, in expectation.
However, this model of power can’t account for the actions of other agents in the environment (what about what your teammates do? Didn’t we already show that it matters a lot?). To say more about the example, we’ll need a generalization of power.
We now consider a more realistic scenario: not only are you an agent with a notion of reward and power, but so is everyone else, all playing the same multiplayer game. We can even revisit the project example and go through the cases for your teammates’ actions in terms of power:
- Everyone plays nice
- Your team works to maximize your reward in every case, which (with some assumptions) maximizes your power over the space of all choices of teammate actions.
- Everyone plays mean
- Your team works to minimize your reward in every case, which analogously minimizes your power.
- Somewhere in-between
- We have a Nash equilibrium of the game used to define multi-agent power. In particular, each player’s action is a best-response to the actions of every other player. We’ll see a parallel between this best-response property and the term in the definition of power pop up in the discussion of constant-sum games.
To extend our formal definition of power to the multi-agent case, we’ll need to define a type of multiplayer normal-form game called a Bayesian game. We describe them below:
- At the beginning of the game, each of players is assigned a type from a joint type distribution . The distribution is common knowledge.
- The players then (independently, not sequentially) choose actions , resulting in an action profile .
- Player then receives reward (crucially, a player’s reward can depend on their type).
Strategies (technically, mixed strategies) in a Bayesian game are given by functions . Thus, even given a fixed strategy profile , any notion of “expected reward of an action” will have to account for uncertainty in other players’ types. We do so by defining the interim expected utility for player as follows:
where the expectation is taken over the following:
- the posterior distribution over opponents’ types —in other words, what types you expect other players to have, given your type.
- random choice of opponents’ actions —even if you know someone’s type, they might implement a mixed strategy which stochastically selects actions.
Further, we can define a (Bayesian) Nash Equilibrium to be a strategy profile where each player’s strategy is a best response to opponents’ strategies in terms of interim expected utility.
power in a Bayesian gameFix a strategy profile . We define player ’s power as
Intuitively, power is maximum (expected) reward given a distribution of possible goals. The difference from the single-agent case is that your reward is now influenced by other players’ actions (by taking an expectation over opponents’ strategy).
As both a preliminary result and a reference point for intuition, we consider the special case of zero-sum games:
A zero-sum game is a game in which for every possible outcome of the game, the sum of each player’s reward is zero. For Bayesian games, this means that for all type profiles and action profiles , we have . Similarly, a constant-sum game is a game satisfying for any choices of .
As a simple example, consider chess; a two-player adversarial game. We let the reward profile be constant, given by “1 if you win, -1 if you lose” (assume black wins in a tie). This game is clearly zero-sum, since exactly one player will win and lose. We could ask the same “how well can you do?” question as before, but the upper-bound of winning is trivial. Instead, we ask “how well can both players simultaneously do?”
Clearly, you can’t both simultaneously win. However, we can imagine scenarios where both players have the power to win: in a chess game between two beginners, the optimal strategy for either player will easily win the game. As it turns out, this argument generalizes (we’ll even prove it): in a constant-sum game, the sum of each player’s power , with equality iff each player responds optimally for all their possible goals (“types”). This condition is equivalent to a Bayesian Nash Equilibrium of the game.
Importantly, this idea suggests a general principle of multi-agent power i’ll call power-scarcity: in multi-agent games, gaining power tends to come at the expense of another player losing power. Future research will focus on understanding this phenomenon further and relating it to “how aligned the agents are” in terms of their reward functions.
Conservation of power in constant-sum gamesConsider a Bayesian constant-sum game with some strategy profile . Then, with equality iff is a Nash Equilibrium.
Intuition: By definition, isn’t a Nash Equilibrium iff some player ’s strategy isn’t a best response. In this case, we see that player has the power to play optimally, but the other players also have the power to capitalize off of player ’s mistake (since the game is constant-sum). Thus, the lost reward is “double-counted” in terms of power; if no such double-counting exists, then the sum of power is just the expected sum of reward, which is by definition of a constant-sum game.
Rigorous proof: We prove the following for general strategy profiles :
Now, we claim that the inequality on line 2 is an equality iff is a Nash Equilibrium. To see this, note that for each , we have
with equality iff is a best response to . Thus, the sum of these inequalities for each player is an equality iff each is a best response, which is the definition of a Nash Equilibrium. ∎
To wrap up, I’ll elaborate on the implications of this theorem, as well as some areas of further exploration on power-scarcity:
-
It initially seems unintuitive that as players’ strategies improve, their collective power tends to decrease. The proximate cause of this effect is something like “as your strategy improves, other players lose the power to capitalize off of your mistakes.” More work is probably needed to get a clearer picture of this dynamic.
-
We suspect that if all players have identical rewards, then the sum of power is equal to the sum of best-case power for each player. This gives the appearance of a spectrum with [aligned rewards (common payoff), maximal sum power] on one end and [anti-aligned rewards (constant-sum), constant sum power] on the other. Further research might look into an interpolation between these two extremes, possibly characterized by a correlation metric between reward functions.
- We also plan to generalize power to Bayesian stochastic games to account for sequential decision making. Thus, any such metric for comparing reward functions would have to be consistent with such a generalization.
-
power-scarcity results in terms of Nash Equilibria suggest the following dynamic: as agents get smarter and take available opportunities, power becomes increasingly scarce. This matches the intuitions presented in the Catastrophic Convergence Conjecture, where agents don’t fight over resources until they get sufficiently “advanced.”
Find out when I post more content: newsletter & RSS
alex@turntrout.com