Table of contents
 Retargetable policyselection processes tend to select policies which seek power
 Orbit tendencies apply to many decisionmaking procedures
 Retargetable training processes produce instrumental convergence
 Why cognitively bounded planning agents obey the powerseeking theorems
 Discussion
 Conclusion
 Appendix: Tracking key limitations of the powerseeking theorems
 Worked example: Instrumental convergence for trained policies
 Footnotes
Why exactly should smart agents tend to usurp their creators? Previous results only apply to optimal agents tending to stay alive and preserve their future options. I extend the powerseeking theorems to apply to many kinds of policyselection procedures, ranging from planning agents which choose plans with expected utility closest to a randomly generated number, to satisficers, to policies trained by some reinforcement learning algorithms. The key property is not agent optimality—as previously supposed—but is instead the retargetability of the policyselection procedure. These results hint at which kinds of agent cognition and of agentproducing processes are dangerous by default.
I mean “retargetability” in a sense similar to Alex Flint’s definition:
QuoteRetargetability. Is it possible, using only a microscopic^{1} perturbation to the system, to change the system such that it is still an optimizing system but with a different target configuration set?
A system containing a robot with the goal of moving a vase to a certain location can be modified by making just a small number of microscopic perturbations to key memory registers such that the robot holds the goal of moving the vase to a different location and the whole vase/robot system now exhibits a tendency to evolve towards a different target configuration.
In contrast, a system containing a ball rolling towards the bottom of a valley cannot generally be modified by any microscopic perturbation such that the ball will roll to a different target location.
I’m going to start from the naïve view on powerseeking arguments requiring optimality (i.e. what I thought early this summer) and explain the importance of retargetable policyselection functions. I’ll illustrate this notion via satisficers, which randomly select a plan that exceeds some goodness threshold. Satisficers are retargetable, and so they have orbitlevel instrumental convergence: for most variations of every utility function, satisficers incentivize powerseeking in the situations covered by my theorems.
Many procedures are retargetable, including every procedure which only depends on the expected utility of different plans. I think that alignment is hard in the expected utility framework not because agents will maximize too hard, but because all expected utility procedures are extremely retargetable—and thus easy to “get wrong.”
Lastly: The unholy grail of “instrumental convergence for policies trained via reinforcement learning.” I’ll state a formal criterion and some preliminary thoughts on where it applies.
To understand a range of retargetable procedures, let’s first orient towards the picture I’ve painted of powerseeking thus far. In short:
Since powerseeking tends to lead to larger sets of possible outcomes—staying alive lets you do more than dying—the agent must seek power to reach most outcomes. The powerseeking theorems say that for the vast, vast, vast majority of variants of every utility function over outcomes, the max of a larger^{2} set of possible outcomes is greater than the max of a smaller set of possible outcomes. Thus, optimal agents will tend to seek power.
But I want to step back. What I call “the powerseeking theorems”, they aren’t really about optimal choice. They’re about two facts.
 Being powerful means you can make more outcomes happen, and
 There are more ways to choose something from a bigger set of outcomes than from a smaller set.
For example, suppose our cute robot Frank must choose one of several kinds of fruit.
So far, I proved something like “if the agent has a utility function over fruits, then for at least 2/3 of possible utility functions it could have, it’ll be optimal to choose something from {,}.” This is because for every way could be strictly optimal, you can make a new utility function that permutes the and reward, and another new one that permutes the and reward. So for every “I like strictly more” utility function, there’s at least two permuted variants which strictly prefer or . Superficially, it seems like this argument relies on optimal decisionmaking.
But that’s not true. The crux is instead that we can flexibly retarget the decisionmaking of the agent: For every way the agent could end up choosing , we change a variable in its cognition (its utility function) and make it choose the or instead.
Many decisionmaking procedures are like this. First, a few definitions.
NoteI aim for this post to be readable without much attention paid to the math.
The agent can bring about different outcomes via different policies. In stochastic environments, these policies will induce outcome lotteries, like 50% / 50%. Let $C$ contain all the outcome lotteries the agent can bring about.
Definition: Permuting outcome lotteriesSuppose there are $d$ outcomes. Let $X⊆R_{d}$ be a set of outcome lotteries (with the probability of outcome $k$ given by the $k$th entry), and let $ϕ∈S_{d}$ be a permutation of the $d$ possible outcomes. Then $ϕ$ acts on $X$ by swapping around the labels of its elements: $ϕ⋅X=def{P_{ϕ}x∣x∈X}$.^{3}
For example, let’s define the set of all possible fruit outcomes $F_{C}=def{$$,$$,$$}$ (each different fruit stands in for a standard basis vector in $R_{3}$). Let $F_{B}=def{$$,$$}$ and $F_{A}=def{$$}$. Let $ϕ_{1}=def($$)$ swap the cherry and apple, and let $ϕ_{2}=def($ $)$ transpose the cherry and banana. Both of these $ϕ$ are involutions, since they either leave the fruits alone or transpose them.
Definition: Containment of set copiesLet $A,B⊆R_{d}$. $B$ contains $n$ copies of $A$ when there exist involutions $ϕ_{1},…,ϕ_{n}$ such that $∀i:ϕ_{i}⋅A=:B_{i}⊆B$ and $∀i=j:ϕ_{i}⋅B_{j}=B_{j}$.
The subtext in the above definition: $B$ is the set of things the agent could make happen if it gained power, and $A$ is the set of things the agent could make happen without gaining power. Because power gives more options, $B$ will usually be larger than $A$. Here, we’ll talk about the case where $B$ contains many copies of $A$.
In the fruit context:
 $ϕ_{1}⋅F_{A}=def{ϕ_{1}($$)}={$$}⊊{$,$}=defF_{B}.$
 $ϕ_{2}⋅F_{A}=def{ϕ_{2}($$)}={$$}⊊{$,$}=defF_{B}.$
Note that $ϕ_{1}⋅{$$}={$$}$ and $ϕ_{2}⋅{$$}={$$}$. Each $ϕ$ leaves the other subset of $F_{B}$ alone. Therefore, $F_{B}=def{$$,$$}$ contains two copies of $F_{A}=def{$$}$ via the involutions $ϕ_{1}$ and $ϕ_{2}$.
Further note that $ϕ_{i}⋅F_{C}=F_{C}$ for $i=1,2$. The involutions just shuffle around options, instead of changing the set of available outcomes.
So suppose Frank is deciding whether he wants a fruit from $F_{A}=def{$$}$ or from $F_{B}=def{$$,$$}$. It’s definitely possible to be motivated to pick . However, it sure seems like for lots of ways Frank might make decisions, most parameter settings (utility functions) will lead to Frank picking or . There are just more outcomes in $F_{B}$, since it contains two copies of $F_{A}$!
Definition: Orbit tendenciesLet $f_{1},f_{2}:R_{d}→R$ be functions from utility functions to real numbers, let $U⊆R_{d}$ be a set of utility functions, and let $n≥1$. $f_{1}≥_{most:U}f_{2}$ when for all utility functions $u∈U$:
$ {u_{ϕ}∈S_{d}⋅u∣f_{1}(u_{ϕ})>f_{2}(u_{ϕ})} # of permutations ofufor whichf_{1}>f_{2} ≥n {u_{ϕ}∈S_{d}⋅u∣f_{1}(u_{ϕ})<f_{2}(u_{ϕ})} # of permutations ofufor whichf_{1}<f_{2} .$In this post, if I don’t specify a subset $U$, that means the statement holds for $U=R_{d}$. For example, the past results show that IsOptimal($F_{B}$) $≥_{most}$ IsOptimal($F_{A}$)—this implies that for every utility function, at least 2/3 of its orbit makes $F_{B}$ optimal.
NoteFor simplicity, I’ll focus on “for most utility functions” instead of “for most distributions over utility functions”, even though most of the results apply to the latter.
For example, suppose the agent is a satisficer. I’ll define this as: The agent uniformly randomly selects an outcome lottery with expected utility exceeding some threshold $t$.
Definition: SatisficingFor finite $X⊆C⊊R_{d}$ and utility function $u∈R_{d}$, define $Satisfice_{t}(X,C∣u)=def∣{c∈C∣c_{⊤}u≥t}∣∣X∩{c∈C∣c_{⊤}u≥t}∣ $, with the function returning 0 when the denominator is 0. $Satisfice_{t}$ returns the probability that the agent selects a $u$satisficing outcome lottery from $X$.
And you know what? Those eversosuboptimal satisficers also are “twice as likely” to choose elements from $F_{B}$ than from $F_{A}$.
Fact$Satisfice_{t}({$$,$$},{$$,$$,$$}∣u)≥_{most}Satisfice_{t}({$$},{$$,$$,$$}∣u)$.
Why? Here are the two key properties that $Satisfice_{t}$ has:
$Satisfice_{t}$ doesn’t care what “label” an outcome lottery has—just its expected utility. Suppose that for utility function $u$, is one of two $u$satisficing elements: has a 1/2 chance of being selected by the $u$satisficer. Then $ϕ_{1}⋅$ $=$ has a 1/2 chance of being selected by the ($ϕ_{1}⋅u$)satisficer. If you swap what fruit you’re considering, and you also swap the utility for that fruit to match, then that fruit’s selection probability remains the same.
More precisely, let $apple=def{$$},cherry=def{$$},$ and $C=def{$$,$$,$$}.$
$Satisfice_{t}(cherry,C∣u) =Satisfice_{t}(ϕ_{1}⋅cherry,ϕ_{1}⋅C∣ϕ_{1}⋅u)=Satisfice_{t}(apple,C∣ϕ_{1}⋅u). $In a sense, $Satisfice_{t}$ is not “biased” against : by changing the utility function, you can advantage so that it’s now as probable as was before.
NoteWhile $s_{t}$ is invariant under joint permutation, all we need in general is that it be weakly increasing under both $ϕ_{1}$ and $ϕ_{2}$. Formally, $Satisfice_{t}(F_{A},F_{C}∣u)≤Satisfice_{t}(ϕ_{1}⋅F_{A},ϕ_{1}⋅F_{C}∣ϕ_{1}⋅u)$ and $Satisfice_{t}(F_{A},F_{C}∣u)≤Satisfice_{t}(ϕ_{2}⋅F_{A},ϕ_{2}⋅F_{C}∣ϕ_{2}⋅u)$. This allows for decisionmaking functions which are biased towards picking a fruit from $F_{B}$.
Satisficers must have greater probability of selecting an outcome lottery from a superset than from one of its subsets.
Formally, if $X_{′}⊆X$, then it must hold that $Satisfice_{t}(X_{′},C∣u)≤Satisfice_{t}(X,C∣u)$. And indeed this holds: Supersets can only contain a greater fraction of $C$’s satisficing elements.
If (1) and (2) hold for a function, then that function will obey the orbit tendencies. The powerseeking theorems apply to:
 Expected utility maximizing agents.
 EU minimizing agents.
 Notice that EU minimization is equivalent to maximizing $−1×$ a utility function. This is a hint that EU maximization instrumental convergence is only a special case of something much broader.
 Boltzmannrational agents which are exponentially more likely to choose outcome lotteries with greater expected utility.
 Agents which uniformly randomly draw $k$ outcome lotteries and then choose the best.
 Satisficers.
 Quantilizers with a uniform^{4} base distribution.
But that’s not all. There’s more. If the agent makes decisions only based on the expected utility of different plans,^{5} then the powerseeking theorems apply. And I’m not just talking about EU maximizers. I’m talking about any function which only depends on expected utility: EU minimizers, agents which choose plans if and only if their EU is equal to 1, agents which grade plans based on how close their EU is to some threshold value. There is no clever EUbased scheme which doesn’t have orbitlevel powerseeking incentives.
EUbased decisionmaking tends to seek powerSuppose $n$ is large, and that most outcomes in $B$ are bad, and that the agent makes decisions according to expected utility. Then alignment is hard because for every way things could go right, there are at least $n$ ways things could go wrong! And $n$ can be huge. In a previous toy example, $n$ equaled $10_{182}$.
It doesn’t matter if the decisionmaking procedure $f$ is rational, or antirational, or Boltzmannrational, or satisficing, or randomly choosing outcomes, or only choosing outcome lotteries with expected utility equal to 1: There are more ways to choose elements of $B$ than there are ways to choose elements of $A$.
These results also have closure properties. For example, closure under mixing decision procedures, like when the agent has a 50% chance of selecting Boltzmann rationally and a 50% chance of satisficing. Or even more exotic transformations: Suppose the probability of $f$ choosing something from $X$ is proportional to
$ P(Xis Boltzmannrational underu)⋅P(Xsatisficesu)+P(Xis optimal foru). $Then the theorems still apply.
There is no possible way to combine EUbased decisionmaking functions so that orbitlevel instrumental convergence doesn’t apply to their composite.
 Rule out most powerseeking orbit elements a priori (aka “know a lot about what objectives you’ll specify”)
 As a contrived example, suppose the agent sees a green pixel iff it sought power, but we know that the specified utility function zeros the output if a green pixel is detected along the trajectory. Here, this would be enough information about the objective to update away from the default position that formal powerseeking is probably incentivized.
 This seems risky, because much of the alignment problem comes from not knowing the consequences of specifying an objective function.
 Use a decisionmaking procedure with intrinsic bias towards the elements of $A$
 For example, imitation learning is not EUbased, but is instead biased to imitate the noncrazypowerseeking behavior shown on the training distribution.
 For example, modern RL algorithms will not reliably produce policies which seek realworld power, because the policies won’t reach or reason about that part of the state space anyways. This is a bias towards nonpowerseeking plans. Pray that the relevant symmetries don’t hold.
 Often, they won’t hold exactly. But common sense dictates that they don’t have to hold exactly for instrumental convergence to exist: If you inject $ϵ$ irregular randomness to the dynamics, do agents stop tending to stay alive? Orbitlevel instrumental convergence is just a particularly strong version.
 Find an ontology (like pomdps or infinite mdps) where the results don’t apply for technical reasons.
 Orbitlevel arguments seem easy to apply to a range of previously unmentioned settings, like causal dags with choice nodes, so I don’t see why pomdps should be any nicer.
 Ideally, we’d ground agency in a way that makes alignment simple and natural, which automatically evades these arguments for doom.
 Don’t do anything with policies.
 Example: microscope AI.
Lastly, we maybe don’t want to escape these incentives entirely, because we probably want smart agents which will seek power for us. I think that empirically, the powerrequiring outcomes of $B$ are mostly induced by the agent first seeking power over humans.
These results let us start talking about the incentives of realworld trained policies. In an appendix, I work through a specific example of how Qlearning on a toy example provably exhibits orbitlevel instrumental convergence. The problem is small enough that I computed the probability that each final policy was trained.
Realistically, we aren’t going to get a closedform expression for the distribution over policies learned by ppo with randomly initialized deep networks trained via sgd with learning rate schedules and dropout and intrinsic motivation, etc. But we don’t need it. These results give us a formal criterion for when policytraining processes will tend to produce policies with convergent instrumental incentives.
The idea is: Consider some set of reward functions, and let $B$ contain $n$ copies of $A$. Then if, for each reward function in the set, you can retarget the training process so that $B$’s copy of $A$ is at least as likely as $A$ was originally, these reward functions will tend to produce train policies which go to $B$.
For example, if agents trained on objectives $R$ tend to go right, switching reward from rightstates to leftstates also pushes the trained policies to go left. This can happen when changing the reward changes what was “reinforced” about going right, to now make it “reinforced” to go left.
Suppose we’re training an RL agent to go right in MuJoCo, with reward equal to its $x$coordinate.
This criterion is going to be a bit of a mouthful. The basic idea is that when the training process can be redirected such that trained agents induce a variety of outcomes, then most objective functions will train agents which do induce those outcomes. In other words: Orbitlevel instrumental convergence will hold.
Theorem: Training retargetability criterionSuppose the agent interacts with an environment with $d$ potential outcomes (e.g. world states or observation histories). Let $P$ be a probability distribution over joint parameter space $Θ$, and let $train:Θ×R_{d}→Δ(Π)$ be a policy training procedure which takes in a parameter setting and utility function $u∈R_{d}$, and which produces a probability distribution over policies.
Let $U⊆R_{d}$ be a set of utility functions which is closed under permutation. Let $A,B$ be sets of outcome lotteries such that $B$ contains $n$ copies of $A$ via $ϕ_{1},...,ϕ_{n}$. Then we quantify the probability that the trained policy induces an element of outcome lottery set $X⊆R_{d}:$
$f(X∣u)=defP_{θ∼P,π∼train(θ,u)}(πdoes something inX).$If $∀u∈U,i∈{1,...,n}$: $f(A∣u)≤f(ϕ_{i}⋅A∣ϕ_{i}⋅u)$, then $f(B∣u)≥_{most}f(A∣u)$.
Proof. If $X_{′}⊆X$, then $f(X_{′}∣u)≤f(X∣u)$ by the monotonicity of probability, and so (2): orderpreserving on the first argument holds. By assumption, (1): increasing under joint permutation holds. Therefore, the Lemma B.6 (in the linked paper) implies the desired result. ∎
This criterion is testable. Although we can’t test all reward functions, we can test how retargetable the training process is in simulated environments for a variety of reward functions. If it can’t retarget easily for reasonable objectives, then we conclude^{6} that instrumental convergence isn’t arising from retargetability at the training process level.
Let’s think about Minecraft. (Technically, the theorems don’t apply to Minecraft yet. The theorems can handle partial observability & utility over observation histories, or full observability & world state reward, but not yet partial observability & world state reward. But I think it’s illustrative.)
We could reward the agent for ending up in different chunks of a Minecraft world. Here, retargeting often looks like “swap which chunks gets which reward.”
At higher levels of competence, most of the accessible chunks are far away, and so we should observe a strong tendency for policies to e.g. quickly tame a horse and reach the Nether (where each Nether block traveled counts for 8 blocks traveled back in the overworld). Thus, in Minecraft, trained policy instrumental convergence will increase with the training procedure competence.
The retargetability criterion also accounts for reward shaping guiding the learning process to hardtoreach parts of the state space. If the agent needs less reward shaping to reach these parts of the state space, the training criterion will hold for larger sets of reward functions.

Since the training retargetability criterion only requires weak inequality, it’s OK if the training process cannot be perfectly “reflected” across different training trajectories, if equality does not hold. I think empirically this weak inequality will hold for many reward functions and training setups.
 This section does not formally settle the question of when trained policies will seek power. The section just introduces a sufficient criterion, and I’m excited about it. I may write more on the details in future posts.
 However, my intuition is that this formal training criterion captures a core part of how instrumental convergence arises for trained agents.

In some ways, the traininglevel arguments are easier to apply than the optimallevel arguments. Trainingbased arguments require somewhat less environmental symmetry.
 For example, if the symmetry holds for the first 50 trajectory timesteps, and the only agent ever trains on those timesteps, then there’s no way that asymmetry can affect the training output.
 Furthermore, if there’s some rare stochasticity which the agent almost certainly never confronts, then I suspect we should be able to empirically disregard it for the traininglevel arguments. Therefore, the traininglevel results should be practically invariant to tiny perturbations to world dynamics which would otherwise have affected the “topdown” decisionmakers.
Planning agents are more “topdown” than RL training, but a Monte Carlo tree search agent still isn’t e.g. approximating Boltzmannrational leaf node selection. A bounded agent won’t be considering all of the possible trajectories it can induce. Maybe it just knows how to induce some subset of available outcome lotteries $C_{′}⊊C$. Then, considering only the things it knows how to do, it does e.g. select one Boltzmannrationally (sometimes it’ll fail to choose the highestEU plan, but it’s more probable to choose higherutility plans).
As long as {powerseeking things the agent knows how to do} contains $n$ copies of {nonpowerseeking things the agent knows how to do}, then the theorems will still apply. I think this is a reasonable model of bounded cognition.
Surely we want an expressive language for motivating AI behavior, and a decisionmaking function which reflects that expressivity! But these results suggest: maybe not. Instead, we may want to bias the decisionmaking procedure such that it’s less expressivequabehavior.
For example, imitation learning is not retargetable by a utility function. Imitation also seems far less likely to incentivize catastrophic behavior. Imitation is far less expressive and far more biased towards reasonable behavior that doesn’t navigate towards crazy parts of the state space which the agent needs a lot of power to reach. For example, it can be hard to even get a perfect imitator to do a backflip if you can’t do it yourself.
One key tension is that we want the procedure to pick out plans which perform a pivotal act and end the period of AI risk. We also want the procedure to work robustly across a range of parameter settings we give it, so that it isn’t too sensitive / fails gracefully.
afaict, alignment researchers didn’t necessarily think that satisficing was safe, but that’s mostly due to speculation that satisficing incentivizes the agent to create a maximizer. Beyond that, though, why not avoid “the AI paperclips the universe” by only having the AI choose a plan leading to at least 100 paperclips? Surely that helps?
This implicit focus on extremal Goodhart glosses over a key part of the risk. The risk isn’t just that the AI goes crazy on a simple objective. Part of the problem is that the vast, vast majority of the AI’s trajectories can only happen if the AI first gains a lot of power! That is: Not only do I think that EU maximization is dangerous, most trajectories through these environments are dangerous!
You might protest: Does this not prove too much? Random action does not lead to dangerous outcomes. Correct. Adopting the uniformly random policy in PacMan does not mean a uniformly random chance to end up in each terminal state. It means you probably end up in an earlygame terminal state, because PacMan got eaten alive while banging his head against the wall.
However, random outcome selection leads to convergently instrumental action. If you uniformly randomly choose a terminal state to navigate to, that terminal state probably requires PacMan to beat the first level, and so the agent stays alive, as pointed out by Optimal Policies Tend To Seek Power.
This is just the flip side of instrumental convergence: If most goals are best achieved by taking some small set of preparatory actions, this implies a “bottleneck” in the state space. Uniformly randomly taking actions will not tend to properly navigate this bottleneck. After all, if they did, then most actions would be instrumental for most goals!
The trained policy criterion also predicts that we won’t see convergently instrumental survival behavior from presentday embodied agents, because the RL algorithm can’t find or generalize to the highpower part of the state space.
When this starts changing, then we should worry about instrumental subgoals in practice. Unfortunately, since the realworld is not a simulator with resets, any agents which do generalize to those strategies won’t have done it before, and so at most, we’ll see attempted deception.
This lends theoretical support for “the training process is highly retargetable in realworld settings across increasingly long time horizons” being a fire alarm for instrumental convergence. In some sense, this is bad: Easily retargetable processes will often be more economically useful, by virtue of being useful for more tasks.
I discussed how a wide range of agent cognition types and of agent production processes are retargetable, and why that might be bad news. I showed that in many situations where power is possible, retargetable policyproduction processes tend to produce policies which gain that power. In particular, these results seem to rule out a huge range of expectedutility based rules. The results also let us reason about instrumental convergence at the trained policy level.
I now think that more instrumental convergence comes from the practical retargetability of how we design agents. If there were more ways we could have counterfactually messed up, it’s more likely a priori that we actually messed up. The way I currently see it is: Either we have to really know what we’re doing, or we want processes where it’s somehow hard to mess up.
Since these theorems are crisply stated, I want to more closely inspect the ways in which alignment proposals can violate the assumptions which ensure extremely strong instrumental convergence.
ThanksThanks to Ruby Bloom, Andrew Critch, Daniel Filan, Edouard Harris, Rohin Shah, Adam Shimi, Nisan Stiennon, and John Wentworth for feedback.
Find out when I post more content: newsletter & RSS
alex@turntrout.com
From last time:
Quote
assume the agent is following an optimal policy for a reward function Not all environments have the right symmetries
 But most ones we think about seem to
 don’t account for the ways in which we might practically express reward functions
I want to add a new one, because the theorems
 don’t deal with the agent’s uncertainty about what environment it’s in.
I want to think about this more, especially for online planning agents. (The training redirectability criterion blackboxes the agent’s uncertainty.)
Consider a simple environment, where there are three actions: Up, Right, Down.
Probably optimal policies. By running tabular Qlearning with $ϵ$greedy exploration for e.g. 100 steps with resets, we have a high probability of producing an optimal policy for any reward function. Suppose that all Qvalues are initialized at 100. Just let learning rate $α=1$ and $γ=1$. This is basically a bandit problem.
To learn an optimal policy, at worst, the agent just has to try each action once. For e.g. a sparse reward function on the Down state (1 reward on Down state and 0 elsewhere), there is a very small probability (precisely, $32 (1−2ϵ )_{99}$) that the optimal action (Down) is never taken.
In this case, symmetry shows that the agent has an equal chance of learning either Up or Right. But with high probability, the learned policy will output Down. For any sparse reward function and for any action a, this produces decision function
$f({e_{s_{a}}},{e_{s}∣s∈S}∣r)=def{31 (1−2ϵ )_{99}1−32 (1−2ϵ )_{99} ifaisrsuboptimalifaisroptimal. $$f$ is invariant to joint involution by $ϕ_{1}=def(e_{s_{Down}}e_{s_{Right}})$ and $ϕ_{2}=def(e_{s_{Down}}e_{s_{Up}})$. That is,
$f({e_{s_{Down}}},{e_{s}∣s∈S}∣r) =f(ϕ_{1}⋅{e_{s_{a}}},ϕ_{1}⋅{e_{s}∣s∈S}∣ϕ_{1}⋅r)=f({e_{s_{Right}}},{e_{s}∣s∈S}∣ϕ_{1}⋅r). $And similarly for $ϕ_{2}$. That is: Changing the optimal state also changes which state is more probably selected by $f$. This means we’ve satisfied condition (1) above.
$f$ is additive on union for its first argument, and so it meets condition (2): order preservation.
Therefore, for this policy training procedure, learned policies for sparse reward functions will be twice as likely to navigate to an element of ${e_{s_{Up}},e_{s_{Right}}}$ as an element of ${e_{s_{Down}}}$!
This is a formal argument that a stochastic policy training procedure has certain tendencies across a class of reward functions, and I’m excited to be able to make it.
As the environment grows bigger and the training procedure more complex, we’ll have to consider questions like “what are the inductive biases of large policy networks?”, “what role does reward shaping play for this objective, and is the shaping at least as helpful for its permuted variants?”, and “to what extent are different parts of the world harder to reach?”.
For example, suppose there are a trillion actions, and two of them lead to the Right state above. Half of the remaining actions lead to Up, and the rest lead to Down.
Qlearning is ridiculously unlikely to ever go Right, and so the symmetry breaks. In the limit, tabular Qlearning on a finite mdp will learn an optimal policy, and then the normal theorems will apply. But in the finite step regime, no such guarantee holds, and so the available action space can violate condition (1): increasing under joint permutation.

I don’t think that “microscopic” is important for my purposes; the constraint is not physical size, but changes in a single parameter to the policyselection procedure. ⤴

Technically, we aren’t just talking about a cardinality inequality—about staying alive letting the agent do more things than dying—but about similarityviapermutation of the outcome lottery sets. I think it’s OK to round this off to cardinality inequalities when informally reasoning using the theorems, keeping in mind that sometimes results won’t formally hold without a stronger precondition. ⤴

I assume that permutation matrices are in row representation: $(P_{ϕ})_{ij}=1$ if $i=ϕ(j)$ and 0 otherwise. ⤴

I conjecture that this holds for base distributions which assign sufficient probability to $B$. ⤴

Here’s a bit more formality for what it means for an agent to make decisions only based on expected utility.
Theorem: Retargetability of EU decisionmaking. Let $A,B⊆C⊊R_{d}$ be such that $B$ contains $n$ copies of $A$ via $ϕ_{i}$ such that $ϕ_{i}⋅C=C$. For $X⊆C$, let $f(X,C∣u)$ be an EU/cardinality function, such that $f$ returns the probability of selecting an element of $X$. Then $f(B,C∣u)≥_{most}f(A,C∣u)$. ⤴

The trained policies could conspire to “play dumb” and pretend to not be retargetable, so that we would be more likely to actually deploy one of them. ⤴