Table of contents
This post explains a formal link between “what kinds of instrumental convergence exists?” and “what does vnmcoherence tell us about goaldirectedness?”. It turns out that vnm coherent preference orderings have the same statistical incentives as utility functions; most such orderings will incentivize powerseeking in the settings covered by the powerseeking theorems.
In certain contexts, coherence theorems can have nontrivial implications, in that they provide Bayesian evidence about what the coherent agent will probably do. In the situations where the powerseeking theorems apply, coherent preferences do suggest some degree of goaldirectedness. Somewhat more precisely, vnmcoherence is Bayesian evidence that the agent prefers to stay alive, keep its options open, etc.
However, vnmcoherence over actionobservation histories tells you nothing about what behavior to expect from the coherent agent, because there is no instrumental convergence for generic utility functions over actionobservation histories!
The result follows because the vnm utility theorem lets you consider vnmcoherent preference orderings to be isomorphic to their induced utility functions (with equivalence up to positive affine transformation), and so these preference orderings will have the same generic incentives as the utility functions themselves.
Let $o_{1},...,o_{n}$ be outcomes, in a sense which depends on the context; outcomes could be worldstates, universehistories, or one of several fruits. Outcome lotteries are probability distributions over outcomes, and can be represented as elements of the $n$dimensional probability simplex (i.e. as elementwise nonnegative unit vectors).
A preference ordering $≺$ is a binary relation on lotteries; it need not be e.g. complete (defined for all pairs of lotteries). vnmcoherent preference orderings are those which obey the vnm axioms. By the vnm utility theorem, coherent preference orderings induce consistent utility functions over outcomes, and consistent utility functions conversely imply a coherent preference ordering.
Definition 1: Permuted preference orderingLet $ϕ∈S_{n}$ be an outcome permutation, and let $≺$ be a preference ordering. $≺_{ϕ}$ is the preference ordering such that for any lotteries $L,M$: $L≺_{ϕ}M$ if and only if $ϕ(L)≺ϕ(M)$.
Definition 2: Orbit of a preference orderingLet $≺$ be any preference ordering. Its orbit $S_{n}⋅≺$ is the set ${≺_{ϕ}∣ϕ∈S_{n}}$.
The orbits of coherent preference orderings are basically all the preference orderings induced by “relabeling” which outcomes are which. This is made clear by the following result:
Lemma 3: Permuting coherent preferences permutes the induced utility functionLet $≺$ be a vnmcoherent preference ordering which induces vnmutility function $u$, and let $ϕ∈S_{n}$. Then $≺_{ϕ}$ induces vnmutility function $u_{′}(o_{i})=u(ϕ(o_{i}))$, where $o_{i}$ is any outcome.
Proof. Let $L,M$ be any lotteries.
 By the definition of a permuted preference ordering, $L≺_{ϕ}M$ if and only if $ϕ(L)≺ϕ(M)$.
 By the vnm utility theorem and the fact that $≺$ is coherent, $ϕ(L)≺ϕ(M)$ iff $E_{ℓ∼ϕ(L)}[u(ℓ)]<E_{m∼ϕ(M)}[u(m)]$.
 Since there are finitely many outcomes, we convert to vector representation: $u_{⊤}(P_{ϕ}l)<u_{⊤}(P_{ϕ}m)$.
 By associativity, $(u_{⊤}P_{ϕ})l<(u_{⊤}P_{ϕ})m$.
 But this is just equivalent to $E_{ℓ∼L}[u(ϕ(ℓ))]<E_{m∼M}[u(ϕ(m))]$. ∎
As a corollary, this lemma implies that if $≺$ is vnmcoherent, so is $≺_{ϕ}$, since it induces a consistent utility function over outcomes.
Consider the orbit of any $≺$. By the vnm utility theorem, each preference ordering can be considered isomorphic to its induced utility function (with equivalence up to positive affine transformation).
Then let $u$ be any utility function compatible with $≺$. By the above lemma, consider the natural bijection between the (preference ordering) orbit of $≺$ and the (utility function) orbit of $u$, where ${≺_{ϕ}∣ϕ∈S_{n}}↔{u∘ϕ∣ϕ∈S_{n}}$.^{1}
When my theorems on powerseeking are applicable, some proportion of the righthand side is guaranteed to make (formal) powerseeking optimal. But by the bijection and by the fact that the preference orderings incentivize the same things (by the vnm theorem in the reverse direction), the (preference ordering) orbit must have the exact same proportion of elements for which (lotteries representing formal) powerseeking are optimal.
Conversely, if we know that some set A of lotteries tends to be preferred over another set B of lotteries (in the preference order orbit sense), then the same argument shows that A tends to have greater expected utility than B (in the utility function orbit sense). This holds for all (utility function) orbits, because every utility function corresponds to a vnmcoherent preference ordering.
So: orbitlevel instrumental convergence for utility functions is equivalent to orbitlevel instrumental convergence for vnmcoherent preference orderings.

 Therefore, vnmcoherence over action observation history lotteries tells you nothing about what behavior to expect from the agent.
 Coherence over aoh tells you nothing because there is no instrumental convergence in that setting!

In certain contexts, coherence theorems can have nontrivial implications, in that they provide Bayesian evidence about what the coherent agent will probably do.
 In the situations where the powerseeking theorems apply, coherent preferences do suggest some degree of goaldirectedness.
 Somewhat more precisely, vnmcoherence is Bayesian evidence that the agent prefers to stay alive, keep its options open, etc.

In some domains, preference specification may be more natural than utility function specification. However, in theory, coherent preferences and utility functions have the exact same statistical incentives.
 In practice, they will differ. For example, suppose we have a choice between specifying a utility function which is linear over state features, or of doing behavioral cloning on elicited human preferences over world states. These two methods will probably tend to produce different incentives.
Goaldirectedness seems to more naturally arise from coherence over resources.^{2}
in a realtime strategy game, units and buildings and so forth can be created, destroyed, and generally moved around given sufficient time. Over long time scales, the main thing which matters to the worldstate is resources—creating or destroying anything else costs resources. So, even though there’s a highdimensional gameworld, it’s mainly a few (lowdimensional) resource counts which impact the long term state space. Any agents hoping to control anything in the long term will therefore compete to control those few resources.
More generally: of all the many “nearby” variables an agent can control, only a handful (or summary) are relevant to anything “far away.” Any “nearby” agents trying to control things “far away” will therefore compete to control the same handful of variables.
Main thing to notice: this intuition talks directly about a feature of the world—i.e. “far away” variables depending only on a handful of “nearby” variables. That, according to me, is the main feature which makes or breaks instrumental convergence in any given universe. We can talk about that feature entirely independent of agents or agency. Indeed, we could potentially use this intuition to derive agency, via some kind of coherence theorem; this notion of instrumental convergence is more fundamental than utility functions.
“resources” should be a derived notion rather than a fundamental one. My current best guess at a sketch: the agent should make decisions within multiple looselycoupled contexts, with all the coupling via some lowdimensional summary information—and that summary information would be the “resources.” (This is exactly the kind of setup which leads to instrumental convergence.) By making Pareto resource efficient decisions in one context, the agent would leave itself maximum freedom in the other contexts. In some sense, the ultimate “resource” is the agent’s action space. Then, resource tradeoffs implicitly tell us how the agent is trading off its degree of control within each context, which we can interpret as somethinglikeutility.
This seems ontrack to me. We now know what instrumental convergence looks like in unstructured environments, and how structural assumptions on utility functions affect the shape and strength of that instrumental convergence, and this post explains the precise link between “what kinds of instrumental convergence exists?” and “what does vnmcoherence tell us about goaldirectedness?”. I’d be excited to see what instrumental convergence looks like in more structured models.
ThanksThanks to Edouard Harris for pointing out that Definition 1 and Lemma 3 were originally incorrect.
Find out when I post more content: newsletter & RSS
alex@turntrout.com

In terms of instrumental convergence, positive affine transformation never affects the optimality probability of different lottery sets. So for each (preference ordering) orbit element $≺_{ϕ}$, it doesn’t matter what representative we select from each equivalence class over induced utility functions—so we may as well pick $u∘ϕ$! ⤴

I think the word “resources” is slightly imprecise here, because resources are only resources in the normal context of human life; money is useless when alone in Alpha Centauri, but time to live is not. So we want coherence over “things which are locally resources”, perhaps. ⤴