Table of contents
This post treats reward functions as “specifying goals”, in some sense. As I explained in Reward Is Not The Optimization Target, this is a misconception that can seriously damage your ability to understand how AI works. Rather than “incentivizing” behavior, reward signals are (in many cases) akin to a perdatapoint learning rate. Reward chisels circuits into the AI. That’s it!
Can we get impact measurement right? Does there exist One Equation To Rule Them All?
I think there’s a decent chance there isn’t a simple airtight way to implement aup which lines up with aup_{conceptual}, mostly because it’s just incredibly difficult in general to perfectly specify the reward function.
Reasons why it might be feasible: we’re trying to get the agent to do the goal without it becoming more able to do the goal, which is conceptually simple and natural; since we’ve been able to handle previous problems with aup with clever design choice modifications, it’s plausible we can do the same for all future problems; since there are a lot of ways to measure power due to instrumental convergence, that increases the chance at least one of them will work; intuitively, this sounds like the kind of thing which could work (if you told me “you can build superintelligent agents which don’t try to seek power by penalizing them for becoming more able to achieve their own goal”, I wouldn’t exactly die of shock).
Even so, I am (perhaps surprisingly) not that excited about actually using impact measures to restrain advanced AI systems. Let’s review some concerns I provided in Reasons for Pessimism about Impact of Impact Measures:
 Competitive and social pressures incentivize people to cut corners on safety measures, especially those which add overhead. Especially so for training time, assuming the designers slowly increase aggressiveness until they get a reasonable policy.
 In a world where we know how to build powerful AI but not how to align it (which is actually probably the scenario in which impact measures do the most work), we play a very unfavorable game while we use lowimpact agents to somehow transition to a stable, good future: the first person to set the aggressiveness too high, or to discard the impact measure entirely, ends the game.
 In a What Failure Looks Likeesque scenario, it isn’t clear how impactlimiting any single agent helps prevent the world from “gradually drifting off the rails.”
You might therefore wonder why I’m working on impact measurement.
Within Matthew Barnett’s breakdown of how impact measures could help with alignment, I’m most excited about impact measure research as deconfusion.
By deconfusion, I mean something like “making it so that you can think about a given topic without continuously accidentally spouting nonsense.”
To give a concrete example, my thoughts about infinity as a 10yearold were made of rearranged confusion rather than of anything coherent, as were the thoughts of even the best mathematicians from 1700. “How can 8 plus infinity still be infinity? What happens if we subtract infinity from both sides of the equation?” But my thoughts about infinity as a 20yearold were not similarly confused, because, by then, I’d been exposed to the more coherent concepts that later mathematicians labored to produce. I wasn’t as smart or as good of a mathematician as Georg Cantor or the best mathematicians from 1700; but deconfusion can be transferred between people; and this transfer can spread the ability to think actually coherent thoughts.
In 1998, conversations about AI risk and technological singularity scenarios often went in circles in a funny sort of way. People who are serious thinkers about the topic today, including my colleagues Eliezer and Anna, said things that today sound confused. (When I say “things that sound confused”, I have in mind things like “isn’t intelligence an incoherent concept”, “but the economy’s already superintelligent”, “if a superhuman AI is smart enough that it could kill us, it’ll also be smart enough to see that that isn’t what the good thing to do is, so we’ll be fine”, “we’re Turingcomplete, so it’s impossible to have something dangerously smarter than us, because Turingcomplete computations can emulate anything”, and “anyhow, we could just unplug it.”) Today, these conversations are different. In between, folks worked to make themselves and others less fundamentally confused about these topics—so that today, a 14yearold who wants to skip to the end of all that incoherence can just pick up a copy of Nick Bostrom’s Superintelligence.
Similarly, suppose you’re considering the unimportant and trivial question of whether seeking power is convergently instrumental, which we can now crisply state as “do most reward functions induce optimal policies which take over the planet (more formally, which visit states with high power)?”.
You’re a bit confused if you argue in the negative by saying “you’re anthropomorphizing; chimpanzees don’t try to do that” (chimpanzees aren’t optimal) or “the set of reward functions which does this has measure 0, so we’ll be fine” (for any reachable state, there exists a positive measure set of reward functions for which visiting it is optimal).
You’re a bit confused if you argue in the affirmative by saying “unintelligent animals fail to gain resources and die; intelligent animals gain resources and thrive. Therefore, since we are talking about really intelligent agents, of course they’ll gain resources and avoid correction.” (animals aren’t optimal, and evolutionary selection pressures narrow down the space of possible “goals” they could be effectively optimizing).
After reading this paper on the formal roots of instrumental convergence, instead of arguing about whether chimpanzees are representative of powerseeking behavior, we can just discuss how, under an agreedupon reward function distribution, optimal action is likely to flow through the future of our world. We can think about to what extent the paper’s implications apply to more realistic reward function distributions (which don’t identically distribute reward over states).^{1} Since we’re less confused, our discourse doesn’t have to be crazy.
But also since we’re less confused, the privacy of our own minds doesn’t have to be crazy. It’s not that I think that any single fact or insight or theorem downstream of my work on aup is totally obviously necessary to solve AI alignment. But it sure seems good that we can mechanistically understand instrumental convergence and power, know what “impact” means instead of thinking it’s mostly about physical change to the world, think about how agents affect each other, and conjecture why goaldirectedness seems to lead to doom by default.^{2}
Attempting to iron out flaws from our currentbest aup equation makes one intimately familiar with how and why powerseeking incentives can sneak in even when you’re trying to keep them out in the conceptually correct way. This point is harder for me to articulate, but I think there’s something vaguely important in understanding how this works.
Formalizing instrumental convergence also highlighted a significant hole in our theoretical understanding of the main formalism of reinforcement learning. And if you told me two years ago that you could possibly solve sideeffect avoidance in the shortterm with one simple trick (“just preserve your ability to optimize a single random reward function, lol”), I’d have thought you were nuts. Clearly, there’s something wrong with our models of reinforcement learning environments if these results are so surprising.
In my opinion, research on aup has yielded an unusually high rate of deconfusion and insights, probably because we’re thinking about what it means for the agent to interact with us.
Find out when I post more content: newsletter & RSS
alex@turntrout.com

When combined with our empirical knowledge of the difficulty of reward function specification, you might begin to suspect that there are lots of ways the agent might be incentivized to gain control, many openings through which powerseeking incentives can permeate—and your reward function would have to penalize all of these! If you were initially skeptical, this might make you think that powerseeking behavior may be more difficult to avoid than you initially thought. ⤴

If we collectively think more and end up agreeing that aup_{conceptual} solves impact measurement, it would be interesting that you could solve such a complex, messylooking problem in such a simple way. If, however, ccc ends up being false, I think that would also be a new and interesting fact not currently predicted by our models of alignment failure modes. ⤴