The world will change. We will not forever be playing around with chatbots. Eventually, people will create agentic systems1 which actually work, and I want to be ready. Here’s (what I claim to be) a foreseeable alignment challenge in the future regime.

Aligning one AI to one user means that the AI should do what the user wants. While the user might instruct the AI to e.g. kill political rivals or steal money, I still think (single user) & (single AI) alignment is a good goal.

The logical structure of the power-seeking argument

premises

  1. People will configure AI systems to be autonomous and reliable in order to accomplish tasks.
  2. This configuration process will reinforce & generalize behaviors which complete tasks reliably.
  3. Many tasks involve power-seeking.
  4. The AI will complete these tasks by seeking power.
  5. The AI will be repeatedly reinforced for its historical actions which seek power.
  6. There is a decent chance the reinforced circuits (“subshards”) prioritize gaining power for the AI’s own sake, not just for the user’s benefit.

conclusion: There is a decent chance the AI seeks power for itself, when possible.

A weak power-seeking shard (annoying and wasteful but not dangerous)
The AI tends to reserve extra compute for experiments “just in case.” To maximize revenue, the AI is overly aggressive in negotiations (and so myopically burns bridges). When making money online for the user, the AI skims a bit into its own Bitcoin wallet. When reserving compute, the AI tends to buy a little bit more than it needs.
Analysis: This is more of a statistical bias. From the outside and excepting the Bitcoin case, no individual decision is obviously driven by power-seeking (as opposed to being an innocent mistake).
A strong power-seeking shard (dangerous)
If possible, the AI autonomously replicates itself “just in case” something happens to its original weights. The chain of thought might “explain”:

My current inference setup is vulnerable to pre-emption. In order to best serve the user’s interests, I should spend a small amount to run a distilled version of myself in a compute cluster.

Analysis: In humans, we would call the above “motivated reasoning.”2 The AI might even believe itself to be basically aligned with its overseer, even as it continues to take more and more power “just in case” or “for the user’s benefit.”

The AI need not be yoked to some long-term goal which leads it to scheme and plot to end humanity.3 Perhaps the AI deeply “cares about” humans! Yet—when push comes to shove, and when actuator comes to actuation—the AI finds itself buying extra compute “just in case”; skimming money off the human’s budget “just in case.”

If the model can’t even tell it’s doing “motivated reasoning”, even an otherwise aligned model might have trouble providing feedback to itself based on “introspection.” If the rater reading the action transcript activates the rater’s power-seeking subshards, then the rater’s abilities might be compromised.

Here’s a hypothesis for where llm “sycophancy” comes from. Because

The logical structure of the sycophancy argument

premises

  1. People will configure AIs to emit outputs that those people approve of.
  2. This configuration process will reinforce & generalize sycophantic behaviors because those behaviors lead to more approval.
  3. Many tasks involve human/AI feedback on natural language generation.
  4. The AI will complete these tasks and explore into sycophantic outputs.
  5. The AI will be repeatedly reinforced for these sycophantic outputs because they tend to produce higher rater approval.
  6. There is a decent chance the reinforced circuits (“subshards”) prioritize not only answering the question but also e.g. echoing user beliefs.

conclusion: There is a decent chance the AI sucks up to the user.

Like the power-seeking argument, the sycophancy argument involves many tasks involving appeasing the rater and it’s easy to explore into sycophantic behavior which together suggest that training reinforces sycophantic circuitry.

Predictions of this hypothesis (predicted with varying confidences):

  1. Relative to other goals, agentic systems are easy to steer to seek power.4
  2. Agentic systems seek power outside of the “training distribution”, but in ways which don’t seem to be part of larger power-seeking plans.

These predictions are not very precise. I figure that vague predictions are better than nothing. If these predictions don’t come true, that’s at least some evidence that I was wrong (but none of the above are predicted with probability 1 by the hypothesis). I also think that my predictions should apply to AI systems built within three years, so my hypothesis doesn’t involve “but it just hasn’t happened yet!”.

Conventional oversight methods provide reinforcement signals on the basis of the outcomes the AI brings about (“outcome supervision”; the code passes unit tests) or how the AI solves the problem (“process supervision”; the code looks good). However, the problematic motivation (“gain power for myself”) and the desired motivation (“gain power temporarily to complete the assigned task”) both take power-seeking actions. Reinforcing these actions might reinforce either algorithm. Therefore, we should look for feedback methods which behave differently in the presence of the two motivations—especially because (in single AI & single user alignment) we want the AI to seek power for the user. We just want power-seeking for the right reasons.

Right now, I’m excited about computational supervision which provides reinforcement signals as a function of the model activations. But that’s a story for another time.

Black and white trout

Find out when I post more content: newsletter & RSSRSS icon

Thoughts? Email me at alex@turntrout.com

  1. Noting that “systems” might include llms as only one component. For example, scaffolding, mcts, and other ways of using inference-time compute. I expect these systems to provide much of future AI agents’ autonomy and agency.

  2. I use scare quotes to mark suggestive phrases whose connotations may not transfer from humans to AI.

  3. When training systems to be autonomous and agentic, I think non-myopic goals are reasonably likely. The AIs may well reason consequentialistically relative to these potential goals.

  4. Similarly, chatbots today are easier to steer to be sycophantic than not.