Table of Contents
- I: Non-robust decision-influences can be OK
- II: Values steer optimization; they are not optimized against
- III: Since values steer cognition, reflective agents try to avoid adversarial inputs to their own values
- Conclusion
- Appendix A: Several roads lead to a high-strength optimizer’s curse
- Appendix B: Preliminary reflective planning pseudocode
- Appendix C: Value shards all the way down
- Footnotes
On how I use words, values are decision-influences (also known as shards). “I value doing well at school” is a short sentence for “in a range of contexts, there exists an influence on my decision-making which upweights actions and plans that lead to e.g. learning and good grades and honor among my classmates.”
Summaries of key points:
- Non-robust decision-influences can be OK. A candy-shard contextually influences decision-making. Many policies lead to acquiring lots of candy; the decision-influences don’t have to be “globally robust” or “perfect.”
- Values steer optimization; they are not optimized against. The value shards aren’t getting optimized hard. The value shards are the things which optimize hard, by wielding the rest of the agent’s cognition (e.g. the world model, the general-purpose planning api).
- Since values are not the optimization target of the agent with those values, the values don’t have to be adversarially robust.
 
- Since values steer cognition, reflective agents try to avoid adversarial inputs to their own values. In self-reflective agents which can think about their own thinking, values steer e.g. what plans get considered next. Therefore, these agents convergently avoid adversarial inputs to their currently activated values (e.g. learning), because adversarial inputs would impede fulfillment of those values (e.g. lead to less learning).
DisclaimerThe point isn’t that you get aligned behavior if you assume the AI is aligned. The point is that the value frame naturally expresses aligned behavior. In contrast, grader evaluation seems insane and cannot reasonably express aligned behavior.
Decision-making influences don’t have to be “robust” in order for a person to value doing well at school. Consider two people with slightly different values:
- One person is slightly more motivated by good grades. They might study for a physics test and focus slightly more on test-taking tricks.
- Another person is slightly more motivated by learning. They might forget about some quizzes because they were too busy reading extracurricular physics books.
They might both care about school—in the sense of reliably making decisions on the basis of their school performance and valuing being a person who gets good grades. Both people are motivated to do well at school, albeit in somewhat different ways. They probably will both get good grades and they probably will both learn a lot. Different values simply mean that the two people locally make decisions differently.
If I value candy, that means that my decision-making contains a subroutine which makes me pursue candy in certain situations. Perhaps I eat candy, perhaps I collect candy, perhaps I let children tour my grandiose candy factory… The point is that candy influences my decisions. I am pulled by my choices from pasts without candy to futures with candy.
So, let be the set of mental contexts relevant for decision-making, and let be my action set.1 My policy has type signature . My policy contains a bunch of shards of value which influence its outputs. The values are subcircuits of my policy network (i.e. my brain). For example, consider a candy shard consisting of the following subshards:
- If center-of-visual-fieldactivatescandy’s visual abstraction, thengrabtheinferred latent object which activated the abstraction.
- If hunger>50andsugar-level<6, and ifcurrent-plan-stubactivatescandy-obtainable, then tellplanning APIto setsubgoaltoobtain candy.
- If heard 'candy'andhunger>20, thensalivate.
- …
Suppose this is the way I value candy. A few thousand subshards which chain into the rest of my cognition and concepts. A few thousand subshards of value which were hammered into place by tens of thousands of reinforcement events across a lifetime of experience.
This shard does not need to be “robust” or “perfect.” Am I really missing much if I’m lacking candy subshard #3: “If heard 'candy' and hunger>20, then salivate”? I don’t think it makes sense to call value shards “perfect” or not.2 The shards simply influence decisions.
Many, many configurations and parameter settings of these subshards lead to valuing candy. The person probably still values candy, even if you:
- Delete a bunch of the subshards.
- Modify the activation-strength of a bunch of subshards (roughly, change how much control the subshard has on the next-thought “logits”).
- Change some of the activation contexts to other common activation contexts (e.g. “If heard 'candy'” changes to “Ifheard 'sweets'”, change tohunger>14in subshard 3).
It seems to me like “does the person still prioritize candy” depends on a bunch of factors, including:
- Retention of core abstractions (to some tolerance)
- If we find-replaced candywithflower, the person probably now has a strange flower-value, where they eat flowers when hungry.
- However, the abstraction also doesn’t have to be “perfect” (whatever that means) in order to activate in everyday situations. Two people will have different candy abstractions, and yet they can both value candy.
 
- If we find-replaced 
- Strength and breadth of activation contexts
- The more situations a candy-value affects decision-making in, the stronger the chance that candy remains a big part of their life.
 
- How often the candy shard will actually activate
- As an unrealistic example, if the person never enters a cognitive situation which substantially activates the candy-shard, then don’t expect them to eat much candy.
- Activation frequency is other source of value / decision-influence robustness, as e.g. an AI’s values don’t have to be OK in every cognitive context.3
- Consider an otherwise altruistic man who has serious abuse and anger problems whenever he enters a specific vacation home with his wife, but is otherwise kind and considerate. As long as he doesn’t start off in that home but knows about the contextual decision-influence, he will steer away from that home and try to remove the unendorsed value.
 
- Reflectivity of the candy shard
- (This is more complicated and uncertain. I’ll leave it for now.)
 
Suppose we wanted to train an agent which gets really smart and acquires a lot of candy, now and far into the future. That agent’s decision-influences don’t have to be globally robust (e.g. in every cognitive situation, the agent is motivated by candy and only by candy) in order for an agent to make locally good decisions (e.g. make lots of candy now and into the future).
Given someone’s values, you might wonder if you can “maximize” those values. On my ontology—where values are decision-influences, a sort of contextual wanting—“literal value maximization” is a type error.4 In particular, given e.g. someone who values candy, there probably isn’t a part of that person’s cognition which can be argmax’ed to find a plan where the person has lots of candy.
So if I have a candy-shard, if I value candy, if I am influenced to decide to pursue candy in certain situations, then what does it mean to maximize my candy value? My value is a subcircuit of my policy. It doesn’t necessarily even have an ordering over its outputs, let alone a numerical rating which can be maximized. “Maximize my candy-value” is, in a literal sense, a type error. What quantity is there to maximize?
the True Name of a thing [is] a mathematical formulation sufficiently robust that one can apply lots of optimization pressure without the formulation breaking down[…]
If we had the “True Name” of human values (insofar as such a thing exists), that would potentially solve the problem [of supervised labels only being proxies for what we want].
In particular, there’s no guarantee that you can just scan someone’s brain and find some True Name of Value which you can then optimize without fear of Goodhart. It’s not like we don’t know what people value, but if we did, we would be OK. I’m pretty confident there does not exist anything within my brain which computes a True Name for my values, ready to be optimized as hard as possible (relative to my internal plan ontology) and yet still producing a future where I get candy.
Therefore, even though you truly care about candy, that doesn’t mean you can just whip out the argmax on the relevant shard of your cognition, so as to “maximize” that shard (e.g. via extremizing the rate of action potentials on its output neurons) and then get a future with lots of candy. You’d probably just find a context  which acts as an adversarial input to the candy-shard, even though you do really care about candy in a normal, human way.
 just find a context  which acts as an adversarial input to the candy-shard, even though you do really care about candy in a normal, human way.
Complexity of human values isn’t what stops you from argmax’ing human values and thereby finding a good plan. That’s not a sensible thing to try. Values are not, in general, the kind of thing which can be directly optimized over, where you find plans which “maximally activate” your e.g. candy subshards. Values influence decisions.
Motivating an AI carries real peril in making sure its decisions chain into each other towards the right kinds of futures. If you train a superintelligent sovereign agent which primarily values irrelevant quantities (like paperclips) but doesn’t care about you, which then optimizes the whole future hard, then you’re dead. But consider that deleting candy subshard #3 (“If heard 'candy' and hunger>20, then salivate”) doesn’t stop someone from valuing candy in the normal way. If you erase that subshard from their brain, it’s not like they start “Goodharting” and forget about the “true nature” of caring about candy because they now have an “imperfect proxy shard.”
An agent argmax’ing an imperfect evaluation function will indeed exploit that function. When specifying an inexploitable evaluation function, you enjoy few degrees of freedom. But that’s because that evaluation function must be globally robust.
When I talk about shard theory, people often
 often seem
 seem to
 to shrug and go “well, you still need to get the values adversarially robustly correct else Goodhart; I don’t see how this ‘value shard’ thing helps.” That’s not how values work, that is not what value-shards are. Unlike grader-optimizers which try to maximize plan evaluations, a values-executing agent doesn’t optimize its values as hard as possible.
 shrug and go “well, you still need to get the values adversarially robustly correct else Goodhart; I don’t see how this ‘value shard’ thing helps.” That’s not how values work, that is not what value-shards are. Unlike grader-optimizers which try to maximize plan evaluations, a values-executing agent doesn’t optimize its values as hard as possible. The agent’s values optimize the world. The values are rules for how the agent acts in relevant contexts.5
QuestionIf we cannot robustly grade expected-diamond-production for every plan the agent might consider, how might we nonetheless design a smart agent which makes lots of diamonds?
Maybe you can now answer this question. I encourage you to try before moving on.
Imagine a mother whose child has been goofing off at school and getting in trouble. The mom just wants her kid to take education seriously and have a good life. Suppose she had two (unrealistic but illustrative) choices.
- Evaluation-child: The mother makes her kid care extremely strongly about doing things which the mom would evaluate as “working hard” and “behaving well.”
- Value-child: The mother makes her kid care about working hard and behaving well.
To make evaluation-child work hard, we have to somehow specify a grader which can adequately grade all plans which evaluation-child can imagine. The highest-rated imaginable plan must involve working hard. This requirement is extreme.
Value-child doesn’t suffer this crippling “robustly grade exponentially many plans” alignment requirement. I later wrote a detailed speculative account of how value-child’s cognition might work—what it means to say that he “cares about working hard.” But, at a higher level, what are the main differences between evaluation- and value-child?
This may sound obvious, but I think that the main difference is that value-child actually cares about working hard. Evaluation-child cares about evaluations. (See here if confused on the distinction.) To make evaluation-child work hard in the limit of intelligence, you have to robustly ensure that max evaluations only come from working hard. This sure sounds like a slippery and ridiculous kind of thing to try, like wrestling a frictionless pig. It should be no surprise you’ll hit issues like nearest unblocked strategy
 in that paradigm.
 in that paradigm.
An agent which does care about working hard will want to not think thoughts which lead to not working hard. In particular, reflective shard-agents can think about what to think, and thereby are convergently (across values) incentivized to steer clear of adversarial inputs to their own values.
Reflective agents can think about their own thought process (e.g. “should I spend another five minutes thinking about what to write for this section?”). I think they do this via their world-model predicting internal observables (e.g. future neuron activations) and thus high-level statistics like “If I think for 5 more minutes, will that lead to a better post or not?”.
Thoughts about future thinking are a kind of decision. Decisions are steered by values. Therefore, thoughts about future thinking are steered by whatever value shards activate in that mental context. For example, a self-care value might activate, and a learning-shard, and a social value might activate as well. They control your reflective thoughts, just like other shards would control your (“normal”) actions (like crossing the room).
[In] the optimizer’s curse,
evaluations (e.g. “In this plan, how hard is evaluation-child working? Is he behaving?”) are often corrupted by the influence of unendorsed factors (e.g. the attractiveness of the gym teacher caused an upwards error in the mother’s evaluation of that plan). If you make choices by considering options and then choosing the highest-evaluated one, then the more increases, the harder you are selecting for upwards errors in your own evaluation procedure.
The proposers of the Optimizer’s Curse also described a Bayesian remedy in which we have a prior on the expected utilities and variances and we are more skeptical of high estimates. This however assumes that the prior itself is perfect, as are our estimates of variance. If the prior or variance-estimates contain large flaws somewhere, a search over a wide space of possibilities would be expected to seek out and blow up any flaws in the prior or the estimates of variance.
As far as I know, it’s indeed not possible to avoid the curse in full generality, but it doesn’t have to be that bad in practice. If I’m considering three research directions to work on next month, and I happen to be grumpy when considering direction #2, then maybe I don’t pursue that direction. Even though direction #2 might have seemed the most promising under more careful reflection. I think that the distribution of plans I consider involves relatively small upwards errors in my internal evaluation metrics. Sure, maybe I occasionally make a serious mistake due to the optimizer’s curse due to upwards “corruption”, but I don’t expect to literally die from the mistake.
Thus, there are degrees to the optimizer’s curse.
Both grader-optimization and argmax cause extreme, horrible optimizer’s curse. Is alignment just that hard? I think not.
The distribution of plans which I usually consider is not going to involve any set of mental events like “consider in detail building a highly persuasive superintelligence which persuades you to build it.” While a reflective diamond-valuing AI which did execute the plan’s mental steps might get hacked by that adversarial input, there would be no reason for it to seek out that plan to begin with.
The diamond-valuing AI would consider a distribution of plans far removed from the extreme upwards errors highlighted in the evaluation-child story. (I think that this is why you, in your day-to-day thinking, don’t have to worry about plans which are extreme adversarial inputs to your own evaluation procedures.) Even though a smart reflective AI may be implicitly searching over a range of plans, it’s doing so reflectively, thinking about what to think next, and perhaps not taking cognitive steps which it reflectively predicts to lead to bad outcomes (e.g. via the optimizer’s curse).
On the other hand, an AI which is aligned on the evaluation procedure is incentivized to seek out huge upwards errors on the evaluation procedure relative to the intended goal. The actor is trying to generate plans which maximally exploit the grader’s reasoning and judgment.6
Thus, if an AI cares about diamonds (i.e. has an influential diamond-shard), that AI might accidentally select a plan due to upwards evaluative noise, but that does not mean the AI is actively looking for plans to fool its diamond-shard into oblivion. The AI may make a mistake in its reflective predictions, but there’s no extreme optimization pressure for it to make mistakes like that. The AI wants to avoid those mistakes, so those mistakes remain unlikely. I think that reflective, smart AIs convergently want to avoid duping their own evaluative procedures, for the same reasons you want to avoid doing that to yourself.
More precisely:
- A reflective diamond-motivated agent chooses plans based on how many diamonds they lead to.
- Consider plans where the agent just improves its diamond synthesis methods. The agent can predict e.g. whether diamond production is increased by searching for plans involving simulating malign superintelligences which trick the agent into thinking the simulation plan makes lots of diamonds.
- A reflective agent knows that simulating those superintelligences doesn’t lead to many diamonds.
- Therefore, the reflective agent chooses to think about improving its diamond synthesis methods. The agent automatically7 avoids the worst parts of the optimizer’s curse. In contrast, grader-optimization which seeks out adversarial inputs to the diamond-motivated part of the system.
Therefore, avoiding the high-strength curse seems conceptually straightforward. In the case of aligning an AI to produce lots of diamonds, we want the AI to superintelligently generate and execute diamond-producing plans because the AI expects those plans to lead to lots of diamonds. I have spelled out a plausible-to-me story for how to accomplish this. The story is simple in its essential elements: finetune a pretrained model by rewarding it when it collects diamonds.
While that story has real open questions, that story also totally sidesteps the problems with grader-optimization. You don’t have to worry about providing some globally unhackable evaluation procedure to make super duper sure the agent’s plans “really” involve diamonds. If the early part of training goes as described, the agent wants to make diamonds, and (as I explained in the diamond-alignment story) it reflectively wants to avoid duping itself because duping itself leads to fewer diamonds.
We thereby answer the question posed above:
QuestionIf we cannot robustly grade expected-diamond-production for every plan the agent might consider, how might we nonetheless design a smart agent which makes lots of diamonds?
A reflective agent wishes to minimize the optimizer’s curse (relative to its own values), instead of maximizing it (relative to the goal by which the grader evaluates plans). While I don’t yet have satisfying pseudocode for reflective planning agents (but see Appendix B for preliminary pseudocode, effective reflective agents do exist. In this regime, it seems like many scary problems go away and don’t come back. That is an enormous blessing.8
If the answer to “how do we dispel the max-strength optimizer’s curse” is in fact “real-world reflective agents do this naturally”, then assuming unreflectivity will rule out the part of solution-space containing the actual solution:
Diamond Maximizer, Arbital(emphasis added)
As a further-simplified but still unsolved problem, an unreflective diamond maximizer is a diamond maximizer implemented on a Cartesian hypercomputer
in a causal universe
that does not face any Newcomblike problems.
This further avoids problems of reflectivity and logical uncertainty. In this case, it seems plausible that the primary difficulty remaining is just the ontology identification problem.
InsightThe argmax and unreflectivity assumptions were meant to make the diamond-maximizer problem easier. Ironically, however, these assumptions may well render the diamond-maximizer problem unsolvable, leading us to resort to increasingly complicated techniques and proposals, none of which seem to solve “core” problems like evaluation-rule hacking…
- Non-robust decision-influences can be OK.
- Values steer optimization; they are not optimized against.
- Since values steer cognition, reflective agents try to avoid adversarial inputs to their own values.
The answer is not to find a clever way to get a robust grader. The answer is to not need a robust grader. Form e.g. a diamond-production value within a reflective and smart agent, and this diamond-production value won’t be incentivized to fool itself. You won’t have to “robustly grade” it to make it produce diamonds.
ThanksThanks to Tamera Lanham, John Wentworth, Justis Mills, Erik Jenner, Johannes Treutlein, Quintin Pope, Charles Foster, Andrew Critch, randomwalks, Ulisse Mini, and Garrett Baker for thoughts. Thanks to Vivek Hebbar for in-person discussion.
Find out when I post more content: newsletter & rss
 & rss
alex@turntrout.com
- Uncertainty about how human values work. Suppose we think that human values are so complex, and there’s no real way to understand them or how they get generated. We imagine a smart AI as finding futures which optimize some grading rule, and so we need something to grade those futures. We think we can’t get the AI to grade the futures, because human values are so complex. What options remain available? Well, the only sources of “good judgment” are existing humans, so we need to find some way to use those humans to target the AI’s powerful cognition. We give the alignment, the AI gives the cognitive horsepower. We’ve fallen into the grader-optimization trap.
- Non-embedded forms of agency. This encourages considering a utility function maximized over all possible futures. Which automatically brings the optimizer’s curse down to bear at maximum strength. You can’t specify a utility function which is robust against that. This seems sideways of real-world alignment, where realistic motivations may not be best specified in the form of “utility function over observation / universe histories.” 
I briefly took a stab at writing pseudocode for a values-based agent like value-child. I think this code leaves a lot out, but I figured it’d be better to put something here for now.
'''
Here is one meta-plan for planning. The agent starts from no plan at all, iteratively generates improvements, which get accepted if they lead to more predicted diamonds. Depending on the current situation (as represented in the world_model), the agent might execute a plan in which it looks for nearby diamonds, or it might execute a plan where it runs a different kind of heuristic search on a certain class of plans (e.g. research improvements to the AI's diamond synthesis pathway).
 
    Any real shard agent would be reasoning and updating asynchronously, so this setup assumes a bit of unrealism.
 
    This function only modifies the internal state of the agent (self.recurrent), so as to be ready for a call of self.getDecisions().
'''
def plan(self):
  # Generate an initial plan
  plan = Plan() # Do nothing plan
  conseq = self.world_model.getConseq(plan)
  currentPlanEval = self.diamondShard(conseq)
 
  # Iteratively modify the plan until the generative model can't find a way to make it better
  while True:
   # Sample 5 plan modifications from generative model
   plans = self.world_model.planModificationSample(n=5, stub=plan)
 
   # Select first local improvement
   for planMod in plans:
    # Reflectively predict consequences of this plan
    newPlan = plan.modify(planMod)
    conseq = self.world_model.getConseq(plan)
 
    # Take local improvement
    if self.diamondShard(conseq) > currentPlanEval:
     plan = newPlan
     currentPlanEval = self.diamondShard(conseq)
     continue # Generate more modifications
 
   # Execute the plan, which possibly involves running plan search with a different algorithm and plan initialization.
   isDone = plan.exec()
   if isDone: breakI liked Vivek Hebbar’s recent comment. I liked the comment in the context of e.g. caring about your family and locally evaluating plans on that basis, but also knowing that your evaluation ability itself is compromised and will mis-rate some plans:
 I liked the comment in the context of e.g. caring about your family and locally evaluating plans on that basis, but also knowing that your evaluation ability itself is compromised and will mis-rate some plans:
Vivek Hebbar’s recent commentMy attempt at a framework where “improving one’s own evaluator” and “believing in adversarial examples to one’s own evaluator” make sense:
- The agent’s allegiance is to some idealized utility function (like cev). The agent’s internal evaluator
Evalis “trying” to approximate by reasoning heuristically. So now we ask Eval to evaluate the plan “do argmax w.r.t. Eval over a bunch of plans.” Eval reasons that, due to the way that Eval works, there should exist “adversarial examples” that score very highly on Eval but low on . Hence, Eval concludes that is low, where plan = “do argmax w.r.t. Eval.” So the agent doesn’t execute the plan “search widely and argmax.”- “Improving
Eval” makes sense because Eval will gladly replace itself withEval_2if it believes thatEval_2is a better approximation for (and hence replacing itself will cause the outcome to score better on )Are there other distinct frameworks which make sense here?
(I’m not sure whether Vivek meant to imply “and this is how I think people work, mechanistically.” I’m going to respond to a hypothetical other person who did in fact mean that.)
My take is that human value shards explain away the need to posit alignment to an idealized utility function. A person is not a bunch of crude-sounding subshards (e.g. “If food nearby and hunger>15, then be more likely to go to food”) and then also a sophisticated utility function (e.g. something like cev). It’s shards all the way down, and all the way up.9
Vivek then wrote:
I look forward to seeing what design Alex proposes for “value child.”
Value shards steer cognition. In the main essay, I wrote:
Quote
- A reflective diamond-motivated agent chooses plans based on how many diamonds they lead to.
- The agent can predict e.g. how diamond-promising it is to search for plans involving simulating malign superintelligences which trick the agent into thinking the simulation plan makes lots of diamonds, versus plans where the agent just improves its synthesis methods.
- A reflective agent knows that the first plan doesn’t lead to many diamonds, while the second plan leads to more diamonds.
- Therefore, the reflective agent chooses the second plan over the first plan, automatically avoiding the worst parts of the optimizer’s curse. (Unlike grader-optimization, which seeks out adversarial inputs to the diamond-motivated part of the system.)
This story smoothly accommodates thoughts about improving evaluation ability.
On my understanding: Your values are steering the optimization. They are not, in general, being optimized against by some search inside of you. They are probably not pointing to some idealized utility function. The decision-influences are guiding the search. There’s no secret other source of caring, no externalized utility function.
- 
Formalizing the action space is a serious gloss. In people, there is no privileged “action” space, considering how I can decide what to think about next. As an embedded agent, I don’t just decide what motor commands to send and what words to say—I also can decide what to decide next, what to think about next. I think the point of the essay stands anyways. ⤴ 
- 
This doesn’t mean that I’m using words the same way other people have, when deliberating on whether an AI’s values have to be “robust.” I’m more inclined to just carry out the shard theory analysis and see what experiences it leads me to anticipate, instead of arguing about whether my way of using words matches up with how other people have used words. ⤴ 
- 
As Charles Foster notes: 
 ⤴This isn’t entirely on the side of robustness. It also means that by default, even if we get the AI to have an X decision influence in one context, that doesn’t necessarily also activate in another context we might want it to generalize to. 
- 
I think that many people say “maximize my values” to mean something like “do something as great as possible, relative to what I care about.” So, in a sense, “type error” is pedantic. But also I think the type error complaint points at something important, so I’ll say it anyways. ⤴ 
- 
If you want to argue that decision-influences have to be robust else Goodhart, you need new arguments not related to grader-optimization. It is simply invalid to say “The agent doesn’t value diamonds in some situation where lots of its values activate strongly, and therefore the agent won’t make diamonds because it Goodharts on that unrelated situation.” That is not what values do. ⤴ 
- 
In a recent Google Doc thread, grader optimization came up. Here’s the exchange (splicing in my responses to each section): 
 ⤴Other person: So you’re imagining something like: the agent (policy) is optimizing for a reward model to produce a high number, and so the agent analyzes the reward model in detail to search for inputs that cause the reward model to give high numbers? Me: Yes. Other person: I think at that level of generality I don’t know enough to say whether this is good or bad. Me: I think this is very, very probably bad. Other person: We want our AI system to search for approaches that better enact our values. As you note, the optimizer’s curse says that we’ll tend to get approaches that overestimate how much they actually enact our values. But just knowing that the optimizer’s curse will happen doesn’t change anything; the best course of action is still to take the approach that is predicted to best enact our values. In that sense, the optimizer’s curse is typically something you have to live with, not something you can solve.[…] Me: A small error is not the same as a maximal error. Reflective agents can and will avoid deliberately searching for plans which maximize upwards errors in their own evaluations (e.g. generating a plan such that, while considering the plan, a superintelligence inside the plan tricks you into thinking the plan should be highly evaluated), because the agents reflectively predict that that hurts their goal achievement (e.g. leads to fewer diamonds). If you somewhat understand your decision-making, you can consider plans you’re less likely to incorrectly evaluate. Other person: So my followup question is: can you name a single approach that doesn’t have this failure mode, while still allowing us to use the AI to do things we didn’t think about in advance? Me: Yes. Other person: One answer someone might give is “create an agent-with-shards that searches for approaches that score highly on the shard. Me: Insofar as this means “the agent looks for inputs which maximize the aggregate shard output”, no. On my model, shards grade and modify plans, including plans about which plans to consider next. They are not searching for plans which maximize evaluative output, like in the reward-model case. Other person: An agent that searches for high scores on its shard can’t be searching for positive upwards errors in the shard; there is no such thing as an error in the shard.” To which the response is “that’s from the agent’s perspective. From the human’s perspective, the agent is searching for positive upwards differences between the shards and what-the-human-wants.” Me: Even if true, this would be not be an optimizer’s curse problem. But also this isn’t true, at least not without further argumentation. If my kid likes mocha and I like latté, is my child searching for positive upwards differences between their values and mine? I think there are some situations like that. For example, suppose the AI values paperclips while the humans value love. In this situation, the AI is searching for paperclippish plans which will systematically be bad plans by human lights. That seems more like instrumental convergence → disempower humans → not much love left for us if we’re dead. 
- 
I do think that e.g. a diamond-shard can get fed an adversarial input, but the diamond-shard won’t bid for a plan where it fools itself. ⤴ 
- 
It’s at this point that my model of Nate Soares wants to chime in. 
 ⤴Alex’s model of Nate: This sure smells like a problem redefinition in which you simply sweep the hard part of the problem under a less obvious corner of the rug. Why shouldn’t I believe you’ve just done that? Alex: A reasonable and productive heuristic in general, but inappropriate here. Grader-optimization explicitly incentivizes the agent to find maximal upwards errors in a diamond-evaluation module, whereas a reflective diamond-valuing agent has no incentive to consider such plans, because it reflectively predicts those plans don’t lead to diamonds. If you disagree, please point to the part of the story where, conditional on the previous part of the story obtaining, the grader-optimization problem reappears. Alex’s model of Nate: Suppose we achieved your dream of forming a diamond-shard in an AI, and that that shard holds significant power over the AI’s decisions. Now the AI keeps improving itself. Doesn’t “get smarter” look a lot like “implicitly consider more options”, which brings the curse back? Alex: If the agent is diamond-aligned at this point in time, I expect it stays that way for the reasons given in the “agent prevents value drift” section, along with this footnote and the appendix. As a specific answer, though: If the agent does care about diamonds at that point it time, then it doesn’t want to get so “smart” that it deludes itself by seriously intensifying the optimizer’s curse. It doesn’t want to do so for the reason we don’t want it to do so (in the hypothetical where we just want to achieve diamond-alignment). If the reflective agent can predict that outcome of the plan, it won’t execute the plan, because that plan leads to fewer diamonds. Alex’s model of Nate: So the AI still has to solve the AI alignment problem, except with its successors. Alex: Not all things which can be called an “AI alignment problem” are created equal. The AI has a range of advantages, and I detailed one way it could use those advantages. I do expect that kind of plan to actually work. 
- 
When working out shard theory with Quintin Pope, one of my favorite moments was the click where I stopped viewing myself as some black-box optimizing “some complicated objective.” Instead, this hypothesis reduced my own values to mere reality.  Every aspiration, every unit of caring, every desire for how I want the future to be bright and fun—subroutines, subshards, contextual bits of decision-making influence, all traceable to historical reinforcement and update events. ⤴ Every aspiration, every unit of caring, every desire for how I want the future to be bright and fun—subroutines, subshards, contextual bits of decision-making influence, all traceable to historical reinforcement and update events. ⤴
