Making a Difference Tempore: Insights From “Reinforcement Learning: An Introduction”

Table of Contents
Reinforcement Learning
1: Introduction
2: Multi-armed Bandits
3: Finite Markov Decision Processes
4: Dynamic Programming
5: Monte Carlo Methods
6: Temporal-Difference Learning
7: $n$ -step Bootstrapping
8: Planning and Learning with Tabular Methods
9: On-policy Prediction with Approximation
10: On-policy Control with Approximation
11: Off-policy Methods with Approximation
12: Eligibility Traces
13: Policy Gradient Methods
14: Psychology
15: Neuroscience
16: Applications and Case Studies
17: Frontiers
Final Thoughts
Forwards
Dota 2
No Longer a Spectator

Reinforcement Learning: An Introduction

The safety of artificial intelligence applications involving reinforcement learning is a topic that deserves careful attention.

Bandit basics, including non-stationarity, the value of optimism for incentivizing exploration, and upper-confidence-bound action selection.

Some explore / exploit results are relevant to daily life—I highly recommend reading Algorithms to Live By: The Computer Science of Human Decisions.

The framework.

Policy evaluation, policy improvement, policy iteration, value iteration, generalized policy iteration. What a nomenclative nightmare.

Prediction, control, and importance sampling.

After gathering data with our behavior policy $b$ , we then want to approximate the value function for the target policy $π$ . In off-policy methods, the policy we use to gather the data is different from the one whose value $v_{π}$ we’re trying to learn; in other words, the distribution of states we sample is different. This gives us a skewed picture of $v_{π}$ , so we must overcome this bias.

If $b$ can take all of the actions that $π$ can (i.e. $\forall a, s : π (a ∣ s) > 0 ⟹ b (a ∣ s) > 0$ ), we can overcome by adjusting the return $G_{t}$ of taking a series of actions $A_{t}, \dots, A_{T - 1}$ using the importance-sampling ratio $ρ_{t : T - 1} = def \prod_{k = t}^{T - 1} \frac{π ( A _{k} ∣ S _{k} )}{b ( A _{k} ∣ S _{k} )}$ . This cleanly recovers $v_{π}$ by the definition of expectation:

ρ_{t : T - 1} E_{b} [G_{t} ∣ S_{t} = s] = (k = t \prod T - 1 \frac{π ( A _{k} ∣ S _{k} )}{b ( A _{k} ∣ S _{k} )}) E_{b} [G_{t} ∣ S_{t} = s] = E_{π} [G_{t} ∣ S_{t} = s] = v_{π} (s) .

Then after observing some set of returns (where ${G_{t}}_{t \in T (s)}$ are the relevant returns for state $s$ ), we define the state’s value as the average adjusted observed return

V (s) = def \frac{\sum _{t \in T (s)} ρ _{t : T (t) - 1} G _{t}}{∣ T ( s ) ∣} .

However, the $ρ_{t : T (t) - 1}$ ’s can be arbitrarily large (suppose $π (A ∣ S) = .5$ and $b (A ∣ S) = 1 \times 1 0^{- 10}$ ; $\frac{.5}{1 \times 1 0 ^{- 10}} = .5 \times 1 0^{10}$ ), so the variance of this estimator can get pretty big. To get an estimator whose variance converges to 0, try

V (s) = def \frac{\sum _{t \in T (s)} ρ _{t : T (t) - 1} G _{t}}{\sum _{t \in T (s)} ρ _{t : T (t) - 1}} .

Instead of viewing the discount factor $γ$ as measuring how much we care about the future, think of it as encoding the probability we don’t terminate on any given step. That is, we expect with $1 - γ$ probability to die on the next turn, so we discount rewards accordingly.

This intuition hugs pre-theoretic understanding much more closely; if you have just 80 years to live, you might dive that big blue sky. If, however, you imagine there’s a non-trivial chance that humanity can be more than just a flash amongst the stars, you probably care more about the far future.

The tabular triple threat: $TD(0)$ , SARSA, and Q-learning.

V (S_{t}) \leftarrow V (S_{t}) + α [TD error R_{t + 1} + γV (S_{t + 1}) TD target - V (S_{t})]

$TD(0)$ is one-step bootstrapping of state values $v_{π}$ . It helps you learn the value of a given policy, and is not used for control.
SARSA is on-policy one-step bootstrapping of the action values $q_{π}$ using the quintuples $(S, A, R, S^{'}, A^{'})$ .

Quote

As in all on-policy methods, we continually estimate $q_{π}$ for the policy $π$ , and at the same time change $π$ toward greediness with respect to $q_{π}$ .

Q-learning is off-policy one-step bootstrapping of action values $q_{π}$ .
- You take an action using π and then use the maximal action value at the next state in your TD target term.

With great branching factors come great biases, and optimistic bias is problematic for Q-learning.

[!math] Definition
The Q-learning update rule for state $S_{t}$ , action $A_{t}$ , and new state $S_{t + 1}$ is
$Q (S_{t}, A_{t}) \leftarrow Q (S_{t}, A_{t}) + α [R_{t + 1} + γ a max Q (S_{t + 1}, a) - Q (S_{t}, A_{t})] .$

Suppose the rewards for all 100 actions at $S_{t + 1}$ are distributed according to $N (0, 1)$ . All of these actions have a true (expected) value of 0, but the probability that none of their estimates are greater than .8 after 1 draw each is $Φ (.8)^{100} \approx 4.6 \times 1 0^{- 11}$ . The more actions you have, the higher the probability is that at the maximum is just an outlier. See: regression toward the mean and mutual funds.

To deal with this, we toss another Q-learner into the mix; at any given update, one has their value updated and the other greedifies the policy. The double Q-learning scheme works because both Q-learners are unlikely to be biased in the same way. For some reason, this wasn’t discovered until 2010.

$n$ -step everything.

Models, prioritized sweeping, expected vs. sample updates, mcts, and rollout algorithms.

Distribution models include the full range of possible futures and their probabilities. For example, a distribution model for two fair coins:

{HH: .25, HT: .5, TT: .25} .

Sample models just let you, well, sample. HH, HT, HT, HT, HH, TT, HH, HT… You could sample thousands of times and gain a high degree of confidence that the above distribution model is generating the data, but this wouldn’t be your only hypothesis (granted, it might have the lowest Kolmogorov complexity).

Note that a distribution model also can function as a sample model.

We finally hit the good stuff: value-function approximators and gradient methods.

The $deadly triad$ of function approximation, bootstrapping, and off-policy training converge to sow divergence and instability in your methods. Metal.

I was told to skip this chapter; the first half was my favorite part of the book.

$TD (λ)$ uses a backwards-view eligibility trace to update features, which I find elegant. Suppose you have some feature extraction function $ϕ : S \to R^{d}$ . Then you apply your TD update not only based on the features relevant at the current state, but also to the time-decaying traces of the features of previously visited states. $λ \in [0, 1]$ sets how quickly this eligibility decay happens; $λ = 0$ recovers $TD(0)$ , while $λ = 1$ recovers Monte-Carlo return methods.

When I was a kid, there was a museum I loved going to—it had all kinds of wacky interactive devices for kids. One of them took sounds and “held them in the air” at a volume which slowly decreased as a function of how long ago the sound was made. State features are sounds, and volume is eligibility.

The policy gradient theorem, reinforce, and actor-critic.

Creating a partial mapping between reinforcement learning and psychology.

There was a word I was looking for that “mental model” didn’t quite seem to fit: “the model with respect to which we mentally simulate courses of action.” cfar’s “inner sim” terminology didn’t quite map either, as to me, that points to the system-in-motion more than that-on-which-the-system-runs. The literature dubs this a cognitive map.

Cognitive map

Individual place cells within the hippocampus correspond to separate locations in the environment with the sum of all cells contributing to a single map of an entire environment. The strength of the connections between the cells represents the distances between them in the actual environment. The same cells can be used for constructing several environments, though individual cells’ relationships to each other may differ on a map by map basis. The possible involvement of place cells in cognitive mapping has been seen in a number of mammalian species, including rats and macaque monkeys. Additionally, in a study of rats by Manns and Eichenbaum, pyramidal cells from within the hippocampus were also involved in representing object location and object identity, indicating their involvement in the creation of cognitive maps.

Given my work on whitelisting, I’ve been thinking about why we’re so “object-oriented” in our mental lives. An off-the-cuff hypothesis: to better integrate with the rest of our mental models, the visual system directly links up to our object schema. One such object is then recognized and engraved as a discrete “thing” in our map. Hence, we emotionally “know” that the world “really is” made up of objects, and isn’t just a collection of particles.

Lynch, 1960

Most of the information used by people for the cognitive mapping of spaces is gathered through the visual channel.

Lahav and Mioduser’s research somewhat supports this idea, suggesting that blind people not only have lower-fidelity and more declarative (as opposed to procedural / interactive) cognitive maps, they’re also less likely to provide object-to-object descriptions.

Epistemic status

I made a not-obviously wrong hypothesis and found two pieces of corroborating evidence.

The reward prediction error hypothesis, dopamine, neural actor-critic, hedonistic neurons, and addiction.

From checkers to checkmate.

Temporal abstraction, designing reward signals, and the future of reinforcement learning. In particular, the idea I had for having a whitelist-enabled agent predict expected object-level transitions is actually one of the frontiers: general value functions. Rad.

The rapid pace of advances in AI has led to warnings that AI poses serious threats to our societies, even to humanity itself.

This chapter talks a fair amount about the serious challenges in AI alignment (not sure if you all have heard of that problem), which is heartening.

As to safety, hazards possible with reinforcement learning are not completely different from those that have been managed successfully for related applications of optimization and control methods.

I’m not sure about that. Admittedly, I’m not as familiar with those solutions, but the challenge here seems to be of a qualitatively different caliber. Conditional on true AI’s achievement, we’d want to have extremely high confidence that it pans out before we flip the switch. The authors acknowledge that:

it may be impossible for the agent to achieve the designer’s goal no matter what its reward signal is.

I don’t think it’s impossible, but it’s going to be extremely hard to get formal probabilistic guarantees. I mean, if you don’t know an agent’s rationality, you can’t learn their utility function. If you do know their rationality but not their probability distribution over outcomes, you still can’t learn their utility function.

The above constitutes one of the many open problems in alignment theory. If that’s not enough, there are always the unknown unknowns…

I read the “nearly complete” draft of the second edition. It was pretty good, but I did find most of the exercises either too easy or requiring considerable programming investment to set up the environment described. The former makes sense, as I’ve already taken a course on this, and I’m probably a bit above the introductory level.

Some graphs could have been more clearly designed, and some math in the proof of Linear $TD(0)$ ’s convergence (p. 206–207) is underspecified:

In general, $w_{t}$ will be reduced toward zero whenever $A$ is positive definite, meaning $y^{⊤} A y > 0$ for real vector $y$ .

An additional precondition: $y$ can’t be the zero vector.

For a key matrix of this type, positive definiteness is assured if all of its columns sum to a non-negative number.

Unless I totally missed what “this type” entails, this is false if taken at face value:

(- 2 5 5 - 2)

has non-negative column sums and is also indefinite, having eigenvalues of 3 and −7.

However, the claim is true in the subtle way they use it—they’re actually showing that since the matrix is symmetric, real, and strictly diagonally dominant with positive diagonal entries, it’s also positive definite. This could be made clearer.

In all, reading this book was definitely a positive experience.

I’ll be finishing Analysis II before moving on to Jaynes’s Probability Theory in preparation for a class in the fall.

Recently, OpenAI made waves with their OpenAI Five Dota 2 bot. To reinforce what I just learned and solidified, I might make a post in the near future breaking down how Five differs from the Alpha(Go) Zero approach, quantifying my expectations for The International for calibration.

Four months and one week ago, I started my journey through the miri reading list. In those dark days, attempting a proof induced a stupor similar to that I encountered approaching a crush in grade school, my words and thoughts leaving me.

Six textbooks later and with a little effort, I’m able to prove things like the convergence of Monte Carlo integration to the Riemann integral, threading together lessons from All of Statistics and Analysis I; target in mind, words and thoughts remaining firmly by my side.

Quote

The rapid pace of advances in artificial intelligence has led to warnings that artificial intelligence poses serious threats to our societies, even to humanity itself. The renowned scientist and artificial intelligence pioneer Herbert Simon anticipated the warnings we are hearing today…
He spoke of the eternal conflict between the promise and perils of any new knowledge, reminding us of the Greek myths of Prometheus, the hero of modern science, who stole fire from the gods for the benefit of mankind, and Pandora, whose box could be opened by a small and innocent action to release untold perils on the world.
Simon urged us to recognize that as designers of our future and not mere spectators, the decisions we make can tilt the scale in Prometheus’ favor.

☙

❧

Sequence:

Find out when I post more content: newsletter & rss

Thoughts? Email me at alex@turntrout.com