Skip to main content

Introduction to Reinforcement Learning

In the previous articles, we introduced how AlphaGo uses supervised learning to learn from human game records. But supervised learning has a fundamental limitation: it can only imitate humans, not surpass them.

To enable AI to surpass humans, we need a different learning method—Reinforcement Learning (RL).

This article will guide you through understanding the core concepts of reinforcement learning from scratch, laying the foundation for self-play and MCTS integration.


What is Reinforcement Learning?

Comparison with Other Learning Methods

Machine learning has three main paradigms:

ParadigmLearning MethodExamples
Supervised LearningLearn from labeled dataImage classification, next move prediction
Unsupervised LearningDiscover structure from unlabeled dataClustering, dimensionality reduction
Reinforcement LearningLearn from interaction experiencePlaying games, robot control

The uniqueness of reinforcement learning lies in: no one tells you what the correct answer is; you must discover it yourself through trial and error.

An Intuitive Example

Imagine you're teaching a dog a new trick:

  1. The dog performs some action (possibly random)
  2. If the action is correct, you give it a treat (positive reward)
  3. If the action is wrong, you don't give a treat or say "no" softly (negative or zero reward)
  4. After many attempts, the dog learns which actions bring rewards

This is the essence of reinforcement learning: learning how to act through reward signals.

Reinforcement Learning in Go

In Go:

  • Each move is an "action"
  • At the end of the game, the win/loss is the "reward"
  • The AI needs to learn: which moves ultimately lead to victory?

But there's a huge challenge here: delayed reward. A game may last over 200 moves, but you only know the outcome at the end. When you play a move at move 50, how do you know how much it contributed to the final result?

This is one of the core problems in reinforcement learning, called the Credit Assignment Problem.


Core Concepts

Agent and Environment

The basic architecture of reinforcement learning includes two main characters:

Agent:

  • The entity that makes decisions
  • In Go, it's the AI that plays
  • Has a "policy" that determines what action to take in what state

Environment:

  • The object the Agent interacts with
  • In Go, it's the board + opponent
  • Receives the Agent's action, returns new state and reward

State

State s is a complete description of the environment. In Go:

  • State includes: current board position, whose turn it is, ko status, etc.
  • The state space is extremely large: approximately 1017010^{170} possible states

The state must have the Markov property: the future depends only on the current state, not on the history.

Action

Action a is the behavior the Agent can take. In Go:

  • Each empty point is a possible action
  • Plus "pass," there are 19×19+1=36219 \times 19 + 1 = 362 actions
  • But many positions are actually illegal (such as suicide, ko violations)

Reward

Reward r is the environment's feedback on actions. In Go:

  • Win: +1+1
  • Loss: 1-1
  • During the game: 00 (this is the most challenging part!)

The sparsity of reward signals is one of the main difficulties in reinforcement learning for Go.

Policy

Policy π is the Agent's behavioral guideline, telling it what to do in each state.

A policy can be:

  • Deterministic policy: a=π(s)a = \pi(s), each state corresponds to a unique action
  • Stochastic policy: aπ(as)a \sim \pi(a|s), gives a probability distribution over actions

In AlphaGo, the Policy Network is a stochastic policy that outputs the probability of playing at each position.


Markov Decision Process (MDP)

Definition of MDP

Markov Decision Process (MDP) is the mathematical framework for reinforcement learning.

An MDP is defined by a five-tuple (S,A,P,R,γ)(S, A, P, R, \gamma):

SymbolMeaningCorrespondence in Go
SSState spaceAll possible board positions
AAAction spaceAll legal move positions
$P(s's,a)$Transition probability
R(s,a,s)R(s,a,s')Reward functionWin/loss outcome
γ\gammaDiscount factorImportance of future rewards

Markov Property

The core assumption of MDP is the Markov Property:

P(st+1st,at,st1,at1,,s0)=P(st+1st,at)P(s_{t+1}|s_t, a_t, s_{t-1}, a_{t-1}, \ldots, s_0) = P(s_{t+1}|s_t, a_t)

In plain terms: the future depends only on the present, not on the past.

Does Go satisfy this property?

On the surface, yes—as long as you know the current board state, you know all legal moves. But actually, Go has the ko rule, which requires remembering the previous state. AlphaGo handles this by encoding the previous 8 board states into input features.

Go is a Deterministic MDP

Go has a special property: transitions are deterministic.

In board games, when you make a move, the board state change is completely deterministic (unlike dice games with randomness). So:

P(ss,a)={1if s is the state after executing a0otherwiseP(s'|s,a) = \begin{cases} 1 & \text{if } s' \text{ is the state after executing } a \\ 0 & \text{otherwise} \end{cases}

But don't forget, Go is a two-player game, and the opponent's moves bring "uncertainty." This makes the problem an adversarial MDP.

Reward Design

The design of the reward function is crucial for reinforcement learning. In Go, the most natural design is:

R(sT)={+1if AI wins1if AI losesR(s_T) = \begin{cases} +1 & \text{if AI wins} \\ -1 & \text{if AI loses} \end{cases}

Where TT is the time step when the game ends.

This sparse reward brings huge challenges:

  • A game may have 200-300 moves
  • Only the last move reveals the outcome
  • How do you judge whether an intermediate move is good or bad?

Some research attempts to design dense rewards, such as:

  • Capture rewards
  • Territory estimation rewards
  • Position evaluation rewards

But AlphaGo's success shows that: even using only the final win/loss as reward, through sufficient self-play, AI can learn sophisticated mid-game tactics.


Value Functions

Why Do We Need Value Functions?

The goal of reinforcement learning is to maximize cumulative reward. But rewards are delayed, so we need a way to evaluate "how good the current state is."

This is the purpose of Value Functions.

State Value Function V(s)

The State Value Function Vπ(s)V^\pi(s) is defined as: starting from state ss, following policy π\pi, the expected cumulative reward.

Vπ(s)=Eπ[t=0γtrt+1s0=s]V^\pi(s) = \mathbb{E}_\pi \left[ \sum_{t=0}^{\infty} \gamma^t r_{t+1} \mid s_0 = s \right]

Where:

  • Eπ\mathbb{E}_\pi represents the expected value under policy π\pi
  • γ[0,1]\gamma \in [0, 1] is the discount factor, making near-term rewards more important than long-term rewards
  • rt+1r_{t+1} is the reward obtained at time step t+1t+1

In Go, V(s)V(s) can be interpreted as: the probability of AI winning from the current position. AlphaGo's Value Network learns this function.

Action Value Function Q(s,a)

The Action Value Function Qπ(s,a)Q^\pi(s,a) goes further, evaluating the value of taking action aa in state ss:

Qπ(s,a)=Eπ[t=0γtrt+1s0=s,a0=a]Q^\pi(s,a) = \mathbb{E}_\pi \left[ \sum_{t=0}^{\infty} \gamma^t r_{t+1} \mid s_0 = s, a_0 = a \right]

Q(s,a)Q(s,a) can be interpreted as: the probability of ultimately winning if you play this move at the current position.

Relationship Between V and Q

These two functions have a close relationship:

Vπ(s)=aπ(as)Qπ(s,a)V^\pi(s) = \sum_a \pi(a|s) Q^\pi(s,a)

That is, state value = weighted average of all possible actions, weighted by the policy.

If we know the optimal policy π\pi^*:

V(s)=maxaQ(s,a)V^*(s) = \max_a Q^*(s,a)

Optimal state value = Q value of the best action.

Bellman Equation

Value functions satisfy an elegant recursive relationship—the Bellman Equation:

Vπ(s)=aπ(as)sP(ss,a)[R(s,a,s)+γVπ(s)]V^\pi(s) = \sum_a \pi(a|s) \sum_{s'} P(s'|s,a) \left[ R(s,a,s') + \gamma V^\pi(s') \right]

In plain terms: the value of the current state = immediate reward + discounted value of the next state.

This equation is the theoretical foundation of dynamic programming and many reinforcement learning algorithms.

AlphaGo's Value Network

In AlphaGo, the Value Network learns V(s)V(s)—evaluating the win rate of the current position.

Input: Board state s (19×19×17 feature tensor)
Output: Win rate estimate V(s) ∈ [-1, 1] (using tanh activation)

The Value Network's training objective is to predict the final outcome:

L=E[(Vθ(s)z)2]L = \mathbb{E} \left[ (V_\theta(s) - z)^2 \right]

Where z{1,+1}z \in \{-1, +1\} is the actual game outcome.


Policy Gradient Methods

From Value to Policy

Traditional reinforcement learning methods (such as Q-Learning) are "value-based": first learn the value function, then derive the policy from it.

But in problems like Go with huge action spaces, learning the policy directly may be more effective. This is the idea behind Policy Gradient methods.

Policy Parameterization

We use a neural network to represent the policy:

πθ(as)\pi_\theta(a|s)

Where θ\theta are the network parameters. The network takes state ss as input and outputs the probability of each action.

In AlphaGo, this is the Policy Network:

  • Input: Board state
  • Output: Probability of placing a stone at each of 361 positions (plus pass)

Policy Gradient Theorem

We want to find the optimal parameters θ\theta^* that maximize expected cumulative reward:

J(θ)=Eπθ[trt]J(\theta) = \mathbb{E}_{\pi_\theta} \left[ \sum_t r_t \right]

The Policy Gradient Theorem tells us how to compute the gradient of JJ with respect to θ\theta:

θJ(θ)=Eπθ[tθlogπθ(atst)Gt]\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta} \left[ \sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot G_t \right]

Where Gt=k=tTγktrkG_t = \sum_{k=t}^{T} \gamma^{k-t} r_k is the cumulative reward starting from time tt.

Intuitive Understanding

This formula can be understood as:

  1. θlogπθ(atst)\nabla_\theta \log \pi_\theta(a_t|s_t): How to adjust parameters to increase the probability of action ata_t
  2. GtG_t: The total return brought by this action

So:

  • If Gt>0G_t > 0 (good outcome), increase this action's probability
  • If Gt<0G_t < 0 (bad outcome), decrease this action's probability

This is one solution to credit assignment!

REINFORCE Algorithm

REINFORCE is the simplest policy gradient algorithm:

Algorithm: REINFORCE

1. Initialize policy network parameters θ

2. Repeat:
a. Use current policy π_θ to complete a game, collect trajectory:
τ = (s_0, a_0, r_1, s_1, a_1, r_2, ..., s_T)

b. Calculate cumulative return for each step:
G_t = r_{t+1} + γ·r_{t+2} + γ²·r_{t+3} + ...

c. Calculate policy gradient:
∇J = (1/T) Σ_t ∇_θ log π_θ(a_t|s_t) · G_t

d. Update parameters:
θ ← θ + α · ∇J

In Go, this means:

  1. Let the AI play a game by itself
  2. If it ultimately wins (G=+1G = +1), increase the probability of all moves played
  3. If it ultimately loses (G=1G = -1), decrease the probability of all moves played
  4. Repeat this process millions of times

Baseline

One problem with REINFORCE is high variance. Imagine a winning game that may also have some bad moves, but their probabilities would all increase.

The solution is to introduce a baseline:

θJ=E[tθlogπθ(atst)(Gtb(st))]\nabla_\theta J = \mathbb{E} \left[ \sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot (G_t - b(s_t)) \right]

A common choice is to let b(st)=V(st)b(s_t) = V(s_t), which is the Advantage Function:

A(st,at)=GtV(st)A(s_t, a_t) = G_t - V(s_t)

The advantage function measures: "How much better is this action than average?"

  • A>0A > 0: This action is better than expected, increase its probability
  • A<0A < 0: This action is worse than expected, decrease its probability

AlphaGo uses the Value Network to compute the baseline, which is why both the Policy Network and Value Network need to be trained.


Exploration vs. Exploitation

The Dilemma

Reinforcement learning faces a classic dilemma: Exploration vs. Exploitation.

  • Exploitation: Based on current knowledge, choose the action that looks best
  • Exploration: Try uncertain actions, possibly discovering better strategies

Pure exploitation gets stuck in local optima; pure exploration wastes time on obviously bad moves.

Challenges in Go

This problem is particularly severe in Go:

  1. Huge action space: 361 possible moves
  2. Sparse rewards: Only know the outcome at the end
  3. Long-term effects: A move's impact may only become apparent dozens of moves later

ε-Greedy Strategy

The simplest exploration method:

π(as)={1ε+εAif a=argmaxQ(s,a)εAotherwise\pi(a|s) = \begin{cases} 1 - \varepsilon + \frac{\varepsilon}{|A|} & \text{if } a = \arg\max Q(s,a) \\ \frac{\varepsilon}{|A|} & \text{otherwise} \end{cases}

Choose the best action with probability 1ε1-\varepsilon, randomly choose with probability ε\varepsilon.

But this is too crude for Go—randomly choosing a position to play is mostly a bad move.

Softmax Exploration

A better method is using a softmax distribution:

π(as)=exp(Q(s,a)/τ)aexp(Q(s,a)/τ)\pi(a|s) = \frac{\exp(Q(s,a)/\tau)}{\sum_{a'} \exp(Q(s,a')/\tau)}

Where τ\tau is the temperature parameter:

  • τ0\tau \to 0: Approaches greedy policy (pure exploitation)
  • τ\tau \to \infty: Approaches uniform random (pure exploration)
  • τ=1\tau = 1: Balances exploration and exploitation

AlphaGo uses similar techniques during self-play training to increase diversity.

UCB and PUCT

In MCTS, exploration and exploitation are handled by the UCB (Upper Confidence Bound) formula. AlphaGo uses its variant PUCT:

score(s,a)=Q(s,a)+cpuctP(s,a)N(s)1+N(s,a)\text{score}(s,a) = Q(s,a) + c_{\text{puct}} \cdot P(s,a) \cdot \frac{\sqrt{N(s)}}{1 + N(s,a)}

This formula is explained in detail in PUCT Formula Explained.

Intrinsic Exploration

AlphaGo also has an implicit exploration mechanism: self-play itself is exploration.

Since the neural network outputs probability distributions rather than deterministic actions, each self-play game produces different games. This naturally brings:

  • Tactical diversity: The same position may try different moves
  • Style evolution: As training progresses, AI may "discover" joseki that humans never tried
  • Self-correction: If a certain move always loses, its probability gradually decreases

Uniqueness of Go Reinforcement Learning

Comparison with Other Domains

Go reinforcement learning has some unique characteristics:

CharacteristicGoRobot ControlVideo Games
State spaceDiscrete, extremely largeContinuousDiscrete, medium
Action spaceDiscrete, largeContinuousDiscrete, small
TransitionsDeterministicStochasticDeterministic or stochastic
RewardsExtremely sparseDesignableMedium density
Environment modelKnown (rules)UnknownPartially known
AdversarialPerfect information gameUsually nonePossible

Deterministic Transitions

Go's rules are completely known. When you make a move, the next state is deterministic. This means:

  • Exact simulation possible: No need to learn the environment model
  • Perfect backtracking: MCTS can search precisely
  • No environment randomness handling needed: Simplifies many problems

Perfect Information

Go is a perfect information game—both players can see the complete board. Unlike poker (hidden information), this makes the problem simpler in some ways:

  • No need to handle opponent's hidden information
  • Can use the Minimax framework
  • State representation is more straightforward

Possibility of Self-Play

Because the rules are known and deterministic, AI can play against itself without needing a real opponent. This brings:

  • Unlimited training data: New games can be generated at any time
  • Stable opponent level: The opponent is itself, same skill level
  • Progressive improvement: As it gets stronger, the opponent also gets stronger

This is exactly the key to AlphaGo's success, which we'll discuss in detail in the next article Self-Play.

Long-term Credit Assignment

Go's rewards are extremely sparse (only the final win/loss), and a game may have 200-300 moves. This brings a severe credit assignment problem:

How do you correctly assign credit for a good move at move 50 when you win at move 250?

AlphaGo's solution combines multiple techniques:

  1. Value Network: Evaluates the win rate of intermediate positions, providing immediate feedback
  2. MCTS: Search to verify the quality of each move
  3. Large-scale play: Learning credit assignment through statistics

Symmetry

The Go board has 8-fold symmetry (4 rotations × 2 reflections). AlphaGo uses this for data augmentation:

  • Each training position can generate 8 variants
  • Greatly increases effective training data
  • Ensures the network learns symmetry-invariant features

Algorithm Comparison

Value-Based vs. Policy-Based

MethodProsConsSuitable For
Value-based (Q-Learning)High sample efficiencyDifficult with large action spacesSmall action spaces
Policy-based (REINFORCE)Can handle large action spacesHigh variance, low sample efficiencyLarge action spaces
Actor-CriticBalances bothNeed to train two networksGeneral applicability

AlphaGo's Choice

AlphaGo uses a variant of the Actor-Critic architecture:

  • Policy Network (Actor): Directly outputs action probabilities
  • Value Network (Critic): Evaluates state value

But it doesn't use traditional Actor-Critic updates. Instead:

  1. Supervised learning: First learn initial Policy Network from human game records
  2. Policy gradient: Strengthen Policy Network through self-play
  3. Regression learning: Train Value Network with self-play data
  4. MCTS integration: Combine both networks in actual play

This hybrid approach combines the advantages of multiple techniques and is one of the keys to AlphaGo's success.


Implementation Considerations

Training Stability

Policy gradient methods can sometimes be unstable. Common techniques include:

Gradient Clipping:

# Limit the norm of gradients
max_grad_norm = 0.5
torch.nn.utils.clip_grad_norm_(policy_net.parameters(), max_grad_norm)

Learning Rate Decay:

# Lower learning rate as training progresses
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=100, gamma=0.9)

Advanced algorithms like PPO/TRPO: Limit policy changes per update to prevent catastrophic forgetting.

Memory Management

Go games are long and require storing many trajectories. Common strategies:

Experience Replay:

# Store past experiences
replay_buffer = ReplayBuffer(max_size=1000000)

# Randomly sample for training
batch = replay_buffer.sample(batch_size=256)

Prioritized Experience Replay: Prioritize replaying "surprising" experiences (those with large TD errors).

Parallelization

Reinforcement learning can be highly parallelized:

  • Multi-threaded play: Run multiple games simultaneously
  • Distributed training: Multiple machines training simultaneously
  • Asynchronous updates: A3C and similar algorithms

AlphaGo's training used hundreds of GPUs and TPUs, conducting thousands of self-play games simultaneously.


Animation Mapping

Core concepts in this article and their animation numbers:

NumberConceptPhysics/Math Correspondence
Animation H1Agent-Environment interactionMarkov chain
Animation H4Policy gradientStochastic optimization
Animation H6Exploration vs. exploitationMulti-armed bandit

Summary

Reinforcement learning is the key technology that enabled AlphaGo to surpass humans. We learned:

  1. Basic framework: Agent, Environment, State, Action, Reward
  2. MDP: Markov Decision Process, the mathematical foundation of reinforcement learning
  3. Value functions: V(s)V(s) and Q(s,a)Q(s,a), evaluating the quality of states and actions
  4. Policy gradient: Methods to directly optimize policies, REINFORCE algorithm
  5. Exploration vs. exploitation: The core trade-off in the learning process
  6. Go characteristics: Challenges and opportunities of determinism, perfect information, and sparse rewards

In the next article, we'll dive deep into how AlphaGo uses self-play to achieve superhuman strength.


Further Reading


References

  1. Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press.
  2. Silver, D. (2015). "Lectures on Reinforcement Learning". University College London.
  3. Schulman, J., et al. (2017). "Proximal Policy Optimization Algorithms." arXiv preprint.
  4. Williams, R. J. (1992). "Simple statistical gradient-following algorithms for connectionist reinforcement learning." Machine Learning, 8(3-4), 229-256.
  5. Silver, D., et al. (2016). "Mastering the game of Go with deep neural networks and tree search." Nature, 529, 484-489.