Key Papers Guide

This article summarizes the most important papers in Go AI development history, providing quick summaries and technical highlights.

The milestone papers in Go AI history include Coulom's MCTS (2006), AlphaGo (2016), AlphaGo Zero (2017), the generalized AlphaZero (2017), and KataGo (2019) by David Wu, which introduced many efficiency improvements. To learn the fundamentals, start with AlphaGo; for implementation reference, focus on the KataGo paper.

Papers Overview

Timeline

Coulom - MCTS first applied to Go
Silver et al. - AlphaGo (Nature)
Silver et al. - AlphaGo Zero (Nature)
Silver et al. - AlphaZero
Wu - KataGo
2020+ Various improvements and applications

Reading Recommendations

Goal	Recommended Paper
Understand basics	AlphaGo (2016)
Understand self-play	AlphaGo Zero (2017)
Understand general methods	AlphaZero (2017)
Implementation reference	KataGo (2019)

1. Birth of MCTS (2006)

Paper Information

Title: Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search
Author: Remi Coulom
Published: Computers and Games 2006

Core Contribution

First systematic application of Monte Carlo methods to Go:

Before: Pure random simulation, no tree structure
After: Build search tree + UCB selection + backpropagation

Key Concepts

UCB1 Formula

Selection score = Average win rate + C × √(ln(N) / n)

Where:
- N: Parent visit count
- n: Child visit count
- C: Exploration constant

MCTS Four Steps

Selection: Use UCB to select node
Expansion: Expand new node
Simulation: Random playout to end
Backpropagation: Propagate win/loss

Impact

Go AI reached amateur dan level
Became foundation for all subsequent Go AI
UCB concept influenced PUCT development

2. AlphaGo (2016)

Paper Information

Title: Mastering the game of Go with deep neural networks and tree search
Authors: Silver, D., Huang, A., Maddison, C.J., et al.
Published: Nature, 2016
DOI: 10.1038/nature16961

Core Contribution

First combination of deep learning and MCTS, defeating human world champion.

System Architecture

Technical Highlights

1. Supervised Learning Policy Network

# Input features (48 planes)
- Own stone positions
- Opponent stone positions
- Liberty counts
- Post-capture state
- Legal move positions
- Recent move positions
...

2. Reinforcement Learning Improvement

SL Policy → Self-play → RL Policy

RL Policy beats SL Policy with ~80% win rate

3. Value Network Training

Key to prevent overfitting:
- Take only one position per game
- Avoid similar positions repeating

4. MCTS Integration

Leaf evaluation = 0.5 × Value Network + 0.5 × Rollout

Rollout uses fast Policy Network (lower accuracy but faster)

Key Numbers

Item	Value
SL Policy accuracy	57%
RL Policy vs SL Policy win rate	80%
Training GPUs	176
Playing TPUs	48

3. AlphaGo Zero (2017)

Paper Information

Title: Mastering the game of Go without human knowledge
Authors: Silver, D., Schrittwieser, J., Simonyan, K., et al.
Published: Nature, 2017
DOI: 10.1038/nature24270

Core Contribution

No human games needed, learning from scratch through self-play.

Differences from AlphaGo

Aspect	AlphaGo	AlphaGo Zero
Human games	Required	Not required
Network count	4	1 dual-head
Input features	48 planes	17 planes
Rollout	Used	Not used
Residual network	No	Yes
Training time	Months	3 days

Key Innovations

1. Single Dual-Head Network

2. Simplified Input Features

# Only 17 feature planes needed
features = [
    current_player_stones,      # Own stones
    opponent_stones,            # Opponent stones
    history_1_player,           # History state 1
    history_1_opponent,
    ...                         # History states 2-7
    color_to_play               # Whose turn
]

3. Pure Value Network Evaluation

No more Rollout
Leaf evaluation = Value Network output

Cleaner, faster

4. Training Process

Learning Curve

Training time    Elo
─────────────────────
3 hours          Beginner
24 hours         Surpasses AlphaGo Lee
72 hours         Surpasses AlphaGo Master

4. AlphaZero (2017)

Paper Information

Title: Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm
Authors: Silver, D., Hubert, T., Schrittwieser, J., et al.
Published: arXiv:1712.01815 (later published in Science, 2018)

Core Contribution

Generalization: Same algorithm applied to Go, Chess, and Shogi.

General Architecture

Input encoding (game-specific) → Residual network (general) → Dual-head output (general)

Cross-Game Adaptation

Game	Input Planes	Action Space	Training Time
Go	17	362	40 days
Chess	119	4672	9 hours
Shogi	362	11259	12 hours

MCTS Improvements

PUCT Formula

Selection score = Q(s,a) + c(s) × P(s,a) × √N(s) / (1 + N(s,a))

c(s) = log((1 + N(s) + c_base) / c_base) + c_init

Exploration Noise

# Add Dirichlet noise at root
P(s,a) = (1 - ε) × p_a + ε × η_a

η ~ Dir(α)
α = 0.03 (Go), 0.3 (Chess), 0.15 (Shogi)

5. KataGo (2019)

Paper Information

Title: Accelerating Self-Play Learning in Go
Author: David J. Wu
Published: arXiv:1902.10565

Core Contribution

50x efficiency improvement, allowing individual developers to train strong Go AI.

Key Innovations

1. Auxiliary Training Targets

Total loss = Policy Loss + Value Loss +
         Score Loss + Ownership Loss + ...

Auxiliary targets help network converge faster

2. Global Features

# Global pooling layer
global_features = global_avg_pool(conv_features)
# Combine with local features
combined = concat(conv_features, broadcast(global_features))

3. Playout Cap Randomization

Traditional: Fixed N searches each time
KataGo: N sampled from a distribution

Lets network perform well at various search depths

4. Progressive Board Sizes

if training_step < 1000000:
    board_size = random.choice([9, 13, 19])
else:
    board_size = 19

Efficiency Comparison

Metric	AlphaZero	KataGo
GPU-days to superhuman	5000	100
Efficiency improvement	Baseline	50x

6. Extended Papers

MuZero (2020)

Title: Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model
Contribution: Learn environment dynamics model, no game rules needed

EfficientZero (2021)

Title: Mastering Atari Games with Limited Data
Contribution: Dramatically improved sample efficiency

Gumbel AlphaZero (2022)

Title: Policy Improvement by Planning with Gumbel
Contribution: Improved policy improvement method

Paper Reading Suggestions

Beginner Order

AlphaGo (2016) - Understand basic architecture
AlphaGo Zero (2017) - Understand self-play
KataGo (2019) - Understand implementation details

Advanced Order

AlphaZero (2017) - Generalization
MuZero (2020) - Learn world models
MCTS original paper - Understand foundations

Reading Tips

Read abstract and conclusion first: Quickly grasp core contribution
Look at figures: Understand overall architecture
Read methods section: Understand technical details
Check appendix: Find implementation details and hyperparameters

Resource Links

Paper PDFs

Paper	Link
AlphaGo	Nature
AlphaGo Zero	Nature
AlphaZero	Science
KataGo	arXiv

Open Source Implementations

Project	Link
KataGo	GitHub
Leela Zero	GitHub
MiniGo	GitHub

Papers Overview​

Timeline​

Reading Recommendations​

1. Birth of MCTS (2006)​

Paper Information​

Core Contribution​

Key Concepts​

UCB1 Formula​

MCTS Four Steps​

Impact​

2. AlphaGo (2016)​

Paper Information​

Core Contribution​

System Architecture​

Technical Highlights​

1. Supervised Learning Policy Network​

2. Reinforcement Learning Improvement​

3. Value Network Training​

4. MCTS Integration​

Key Numbers​

3. AlphaGo Zero (2017)​

Paper Information​

Core Contribution​

Differences from AlphaGo​

Key Innovations​

1. Single Dual-Head Network​

2. Simplified Input Features​

3. Pure Value Network Evaluation​

4. Training Process​

Learning Curve​

4. AlphaZero (2017)​

Paper Information​

Core Contribution​

General Architecture​

Cross-Game Adaptation​

MCTS Improvements​

PUCT Formula​

Exploration Noise​

5. KataGo (2019)​

Paper Information​

Core Contribution​

Key Innovations​

1. Auxiliary Training Targets​

2. Global Features​

3. Playout Cap Randomization​

4. Progressive Board Sizes​

Efficiency Comparison​

6. Extended Papers​

MuZero (2020)​

EfficientZero (2021)​

Gumbel AlphaZero (2022)​

Paper Reading Suggestions​

Beginner Order​

Advanced Order​

Reading Tips​

Resource Links​

Paper PDFs​

Open Source Implementations​

Further Reading​

Papers Overview

Timeline

Reading Recommendations

1. Birth of MCTS (2006)

Paper Information

Core Contribution

Key Concepts

UCB1 Formula

MCTS Four Steps

Impact

2. AlphaGo (2016)

Paper Information

Core Contribution

System Architecture

Technical Highlights

1. Supervised Learning Policy Network

2. Reinforcement Learning Improvement

3. Value Network Training

4. MCTS Integration

Key Numbers

3. AlphaGo Zero (2017)

Paper Information

Core Contribution

Differences from AlphaGo

Key Innovations

1. Single Dual-Head Network

2. Simplified Input Features

3. Pure Value Network Evaluation

4. Training Process

Learning Curve

4. AlphaZero (2017)

Paper Information

Core Contribution

General Architecture

Cross-Game Adaptation

MCTS Improvements

PUCT Formula

Exploration Noise

5. KataGo (2019)

Paper Information

Core Contribution

Key Innovations

1. Auxiliary Training Targets

2. Global Features

3. Playout Cap Randomization

4. Progressive Board Sizes

Efficiency Comparison

6. Extended Papers

MuZero (2020)

EfficientZero (2021)

Gumbel AlphaZero (2022)

Paper Reading Suggestions

Beginner Order

Advanced Order

Reading Tips

Resource Links

Paper PDFs

Open Source Implementations

Further Reading