Training from Scratch

What's most astonishing about AlphaGo Zero isn't just its final playing strength, but its growth process - starting from a completely random state, in just three days it went through the Go knowledge accumulation that took humans thousands of years, then surpassed all human understanding.

This article will walk you through this stunning transformation step by step.

Training Curve

First, let's look at AlphaGo Zero's strength growth curve:

載入中...

This curve shows AlphaGo Zero's strength changes over 72 hours. Note several key milestones:

Time	ELO Rating	Equivalent to
0 hours	0	Random moves
3 hours	~1000	Discovers basic rules
12 hours	~3000	Discovers joseki and shapes
36 hours	~4500	Surpasses AlphaGo Fan version
60 hours	~5200	Surpasses AlphaGo Lee version
72 hours	~5400	Surpasses all previous versions

Three days, from zero to surpassing the peak of human achievement.

Day 0: Chaotic Beginning

Completely Random Initial State

At the start of training, the neural network weights are randomly initialized. This means:

Policy Head: Outputs approximately uniform distribution, each position has about 1/361 probability
Value Head: Outputs approximately 0, cannot distinguish good positions from bad ones

At this point, AlphaGo Zero plays completely randomly - worse than someone who has never seen a Go board.

The First Self-Play Game

Imagine what the first self-play game looks like:

Black 1: Randomly placed somewhere (could be tengen, a corner, or the first line)
White 2: Randomly placed elsewhere
Black 3: Random...
...
Move 200: The board is covered with isolated stones, no connections
Final: Win/loss determined by random factors

This game's "quality" is extremely low, but it contains precious information: who won in the end.

The First Training Signal

Although both sides are playing randomly, the outcome is definite. The neural network begins to learn:

"In this position, Black won in the end. I don't know why, but this position might be better for Black."

This is a very weak signal, but it's real. After thousands of such "junk games," the network begins to discover some statistical patterns.

Hour 1-3: Discovering Game Rules

Emerging Rule Awareness

After tens of thousands of self-play games, AlphaGo Zero begins to "discover" Go's basic rules (though these rules were built into the game engine all along):

1. Importance of Connection

Observation: When stones are connected, they're harder to capture
Learning: Begins to prefer playing next to existing stones

This wasn't taught - it was learned from win/loss results. Scattered stones are easily defeated one by one; connected groups survive more easily.

2. Concept of Liberties

Observation: When all adjacent empty points of a stone are occupied, the stone disappears
Learning: Begins to avoid positions with few liberties, begins to attack opponent's stones with few liberties

The network learned to track liberty counts - though there's no explicit "liberty count" feature in the input, it can be inferred from historical board states.

3. Embryo of Eyes

Observation: Certain shapes are especially hard to capture
Learning: Begins forming shapes with internal space in corners and edges

This is the beginning of the living group concept. The network discovered that stone groups with internal space survive more easily.

Strength Assessment

At this point, AlphaGo Zero is approximately:

ELO: ~1000
Equivalent to: Beginner who just learned the rules
Characteristics: Knows to connect stones, knows to capture opponent's stones

Hour 3-12: Discovering Joseki and Shapes

Corner Awakening

With more training, the network discovered the importance of corners:

Observation: Corner stones only need 2 eyes to live
          Edge stones need 2 eyes but it's harder
          Center stones need 2 eyes and it's hardest
Learning: Prioritize occupying corners in the opening

This is the discovery process for what humans call "corners are gold, edges are silver, center is grass." The network wasn't told this principle - it discovered it from hundreds of thousands of games.

Emergence of Joseki

More amazingly, the network began to "invent" joseki - standard corner sequences for both sides:

Observed Phenomenon

Early training: Corner play varies wildly
Mid training: Certain sequences appear repeatedly
Late training: Stable corner joseki forms

These joseki are highly similar to joseki humans accumulated over centuries, validating that these joseki are indeed approximations of optimal play for both sides.

Typical Emergent Joseki

Using komoku joseki as an example:

  A B C D E F G H J
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . ● . . . . .   ● = Black
. . . . . . . . .   ○ = White
. . . ○ . ● . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .

Black occupies komoku, White approaches, Black pincer - this sequence naturally emerged during training.

Shape Knowledge

Beyond joseki, the network also learned the difference between good and bad shapes:

Shape	Human Evaluation	Zero's Learning
Empty triangle	Bad shape	Gradually avoids
Tiger's mouth	Good shape	Gradually prefers
Double wing	Classic attack shape	Naturally discovered
Capping move	Strong attack	Naturally discovered

Strength Assessment

At this point, AlphaGo Zero:

ELO: ~3000
Equivalent to: Strong amateur
Characteristics: Has basic joseki knowledge, understands basic shapes

Hour 12-36: Maturation of Go Principles

Formation of Global Vision

Entering the second day, the network began to exhibit global vision:

Influence vs. Territory

Observation: Enclosing space gives points
          But influence also has value - can attack opponent
Learning: Finding balance between territory and influence

This is one of Go's most profound concepts. The network learned to evaluate "virtual" and "solid" value.

Thickness Judgment

Observation: "Thick" stones can support distant fighting
          "Thin" stones need reinforcement or will be attacked
Learning: Proactively build thickness, attack opponent's weaknesses

Middle Game Tactics

The network's middle game fighting ability improved dramatically:

Technique	Description
Attack weak groups	Identify opponent's isolated stones, launch attacks
Utilize thickness	Use strong positions to support attacks, gain benefits
Exchange	Give up local loss for global advantage
Invasion	Reduce opponent's framework

Endgame Skills

Endgame calculation precision also improved:

Observation: Each endgame move's value can be precisely calculated
Learning: Collect endgame points in order of value

The network learned endgame concepts like "sente for both," "sente for one," and "gote."

Strength Assessment

At this point, AlphaGo Zero:

ELO: ~4500
Equivalent to: Professional level
Characteristics: Has complete Go understanding, can play high-quality games

Hour 36-72: Surpassing Humans

Breaking Through Professional Level

Around 36 hours, AlphaGo Zero's strength reached professional level. But training didn't stop - it continued self-play, continued improving.

What happened next is even more interesting: it began discovering moves humans never thought of.

Revolutionary Openings

Traditional Go openings have many "established views":

Traditional View	AlphaGo Zero's Discovery
Occupy corners first in opening	Sometimes occupying edges first is better
Komoku is most stable	Direct 3-3 invasion is viable
Must memorize joseki	Can actively deviate from joseki
3-3 invasion too early is greedy	3-3 invasion is correct in some positions

These "discoveries" have been widely studied by professional players after AlphaGo, and many have been incorporated into modern Go theory.

Counter-intuitive Shapes

AlphaGo Zero sometimes plays shapes humans consider "ugly":

Human: "This is bad shape, can't possibly be good"
Zero: (plays that move)
After analysis: "Turns out this is more efficient"

This reveals limitations in human Go theory: some "bad shapes" are actually optimal in specific positions.

Aggressive Sacrifices

Zero is more willing than humans to sacrifice stones for other benefits:

Local loss of 3 points
Gain global initiative
Final win rate increases

Human players often care too much about local gains and losses, while Zero always focuses on final win rate.

Strength Assessment

AlphaGo Zero after 72 hours:

ELO: ~5400
Equivalent to: Surpasses all human players
Characteristics: Discovers unknown moves, creates new Go theory

Rediscovering Human Go Principles

Thousands of Years vs. Three Days

Human Go developed over thousands of years:

Originated in China around 2000 BCE
Spread to Japan during Tang Dynasty, developed refined theory
Professional system emerged in 20th century, theory deepened further
By 2016, humans believed they understood Go quite well

AlphaGo Zero completed this journey in three days. Even more stunning, the Go principles it discovered are highly consistent with human ones.

Validation and Transcendence

Human Knowledge	Zero's Attitude
Corners gold, edges silver, center grass	Confirmed (corners are indeed important)
Basic joseki	Mostly confirmed, few improvements
Good/bad shapes	Mostly confirmed, exceptions exist
Sacrifice exchanges	More aggressive than humans
Thickness judgment	Generally consistent, details differ

This shows human Go principles accumulated over thousands of years are largely correct. But there are also areas where human understanding needs revision.

Insights for Human Learning

AlphaGo Zero's training process offers insights for human learning:

Start from basics: Zero first learned rules, then shapes, finally global vision
Massive practice: 4.9 million self-play games equals tens of thousands of years of human games
Focus on winning: Don't pursue "beautiful play," only pursue winning
Don't be bound by tradition: Dare to try "impossible" moves

Technical Details of Training

Self-Play Mechanism

Each self-play game flow:

Initialize: Empty board
↓
Each move:
  1. Evaluate current position with neural network
  2. Execute MCTS search (1600 simulations)
  3. Select move based on search results
  4. Record (position, MCTS probabilities, -)
↓
Game ends:
  1. Determine winner z ∈ {-1, +1}
  2. Add outcome to all records (position, MCTS probabilities, z)
  3. Add data to training pool

Training Rhythm

AlphaGo Zero's training is continuous:

Self-play Workers:       Continuously generate self-play data
Training Workers:        Continuously sample from data pool and train
Network Updates:         Periodically update network used for self-play

These three processes run simultaneously, forming a continuous improvement loop.

Data Pool Management

Training data pool management:

Parameter	Value
Pool size	Most recent 500,000 games
Samples per game	~200 moves
Total samples	~100 million
Sampling method	Uniform random

Old data is replaced by new data, ensuring training data reflects current network level.

Network Update Strategy

The self-play network isn't updated after every training step. Instead:

After training for a while, generate candidate network
Have candidate network play against current network (400 games)
If candidate win rate > 55%, update
Otherwise continue training

This ensures self-play always uses a sufficiently strong network.

Analysis of Learning Speed

Why So Fast?

Reasons for AlphaGo Zero's stunning learning speed:

1. Computational Resources

4 TPUs, tens of thousands of inferences per second
Hundreds of thousands of self-play games per day
Equivalent to thousands of years of human games

2. Perfect Opponent

Self-play means:

Opponent level is always comparable to self
Not too weak (nothing to learn) or too strong (can't win)
These are ideal learning conditions

3. Direct Objective

Only one goal: win. No:

Teacher preferences
Style pursuits
Aesthetic considerations

4. Efficient Representation Learning

Residual networks can learn very abstract board features, more effective than hand-designed features.

Comparison with Humans

Aspect	Human	AlphaGo Zero
Learning speed	~10 games/day	~100,000 games/day
Memory retention	Has forgetting	Perfect retention
Energy limits	Needs rest	Runs 24/7
Innovation ability	Influenced by tradition	No preset limitations

Interesting Phenomena During Training

Periodic Stagnation

The training curve isn't completely smooth; sometimes plateau periods appear:

ELO: 2000 -----> 2000 -----> 2500 ---->
          (stagnation)  (breakthrough)

This might be because the network is learning some new concept and needs time to "digest."

Emergence and Disappearance of Strategies

Certain strategies emerge during training, then disappear:

Stage 1: Discover some attack technique
Stage 2: Opponent learns to defend
Stage 3: That technique's usage decreases
Stage 4: Discover new attack technique

This is an arms race in miniature.

"Reinventing the Wheel"

During training, Zero "reinvents" concepts humans already knew:

Ladder: Discovers continuous atari can capture stones
Snapback: Discovers sacrificing first then counter-killing
Ko: Discovers how to use the superko rule

The order of these discoveries is similar to how humans learn Go.

Animation Reference

Core concepts covered in this article with animation numbers:

Number	Concept	Physics/Math Correspondence
Animation E12	Strength growth curve	S-curve growth (logistic)
Animation E7	From scratch	Self-organization
Animation E5	Self-play	Fixed-point convergence
Animation F8	Emergent abilities	Phase transition

References

Silver, D., et al. (2017). "Mastering the game of Go without human knowledge." Nature, 550, 354-359.
Silver, D., et al. (2017). "AlphaGo Zero: Starting from scratch." DeepMind Blog.
DeepMind. (2017). "AlphaGo Zero: Learning from scratch." YouTube.
Wang, F., et al. (2019). "A Survey on the Evolution of AlphaGo." arXiv:1907.11180.

Training Curve​

Day 0: Chaotic Beginning​

Completely Random Initial State​

The First Self-Play Game​

The First Training Signal​

Hour 1-3: Discovering Game Rules​

Emerging Rule Awareness​

1. Importance of Connection​

2. Concept of Liberties​

3. Embryo of Eyes​

Strength Assessment​

Hour 3-12: Discovering Joseki and Shapes​

Corner Awakening​

Emergence of Joseki​

Observed Phenomenon​

Typical Emergent Joseki​

Shape Knowledge​

Strength Assessment​

Hour 12-36: Maturation of Go Principles​

Formation of Global Vision​

Influence vs. Territory​

Thickness Judgment​

Middle Game Tactics​

Endgame Skills​

Strength Assessment​

Hour 36-72: Surpassing Humans​

Breaking Through Professional Level​

Revolutionary Openings​

Counter-intuitive Shapes​

Aggressive Sacrifices​

Strength Assessment​

Rediscovering Human Go Principles​

Thousands of Years vs. Three Days​

Validation and Transcendence​

Insights for Human Learning​

Technical Details of Training​

Self-Play Mechanism​

Training Rhythm​

Data Pool Management​

Network Update Strategy​

Analysis of Learning Speed​

Why So Fast?​

1. Computational Resources​

2. Perfect Opponent​

3. Direct Objective​

4. Efficient Representation Learning​

Comparison with Humans​

Interesting Phenomena During Training​

Periodic Stagnation​

Emergence and Disappearance of Strategies​

"Reinventing the Wheel"​

Animation Reference​

Further Reading​

References​

Training Curve

Day 0: Chaotic Beginning

Completely Random Initial State

The First Self-Play Game

The First Training Signal

Hour 1-3: Discovering Game Rules

Emerging Rule Awareness

1. Importance of Connection

2. Concept of Liberties

3. Embryo of Eyes

Strength Assessment

Hour 3-12: Discovering Joseki and Shapes

Corner Awakening

Emergence of Joseki

Observed Phenomenon

Typical Emergent Joseki

Shape Knowledge

Strength Assessment

Hour 12-36: Maturation of Go Principles

Formation of Global Vision

Influence vs. Territory

Thickness Judgment

Middle Game Tactics

Endgame Skills

Strength Assessment

Hour 36-72: Surpassing Humans

Breaking Through Professional Level

Revolutionary Openings

Counter-intuitive Shapes

Aggressive Sacrifices

Strength Assessment

Rediscovering Human Go Principles

Thousands of Years vs. Three Days

Validation and Transcendence

Insights for Human Learning

Technical Details of Training

Self-Play Mechanism

Training Rhythm

Data Pool Management

Network Update Strategy

Analysis of Learning Speed

Why So Fast?

1. Computational Resources

2. Perfect Opponent

3. Direct Objective

4. Efficient Representation Learning

Comparison with Humans

Interesting Phenomena During Training

Periodic Stagnation

Emergence and Disappearance of Strategies

"Reinventing the Wheel"

Animation Reference

Further Reading

References