Value Network Explained

If the Policy Network tells AlphaGo "where to play next," the Value Network answers a more fundamental question:

"Will I win this game?"

What is the Value Network?

Core Function

The Value Network is a deep convolutional neural network whose task is:

Given the current board state, predict the final win probability

Mathematically expressed:

v = f_θ(s)

Where:

s: Current board state
f_θ: Value Network (θ represents network parameters)
v: A value between -1 and +1

Output Interpretation

Output Value	Meaning
+1	Current player wins for certain
+0.5	Current player has ~75% win rate
0	Equal chances for both sides
-0.5	Current player has ~25% win rate
-1	Current player loses for certain

Why a Single Value?

Comparing Different Options

When playing Go, we often need to choose among multiple options. The Value Network makes this comparison simple:

Option A position value: 0.3
Option B position value: 0.5
Option C position value: 0.2

→ Choose B (highest value)

Without a single value, how would we compare "capturing an opponent's group" versus "surrounding a large territory"?

Replacing Extensive Simulations

In traditional Monte Carlo Tree Search, evaluating a position requires random rollouts:

Start from current position
Both sides play randomly until game ends
Record win/loss
Repeat thousands of times, calculate win rate

This is very slow. The Value Network can provide an evaluation with a single forward pass, orders of magnitude faster.

Method	Evaluation Time	Accuracy
1000 random rollouts	~2000 ms	Lower
15000 random rollouts	~30000 ms	Medium
Value Network	~3 ms	High (equivalent to 15000 rollouts)

Network Architecture

Similarity to Policy Network

The Value Network architecture is very similar to the Policy Network, both being deep convolutional neural networks:

Input Layer

Same as Policy Network, input is a 19×19×49 feature tensor:

19×19: Board size
49: 48 feature planes + 1 plane indicating whose turn

The extra plane is important: Value Network needs to know whose turn it is, since the same position has opposite values for Black and White.

Convolutional Layers

Same as Policy Network:

12 convolutional layers
192 filters
3×3 kernels (first layer 5×5)
ReLU activation function

Output Layer Differences

This is the key difference between Value Network and Policy Network:

Policy Network Output

19×19×192 → 1×1 conv → 19×19×1 → Flatten → 361-dim → Softmax → Probability distribution

Value Network Output

19×19×192 → 1×1 conv → 19×19×1 → Flatten → 361-dim → FC 256 → ReLU → FC 1 → Tanh → Single value

Tanh Activation Function

The Value Network's final layer uses Tanh (hyperbolic tangent) function:

Tanh(x) = (e^x - e^(-x)) / (e^x + e^(-x))

Tanh's output range is (-1, +1), exactly corresponding to win/loss.

Why Tanh Instead of Sigmoid?

Sigmoid's output range is (0, 1), which could also represent win rate. But Tanh has several advantages:

Symmetry: Centered at 0, output can be positive or negative
Better gradients: Gradient is close to 1 near 0
Clear semantics: Positive means win, negative means loss, zero is draw

Complete Architecture Diagram

Parameter Count

Layer	Calculation	Parameter Count
Conv Layers	Same as Policy Network	~3.9M
FC Layer 1	361×256 + 256	92,672
FC Layer 2	256×1 + 1	257
Total		~4.0M

About 4 million parameters, slightly more than Policy Network.

Training Challenges

Overfitting Problem

Training the Value Network is much harder than the Policy Network. The main issue is overfitting.

What is Overfitting?

Overfitting means the model "memorizes" the training data rather than learning to generalize. Symptoms include:

Great performance on training set
Poor performance on test set

Why is Value Network Prone to Overfitting?

Consider data from one game:

Position 1 → Position 2 → Position 3 → ... → Position 200 → Result: Black wins

If we train directly with this data:

These 200 positions are highly correlated
They come from the same game, with the same outcome
The model might learn to "recognize" this game rather than understand positions

DeepMind discovered: if you train Policy and Value Networks with the same human game records, the Value Network severely overfits.

Solution: Self-Play Data

DeepMind's solution was to generate new training data through self-play:

Use trained RL Policy Network for self-play
Take only one position from each game (avoiding correlation)
Label this position with the game's final result
Generate 30 million such samples

Why Does This Solve Overfitting?

Large data volume: 30 million independent positions
No correlation: Only one position per game
Different distribution: Self-play positions differ from human games

Training Data Generation

# Pseudocode
training_data = []

for game_id in range(30_000_000):
    # Self-play one game
    states, result = self_play(rl_policy_network)

    # Randomly select one position
    random_index = random.randint(0, len(states) - 1)
    state = states[random_index]

    # Record position and result
    training_data.append((state, result))

Training Objective and Method

Mean Squared Error Loss

The Value Network uses Mean Squared Error (MSE) as the loss function:

L(θ) = (1/n) × Σ (v_θ(s) - z)²

Where:

v_θ(s): Model's predicted value
z: Actual result (+1 or -1)

Why MSE Instead of Cross-Entropy?

Cross-entropy is suitable for classification problems (discrete labels)
MSE is suitable for regression problems (continuous values)

Although results are only +1 or -1, the model predicts continuous values (any number between -1 and +1). MSE trains the model to predict values close to +1 or -1.

Training Process

# Pseudocode
for epoch in range(num_epochs):
    for batch in dataloader:
        states, outcomes = batch

        # Forward pass
        values = network(states)  # (batch, 1)

        # Calculate loss (MSE)
        loss = mse_loss(values, outcomes)

        # Backward pass
        loss.backward()
        optimizer.step()

Training details:

Optimizer: SGD with momentum
Learning rate: 0.003
Batch size: 32
Training time: About 1 week (50 GPUs)

Accuracy Analysis

Comparison with Random Rollouts

DeepMind conducted detailed comparisons in their paper:

Evaluation Method	Prediction Error
1000 random rollouts	Higher
15000 random rollouts	Medium
Value Network	Comparable to 15000 rollouts

This means one Value Network evaluation ≈ 15000 random rollouts, but ~1000× faster.

Accuracy at Different Stages

Value Network accuracy depends on game progress:

Stage	Remaining Moves	Prediction Difficulty	Accuracy
Opening	~300	Very hard	Lower
Middlegame	~150	Difficult	Medium
Endgame	~50	Easier	Higher
Final	~10	Simple	Very high

This is intuitively reasonable: the closer to game end, the more certain the outcome.

Output Distribution

A well-trained Value Network's output distribution:

Most outputs concentrated near -1 and +1 (because most positions have clear win/loss tendency).

Uncertain Positions

When Value Network output is close to 0, it indicates a very complex position where the outcome is uncertain. These positions are typically:

In the middle of large battles
Both sides evenly matched
Multiple possible variations exist

In MCTS, these nodes receive more search resources (due to high uncertainty).

Role in MCTS

Leaf Node Evaluation

The Value Network plays a critical role in MCTS's Evaluation phase:

When MCTS reaches a leaf node, it needs to evaluate this position's value. Two methods:

Random Rollout: Play randomly from leaf node to game end
Value Network Evaluation: Directly predict with neural network

AlphaGo combines both:

V(leaf) = (1-λ) × V_network(leaf) + λ × V_rollout(leaf)

Where λ = 0.5, meaning equal weight for both.

Why Combine Them?

Value Network is more accurate but may have systematic bias
Random rollouts are less accurate but provide independent estimates
Combining both complements each other

AlphaGo Zero's Simplification

Later, AlphaGo Zero completely removed random rollouts:

V(leaf) = V_network(leaf)

This greatly simplified the system while achieving stronger play. This proved the Value Network is reliable enough without the "insurance" of random rollouts.

Backpropagation Updates

After evaluating a leaf node, this value propagates back along the path:

Each node maintains a Q value that is the average of all leaf evaluations passing through it:

Q(s, a) = (1/N(s,a)) × Σ V(leaf)

Visual Analysis

Value Surface

Imagine a simplified 3×3 board. What the Value Network learns is a "value surface":

	White 1	White 2	White 3
Black 1	+0.3	-0.1	+0.2
Black 2	-0.2	+0.5	-0.3
Black 3	+0.1	-0.2	+0.4

This surface tells us the value of each position combination. Positive values favor Black, negative values favor White.

Evolution During Training

As training progresses, Value Network predictions become more accurate:

Error drops rapidly then stabilizes.

Identifying Difficult Positions

Value Network can help identify difficult positions:

Output	Meaning	Strategy
Near +1	Big advantage	Play conservatively
Near -1	Big disadvantage	Look for comeback opportunities
Near 0	Complex position	Need deep calculation

AlphaGo invests more thinking time in positions near 0.

Implementation Notes

PyTorch Implementation

import torch
import torch.nn as nn
import torch.nn.functional as F

class ValueNetwork(nn.Module):
    def __init__(self, input_channels=49, num_filters=192, num_layers=12):
        super().__init__()

        # First convolutional layer (5×5)
        self.conv1 = nn.Conv2d(input_channels, num_filters,
                               kernel_size=5, padding=2)

        # Middle convolutional layers (3×3)×11
        self.conv_layers = nn.ModuleList([
            nn.Conv2d(num_filters, num_filters,
                     kernel_size=3, padding=1)
            for _ in range(num_layers - 1)
        ])

        # Output convolutional layer
        self.conv_out = nn.Conv2d(num_filters, 1, kernel_size=1)

        # Fully connected layers
        self.fc1 = nn.Linear(361, 256)
        self.fc2 = nn.Linear(256, 1)

    def forward(self, x):
        # x: (batch, 49, 19, 19)

        # Convolutional layers
        x = F.relu(self.conv1(x))
        for conv in self.conv_layers:
            x = F.relu(conv(x))
        x = self.conv_out(x)

        # Flatten
        x = x.view(x.size(0), -1)  # (batch, 361)

        # Fully connected layers
        x = F.relu(self.fc1(x))
        x = torch.tanh(self.fc2(x))

        return x.squeeze(-1)  # (batch,)

Training Loop

def train_value_network(model, optimizer, states, outcomes):
    """
    states: (batch, 49, 19, 19) - Board features
    outcomes: (batch,) - Game results (+1 or -1)
    """
    # Forward pass
    values = model(states)  # (batch,)

    # MSE loss
    loss = F.mse_loss(values, outcomes)

    # Backward pass
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    # Calculate accuracy (correct win/loss prediction)
    predictions = (values > 0).float() * 2 - 1  # Convert to +1/-1
    accuracy = (predictions == outcomes).float().mean()

    return loss.item(), accuracy.item()

Tips to Avoid Overfitting

# 1. Data augmentation (8-fold symmetry)
def augment(state, outcome):
    augmented = []
    for rotation in [0, 90, 180, 270]:
        s = rotate(state, rotation)
        augmented.append((s, outcome))
        augmented.append((flip(s), outcome))
    return augmented

# 2. Dropout
class ValueNetworkWithDropout(ValueNetwork):
    def __init__(self, *args, dropout_rate=0.5, **kwargs):
        super().__init__(*args, **kwargs)
        self.dropout = nn.Dropout(dropout_rate)

    def forward(self, x):
        # ... conv layers ...
        x = self.dropout(x)  # Dropout before FC layers
        # ... FC layers ...

# 3. Early Stopping
best_val_loss = float('inf')
patience = 10
counter = 0

for epoch in range(max_epochs):
    train_loss = train_one_epoch()
    val_loss = evaluate()

    if val_loss < best_val_loss:
        best_val_loss = val_loss
        save_model()
        counter = 0
    else:
        counter += 1
        if counter >= patience:
            print("Early stopping!")
            break

Collaboration with Policy Network

Complementary Relationship

Policy Network and Value Network complement each other in AlphaGo:

Network	Question Answered	Output	MCTS Role
Policy	Where to play next?	Probability distribution	Guide search direction
Value	Will I win?	Single value	Evaluate leaf nodes

Unified Dual-Head Network

In AlphaGo Zero, these two networks were merged into a dual-head network:

Advantages of this design:

Parameter sharing: Reduces computation
Feature sharing: Policy and Value use same features
More stable training: Two objectives regularize each other

See Dual-Head Network and Residual Network for details.

Animation Reference

Core concepts covered in this article with animation numbers:

Number	Concept	Physics/Math Correspondence
Animation E2	Value Network	Potential energy surface
Animation D4	Value function	Expected return
Animation C6	Leaf node evaluation	Function approximation
Animation H3	Temporal difference	Bootstrap learning

Key Takeaways

Value Network predicts win rate: Outputs a single value between -1 and +1
Tanh output: Ensures output is in the correct range
MSE loss: Drives predictions toward actual outcomes
Overfitting challenge: Requires self-play data to avoid
Replaces random rollouts: One evaluation ≈ 15000 rollouts

The Value Network is AlphaGo's "judgment" - it lets the AI evaluate any position's quality without exhaustively exploring all possibilities.

References

Silver, D., et al. (2016). "Mastering the game of Go with deep neural networks and tree search." Nature, 529, 484-489.
Silver, D., et al. (2017). "Mastering the game of Go without human knowledge." Nature, 551, 354-359.
Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction. MIT Press.
Tesauro, G. (1995). "Temporal difference learning and TD-Gammon." Communications of the ACM, 38(3), 58-68.

What is the Value Network?​

Core Function​

Output Interpretation​

Why a Single Value?​

Comparing Different Options​

Replacing Extensive Simulations​

Network Architecture​

Similarity to Policy Network​

Input Layer​

Convolutional Layers​

Output Layer Differences​

Policy Network Output​

Value Network Output​

Tanh Activation Function​

Why Tanh Instead of Sigmoid?​

Complete Architecture Diagram​

Parameter Count​

Training Challenges​

Overfitting Problem​

What is Overfitting?​

Why is Value Network Prone to Overfitting?​

Solution: Self-Play Data​

Why Does This Solve Overfitting?​

Training Data Generation​

Training Objective and Method​

Mean Squared Error Loss​

Why MSE Instead of Cross-Entropy?​

Training Process​

Accuracy Analysis​

Comparison with Random Rollouts​

Accuracy at Different Stages​

Output Distribution​

Uncertain Positions​

Role in MCTS​

Leaf Node Evaluation​

Why Combine Them?​

AlphaGo Zero's Simplification​

Backpropagation Updates​

Visual Analysis​

Value Surface​

Evolution During Training​

Identifying Difficult Positions​

Implementation Notes​

PyTorch Implementation​

Training Loop​

Tips to Avoid Overfitting​

Collaboration with Policy Network​

Complementary Relationship​

Unified Dual-Head Network​

Animation Reference​

Further Reading​

Key Takeaways​

References​