CNN and Go
When DeepMind chose to use Convolutional Neural Networks (CNN) to process Go, it was a brilliant design decision.
CNNs were originally designed for image recognition. Why are they also suitable for Go? This article will explore how CNNs work and their perfect fit with Go.
Why CNNs Suit the Board
The Board Is an "Image"
From a certain perspective, the 19×19 Go board is just an image:
| Image | Go Board |
|---|---|
| Pixels | Intersections |
| RGB channels | Feature planes (black, white, empty...) |
| 224×224 | 19×19 |
| Recognize cats/dogs | Judge good/bad moves |
This analogy is not coincidental. The reasons CNNs excel at images also make them excel at boards.
Three Key Properties
CNNs have three properties that make them particularly suitable for board-type data:
1. Local Connectivity
CNN kernels only look at local regions, which perfectly matches Go's characteristics:
| Image recognition | Go |
|---|---|
| Cat ears are local features | "Eye" is a local shape |
| Don't need to see whole image | Don't need to see whole board |
3x3 region example (Eye shape):
| ○ | ● | ○ |
| ● | · | ● |
| ○ | ● | ○ |
Many Go concepts are "local":
- Eye: 2×2 or 3×3 region
- Atari: 3×3 region
- Connect, cut: 2×2 region
2. Weight Sharing
The same kernel scans the entire board, meaning:
An "eye" in the top-left corner and one in the bottom-right are recognized the same way
This is reasonable - Go rules don't change based on position (except edges, but those can be handled with edge feature planes).
Weight sharing also dramatically reduces parameter count:
| Method | Parameters |
|---|---|
| Fully connected | 361 × 361 × channels = tens of millions |
| CNN | 3 × 3 × channels × filters = millions |
3. Translation Equivariance
If the input translates, CNN output also translates accordingly:
Input:
| A | B | C | D | E | |
|---|---|---|---|---|---|
| 1 | · | · | · | · | · |
| 2 | · | ● | · | · | · |
| 3 | · | · | · | · | · |
Output (high probability region):
| A | B | C | D | E | |
|---|---|---|---|---|---|
| 1 | · | · | · | · | · |
| 2 | · | * | · | · | · |
| 3 | · | · | · | · | · |
After input translation:
| A | B | C | D | E | |
|---|---|---|---|---|---|
| 1 | · | · | · | · | · |
| 2 | · | · | · | · | · |
| 3 | · | · | ● | · | · |
Output also translates:
| A | B | C | D | E | |
|---|---|---|---|---|---|
| 1 | · | · | · | · | · |
| 2 | · | · | · | · | · |
| 3 | · | · | * | · | · |
This is important for Go: the same local shape should have similar evaluation regardless of where it appears on the board.
Convolution Operations
Basic Principle
Convolution is the core of CNNs. It's a "sliding window" operation:
| Input (5x5) | Kernel (3x3) | Output (5x5) | ||
|---|---|---|---|---|
| 1 0 1 0 0 | 1 0 1 | 2 1 3 1 2 | ||
| 0 1 1 1 0 | * | 0 1 0 | = | 1 4 3 3 1 |
| 1 1 1 1 1 | 1 0 1 | 3 3 5 3 3 | ||
| 0 0 1 1 0 | 1 3 3 4 1 | |||
| 0 1 0 0 1 | 2 1 3 1 2 |
Calculation process (for center point):
Output[2,2] = 1×1 + 1×0 + 1×1 +
1×0 + 1×1 + 1×0 +
1×1 + 1×0 + 1×1
= 1 + 0 + 1 + 0 + 1 + 0 + 1 + 0 + 1
= 5
Multi-Channel Convolution
When input has multiple channels (like 48 feature planes), the kernel also becomes 3D:
Each kernel computes across all input channels, producing one output channel.
Multiple Filters
AlphaGo uses 192 filters, each learning different features:
Each filter may learn different shapes:
- Filter 1: Eye detection
- Filter 2: Cut point detection
- Filter 3: Connection detection
- ...
- Filter 192: Some complex pattern
Receptive Field
What Is Receptive Field?
Receptive field refers to which input positions influence a given output position.
Single-Layer Convolution
With a 3×3 kernel, each output position is influenced by only a 3×3 input region:
| Input (3x3 receptive field) | Output | |
|---|---|---|
| . . . . . | . . . . | |
| . X X X . | --> | . Y . . |
| . X X X . | . . . . | |
| . X X X . | . . . . | |
| . . . . . |
The highlighted 3x3 region in input affects a single point in output.
Multi-Layer Convolution
Stacking multiple conv layers expands the receptive field:
| Layers | Receptive Field | Calculation |
|---|---|---|
| 1 | 3×3 | 3 |
| 2 | 5×5 | 3 + (3-1) = 5 |
| 3 | 7×7 | 5 + (3-1) = 7 |
| ... | ... | ... |
| 12 | 25×25 | 3 + 11×2 = 25 |
AlphaGo's 12 conv layers give a 25×25 receptive field, already larger than the 19×19 board!
This means:
- Each output position can "see" the entire board
- But "seeing" works differently: nearby details are clear, distant ones are summarized
- This is similar to how human players think
Receptive Field and Go
The receptive field concept explains why AlphaGo can handle "global" problems:
Local problems (3×3 receptive field): Global problems (25×25 receptive field):
- Is there an eye here? - Does this group have eye space?
- Can we atari? - Does the ladder work?
- Can we connect? - What's the overall position?
Shallow layers process local features; deep layers process global features.
Local vs. Global Features
CNN's Hierarchical Structure
CNNs naturally form a hierarchical structure:
This is strikingly similar to how humans learn Go:
- First learn the rules (where stones are)
- Then learn tactics (how to capture)
- Then learn shapes (what's good shape)
- Finally learn whole-board vision (overall judgment)
Visualizing Hidden Layers
Researchers found that CNN hidden layers indeed learn meaningful features:
Shallow Filters
Filter A (eye detection):
| + | - | + |
| - | + | - |
| + | - | + |
Filter B (atari detection):
| + | + | + |
| + | - | - |
| + | + | + |
Deep Filters
Deep filters are more abstract and harder to interpret directly, but they capture complex shape patterns.
Activation Function Choice
ReLU: Simple and Effective
AlphaGo uses ReLU (Rectified Linear Unit) after all conv layers:
def relu(x):
return max(0, x)
ReLU function behavior:
- For negative input values: output = 0 (flat line along x-axis)
- For positive input values: output = input (45-degree line upward)
- The function creates a "ramp" starting at the origin
Why Not Other Functions?
| Activation | Formula | Pros | Cons |
|---|---|---|---|
| ReLU | max(0, x) | Fast, good gradients | Dead neurons |
| Sigmoid | 1/(1+e^-x) | Bounded output | Vanishing gradients |
| Tanh | (e^x-e^-x)/(e^x+e^-x) | Zero-centered | Vanishing gradients |
| LeakyReLU | max(0.01x, x) | Fixes dead neurons | Extra hyperparameter |
For deep networks, ReLU's advantages are clear:
- Simple computation: Just comparison and max
- No vanishing gradients: Gradient is always 1 in positive region
- Sparse activation: Many neurons output 0, improving efficiency
ReLU's Meaning in Go
ReLU's sparsity has an interesting interpretation in Go:
A filter detecting "cut points":
- Has cut point → Positive output (activated)
- No cut point → Zero output (not activated)
This is like a player only focusing on positions "where something is happening"
Batch Normalization
What Is Batch Normalization?
Batch Normalization (BN) is a technique that keeps each layer's output in a stable distribution:
def batch_norm(x, gamma, beta):
# Calculate batch mean and std
mean = x.mean(axis=0)
std = x.std(axis=0)
# Normalize
x_norm = (x - mean) / (std + 1e-8)
# Scale and shift
return gamma * x_norm + beta
Why Is It Needed?
Internal Covariate Shift
When a network trains, each layer's input distribution changes as preceding layers' weights update. This is called "internal covariate shift":
First layer weights update → First layer output distribution changes
↓
Second layer input distribution changes → Second layer needs to readapt
↓
... (propagates)
Batch normalization stabilizes training by forcing each layer's input to have a fixed distribution (mean 0, std 1).
Application in AlphaGo
AlphaGo uses batch normalization after each conv layer, before the activation function:
Conv → BatchNorm → ReLU → Conv → BatchNorm → ReLU → ...
Benefits:
- Faster training: Can use larger learning rates
- More stable: Reduces sensitivity to initialization
- Regularization effect: Has mild dropout-like effect
Inference-Time Handling
During training, use current batch statistics. During inference, use overall training set statistics (moving average):
# During training
mean = batch_mean
var = batch_var
# During inference
mean = running_mean # Mean accumulated during training
var = running_var # Variance accumulated during training
AlphaGo's Specific Configuration
Complete Architecture
Input: 19×19×48
Layer 1:
Conv2D(5×5, 192 filters, padding='same')
BatchNorm
ReLU
Output: 19×19×192
Layers 2-12 (11 layers total):
Conv2D(3×3, 192 filters, padding='same')
BatchNorm
ReLU
Output: 19×19×192
Output Layer (Policy):
Conv2D(1×1, 1 filter)
Flatten
Softmax
Output: 361-dim probability
Output Layer (Value):
Conv2D(1×1, 1 filter)
Flatten
Dense(256)
ReLU
Dense(1)
Tanh
Output: Single value
Parameter Configuration
| Parameter | Value | Description |
|---|---|---|
| Input channels | 48 | Feature plane count |
| Filters | 192 | Channels per layer |
| Kernel size | 3×3 (first layer 5×5) | Receptive field |
| Layers | 13 (including output) | Depth |
| Activation | ReLU | Non-linearity |
| Normalization | BatchNorm | Stabilize training |
PyTorch Implementation
import torch
import torch.nn as nn
class AlphaGoCNN(nn.Module):
def __init__(self, input_channels=48, num_filters=192, num_layers=12):
super().__init__()
# First layer (5×5 conv)
self.conv1 = nn.Sequential(
nn.Conv2d(input_channels, num_filters, kernel_size=5, padding=2),
nn.BatchNorm2d(num_filters),
nn.ReLU(inplace=True)
)
# Middle layers (3×3 conv)
self.conv_layers = nn.Sequential(*[
nn.Sequential(
nn.Conv2d(num_filters, num_filters, kernel_size=3, padding=1),
nn.BatchNorm2d(num_filters),
nn.ReLU(inplace=True)
)
for _ in range(num_layers - 1)
])
# Policy output head
self.policy_head = nn.Sequential(
nn.Conv2d(num_filters, 1, kernel_size=1),
nn.Flatten(),
nn.Softmax(dim=1)
)
# Value output head
self.value_head = nn.Sequential(
nn.Conv2d(num_filters, 1, kernel_size=1),
nn.Flatten(),
nn.Linear(361, 256),
nn.ReLU(inplace=True),
nn.Linear(256, 1),
nn.Tanh()
)
def forward(self, x):
# Shared feature extraction
x = self.conv1(x)
x = self.conv_layers(x)
# Split output heads
policy = self.policy_head(x)
value = self.value_head(x)
return policy, value
Comparison with Other Architectures
Fully Connected Networks
If using fully connected networks for Go:
| Property | Fully Connected | CNN |
|---|---|---|
| Parameters | Huge (hundreds of millions) | Smaller (millions) |
| Position invariance | None | Yes |
| Local features | Hard to learn | Naturally captured |
| Training efficiency | Low | High |
Fully connected networks cannot utilize the board's spatial structure, making them extremely inefficient.
Recurrent Neural Networks (RNN)
RNNs suit sequential data (like game history), but:
| Property | RNN | CNN |
|---|---|---|
| Spatial processing | Weak | Strong |
| Sequence processing | Strong | Weak (needs history planes) |
| Parallelization | Difficult | Easy |
| Long-range dependencies | Needs LSTM | Deep layers suffice |
AlphaGo chose CNN + history planes rather than CNN + RNN.
Residual Networks (ResNet)
AlphaGo Zero upgraded to ResNet:
Regular CNN: ResNet:
x x
↓ ↓
Conv Conv
↓ ↓
ReLU ReLU
↓ ↓
Conv Conv
↓ ↓
y y + x ← Residual connection
Residual connections let gradients flow more easily, enabling training of much deeper networks (40 layers vs 12 layers).
See Dual-Head Network and Residual Network for details.
Visual Understanding
Convolution Process
Input board (simplified to 5x5):
| A | B | C | D | E | |
|---|---|---|---|---|---|
| 1 | · | · | · | · | · |
| 2 | · | ● | · | · | · |
| 3 | · | · | ○ | · | · |
| 4 | · | · | · | ● | · |
| 5 | · | · | · | · | · |
A filter (3x3, detecting "cross shape"):
| 0 | 1 | 0 |
| 1 | 1 | 1 |
| 0 | 1 | 0 |
Convolution output:
| A | B | C | D | E | |
|---|---|---|---|---|---|
| 1 | 0 | 0 | 0 | 0 | 0 |
| 2 | 0 | 0 | 0 | 0 | 0 |
| 3 | 0 | 0 | 1 | 0 | 0 |
| 4 | 0 | 0 | 0 | 0 | 0 |
| 5 | 0 | 0 | 0 | 0 | 0 |
Strong response at center (cross shape match)
Multi-Layer Features
Layer 1 output (4 of 192 channels):
Channel 1 (eye):
| 0 | 0 | 0 | 0 |
| 0 | 0.9 | 0 | 0 |
| 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 |
Channel 2 (edge):
| 0.8 | 0 | 0 | 0 |
| 0.8 | 0 | 0 | 0 |
| 0.8 | 0 | 0 | 0 |
| 0.8 | 0 | 0 | 0 |
Channel 3 (cut):
| 0 | 0 | 0 | 0 |
| 0 | 0 | 0.7 | 0 |
| 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 |
Channel 4 (connect):
| 0 | 0 | 0 | 0 |
| 0 | 0.5 | 0 | 0 |
| 0 | 0.8 | 0 | 0 |
| 0 | 0.5 | 0 | 0 |
These features are combined into more complex concepts in deeper layers...
Animation Reference
Core concepts covered in this article with animation numbers:
| Number | Concept | Physics/Math Correspondence |
|---|---|---|
| Animation D9 | Convolution operation | Filter response |
| Animation D10 | Receptive field | Local→Global |
| Animation D11 | Batch normalization | Distribution stability |
| Animation D1 | Multi-channel input | Tensor operations |
Further Reading
- Previous: Input Feature Design - 48 feature planes explained
- Next: Supervised Learning Phase - How to learn from human games
- Advanced Topic: Dual-Head Network and Residual Network - AlphaGo Zero's network upgrade
Key Takeaways
- CNNs naturally suit boards: Local connectivity, weight sharing, translation equivariance
- Convolution extracts local features: Pattern recognition in 3×3 regions
- Deep networks gain global vision: 12 layers → 25×25 receptive field
- ReLU is fast and effective: Simple non-linear activation
- BatchNorm stabilizes training: Normalizes each layer's output
CNNs let AlphaGo "see" the board - as naturally as humans see images with their eyes.
References
- LeCun, Y., Bengio, Y., & Hinton, G. (2015). "Deep learning." Nature, 521, 436-444.
- He, K., et al. (2015). "Deep Residual Learning for Image Recognition." CVPR.
- Ioffe, S., & Szegedy, C. (2015). "Batch Normalization: Accelerating Deep Network Training." ICML.
- Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). "ImageNet Classification with Deep Convolutional Neural Networks." NeurIPS.