Skip to main content

Evaluation & Benchmarking

This article introduces how to evaluate Go AI strength and performance, including the Elo rating system, match testing methods, and standard benchmarks.


Elo Rating System

Basic Concept

Elo rating is the standard method for measuring relative playing strength:

Expected win rate E_A = 1 / (1 + 10^((R_B - R_A) / 400))

New Elo = Old Elo + K × (Actual result - Expected result)

Elo Difference vs Win Rate

Elo DifferenceStronger Player Win Rate
050%
10064%
20076%
40091%
80099%

Implementation

def expected_score(rating_a, rating_b):
"""Calculate A's expected score against B"""
return 1 / (1 + 10 ** ((rating_b - rating_a) / 400))

def update_elo(rating, expected, actual, k=32):
"""Update Elo rating"""
return rating + k * (actual - expected)

def calculate_elo_diff(wins, losses, draws):
"""Calculate Elo difference from match results"""
total = wins + losses + draws
win_rate = (wins + 0.5 * draws) / total

if win_rate <= 0 or win_rate >= 1:
return float('inf') if win_rate >= 1 else float('-inf')

return 400 * math.log10(win_rate / (1 - win_rate))

Match Testing

Testing Framework

class MatchTester:
def __init__(self, engine_a, engine_b):
self.engine_a = engine_a
self.engine_b = engine_b
self.results = {'a_wins': 0, 'b_wins': 0, 'draws': 0}

def run_match(self, num_games=400):
"""Run match test"""
for i in range(num_games):
# Alternate colors
if i % 2 == 0:
black, white = self.engine_a, self.engine_b
a_is_black = True
else:
black, white = self.engine_b, self.engine_a
a_is_black = False

# Play game
result = self.play_game(black, white)

# Record result
if result == 'black':
if a_is_black:
self.results['a_wins'] += 1
else:
self.results['b_wins'] += 1
elif result == 'white':
if a_is_black:
self.results['b_wins'] += 1
else:
self.results['a_wins'] += 1
else:
self.results['draws'] += 1

return self.results

def play_game(self, black_engine, white_engine):
"""Play one game"""
game = Game()

while not game.is_terminal():
if game.current_player == 'black':
move = black_engine.get_move(game.state)
else:
move = white_engine.get_move(game.state)

game.play(move)

return game.get_winner()

Statistical Significance

Ensure test results are statistically meaningful:

from scipy import stats

def calculate_confidence_interval(wins, total, confidence=0.95):
"""Calculate confidence interval for win rate"""
p = wins / total
z = stats.norm.ppf((1 + confidence) / 2)
margin = z * math.sqrt(p * (1 - p) / total)

return (p - margin, p + margin)

# Example
wins, total = 220, 400
ci_low, ci_high = calculate_confidence_interval(wins, total)
print(f"Win rate: {wins/total:.1%}, 95% CI: [{ci_low:.1%}, {ci_high:.1%}]")
Expected Elo DiffRecommended GamesConfidence
>10010095%
50-10020095%
20-5040095%
<201000+95%

SPRT (Sequential Probability Ratio Test)

Concept

No fixed game count needed; dynamically decide whether to stop based on accumulated results:

def sprt(wins, losses, elo0=0, elo1=10, alpha=0.05, beta=0.05):
"""
Sequential Probability Ratio Test

elo0: Null hypothesis Elo difference (usually 0)
elo1: Alternative hypothesis Elo difference (usually 5-20)
alpha: False positive rate
beta: False negative rate
"""
if wins + losses == 0:
return 'continue'

# Calculate log likelihood ratio
p0 = expected_score(elo1, 0) # Expected win rate under H1
p1 = expected_score(elo0, 0) # Expected win rate under H0

llr = (
wins * math.log(p0 / p1) +
losses * math.log((1 - p0) / (1 - p1))
)

# Decision boundaries
lower = math.log(beta / (1 - alpha))
upper = math.log((1 - beta) / alpha)

if llr <= lower:
return 'reject' # H0 rejected, new model is worse
elif llr >= upper:
return 'accept' # H0 accepted, new model is better
else:
return 'continue' # Continue testing

KataGo Benchmarking

Run Benchmark

# Basic test
katago benchmark -model model.bin.gz

# Specify visit count
katago benchmark -model model.bin.gz -v 1000

# Detailed output
katago benchmark -model model.bin.gz -v 1000 -t 8

Output Interpretation

KataGo Benchmark Results
========================

Configuration:
Model: kata-b18c384.bin.gz
Backend: CUDA
Threads: 8
Visits: 1000

Performance:
NN evals/second: 2847.3
Playouts/second: 4521.8
Avg time per move: 0.221 seconds

Memory:
GPU memory usage: 2.1 GB
System memory: 1.3 GB

Quality metrics:
Policy accuracy: 0.612
Value accuracy: 0.891

Key Metrics

MetricDescriptionGood Value
NN evals/secNeural network evaluation speed>1000
Playouts/secMCTS simulation speed>2000
GPU utilizationGPU usage efficiency>80%

Strength Evaluation

Human Strength Reference

AI EloHuman Strength
~1500Amateur 1 dan
~2000Amateur 5 dan
~2500Professional 1 dan
~3000Professional 5 dan
~3500World champion level
~4000+Beyond human

Major AI Elo Ratings

AIElo (estimated)
KataGo (latest)~5000
AlphaGo Zero~5000
Leela Zero~4500
Fine Art~4800

Test Comparison

def estimate_human_rank(ai_model, test_positions):
"""Estimate AI's equivalent human rank"""
# Use standard test problems
correct = 0
for pos in test_positions:
ai_move = ai_model.get_best_move(pos['state'])
if ai_move == pos['best_move']:
correct += 1

accuracy = correct / len(test_positions)

# Accuracy reference table
if accuracy > 0.9:
return "Professional level"
elif accuracy > 0.7:
return "Amateur 5 dan+"
elif accuracy > 0.5:
return "Amateur 1-5 dan"
else:
return "Below amateur level"

Performance Monitoring

Continuous Monitoring

import time
import psutil
import GPUtil

class PerformanceMonitor:
def __init__(self):
self.metrics = []

def sample(self):
"""Sample current performance metrics"""
gpus = GPUtil.getGPUs()

self.metrics.append({
'timestamp': time.time(),
'cpu_percent': psutil.cpu_percent(),
'memory_percent': psutil.virtual_memory().percent,
'gpu_util': gpus[0].load * 100 if gpus else 0,
'gpu_memory': gpus[0].memoryUsed if gpus else 0,
})

def report(self):
"""Generate report"""
if not self.metrics:
return

avg_cpu = sum(m['cpu_percent'] for m in self.metrics) / len(self.metrics)
avg_gpu = sum(m['gpu_util'] for m in self.metrics) / len(self.metrics)

print(f"Average CPU usage: {avg_cpu:.1f}%")
print(f"Average GPU usage: {avg_gpu:.1f}%")

Performance Bottleneck Diagnosis

SymptomPossible CauseSolution
CPU 100%, GPU lowInsufficient search threadsIncrease numSearchThreads
GPU 100%, slow outputBatch too smallIncrease nnMaxBatchSize
Out of memoryModel too largeUse smaller model
Unstable speedOverheatingImprove cooling

Automated Testing

CI/CD Integration

# .github/workflows/benchmark.yml
name: Benchmark

on:
push:
branches: [main]

jobs:
benchmark:
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v3

- name: Run benchmark
run: |
./katago benchmark -model model.bin.gz -v 500 > results.txt

- name: Check performance
run: |
playouts=$(grep "Playouts/second" results.txt | awk '{print $2}')
if (( $(echo "$playouts < 1000" | bc -l) )); then
echo "Performance regression detected!"
exit 1
fi

Regression Testing

def regression_test(new_model, baseline_model, threshold=0.95):
"""Check if new model has performance regression"""
# Test accuracy
new_accuracy = test_accuracy(new_model)
baseline_accuracy = test_accuracy(baseline_model)

if new_accuracy < baseline_accuracy * threshold:
raise Exception(f"Accuracy regression: {new_accuracy:.3f} < {baseline_accuracy:.3f}")

# Test speed
new_speed = benchmark_speed(new_model)
baseline_speed = benchmark_speed(baseline_model)

if new_speed < baseline_speed * threshold:
raise Exception(f"Speed regression: {new_speed:.1f} < {baseline_speed:.1f}")

print("Regression test passed")

Further Reading