Lewati ke konten utama

Evaluasi dan Benchmark

Artikel ini memperkenalkan cara mengevaluasi kekuatan dan performa AI Go, termasuk sistem rating Elo, metode pengujian pertandingan, dan benchmark standar.


Sistem Rating Elo

Konsep Dasar

Rating Elo adalah metode standar untuk mengukur kekuatan relatif:

Expected winrate E_A = 1 / (1 + 10^((R_B - R_A) / 400))

Elo baru = Elo lama + K × (hasil aktual - hasil yang diharapkan)

Tabel Konversi Selisih Elo dan Winrate

Selisih EloWinrate yang Lebih Kuat
050%
10064%
20076%
40091%
80099%

Implementasi

def expected_score(rating_a, rating_b):
"""Hitung expected score A terhadap B"""
return 1 / (1 + 10 ** ((rating_b - rating_a) / 400))

def update_elo(rating, expected, actual, k=32):
"""Update rating Elo"""
return rating + k * (actual - expected)

def calculate_elo_diff(wins, losses, draws):
"""Hitung selisih Elo dari hasil pertandingan"""
total = wins + losses + draws
win_rate = (wins + 0.5 * draws) / total

if win_rate <= 0 or win_rate >= 1:
return float('inf') if win_rate >= 1 else float('-inf')

return 400 * math.log10(win_rate / (1 - win_rate))

Pengujian Pertandingan

Framework Pengujian

class MatchTester:
def __init__(self, engine_a, engine_b):
self.engine_a = engine_a
self.engine_b = engine_b
self.results = {'a_wins': 0, 'b_wins': 0, 'draws': 0}

def run_match(self, num_games=400):
"""Jalankan pengujian pertandingan"""
for i in range(num_games):
# Bergantian giliran pertama
if i % 2 == 0:
black, white = self.engine_a, self.engine_b
a_is_black = True
else:
black, white = self.engine_b, self.engine_a
a_is_black = False

# Mainkan pertandingan
result = self.play_game(black, white)

# Catat hasil
if result == 'black':
if a_is_black:
self.results['a_wins'] += 1
else:
self.results['b_wins'] += 1
elif result == 'white':
if a_is_black:
self.results['b_wins'] += 1
else:
self.results['a_wins'] += 1
else:
self.results['draws'] += 1

return self.results

def play_game(self, black_engine, white_engine):
"""Mainkan satu pertandingan"""
game = Game()

while not game.is_terminal():
if game.current_player == 'black':
move = black_engine.get_move(game.state)
else:
move = white_engine.get_move(game.state)

game.play(move)

return game.get_winner()

Signifikansi Statistik

Pastikan hasil pengujian memiliki makna statistik:

from scipy import stats

def calculate_confidence_interval(wins, total, confidence=0.95):
"""Hitung interval kepercayaan winrate"""
p = wins / total
z = stats.norm.ppf((1 + confidence) / 2)
margin = z * math.sqrt(p * (1 - p) / total)

return (p - margin, p + margin)

# Contoh
wins, total = 220, 400
ci_low, ci_high = calculate_confidence_interval(wins, total)
print(f"Winrate: {wins/total:.1%}, 95% CI: [{ci_low:.1%}, {ci_high:.1%}]")

Jumlah Pertandingan yang Disarankan

Perkiraan Selisih EloJumlah Pertandingan yang DisarankanTingkat Kepercayaan
>10010095%
50-10020095%
20-5040095%
<201000+95%

SPRT (Sequential Probability Ratio Test)

Konsep

Tidak perlu jumlah pertandingan tetap, keputusan berhenti berdasarkan hasil kumulatif secara dinamis:

def sprt(wins, losses, elo0=0, elo1=10, alpha=0.05, beta=0.05):
"""
Sequential Probability Ratio Test

elo0: Selisih Elo hipotesis null (biasanya 0)
elo1: Selisih Elo hipotesis alternatif (biasanya 5-20)
alpha: Tingkat false positive
beta: Tingkat false negative
"""
if wins + losses == 0:
return 'continue'

# Hitung log likelihood ratio
p0 = expected_score(elo1, 0) # Expected winrate di bawah H1
p1 = expected_score(elo0, 0) # Expected winrate di bawah H0

llr = (
wins * math.log(p0 / p1) +
losses * math.log((1 - p0) / (1 - p1))
)

# Batas keputusan
lower = math.log(beta / (1 - alpha))
upper = math.log((1 - beta) / alpha)

if llr <= lower:
return 'reject' # H0 ditolak, model baru lebih lemah
elif llr >= upper:
return 'accept' # H0 diterima, model baru lebih kuat
else:
return 'continue' # Lanjutkan pengujian

Benchmark KataGo

Menjalankan Benchmark

# Pengujian dasar
katago benchmark -model model.bin.gz

# Tentukan jumlah kunjungan
katago benchmark -model model.bin.gz -v 1000

# Output detail
katago benchmark -model model.bin.gz -v 1000 -t 8

Interpretasi Output

KataGo Benchmark Results
========================

Configuration:
Model: kata-b18c384.bin.gz
Backend: CUDA
Threads: 8
Visits: 1000

Performance:
NN evals/second: 2847.3
Playouts/second: 4521.8
Avg time per move: 0.221 seconds

Memory:
GPU memory usage: 2.1 GB
System memory: 1.3 GB

Quality metrics:
Policy accuracy: 0.612
Value accuracy: 0.891

Metrik Kunci

MetrikDeskripsiNilai Bagus
NN evals/secKecepatan evaluasi neural network>1000
Playouts/secKecepatan simulasi MCTS>2000
Utilisasi GPUEfisiensi penggunaan GPU>80%

Evaluasi Kekuatan

Padanan Kekuatan Manusia

Elo AIKekuatan Manusia
~1500Amatir 1 dan
~2000Amatir 5 dan
~2500Profesional shodan
~3000Profesional 5 dan
~3500Level juara dunia
~4000+Melampaui manusia

Elo AI Utama

AIElo (perkiraan)
KataGo (terbaru)~5000
AlphaGo Zero~5000
Leela Zero~4500
Fine Art~4800

Pengujian Perbandingan

def estimate_human_rank(ai_model, test_positions):
"""Perkirakan kekuatan manusia yang setara dengan AI"""
# Gunakan soal tes standar
correct = 0
for pos in test_positions:
ai_move = ai_model.get_best_move(pos['state'])
if ai_move == pos['best_move']:
correct += 1

accuracy = correct / len(test_positions)

# Tabel padanan akurasi
if accuracy > 0.9:
return "Level profesional"
elif accuracy > 0.7:
return "Amatir 5 dan+"
elif accuracy > 0.5:
return "Amatir 1-5 dan"
else:
return "Di bawah level amatir"

Monitoring Performa

Monitoring Berkelanjutan

import time
import psutil
import GPUtil

class PerformanceMonitor:
def __init__(self):
self.metrics = []

def sample(self):
"""Sampel metrik performa saat ini"""
gpus = GPUtil.getGPUs()

self.metrics.append({
'timestamp': time.time(),
'cpu_percent': psutil.cpu_percent(),
'memory_percent': psutil.virtual_memory().percent,
'gpu_util': gpus[0].load * 100 if gpus else 0,
'gpu_memory': gpus[0].memoryUsed if gpus else 0,
})

def report(self):
"""Buat laporan"""
if not self.metrics:
return

avg_cpu = sum(m['cpu_percent'] for m in self.metrics) / len(self.metrics)
avg_gpu = sum(m['gpu_util'] for m in self.metrics) / len(self.metrics)

print(f"Rata-rata penggunaan CPU: {avg_cpu:.1f}%")
print(f"Rata-rata penggunaan GPU: {avg_gpu:.1f}%")

Diagnosis Bottleneck Performa

GejalaKemungkinan PenyebabSolusi
CPU 100%, GPU rendahThread pencarian tidak cukupTingkatkan numSearchThreads
GPU 100%, output lambatBatch terlalu kecilTingkatkan nnMaxBatchSize
Memori tidak cukupModel terlalu besarGunakan model lebih kecil
Kecepatan tidak stabilSuhu terlalu tinggiPerbaiki pendinginan

Pengujian Otomatis

Integrasi CI/CD

# .github/workflows/benchmark.yml
name: Benchmark

on:
push:
branches: [main]

jobs:
benchmark:
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v3

- name: Run benchmark
run: |
./katago benchmark -model model.bin.gz -v 500 > results.txt

- name: Check performance
run: |
playouts=$(grep "Playouts/second" results.txt | awk '{print $2}')
if (( $(echo "$playouts < 1000" | bc -l) )); then
echo "Performance regression detected!"
exit 1
fi

Pengujian Regresi

def regression_test(new_model, baseline_model, threshold=0.95):
"""Periksa apakah model baru memiliki regresi performa"""
# Uji akurasi
new_accuracy = test_accuracy(new_model)
baseline_accuracy = test_accuracy(baseline_model)

if new_accuracy < baseline_accuracy * threshold:
raise Exception(f"Regresi akurasi: {new_accuracy:.3f} < {baseline_accuracy:.3f}")

# Uji kecepatan
new_speed = benchmark_speed(new_model)
baseline_speed = benchmark_speed(baseline_model)

if new_speed < baseline_speed * threshold:
raise Exception(f"Regresi kecepatan: {new_speed:.1f} < {baseline_speed:.1f}")

print("Pengujian regresi berhasil")

Bacaan Lanjutan