1. Introduction¶
In this data science project, we will navigate through the complete data science lifecycle, from data collection to training a machine learning model and uncovering valuable insights. Our main goal is to analyze the key characteristics that significantly impact the outcome of a chess game, identifying the most crucial factors contributing to victory. As chess enthusiasts, we have explored various learning resources that emphasize different aspects of the game. However, there has been a lack of a data-driven consensus regarding the most impactful features that players should prioritize to maximize their chances of success. This study aims to fill this gap by leveraging data science to provide players with evidence-based recommendations. By analyzing a vast array of chess game data, we will uncover patterns, strategies, and characteristics that distinguish winning players. Through machine learning techniques and statistical analysis, we will identify the key factors that greatly influence the outcome of a chess game. The findings of this research will serve as a valuable resource for chess players of all skill levels, helping them focus their efforts on the most essential aspects of their gameplay. By providing a data-driven perspective on the qualities that matter in chess, we aim to change how players approach their training and help them make informed descisions. Moreover, this study is relevant in a general data science context due to the lack of analysis on the subject currently done and the abundance of data avaliable (millions of games are played each month).
Join us on this journey as we explore the world of chess analytics and uncover the secrets to success on the chessboard. Through this project, we will advance our understanding of the game and provide players with the tools and knowledge they need to improve their chess skills.
Required Libraries and Imports¶
We use a variety of libraries in order to help us with data cleaning, visualization, and cleaning. Use pip3 install [packageName]
to install the package locally.
- Pandas: high performance data structures and analysis toolkit for data frames
- NumPy: scientific computing library, allowing for convenient and performant array / matrix operations
- Matplotlib: plotting library for data visualization
- scikit-learn: tool for machine learning pipelines
- category-encoders: set of scikit-learn transformers to help with encoding categorical data
- chess: allows us to parse Portable Game Notation file (.pgn), which is how chess games are stored
Import Libaries¶
import chess.pgn
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, accuracy_score, classification_report, precision_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
/Users/pranavshah/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed). from pandas.core import (
Background / Further Reading: Chess¶
In order to understand some of the motivations and terminiology in this study, we reccomend a familitarity with the game of chess. Reccomended resources include:
2. About The Data¶
There are a wide variety of chess databases online that allow you to download and analyze chess games. For this study, we selected the Lichess open database for two main reasons:
Lichess is one of the largest online chess platforms, providing a vast collection of games played by players of various skill levels. This ensures a diverse and representative dataset for our analysis.
The Lichess database is freely accessible and regularly updated, allowing for easy acquisition and use of the data for research purposes.
The data was acquired from the website: https://database.lichess.org. Due to the size of the data (~30 GB of games / month), we restricted our analysis to March 2024. Once downloaded, it was divided into two CSV files: one file encompassing the complete dataset, and another smaller file suitable for sharing on GitHub. These files are identified as evaluations.csv and miniEvaluations.csv respectively.
The evaluations.csv file contains the full dataset, which includes a comprehensive set of features and game information for a large number of chess games. This dataset will be used for the main analysis and model training.
On the other hand, the miniEvaluations.csv file is a smaller subset of the full dataset, specifically curated for sharing on GitHub. This file contains a representative sample of chess games and their associated features, making it suitable for rapid iteration.
3. Data Collection: Scraping¶
At this stage, we need to take the data downloaded from the Lichess Database, filter out all of the games missing computer evaluations (approximately 90% of the database), and organize it into a csv with all the pertinent features. We required games with computer evaluations as this allows us to create deeper insights using blunders and other features. Note that 90% of the dataset being removed may sound significant, but considering ~90,000,000 million games are played monthly, this is ~9,000,000 games. This csv will be organized in a way where it can be directly read into a pandas dataframe.
pgnFilePath = r'./data/*.pgn'
#outputFilePath = "./evaluations.csv"
totalGameCount = 0
totalEvalGameCount = 0
fileList = []
#glob.glob(pgnFilePath)
rows_list = [] # list of dictionaries for each game
for file in fileList:
pgn = open(fileList[0])
game = chess.pgn.read_game(pgn)
while game != None:
totalGameCount += 1
variations = game.variations # list of either the eval or clock score
if(totalGameCount % 2000 == 0): # print every 2000 games
print("Total Game Count: ", totalGameCount)
print("Eval Game Count: ", totalEvalGameCount)
if(totalGameCount % 1000000 == 0):
df = pd.DataFrame(rows_list) # add everything to the dataframe
df.to_csv(outputFilePath)
if(len(variations) > 0 and 'eval' in variations[0].comment): # if this game was evaluated by a computer, add it
totalEvalGameCount+=1
h = game.headers
# adding a dictionary is faster than appending to a dataframe
rows_list.append({
"UTCDate": h.get("UTCDate", np.NaN),
"UTCTime": h.get("UTCTime", np.NaN),
'WhiteElo': h.get('WhiteElo', np.NaN),
'BlackElo': h.get('BlackElo', np.NaN),
"Opening": h.get("Opening", np.NaN),
"ECO": h.get("ECO", np.NaN),
'Result': h.get('Result', np.NaN),
"Termination": h.get("Termination", np.NaN),
"Variations": str(variations[0]) if variations else np.NaN,
'WhiteRatingDiff': h.get('WhiteRatingDiff', np.NaN),
'BlackRatingDiff': h.get("BlackRatingDiff", np.NaN)
})
# Iterator reading next game
game = chess.pgn.read_game(pgn)
4. Data Processing: Cleaning¶
During this stage of the Data Science lifecycle, we will be cleaning and preparing the data for analysis and machine learning. We will modify the dataframe to only include key features that we wish to train our model on. Most features will be comparing both players as we want to predict winning. Compared features will be represented from the perspective of the white player. Features that will be included relate to time, mistakes, positional traits, elo, moves, and of course the actual result.
# Read data back in from csv and ignore the first column because it just contains the indicies.
# We only want white and black elo, opening category, result, and variations
smallDataset = "miniEvaluations.csv"
fullDataset = "evaluations.csv"
df= pd.read_csv(fullDataset, usecols=["WhiteElo", "BlackElo", "ECO", "Result", "Variations"])
# All data will be from the perspective of white
# Get mistake differential from a game
def getMistakeDifferentials(variation):
# Find all evaluations
evalText = re.findall(r'%eval -?\d.\d*', variation)
# Truncate text and get float eval value
evalList = []
for eval in evalText:
evalList.append(float(eval.split(" ")[1]))
# Find the mistake differential for white
evalDifference = 0
whiteBlunders = 0
blackBlunders = 0
whiteMistakes = 0
blackMistakes = 0
whiteInaccuracies = 0
blackInaccuracies = 0
for i in range(len(evalList)):
if i != 0:
evalDifference = evalList[i] - evalList[i-1]
if abs(evalDifference) > 3:
if i % 2 == 0:
whiteBlunders += 1
else:
blackBlunders += 1
elif abs(evalDifference) > 1:
if i % 2 == 0:
whiteMistakes += 1
else:
blackMistakes += 1
elif abs(evalDifference) > 0.5:
if i % 2 == 0:
whiteInaccuracies += 1
else:
blackInaccuracies += 1
blunderDifferential = whiteBlunders - blackBlunders
mistakeDifferential = whiteMistakes - blackMistakes
inaccuracyDifferential = whiteInaccuracies - blackInaccuracies
return pd.Series({"BlunderDifferential" : blunderDifferential, "MistakeDifferential" : mistakeDifferential,
"InaccuracyDifferential" : inaccuracyDifferential})
# Get time differential from a game
def getTimeDifferential(variation):
# Find all clock text
clockText = re.findall(r'%clk \d:\d{2}:\d{2}', variation)
# Truncate text and get clock data in seconds
clockList = []
for clock in clockText:
time = clock.split(" ")[1]
hr = int(time.split(":")[0])
min = int(time.split(":")[1])
sec = int(time.split(":")[2])
totalSeconds = hr * 3600 + min * 60 + sec
clockList.append(totalSeconds)
# Get clock differential
whiteEndClock = 0
blackEndClock = 0
# If less than 2 moves, can't calculate time differential
whiteBeginClock = 0
if len(clockList) >= 2:
whiteBeginClock = clockList[0]
if len(clockList) % 2 == 0:
whiteEndClock = clockList[-2]
blackEndClock = clockList[-1]
else:
whiteEndClock = clockList[-1]
blackEndClock = clockList[-2]
timeDifferential = whiteEndClock - blackEndClock
return pd.Series({"TimeDifferential": timeDifferential, "TimeControl": whiteBeginClock})
def getMoves(variation):
moves = re.findall(r'%eval -?\d.\d*', variation)
# Even out moves to prevent it from giving away winner
if len(moves) % 2 != 0:
return len(moves) + 1
return len(moves)
# Turn string result into a number result
def getResultForWhite(result):
if result == "0-1":
return 0
elif result == "1-0":
return 1
else:
return 0.5
df[["BlunderDifferential","MistakeDifferential","InaccuracyDifferential"]] = df["Variations"].apply(getMistakeDifferentials)
df[["TimeDifferential","TimeControl"]] = df["Variations"].apply(getTimeDifferential)
df["Moves"] = df["Variations"].apply(getMoves)
df["EloDifferential"] = df["WhiteElo"] - df["BlackElo"]
df["AverageElo"] = (df["WhiteElo"] + df["BlackElo"]) / 2
df["Result"] = df["Result"].apply(getResultForWhite)
pd.set_option('display.max_columns', None)
print(df.head())
WhiteElo BlackElo ECO Result \ 0 2344 2247 B09 0.0 1 1605 1733 C33 1.0 2 1897 1491 B12 1.0 3 2026 1684 B13 1.0 4 1520 1079 C40 1.0 Variations BlunderDifferential \ 0 1. e4 { [%eval 0.13] [%clk 0:03:00] } 1... d6 ... 0 1 1. e4 { [%eval 0.13] [%clk 0:10:00] } 1... e5 ... -1 2 1. e4 { [%eval 0.13] [%clk 0:30:00] } 1... c6 ... -1 3 1. e4 { [%eval 0.13] [%clk 0:29:57] } 1... c6 ... 0 4 1. e4 { [%eval 0.13] [%clk 0:10:00] } 1... e5 ... 0 MistakeDifferential InaccuracyDifferential TimeDifferential TimeControl \ 0 2 1 -26 180 1 0 -2 33 600 2 -2 0 495 1800 3 1 2 531 1797 4 -3 -3 -121 600 Moves EloDifferential AverageElo 0 36 97 2295.5 1 48 -128 1669.0 2 48 406 1694.0 3 90 342 1855.0 4 48 441 1299.5
This next section will keep track of the board position and calculate 4 positional features at the end of the opening phase of the game.
import re
def setupGame():
# Setup the starting board
gameArr = [["" for i in range(8)] for i in range(8)]
setupArr = ["R", "N", "B", "Q", "K", "B", "N", "R"]
for i in range(8):
gameArr[7][i] = "B" + setupArr[i]
gameArr[0][i] = "W" + setupArr[i]
for i in range(8):
gameArr[6][i] = "BP"
gameArr[1][i] = "WP"
return gameArr
def makeMove(gameArr, move, moveNum):
# Each move can either be a regular move from one place to another,
# or it can be a special move with more steps than a simple move
piece, specifier, file, rank = breakMoveUp(move)
color = "W" if moveNum % 2 == 0 else "B"
# A castle is simply a rearranging of King and Rook
if piece == "Castle":
colorRank = 0 if color == "W" else 7
gameArr[colorRank][4] = ""
gameArr[colorRank][5] = color + "R"
gameArr[colorRank][6] = color + "K"
gameArr[colorRank][7] = ""
elif piece == "Queenside Castle":
colorRank = 0 if color == "W" else 7
gameArr[colorRank][0] = ""
gameArr[colorRank][2] = color + "K"
gameArr[colorRank][3] = color + "R"
gameArr[colorRank][4] = ""
# A promotion is a pawn move to the last rank and then a swap to a different piece
elif piece == "Promotion":
firstMove = move.split("=")[0]
makeMove(gameArr, firstMove, moveNum)
piece, specifier, file, rank = breakMoveUp(firstMove)
col = ord(file) - ord("a")
row = int(rank) - 1
gameArr[row][col] = color + move.split("=")[1][0]
else:
piece = color + piece
col = ord(file) - ord("a")
row = int(rank) - 1
# Need to find the current piece that's being moved
curRow,curCol = findPiecePos(piece, specifier, gameArr, col, row, color)
#En Passant Condition: If a piece was captured by a pawn
# and the new location for it had no piece to begin with, then it must be an en passant
if "x" in move and gameArr[row][col] == "":
if color == "W":
# Erasing black piece, so add 1
gameArr[row-1][col] = ""
else:
gameArr[row+1][col] = ""
gameArr[row][col] = piece
gameArr[curRow][curCol] = ""
def findPiecePos(piece, specifier, gameArr, endCol, endRow, color):
col = -1
row = -1
if specifier:
# Specifier is file
if 0 <= ord(specifier) - ord("a") <= 8:
col = ord(specifier) - ord("a")
# Specifier is rank
else:
row = int(specifier) - 1
# Check all 64 squares brute force
for i in range(len(gameArr)):
for j in range(len(gameArr[i])):
# Meets piece and specifier constraints
if piece == gameArr[i][j] and (col==-1 or j==col) and (row==-1 or i==row):
colDistance = abs(endCol-j)
rowDistance = abs(endRow-i)
# An original position for a piece is valid when these conditions are satisfied:
# 1. The piece's movement rules are followed ex. N moves 2 squares one way and 1 square the other way
# 2. The piece's path to its destination is unobstructed
# 3. The piece's movement doesn't leave its king in check
#If positions satisfy movement rules of the piece
if piece[1] == "N" and colDistance + rowDistance == 3 \
and min(colDistance, rowDistance) == 1 and not kingChecked(gameArr, color, i, j, endRow, endCol):
return (i,j)
if piece[1] == "R" and min(colDistance, rowDistance) == 0 and not kingChecked(gameArr, color, i, j, endRow, endCol):
blockingPiece = False
if colDistance == 0:
multiplier = 1
if i < endRow:
multiplier = -1
for k in range(1,rowDistance):
if gameArr[endRow + multiplier*k][endCol] != "":
blockingPiece = True
else:
multiplier = 1
if j < endCol:
multiplier = -1
for k in range(1,colDistance):
if gameArr[endRow][endCol + multiplier*k] != "":
blockingPiece = True
if blockingPiece == False:
return (i,j)
if piece[1] == "B" and colDistance == rowDistance and not kingChecked(gameArr, color, i, j, endRow, endCol):
blockingPiece = False
if colDistance == rowDistance:
rowM = 1
colM = 1
if i<endRow:
rowM = -1
if j<endCol:
colM = -1
for k in range(1,rowDistance):
if gameArr[endRow+rowM*k][endCol + colM*k] != "":
blockingPiece = True
if blockingPiece == False:
return (i,j)
if piece[1] == "P" and not kingChecked(gameArr, color, i, j, endRow, endCol):
if (((color == "W" and i==1) or (color == "B" and i==6)) and rowDistance == 2 and colDistance == 0):
if color == "W" and i<endRow:
if gameArr[endRow-1][endCol] == "":
return(i,j)
elif color == "B" and i>endRow:
if gameArr[endRow+1][endCol] == "":
return(i,j)
elif rowDistance == 1 and colDistance <= 1:
if color == "W" and i<endRow:
return (i,j)
elif color == "B" and i>endRow:
return(i,j)
if piece[1] == "K":
return(i,j)
if piece[1] == "Q" and (min(colDistance, rowDistance) == 0 or colDistance == rowDistance) and not kingChecked(gameArr, color, i, j, endRow, endCol):
blockingPiece = False
if colDistance == rowDistance:
rowM = 1
colM = 1
if i<endRow:
rowM = -1
if j<endCol:
colM = -1
for k in range(1,rowDistance):
if gameArr[endRow+rowM*k][endCol + colM*k] != "":
blockingPiece = True
elif colDistance == 0:
multiplier = 1
if i < endRow:
multiplier = -1
for k in range(1,rowDistance):
if gameArr[endRow + multiplier*k][endCol] != "":
blockingPiece = True
elif rowDistance == 0:
multiplier = 1
if j < endCol:
multiplier = -1
for k in range(1,colDistance):
if gameArr[endRow][endCol + multiplier*k] != "":
blockingPiece = True
if blockingPiece == False:
return (i,j)
def kingChecked(gameArr, color, startRow, startCol, endRow, endCol):
# Checks if a piece is pinned to its king
# Does this by first assuming the move takes place, and then checking if the king is left attacked
startPiece = gameArr[startRow][startCol]
gameArr[startRow][startCol] = ""
endPiece = gameArr[endRow][endCol]
gameArr[endRow][endCol] = startPiece
for i in range(len(gameArr)):
for j in range(len(gameArr[i])):
if gameArr[i][j] == color + "K":
kArr = [(0,1),(0,-1),(1,0),(-1,0),(1,1),(-1,-1),(1,-1),(-1,1)]
row = i
col = j
for k in range(len(kArr)*8):
r,c = kArr[k%8]
r = r*((k//8)+1)
c = c*((k//8)+1)
modRow = row + r
modCol = col + c
if 0 <= modRow <= 7 and 0 <= modCol <= 7:
if gameArr[modRow][modCol] != "":
pieceColor = gameArr[modRow][modCol][0]
oppositeColor = "B" if color=="W" else "W"
piece = gameArr[modRow][modCol][1]
# Rook or Queen
if k%8 < 4:
if pieceColor == oppositeColor and piece in ["R","Q"]:
gameArr[startRow][startCol] = startPiece
gameArr[endRow][endCol] = endPiece
return True
else:
kArr[k%8] = (0,0)
# Bishop and Queen
else:
if pieceColor == oppositeColor and piece in ["B","Q"]:
gameArr[startRow][startCol] = startPiece
gameArr[endRow][endCol] = endPiece
return True
else:
kArr[k%8] = (0,0)
gameArr[startRow][startCol] = startPiece
gameArr[endRow][endCol] = endPiece
return False
def breakMoveUp(move):
# Break up the move into its piece, specifier, to break tie between multiple of the same piece, file, and rank
# Get piece
files = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']
ranks = ['1', '2', '3', '4', '5', '6', '7', '8']
piece = move[0]
specifier = None
file = None
rank = None
if move[0] not in ["R", "N", "B", "Q", "K"]:
# Castle or Pawn move
if move[0] == "O":
if len(move) == 3:
return "Castle", None, None, None
else:
return "Queenside Castle", None, None, None
else:
piece = "P"
if "=" in move:
return "Promotion", None, None, None
# Get file, rank, specifier
for i in range(0, len(move)):
if move[i] in files or move[i] in ranks:
if specifier == None:
specifier = move[i]
else:
if move[i] in files:
file = move[i]
else:
rank = move[i]
if not file:
file = specifier
specifier = None
if not rank:
rank = specifier
specifier = None
if piece == "P" and specifier == None:
specifier = file
return piece, specifier, file, rank
def playMoves(moves):
# Play through the moves by calling makeMove repeatedly
gameArr = setupGame()
for i in range(len(moves)):
makeMove(gameArr, moves[i], i)
return gameArr
def findKingSafetyDifferential(gameArr):
# Using the Pawn Shield method for calculating King Safety
# The lack of a shielding pawn within one or two squares of king gets a -1 penalty and an open file gets a -3 penalty
blackKingSafetyPenalty = 0
whiteKingSafetyPenalty = 0
for i in range(len(gameArr)):
for j in range(len(gameArr[i])):
if gameArr[i][j] == "WK":
minRow = i+1
maxRow = min(7, i+2)
minCol = max(0, j-1)
maxCol = min(7, j+1)
whiteKingSafetyPenalty = calculateKingSafetyPenalty(minRow, maxRow, minCol, maxCol, "W", gameArr)
if gameArr[i][j] == "BK":
minRow = max(0, i-2)
maxRow = i-1
minCol = max(0, j-1)
maxCol = min(7, j+1)
blackKingSafetyPenalty = calculateKingSafetyPenalty(minRow, maxRow, minCol, maxCol, "B", gameArr)
return whiteKingSafetyPenalty - blackKingSafetyPenalty
def calculateKingSafetyPenalty(minRow, maxRow, minCol, maxCol, color, gameArr):
# Open Files are -3 and Missing Pawns are -1
totalPenalty = 0
for b in range(minCol, maxCol+1):
# Missing Pawn Check
penalty = -1
for a in range(minRow, maxRow+1):
if gameArr[a][b] == color + "P":
penalty = 0
# Open File Check
if penalty == -1:
penalty = -3
for a in range(0,7):
if gameArr[a][b] != "" and gameArr[a][b][1] == "P":
penalty = -1
totalPenalty += penalty
return totalPenalty
def findMobilityDifferential(gameArr):
# Calculating mobility by looking at number of legal moves
blackMobility = 0
whiteMobility = 0
for i in range(len(gameArr)):
for j in range(len(gameArr[i])):
if gameArr[i][j] != "":
color = gameArr[i][j][0]
piece = gameArr[i][j][1]
if piece == "P":
if color == "W":
if gameArr[i+1][j] == "":
whiteMobility+=1
if i==1 and gameArr[i+2][j] == "":
whiteMobility+=1
else:
if gameArr[i-1][j] == "":
blackMobility+=1
if i==6 and gameArr[i-2][j] == "":
blackMobility+=1
else:
if color == "W":
whiteMobility += generateMobilityCombos(piece, i, j, gameArr)
else:
blackMobility += generateMobilityCombos(piece, i, j, gameArr)
return whiteMobility - blackMobility
def generateMobilityCombos(piece, row, col, gameArr):
# Checking all possible ways each piece can move to determine number of legal squares
mobility = 0
if piece == "N":
nArr = [(-1,-2), (-1,2), (1,-2), (1,2), (2, 1), (2, -1), (-2,1), (-2,-1)]
for i in range(len(nArr)):
r, c = nArr[i]
modRow = row + r
modCol = col + c
if 0 <= modRow <= 7 and 0 <= modCol <= 7:
if gameArr[modRow][modCol] == "":
mobility += 1
elif piece == "B":
bArr = [(-1,-1),(1,1),(-1,1),(1,-1)]
for i in range(len(bArr)*4):
r,c = bArr[i%4]
r = r*((i//4)+1)
c = c*((i//4)+1)
modRow = row + r
modCol = col + c
if 0 <= modRow <= 7 and 0 <= modCol <= 7:
if gameArr[modRow][modCol] == "":
mobility += 1
else:
bArr[i%4] = (0,0)
elif piece == "R":
rArr = [(0,-1), (0,1), (1,0), (-1,0)]
for i in range(len(rArr)*4):
r,c = rArr[i%4]
r = r*((i//4)+1)
c = c*((i//4)+1)
modRow = row + r
modCol = col + c
if 0 <= modRow <= 7 and 0 <= modCol <= 7:
if gameArr[modRow][modCol] == "":
mobility += 1
else:
rArr[i%4] = (0,0)
elif piece == "K":
kArr = [(0,1),(0,-1),(1,-1),(1,0),(1,1),(-1,-1),(-1,0),(-1,1)]
for i in range(len(kArr)):
r,c = kArr[i]
modRow = row + r
modCol = col + c
if 0 <= modRow <= 7 and 0 <= modCol <= 7:
if gameArr[modRow][modCol] == "":
mobility += 1
elif piece == "Q":
qArr = [(0,-1), (0,1), (1,0), (-1,0), (-1,-1),(1,1),(-1,1),(1,-1)]
for i in range(len(qArr)*4):
r,c = qArr[i%8]
r = r*((i//8)+1)
c = c*((i//8)+1)
modRow = row + r
modCol = col + c
if 0 <= modRow <= 7 and 0 <= modCol <= 7:
if gameArr[modRow][modCol] == "":
mobility += 1
else:
qArr[i%8] = (0,0)
return mobility
def findMaterialDifferential(gameArr):
blackDict = {"P": 0, "B": 0, "N": 0, "R": 0, "Q": 0, "K": 0}
whiteDict = {"P": 0, "B": 0, "N": 0, "R": 0, "Q": 0, "K": 0}
# Calculating material imbalance
# P-1, B and N- 3, R-5, Q-9
# +1 for bishop pair
# -1 for rook pair
# -1 for knight pair
# -1 for no pawns
for i in range(len(gameArr)):
for j in range(len(gameArr[i])):
if gameArr[i][j] != "":
color = gameArr[i][j][0]
piece = gameArr[i][j][1]
if color == "B":
blackDict[piece] = blackDict.get(piece,0) + 1
else:
whiteDict[piece] = whiteDict.get(piece,0) + 1
blackPoints = calcPoints(blackDict)
whitePoints = calcPoints(whiteDict)
return whitePoints - blackPoints
def calcPoints(dict):
#
points = dict["B"] * 3 + dict["N"] * 3 + dict["P"] * 1 + dict["Q"] * 9 + dict["R"] * 5
if dict["B"] == 2:
points += 1
if dict["N"] == 2:
points -= 1
if dict["R"] == 2:
points -= 1
if dict["P"] == 0:
points -= 1
return points
def findDevelopmentDifferential(moves):
whiteNonPawnMoves = 0
blackNonPawnMoves = 0
for i in range(len(moves)):
if moves[i][0] in ["N","B","R","Q","K","O"]:
if i % 2 == 0:
if moves[i][0] == "O":
whiteCastleTime = i//2
whiteNonPawnMoves+=1
else:
if moves[i][0] == "O":
blackCastleTime = i//2
blackNonPawnMoves+=1
nonPawnMoveDifferential = whiteNonPawnMoves - blackNonPawnMoves
return nonPawnMoveDifferential
def findStatistics(variation):
# Match with a number followed by some .'s, a space, and then a combination of letters, dashes, and equals to match all moves
moves = re.findall(r'\d+\.+\s[\w\-\=]+',variation)
moves = [move.split(" ")[1] for move in moves]
modifiedMoves = moves[:24]
gameArr = playMoves(modifiedMoves)
safety = findKingSafetyDifferential(gameArr)
mobility = findMobilityDifferential(gameArr)
material = findMaterialDifferential(gameArr)
development = findDevelopmentDifferential(modifiedMoves)
return pd.Series({"KingSafetyDifferential": safety, "MobilityDifferential": mobility,
"MaterialDifferential": material, "DevelopmentDifferential": development})
df[["KingSafetyDifferential","MobilityDifferential","MaterialDifferential", "DevelopmentDifferential"]] = df["Variations"].apply(findStatistics)
df = df.drop(["Variations", "WhiteElo", "BlackElo"], axis=1)
print(df.head())
ECO Result BlunderDifferential MistakeDifferential \ 0 B09 0.0 0 2 1 C33 1.0 -1 0 2 B12 1.0 -1 -2 3 B13 1.0 0 1 4 C40 1.0 0 -3 InaccuracyDifferential TimeDifferential TimeControl Moves \ 0 1 -26 180 36 1 -2 33 600 48 2 0 495 1800 48 3 2 531 1797 90 4 -3 -121 600 48 EloDifferential AverageElo KingSafetyDifferential MobilityDifferential \ 0 97 2295.5 -2 10 1 -128 1669.0 0 3 2 406 1694.0 -1 11 3 342 1855.0 0 1 4 441 1299.5 3 7 MaterialDifferential DevelopmentDifferential 0 1 -1 1 3 -2 2 -3 1 3 0 1 4 4 0
5. Exploratory Data Analysis¶
Let's plot the relationship between each of our differentials and the average result. We'll use bar graphs as our result data is categorical and wouldn't be well represented with a scatter plot. While analyzing these graphs, we must keep in mind that a positive differential is advantageous for the Time, Elo, and Development differentials and a negative differential is advantageous for the Mistake and King Safety differentials.
df["Inaccuracy Differential Group"] = pd.cut(df['InaccuracyDifferential'], bins=[-12,-8,-4,0,4,8,12], precision=0)
df["Mistake Differential Group"] = pd.cut(df['MistakeDifferential'], bins=[-9,-6,-3,0,3,6,9], precision=0)
df["Blunder Differential Group"] = pd.cut(df['BlunderDifferential'], bins=[-6,-4,-2,0,2,4,6], precision=0)
# Drop na values caused by values that don't fit the bins
inaccuracyGroups = df["Inaccuracy Differential Group"].dropna().unique().sort_values()
mistakeGroups = df["Mistake Differential Group"].dropna().unique().sort_values()
blunderGroups = df["Blunder Differential Group"].dropna().unique().sort_values()
fig, axes = plt.subplots(3, 1, figsize=(10, 20))
ax1 = axes[0]
x = range(len(inaccuracyGroups))
y = df.groupby("Inaccuracy Differential Group", observed=True)["Result"].mean()
ax1.bar(x,y)
ax1.set_xticks(x,inaccuracyGroups)
ax1.set_xlabel('Inaccuracy Differential Group')
ax1.set_ylabel('Mean Result')
ax2 = axes[1]
x = range(len(mistakeGroups))
y = df.groupby("Mistake Differential Group", observed=True)["Result"].mean()
ax2.bar(x,y)
ax2.set_xticks(x,mistakeGroups)
ax2.set_xlabel('Mistake Differential Group')
ax2.set_ylabel('Mean Result')
ax3 = axes[2]
x = range(len(blunderGroups))
y = df.groupby("Blunder Differential Group", observed=True)["Result"].mean()
ax3.bar(x,y)
ax3.set_xticks(x,blunderGroups)
ax3.set_xlabel('Blunder Differential Group')
ax3.set_ylabel('Mean Result')
Text(0, 0.5, 'Mean Result')
It seems that making less errors does influence winning. However, at extreme error differentials in the graphs, there were results contradicting the overall trend. This was most likely because of the low sample size for those differentials groups.
bins = [-150,-100,-50,0,50,100,150]
df["EloGroup"] = pd.cut(df['EloDifferential'], bins=bins, precision=0)
# Drop na values caused by values that don't fit the bins
groups = df["EloGroup"].dropna().unique().sort_values()
x = range(len(groups))
y = df.groupby("EloGroup", observed=True)["Result"].mean()
plt.figure(figsize=(10,6))
plt.bar(x, y)
plt.xticks(x, groups)
plt.xlabel('Elo Differential Group')
plt.ylabel('Mean Result')
Text(0, 0.5, 'Mean Result')
The Elo Differential graph shows that elo has a clear correlation towards winning. The greater the differential between the player and their opponent, the greater the chance of winning. However elo doesn't impact the game significantly, at least not within the elo range shown. As long as players were in that elo range they were only at a less than 10% disadvantage/advantage. This is much different from the error graphs where having the mistake advantage could increase chances of winning by more than 30%.
bins = [-150,-120,-90,-60,-30,0,30,60,90,120,150]
df["TimeGroup"] = pd.cut(df['TimeDifferential'],bins=bins, precision=0)
# Drop na values caused by values that don't fit the bins
groups = df["TimeGroup"].dropna().unique().sort_values()
x = range(len(groups))
y = df.groupby("TimeGroup", observed=True)["Result"].mean()
plt.figure(figsize=(12,6))
plt.bar(x, y)
plt.xticks(x, groups)
plt.xlabel('Time Differential Group')
plt.ylabel('Mean Result')
Text(0, 0.5, 'Mean Result')
According to the graph, it seems that the only factor that influences winning is whether the player has the time advantage, not the quantity of that advantage. This makes sense as games can be won on time, so a significant decline in results can only be seen near 0.
# Filter our df by openings that have at least 10000 games played, approximately 1% of the dataset.
# We want frequently played openings because unusual openings have a large variance to their win percentage.
value_counts = df["ECO"].value_counts()
filtered_values = value_counts[value_counts > len(df.index)*.01].index
filtered_df = df[df['ECO'].isin(filtered_values)]
groups = filtered_df['ECO'].unique()
x = range(len(groups))
y = filtered_df.groupby("ECO", observed=True)["Result"].mean()
plt.figure(figsize=(16,6))
plt.bar(x, y)
plt.xticks(x, groups)
plt.xlabel('Openings')
plt.ylabel('Mean Result')
Text(0, 0.5, 'Mean Result')
We filtered the data to only include openings that accounted for at least 1% of games in our database to get results with a low variance. The results show that openings do have a small impact on winning. The least and most winning openings differed by about 20%, with most openings hovering close to the 50% win rate.
Let's graph our positional features against the mean result!
bins = [-6,-4,-2,0,2,4,6]
df["DevelopmentGroup"] = pd.cut(df['DevelopmentDifferential'],bins=bins, precision=0)
# Drop na values caused by values that don't fit the bins
groups = df["DevelopmentGroup"].dropna().unique().sort_values()
x = range(len(groups))
y = df.groupby("DevelopmentGroup", observed=True)["Result"].mean()
plt.figure(figsize=(12,6))
plt.bar(x, y)
plt.xticks(x, groups)
plt.xlabel('Development Differential Group')
plt.ylabel('Mean Result')
Text(0, 0.5, 'Mean Result')
Development differential has a small but present impact on win rate. Players who spend less time moving their pawns and more developing their major pieces come out ahead after the opening.
bins = [-4,-2,0,2,4]
df["KingSafetyGroup"] = pd.cut(df['KingSafetyDifferential'],bins=bins, precision=0)
# Drop na values caused by values that don't fit the bins
groups = df["KingSafetyGroup"].dropna().unique().sort_values()
x = range(len(groups))
y = df.groupby("KingSafetyGroup", observed=True)["Result"].mean()
plt.figure(figsize=(12,6))
plt.bar(x, y)
plt.xticks(x, groups)
plt.xlabel('King Safety Differential Group')
plt.ylabel('Mean Result')
Text(0, 0.5, 'Mean Result')
This graph has a similar trend as the development differential graph. Having more pawns and less open files around the king slightly increases the chance of winning out of the opening.
bins = [-12,-8,-4,0,4,8,12]
df["MobilityGroup"] = pd.cut(df['MobilityDifferential'],bins=bins, precision=0)
# Drop na values caused by values that don't fit the bins
groups = df["MobilityGroup"].dropna().unique().sort_values()
x = range(len(groups))
y = df.groupby("MobilityGroup", observed=True)["Result"].mean()
plt.figure(figsize=(12,6))
plt.bar(x, y)
plt.xticks(x, groups)
plt.xlabel('Mobility Differential Group')
plt.ylabel('Mean Result')
Text(0, 0.5, 'Mean Result')
Mobility out of the opening seems to have the greatest impact on winning out of all the positional features. Placing pieces where they control a substantial amount of space is important in the opening.
bins = [-4,-2,0,2,4]
df["MaterialGroup"] = pd.cut(df['MaterialDifferential'],bins=bins, precision=0)
# Drop na values caused by values that don't fit the bins
groups = df["MaterialGroup"].dropna().unique().sort_values()
x = range(len(groups))
y = df.groupby("MaterialGroup", observed=True)["Result"].mean()
plt.figure(figsize=(12,6))
plt.bar(x, y)
plt.xticks(x, groups)
plt.xlabel('Material Differential Group')
plt.ylabel('Mean Result')
Text(0, 0.5, 'Mean Result')
Material of course also has a significant impact on winning. However, out of the opening, players are usually still even in material, so this feature isn't very indicative of winning in the way it is currently presented.
Let's explore how our features behave with one another to get a better insight into how they impact winning!
# Filter dataframe to remove outliers to make axes smaller
mincnt = 300
# If using small dataset
if len(df.index) < 100000:
mincnt = 10
# Filtering dataframe to remove outliers because outliers will expand the range of the graphs significantly and
# make trends in the graph tougher to notice.
filtered_df = df[(abs(df["EloDifferential"]) < 200) & (abs(df["TimeDifferential"]) < 200)]
fig, axes = plt.subplots(3, 1, figsize=(10, 20))
axesArr = [("EloDifferential","MistakeDifferential"),("EloDifferential","TimeDifferential"),("TimeDifferential","MistakeDifferential")]
for i, ax in enumerate(axes.flat):
x, y = axesArr[i]
# Decided to use a hexbin as there is a signficant amount of data, so scatter plots wouldnt work well
ax.hexbin(filtered_df[x],filtered_df[y],gridsize=25,mincnt=mincnt, bins="log")
# Getting a line of best fit using linear regression
[m,b] = np.polyfit(filtered_df[x],filtered_df[y],1)
# Plotting this line
ax.plot(filtered_df[x],m*filtered_df[x]+b, 'r')
print(f"Slope for {x} vs {y}: {m}")
predicted_y = np.polyval([m,b], filtered_df[x])
residuals = filtered_df[y] - predicted_y
SSR = np.sum(residuals ** 2)
mean_y = filtered_df[y].mean()
difference = filtered_df[y] - mean_y
SST = np.sum(difference ** 2)
Rsquared = 1-(SSR/SST)
print(f"R^2 value {Rsquared}")
ax.set_xlabel(x)
ax.set_ylabel(y)
plt.show()
Slope for EloDifferential vs MistakeDifferential: -0.0019607645692288327 R^2 value 0.0027103792832799956 Slope for EloDifferential vs TimeDifferential: 0.039300400488498095 R^2 value 0.0014882666837612302 Slope for TimeDifferential vs MistakeDifferential: -0.0008304778469412856 R^2 value 0.0005046014153655687
The graphs above provide the correlation between our features and give us a more nuanced look at the impact of the features on winning rates. Hexbin graphs were chosen as there is a signficant amount of data, so scatter plots wouldnt work well. We also used linear regression to draw a line of best fit for each of these graphs. Using this line of best fit, we calculated the R^2 value. As we can see, the incredibly low R squared values for each graph shows the lines inability to explain the variance.
The first graph between EloDifferential and MistakeDifferential shows that having a greater elo than your opponent can reduce your mistakes slightly. A close look at the hexbins shows this slight correlation. This makes sense intuitively as higher elo players are better than their opponents, and will make fewer mistakes.
The second graph between EloDifferential and TimeDifferential also agrees with our past findings, having a greater elo will give a slight time advantage and therefore increase chances of winning. This graph also had a signficantly greater slope for its line of best fit than the other two, showing a much stronger correlation between these two characteristics.
The third graph between TimeDifferential and MistakeDifferential, however, defies intuition. One would think that having a time disadvantage would cause a player to panic and make more mistakes, but there seems to be little to no correlation between the two variables. The slope of the line of best fit is about half that of the first graph and there is no noticable trend when looking at the hexbin graph. One explanation for this might be that games with a large time differential may end by time out, so mistakes wouldn't factor into the result.
6. Machine Learning¶
Now that we have completed the data cleaning and visualization parts of the data science pipeline, we want to train a machine learning model that will allow us to perdict the winner of a chess game given certain features like ELO difference, time difference, blunders, etc., and then can examine the feature importance in order to show players which ones are having the greatest impact on determining the winner. We will be training a classifier (as we want to classify as win, loss, etc.) rather than performing regression.
Preliminary Reading: Classification and Categorical Encoding¶
We reccomend a familiarity with classification and categorical encoding to help understand this section. Reccomended resources include:
Preparing Data For Training¶
Originally, the data for winning and losing is stored as a numerical value with 1.0 for winning, 0.0 for losing, and 0.5 for a tie. This is converted to a categorical label (rather than numerical) for training to allow for classification.
def convertWinToCategory(val):
res = ""
match val:
case 1.0:
res = "Win"
case 0.0:
res = "Loss"
case 0.5:
res = "Tie"
return res
We first pick out all of the features we will be using in order to train our model. This includes king safety differential, development differential, average ELO, time control, moves, inaccuracy differential, mistake differential, blunder differential, time differential, elo differential, opening, and the result. These features are all explained in the sections above. After that, we want to limit ourselves to openings that make up at least 2% of the database, which helps us avoid the curse of dimensionality (there are hundreds of thousands of chess openings, making encoding difficult).
# extract necessary features for training
train_feat_df = df.loc[:,["MobilityDifferential", "MaterialDifferential", "KingSafetyDifferential","DevelopmentDifferential","AverageElo","TimeControl", "Moves", "InaccuracyDifferential", "MistakeDifferential", "BlunderDifferential", "TimeDifferential", "EloDifferential", "ECO", "Result"]]
# convert results to categorical values for training
train_feat_df["Result"] = train_feat_df["Result"].map(convertWinToCategory)
#retuns a df of the value counts of all the openings
value_counts = train_feat_df["ECO"].value_counts()
#filters all the values such that the count of their occurences needs to be greater than 2% of the dataset, and then cuts the others out of the training dataset
filtered_values = value_counts[value_counts > len(train_feat_df.index)*.02].index
train_feat_df = train_feat_df[train_feat_df['ECO'].isin(filtered_values)]
Encoding Openings¶
Encoding the chess openings posed a significant challenge due to the vast number of possibilities, which can reach hundreds of thousands. Using one-hot encoding without filtering would introduce n new features, where n represents the number of openings, leading to the curse of dimensionality. To overcome this obstacle, we implemented a filtering process that limited the openings to those that constitute at least 2% of the dataset. This decision was based on the understanding that openings with a lower representation would not provide a statistically significant number of games to derive meaningful insights. Once the filtering process was complete, we applied one-hot encoding to the remaining openings, as they are categorical variables. This approach allowed us to effectively manage the dimensionality of the dataset while preserving the most relevant information for analysis.
# Use pd.get_dummies to perform one-hot encoding
one_hot_enc_df = pd.get_dummies(train_feat_df["ECO"], prefix='ECO')
train_feat_df = pd.concat([train_feat_df, one_hot_enc_df], axis=1)
train_feat_df = train_feat_df.drop("ECO", axis = 1) #drop old opening column
Choosing A Model¶
Now that the data has been prepared for training, a model has to be chosen. In order to do this, we trained a wide variety of classification models, did hyperparameter tuning and k-fold cross validation, and then saved the accuracy for each. This allows us to find the model with the highest accuracy and prescision for our needs. The accuracy of each model is then plotted.
# set up all classifiers
classifiers = [
RandomForestClassifier(),
GradientBoostingClassifier(),
AdaBoostClassifier(),
DecisionTreeClassifier(),
KNeighborsClassifier(),
SVC(),
GaussianNB(),
LogisticRegression(),
MLPClassifier()
]
# seperate out data into testing and training dataset
X = train_feat_df.drop("Result", axis=1)
y = train_feat_df["Result"]
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=.25, random_state=42, shuffle=True)
results = []
#loop through all classifiers
for classifier in classifiers:
clf_name = classifier.__class__.__name__
clf = classifier
#fit each classifier
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test.values)
# extract the accuracy and precsision
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
#save the results in a dictionary
results.append({'classifier': clf_name, 'accuracy': accuracy, 'precision': precision})
# plot using matplot lib
accuracies = [result['accuracy'] for result in results]
precisions = [result['precision'] for result in results]
classifiers = [result['classifier'] for result in results]
fig, ax = plt.subplots(figsize=(12, 6))
ax.bar(classifiers, accuracies, label='Accuracy')
ax.bar(classifiers, precisions, label='Precision')
ax.set_xlabel('Classifier')
ax.set_ylabel('Score')
ax.set_title('Classifier Performance')
ax.legend()
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
/Users/pranavshah/anaconda3/lib/python3.11/site-packages/sklearn/base.py:464: UserWarning: X does not have valid feature names, but RandomForestClassifier was fitted with feature names warnings.warn( /Users/pranavshah/anaconda3/lib/python3.11/site-packages/sklearn/base.py:464: UserWarning: X does not have valid feature names, but GradientBoostingClassifier was fitted with feature names warnings.warn( /Users/pranavshah/anaconda3/lib/python3.11/site-packages/sklearn/base.py:464: UserWarning: X does not have valid feature names, but AdaBoostClassifier was fitted with feature names warnings.warn( /Users/pranavshah/anaconda3/lib/python3.11/site-packages/sklearn/base.py:464: UserWarning: X does not have valid feature names, but DecisionTreeClassifier was fitted with feature names warnings.warn( /Users/pranavshah/anaconda3/lib/python3.11/site-packages/sklearn/base.py:464: UserWarning: X does not have valid feature names, but KNeighborsClassifier was fitted with feature names warnings.warn( /Users/pranavshah/anaconda3/lib/python3.11/site-packages/sklearn/base.py:464: UserWarning: X does not have valid feature names, but SVC was fitted with feature names warnings.warn( /Users/pranavshah/anaconda3/lib/python3.11/site-packages/sklearn/metrics/_classification.py:1469: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) /Users/pranavshah/anaconda3/lib/python3.11/site-packages/sklearn/base.py:464: UserWarning: X does not have valid feature names, but GaussianNB was fitted with feature names warnings.warn( /Users/pranavshah/anaconda3/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:460: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html Please also refer to the documentation for alternative solver options: https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression n_iter_i = _check_optimize_result( /Users/pranavshah/anaconda3/lib/python3.11/site-packages/sklearn/base.py:464: UserWarning: X does not have valid feature names, but LogisticRegression was fitted with feature names warnings.warn( /Users/pranavshah/anaconda3/lib/python3.11/site-packages/sklearn/base.py:464: UserWarning: X does not have valid feature names, but MLPClassifier was fitted with feature names warnings.warn(
From the graph above, we can see that the GradientBoostingClassifier and random forest are about equal, and KNeighbors performing the worst. This is also intuitive because gradient boosting and random forest are ensemble methods that combine multiple weak learners (descision trees in both cases) to create a strong learner. On the other hand, K-Nearest Neighbors (KNN) is a simpler algorithm that relies on the proximity of data points to make predictions. It may struggle with high-dimensional data and can be sensitive to the choice of the number of neighbors (k).
Given their similar performance, we will go with the gradient boosting classifier. This is because it is an iterative algorithm that progressively improves the model by focusing on the misclassified samples from previous iterations. This allows it to effectively handle difficult cases and achieve higher accuracy. It also has more hyperparameters that can be tuned to improve overall performance. In contrast, KNN's performance heavily depends on the quality and relevance of the selected neighbors. If the neighbors are not representative of the true class distribution or if the features are not well-separated, KNN may struggle to make accurate predictions.
Hyperparameter Optimization¶
After selecting the best model, we want to perform more rigorous hyperparameter optimization in order to get the best accuracy and prescision possible. We use GridSearchCV to perform an exhaustive search of the parameter space and then save the best perfoming model. For additional reading on hyperparamter optimization, please see:
param_grid = {
'n_estimators': [50, 100, 200],
'learning_rate': [0.01, 0.1, 0.2],
'max_depth': [3, 5, 7],
}
# Create a Gradient Boosting classifier
gb_classifier = GradientBoostingClassifier(random_state=42)
# Perform grid search with cross-validation
grid_search = GridSearchCV(estimator=gb_classifier, param_grid=param_grid, cv=5, n_jobs=-1, verbose=True)
grid_search.fit(X_train, y_train)
# Print the best model and its hyperparameters
print("Best Model:")
best_model = grid_search.best_estimator_
print(best_model)
Fitting 5 folds for each of 27 candidates, totalling 135 fits
/Users/pranavshah/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed). from pandas.core import ( /Users/pranavshah/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed). from pandas.core import ( /Users/pranavshah/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed). from pandas.core import ( /Users/pranavshah/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed). from pandas.core import ( /Users/pranavshah/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed). from pandas.core import ( /Users/pranavshah/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed). from pandas.core import ( /Users/pranavshah/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed). from pandas.core import ( /Users/pranavshah/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed). from pandas.core import ( /Users/pranavshah/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed). from pandas.core import ( /Users/pranavshah/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed). from pandas.core import (
Best Model: GradientBoostingClassifier(max_depth=7, random_state=42)
best_model.fit(X_train, y_train)
y_pred_rand = best_model.predict(X_test)
print(classification_report(y_pred_rand, y_test))
precision recall f1-score support Loss 0.82 0.81 0.82 194960 Tie 0.12 0.43 0.19 3313 Win 0.84 0.82 0.83 208935 accuracy 0.81 407208 macro avg 0.60 0.69 0.61 407208 weighted avg 0.83 0.81 0.82 407208
Finding the most impactful features¶
By looking at the most impactful features, we are able to see (generally) what matters the most when it comes to perdicting the winner of the game. This information is already a part of the scikit-learn model, so a plot is shown below.
# combine features and their imporances into a dataframe
important_features = zip(X.columns, best_model.feature_importances_)
imp_feat_df = pd.DataFrame(important_features)
imp_feat_df.rename(columns={0:'features',
1:'importance'},
inplace=True)
# sort by importance and reconfigure the index
imp_feat_df.sort_values(by=['importance'], inplace=True, ascending=False)
imp_feat_df.reset_index(inplace=True)
imp_feat_df.drop(['index'], axis=1, inplace=True)
imp_feat_df
features | importance | |
---|---|---|
0 | BlunderDifferential | 0.379436 |
1 | MistakeDifferential | 0.233109 |
2 | TimeDifferential | 0.098432 |
3 | TimeControl | 0.077960 |
4 | Moves | 0.070639 |
5 | InaccuracyDifferential | 0.041791 |
6 | AverageElo | 0.036350 |
7 | EloDifferential | 0.022973 |
8 | MobilityDifferential | 0.015647 |
9 | MaterialDifferential | 0.006348 |
10 | DevelopmentDifferential | 0.004517 |
11 | KingSafetyDifferential | 0.004035 |
12 | ECO_C41 | 0.001192 |
13 | ECO_A00 | 0.001131 |
14 | ECO_B10 | 0.001057 |
15 | ECO_B01 | 0.000963 |
16 | ECO_C20 | 0.000834 |
17 | ECO_D00 | 0.000587 |
18 | ECO_B00 | 0.000571 |
19 | ECO_A40 | 0.000550 |
20 | ECO_D02 | 0.000489 |
21 | ECO_C50 | 0.000477 |
22 | ECO_C44 | 0.000458 |
23 | ECO_C00 | 0.000455 |
# plot using matplotlib
plt.figure(figsize=(10, 6))
plt.barh(imp_feat_df['features'], imp_feat_df['importance'], color='darkblue')
plt.xlabel('Importance')
plt.ylabel('Features')
plt.title('Feature Importance for Making Predictions')
plt.show()
Using the plot above, we can see that the blunder differential is the most impactful feature to perdicting the winner of a chess game. This is intuitive as the player who makes blunders means that their position worsens considerable but is also suprising, that it is significantly more impactful than ELO differential. This is relevant and impactful players as it shows them that regrardless of your ELO differential, it is important to play calm, collected chess rather than trying to find a "creative" position as that will put you at a much higher risk of losing. Out of the positional features, it seems that mobility was the most important feature coming out of the opening.
Benchmark¶
In order to do more benchmarking on the performance of our model, we will look at the confusion matrix in order to see prescision, accuracy, and recall in a more visual way. Additionally, we will compare against a baseline of random guessing and the heurisitic of "the player with the higher ELO will win".
# plot confusion matrix
cm = confusion_matrix(y_test, y_pred_rand, labels=best_model.classes_)
disp = ConfusionMatrixDisplay(confusion_matrix=cm,
display_labels=best_model.classes_)
disp.plot()
plt.show()
Now, we will compare our model against the baseline performance of guessing and the heuristic of "the player with the higher ELO wins".
tolerance = 10
bench1 = X_test['EloDifferential'].apply(lambda x: "Win" if x > tolerance else ("Loss" if x < -tolerance else "Tie"))
bench2 = np.random.choice(["Loss", "Tie", "Win"], size=len(X_test))
bench3 = y_pred
bench4 = y_pred_rand
ground_truth = y_test
data = {'Greater ELO Wins': bench1, 'Random Guess': bench2, 'predictions (no hyperparameter)': bench3, 'predictions (hyperparameter)': bench4,'ground_truth': ground_truth}
test_df = pd.DataFrame(data)
accuracy_bench1 = accuracy_score(test_df['ground_truth'], test_df['Greater ELO Wins'])
accuracy_bench2 = accuracy_score(test_df['ground_truth'], test_df['Random Guess'])
accuracy_bench3 = accuracy_score(test_df['ground_truth'], test_df['predictions (no hyperparameter)'])
accuracy_bench4 = accuracy_score(test_df['ground_truth'], test_df['predictions (hyperparameter)'])
print(f"Accuracy of Greater ELO Wins Benchmark: {accuracy_bench1*100:.2f}%")
print(f"Accuracy of Randomly Guessing Winner: {accuracy_bench2*100:.2f}%")
print(f"Accuracy of Gradient Boosting Predictions (w/o hyperparameter): {accuracy_bench3*100:.2f}%")
print(f"Accuracy of Gradient Boosting Predictions (w/ hyperparameter): {accuracy_bench4*100:.2f}%")
Accuracy of Greater ELO Wins Benchmark: 42.68% Accuracy of Randomly Guessing Winner: 33.41% Accuracy of Gradient Boosting Predictions (w/o hyperparameter): 75.04% Accuracy of Gradient Boosting Predictions (w/ hyperparameter): 81.22%
As we can see, our model performs significantly better than both benchmarks, meaning that we have found a meaningful model to predict the winner of a chess game. The hyperparameter tuning process has allowed us to find the optimal combination of parameters that maximizes the model's performance. Benchmark 1 follows the heursitic of a player who has a higher ELO will win (as they are a stronger player). Our model's performance surpasses this benchmark, indicating that it has learned valuable information from the features to make more accurate predictions. Benchmark 2, which is a random guess classifier, randomly assigns class labels to the instances. This benchmark represents the performance that can be achieved by chance alone. Our model's significantly higher accuracy compared to this benchmark demonstrates that it has captured meaningful patterns and relationships in the data, allowing it to make informed predictions. The gradient boosting classifier has proven to be effective in modeling the complexities of chess game outcomes. The use of cross-validation during the hyperparameter tuning process ensures that the model's performance is assessed on multiple splits of the data, providing a more robust estimate of its generalization ability.
7. Insights, Future Work, and Considerations¶
Our current model could be improved upon with the analysis of additional positional characteristics and better formulas for evaluating these characteristics. For example, with the mobility feature, instead of only counting the number of legal moves, we could instead count the number of safe moves that don't hang the piece. This is known as Safe Mobility. Unfortunately, calculating more positional characteristics with better formulas is incredibly time consuming and resource intensive, so for this project we decided to focus on the basic formulas and features.
Our low accuracy of 80% is just a testament to how difficult it is to predict the winner of chess games. Chess is a complex game with many different qualitiative and quantitative characteristics to consider. From the results it is evident that reducing unforced errors and watching the time are crucial actions to take in the game. In addition, from the results we can determine that mobility is the most important positional feature out of the opening. This is an interesting insight which highlights the important of early and precise development of pieces in the beginning stages of the game. From these results, players should take away from this study the importance of error reduction, time management, and early piece activity.
This project, although it simply affirms many popular beliefs in the community, uncovers future chess-related data science projects. Projects like an ELO predictor could be developed with many of the same features used in this project.
8. References and Additional Resources¶
Lichess Database: https://database.lichess.org
Information about how to evaluate positional characteristics:
King Safety: https://www.chessprogramming.org/King_Safety
Development: https://www.chessprogramming.org/Development
Center Control: https://www.chessprogramming.org/Center_Control
Material Balance: https://www.chessprogramming.org/Material
Number of moves for opening: https://www.chessable.com/blog/opening-guide/#:~:text=Discover-,Introduction%20to%20Chess%20Openings,the%20main%20fight%20takes%20place