How to use pandas to level up your chess game

Chess is one of the most popular games on the planet. It's history goes back hundreds of years. Unlike the online era, hundreds of years ago people did not have chess engines and computer analysis to figure out what the best moves are. We have an incredible advantage over our historical counterparts: with so many players in the online era, more data is available than ever. I'll now briefly demonstrate how a chess player might use the pandas package to make decisions to improve their personal chess play.

A .pgn (portable game notation) is a common file type used for chess databases. Below is a sample of a .pgn grabbed from the open source lichess.org. I've gone ahead an opened it to see what kind of data I'll be dealing with.

I hacked together a function that parses it into lists, puts it into a DataFrame, and gives it some formatting. A small selection of games are missing rating data for one or both players (<1%) so they are omitted.

With our .pgn parsed, we can import our file as a DataFrame. The important columns I will be examining are white and black ELO (a rating system used in chess to measure player strength, with 1200 being an average player, 2000 being an expert, and 2500 being a grandmaster), the result, and the game's move order.

This is an online chess website, which means anyone can play. But if I'm making decisions about what moves to play, I dont want the data from weak players influencing the results. Let's take a look at only the games where both players are experts (over 2000 rating).

Every beginning chess player at one point probably asks themselves the question: "What is the best first move?." There are many approaches, but for a beginner it's often best to focus on learning one opening and then studying the variations that arise. We can look at the first move expert players tend to use.

I'll slice this to only look at move 1. and put it in its own column.

On turn 1 there are 20 legal moves. Already in this dataset we have narrowed down the moves that experts play to 10.

So we see that half of those 10 moves that our sample of experts play are played extremely rarely. The most common moves on turn 1 for white are 1. e4 and 1. d4 by a large margin, making over half of the games in this sample. If you were using usage statistics to validate your gameplay decisions in chess, you would come to the conclusion that 1. e4 and 1. d4 are the best moves to make on turn 1. Let's check the win rate of those games where 1. e4 were played

In chess there are 3 potential outcomes to a game. 1-0 indicates a win for white, 0-1 a win for black, and 1/2 -1/2 is a draw. In our sample of expert games, 1. e4 has 130 wins, 140 losses, and 25 draws out of 295 games. So for white, 1. e4 has a:

Repeating this for the second most common move, 1. d4

So it appears even though 1. d4 is a less common move than 1. e4, in this set it has a much better winrate, and a lower chance to draw as well. Data like this might influence some players to choose 1. d4 as their starting move. You would want to compare this across multiple datasets. Perhaps January, 2013 was just a bad month for 1. e4 players. I also notice that in this set some players are represented more than others, with some players appearing as over 10% of the number of expert games. For example this data could be heavily skewed if the user 'Panevis' had a particular bias in his performance. I did choose this dataset because it was the smallest one. While Jan 2013 has just around 120,000 games, some of the current ones have upwards of 75 million games, with the files being around 17 gigs. You are likely to get a better statistical summary from that dataset, where there are thousands more expert games being played.

You could also select the games where your own username shows up, and collect statistics about how you perform against different openings, and use that to bolster your preparation.