Chess is one of the most popular games on the planet. It's history goes back hundreds of years. Unlike the online era, hundreds of years ago people did not have chess engines and computer analysis to figure out what the best moves are. We have an incredible advantage over our historical counterparts: with so many players in the online era, more data is available than ever. I'll now briefly demonstrate how a chess player might use the pandas package to make decisions to improve their personal chess play.
A .pgn (portable game notation) is a common file type used for chess databases. Below is a sample of a .pgn grabbed from the open source lichess.org. I've gone ahead an opened it to see what kind of data I'll be dealing with.
open('lichess_db_standard_rated_2013-01.pgn').read()[:300]
'[Event "Rated Classical game"]\n[Site "https://lichess.org/j1dkb5dw"]\n[White "BFG9k"]\n[Black "mamalak"]\n[Result "1-0"]\n[UTCDate "2012.12.31"]\n[UTCTime "23:01:03"]\n[WhiteElo "1639"]\n[BlackElo "1403"]\n[WhiteRatingDiff "+5"]\n[BlackRatingDiff "-8"]\n[ECO "C00"]\n[Opening "French Defense: Normal Variation"]'
I hacked together a function that parses it into lists, puts it into a DataFrame, and gives it some formatting. A small selection of games are missing rating data for one or both players (<1%) so they are omitted.
import pandas as pd
class Chess:
def __init__(self, path):
self.path = path
def pgn_to_df(self):
f = open(self.path).read()
formatted_string = f.replace('[Event', '***[Event').replace('\n\n', '\n').split('***')
blank_values_stripped = [x for x in formatted_string if x != '']
sub_lists = [x.split('\n') for x in blank_values_stripped]
pre_dict_format = [[x.replace('"', ':', 1).replace('[', '').replace(']', '').replace('"', '').replace(' ', '')
for x in row]for row in sub_lists]
[x.pop() for x in pre_dict_format]
dicted_list = [[element.split(':')[1] if len(element.split(':')) == 2 else element for element in row]
for row in pre_dict_format]
df = pd.DataFrame.from_records([x for x in dicted_list if len(x) == 16])
df.rename(columns={0: 'event', 1: 'site', 2: 'white', 3: 'black', 4: 'result'}, inplace = True)
df.rename(columns={5: 'utcdate', 6: 'utctime', 7: 'white_elo', 8: 'black_elo'}, inplace = True)
df.rename(columns={9: 'white_rating_diff', 10: 'black_rating_diff', 11: 'eco', 12: 'opening',
13: 'time_control', 14: 'termination', 15: 'game_notation'}, inplace = True)
df.site = df.site.apply(lambda x: x.replace('Site:',''))
df.utctime = df.utctime.apply(lambda x: x.replace('UTCTime:',''))
df.opening = df.opening.apply(lambda x: x.replace('Opening:',''))
df.white_elo = df.white_elo.astype('int64')
df.black_elo = df.black_elo.astype('int64')
return df
With our .pgn parsed, we can import our file as a DataFrame. The important columns I will be examining are white and black ELO (a rating system used in chess to measure player strength, with 1200 being an average player, 2000 being an expert, and 2500 being a grandmaster), the result, and the game's move order.
c = Chess('lichess_db_standard_rated_2013-01.pgn')
my_chess_table = c.pgn_to_df()
my_chess_table.head()
event | site | white | black | result | utcdate | utctime | white_elo | black_elo | white_rating_diff | black_rating_diff | eco | opening | time_control | termination | game_notation | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | RatedClassicalgame | https://lichess.org/j1dkb5dw | BFG9k | mamalak | 1-0 | 2012.12.31 | 23:01:03 | 1639 | 1403 | +5 | -8 | C00 | FrenchDefense:NormalVariation | 600+8 | Normal | 1.e4e62.d4b63.a3Bb74.Nc3Nh65.Bxh6gxh66.Be2Qg57... |
1 | RatedClassicalgame | https://lichess.org/a9tcp02g | Desmond_Wilson | savinka59 | 1-0 | 2012.12.31 | 23:04:12 | 1654 | 1919 | +19 | -22 | D04 | Queen'sPawnGame:ColleSystem,Anti-Colle | 480+2 | Normal | 1.d4d52.Nf3Nf63.e3Bf54.Nh4Bg65.Nxg6hxg66.Nd2e6... |
2 | RatedClassicalgame | https://lichess.org/szom2tog | Kozakmamay007 | VanillaShamanilla | 1-0 | 2012.12.31 | 23:03:15 | 1643 | 1747 | +13 | -94 | C50 | FourKnightsGame:ItalianVariation | 420+17 | Normal | 1.e4e52.Nf3Nc63.Bc4Nf64.Nc3Bc55.a3Bxf2+6.Kxf2N... |
3 | RatedBulletgame | https://lichess.org/rklpc7mk | Naitero_Nagasaki | 800 | 0-1 | 2012.12.31 | 23:04:57 | 1824 | 1973 | -6 | +8 | B12 | Caro-KannDefense:GoldmanVariation | 60+1 | Normal | 1.e4c62.Nc3d53.Qf3dxe44.Nxe4Nd75.Bc4Ngf66.Nxf6... |
4 | RatedBulletgame | https://lichess.org/1xb3os63 | nichiren1967 | Naitero_Nagasaki | 0-1 | 2012.12.31 | 23:02:37 | 1765 | 1815 | -9 | +9 | C00 | FrenchDefense:LaBourdonnaisVariation | 60+1 | Normal | 1.e4e62.f4d53.e5c54.Nf3Qb65.c3Nc66.d3Bd77.Be2N... |
This is an online chess website, which means anyone can play. But if I'm making decisions about what moves to play, I dont want the data from weak players influencing the results. Let's take a look at only the games where both players are experts (over 2000 rating).
experts = my_chess_table.loc[(my_chess_table.white_elo > 2000) & (my_chess_table.black_elo > 2000)]
experts.head()
event | site | white | black | result | utcdate | utctime | white_elo | black_elo | white_rating_diff | black_rating_diff | eco | opening | time_control | termination | game_notation | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
823 | RatedBulletgame | https://lichess.org/muyx35xi | Yeshi | xiaoqiao | 0-1 | 2013.01.01 | 06:56:45 | 2004 | 2040 | -29 | +12 | A01 | Nimzo-LarsenAttack:ModernVariation#3 | 120+0 | Timeforfeit | 1.b3e52.Bb2Nc63.Nf3d64.e3f55.d4e46.Nfd2Nf67.c4... |
1155 | RatedBlitzgame | https://lichess.org/hmkle9lz | dek | StefanK | 0-1 | 2013.01.01 | 10:57:13 | 2087 | 2042 | -14 | +13 | B12 | Caro-KannDefense:AdvanceVariation,Botvinnik-Ca... | 180+0 | Timeforfeit | 1.e4c62.d4d53.e5c54.c3Nc65.f4g66.Nf3Bg77.Bd3h5... |
1186 | RatedBlitzgame | https://lichess.org/607nglwh | Antarut | dek | 0-1 | 2013.01.01 | 11:03:42 | 2012 | 2073 | -18 | +10 | C46 | ThreeKnightsOpening | 180+0 | Normal | 1.e4e52.Nc3Nc63.Nf3Bc54.Nxe5Bxf2+5.Kxf2Nxe56.d... |
1691 | RatedBulletgame | https://lichess.org/i3xuyu7q | Panevis | StefanK | 0-1 | 2013.01.01 | 14:53:32 | 2062 | 2145 | -10 | +11 | C00 | FrenchDefense:NormalVariation | 60+0 | Normal | 1.e4e62.d4b63.e5Bb74.Nf3c55.c3cxd46.cxd4Nc67.B... |
2003 | RatedBlitzgame | https://lichess.org/kahmgob6 | ostrak11 | celeron17 | 0-1 | 2013.01.01 | 16:22:12 | 2070 | 2240 | -7 | +21 | D00 | Queen'sPawnGame#2 | 240+0 | Normal | 1.d4d52.e3e63.c4f54.Nc3Nf65.Nf3c66.cxd5exd57.B... |
Every beginning chess player at one point probably asks themselves the question: "What is the best first move?." There are many approaches, but for a beginner it's often best to focus on learning one opening and then studying the variations that arise. We can look at the first move expert players tend to use.
experts.game_notation.head()
823 1.b3e52.Bb2Nc63.Nf3d64.e3f55.d4e46.Nfd2Nf67.c4... 1155 1.e4c62.d4d53.e5c54.c3Nc65.f4g66.Nf3Bg77.Bd3h5... 1186 1.e4e52.Nc3Nc63.Nf3Bc54.Nxe5Bxf2+5.Kxf2Nxe56.d... 1691 1.e4e62.d4b63.e5Bb74.Nf3c55.c3cxd46.cxd4Nc67.B... 2003 1.d4d52.e3e63.c4f54.Nc3Nf65.Nf3c66.cxd5exd57.B... Name: game_notation, dtype: object
I'll slice this to only look at move 1. and put it in its own column.
experts['first_move'] = experts.game_notation.apply(lambda x: x[0:4])
experts.first_move
/opt/anaconda3/envs/learn-env/lib/python3.6/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy """Entry point for launching an IPython kernel.
823 1.b3 1155 1.e4 1186 1.e4 1691 1.e4 2003 1.d4 ... 118286 1.e4 120499 1.e4 120720 1.e4 120794 1.e4 120933 1.e4 Name: first_move, Length: 609, dtype: object
experts.first_move.unique()
array(['1.b3', '1.e4', '1.d4', '1.Nf', '1.c4', '1.f4', '1.Nc', '1.d3', '1.e3', '1.g3'], dtype=object)
On turn 1 there are 20 legal moves. Already in this dataset we have narrowed down the moves that experts play to 10.
experts.first_move.value_counts()
1.e4 295 1.d4 190 1.c4 49 1.Nf 37 1.Nc 21 1.f4 8 1.b3 5 1.e3 2 1.d3 1 1.g3 1 Name: first_move, dtype: int64
So we see that half of those 10 moves that our sample of experts play are played extremely rarely. The most common moves on turn 1 for white are 1. e4 and 1. d4 by a large margin, making over half of the games in this sample. If you were using usage statistics to validate your gameplay decisions in chess, you would come to the conclusion that 1. e4 and 1. d4 are the best moves to make on turn 1. Let's check the win rate of those games where 1. e4 were played
experts.loc[experts.first_move == '1.e4'].result.value_counts()
0-1 140 1-0 130 1/2-1/2 25 Name: result, dtype: int64
In chess there are 3 potential outcomes to a game. 1-0 indicates a win for white, 0-1 a win for black, and 1/2 -1/2 is a draw. In our sample of expert games, 1. e4 has 130 wins, 140 losses, and 25 draws out of 295 games. So for white, 1. e4 has a:
print(str(130 / 295) + ' chance to win')
print(str(140 / 295) + ' chance to lose')
print(str(25 / 295) + ' chance to draw')
0.4406779661016949 chance to win 0.4745762711864407 chance to lose 0.0847457627118644 chance to draw
Repeating this for the second most common move, 1. d4
experts.loc[experts.first_move == '1.d4'].result.value_counts()
1-0 103 0-1 78 1/2-1/2 9 Name: result, dtype: int64
print(str(103 / 190) + ' chance to win')
print(str(78 / 190) + ' chance to lose')
print(str(9 / 190) + ' chance to draw')
0.5421052631578948 chance to win 0.4105263157894737 chance to lose 0.04736842105263158 chance to draw
So it appears even though 1. d4 is a less common move than 1. e4, in this set it has a much better winrate, and a lower chance to draw as well. Data like this might influence some players to choose 1. d4 as their starting move. You would want to compare this across multiple datasets. Perhaps January, 2013 was just a bad month for 1. e4 players. I also notice that in this set some players are represented more than others, with some players appearing as over 10% of the number of expert games. For example this data could be heavily skewed if the user 'Panevis' had a particular bias in his performance. I did choose this dataset because it was the smallest one. While Jan 2013 has just around 120,000 games, some of the current ones have upwards of 75 million games, with the files being around 17 gigs. You are likely to get a better statistical summary from that dataset, where there are thousands more expert games being played.
You could also select the games where your own username shows up, and collect statistics about how you perform against different openings, and use that to bolster your preparation.
experts.loc[(experts.white == 'Panevis') | (experts.black == 'Panevis')]
event | site | white | black | result | utcdate | utctime | white_elo | black_elo | white_rating_diff | black_rating_diff | eco | opening | time_control | termination | game_notation | first_move | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1691 | RatedBulletgame | https://lichess.org/i3xuyu7q | Panevis | StefanK | 0-1 | 2013.01.01 | 14:53:32 | 2062 | 2145 | -10 | +11 | C00 | FrenchDefense:NormalVariation | 60+0 | Normal | 1.e4e62.d4b63.e5Bb74.Nf3c55.c3cxd46.cxd4Nc67.B... | 1.e4 |
4353 | RatedBulletgame | https://lichess.org/0gthevaz | jhipolito | Panevis | 1-0 | 2013.01.02 | 09:53:45 | 2020 | 2043 | +12 | -14 | B06 | ModernDefense:TwoKnightsVariation,SuttlesVaria... | 60+0 | Normal | 1.e4d62.Nf3c63.Nc3g64.d4Bg75.Be3Nf66.Qd2Ng47.B... | 1.e4 |
4359 | RatedBulletgame | https://lichess.org/nmebt3tl | Panevis | jhipolito | 1-0 | 2013.01.02 | 09:56:00 | 2029 | 2032 | +13 | -11 | B00 | OwenDefense | 60+0 | Normal | 1.e4b62.d4Bb73.Nc3e64.Bd3Bb45.Ne2Nf66.O-Od57.e... | 1.e4 |
4362 | RatedBulletgame | https://lichess.org/xz84vyol | jhipolito | Panevis | 0-1 | 2013.01.02 | 09:59:09 | 2021 | 2042 | -11 | +12 | A11 | EnglishCaro-KannDefensiveSystem | 60+0 | Normal | 1.c4c62.Nc3d63.d4g64.e4Nf65.Be3Bg76.f3Nbd77.Qd... | 1.c4 |
4409 | RatedBulletgame | https://lichess.org/ztleoyko | jhipolito | Panevis | 0-1 | 2013.01.02 | 10:16:01 | 2039 | 2058 | -11 | +12 | B06 | ModernDefense:TwoKnightsVariation,SuttlesVaria... | 60+0 | Normal | 1.e4d62.Nf3c63.Nc3g64.d4Bg75.Be3Nf66.Qd2Ng47.B... | 1.e4 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
115169 | RatedBulletgame | https://lichess.org/gwo5b53g | Panevis | RaptoR-Bee | 0-1 | 2013.01.30 | 15:44:19 | 2041 | 2188 | -8 | +9 | C00 | FrenchDefense:KnightVariation | 0+1 | Timeforfeit | 1.e4e62.Nf3d53.exd5exd54.Be2Nf65.O-OBe76.h3O-O... | 1.e4 |
115494 | RatedBulletgame | https://lichess.org/h2uetwa4 | Panevis | RaptoR-Bee | 1-0 | 2013.01.30 | 17:11:27 | 2034 | 2197 | +20 | -20 | C00 | FrenchDefense:KnightVariation | 0+1 | Normal | 1.e4e62.Nf3d53.exd5exd54.d4Nf65.h3Be76.Bd3O-O7... | 1.e4 |
118286 | RatedBulletgame | https://lichess.org/py6a0635 | Panevis | AlexPers | 0-1 | 2013.01.31 | 11:40:18 | 2084 | 2056 | -14 | +13 | B00 | OwenDefense | 0+1 | Timeforfeit | 1.e4b62.Nf3Bb73.Nc3e64.d4Bb45.Bd3Ne76.O-Oh67.Q... | 1.e4 |
120720 | RatedBulletgame | https://lichess.org/g6wi1ebh | Panevis | celeron17 | 0-1 | 2013.01.31 | 21:16:32 | 2084 | 2135 | -12 | +12 | B00 | OwenDefense | 0+1 | Timeforfeit | 1.e4b62.Nf3Bb73.Nc3e64.d4Nf65.Bd3Bb46.Qe2d57.e... | 1.e4 |
120794 | RatedBulletgame | https://lichess.org/cyo8t8mt | Panevis | Romoda | 0-1 | 2013.01.31 | 21:38:06 | 2083 | 2112 | -13 | +12 | B00 | OwenDefense | 0+1 | Timeforfeit | 1.e4b62.Nf3Bb73.Nc3e64.d4Nf65.Bd3Bb46.Qe2c57.O... | 1.e4 |
143 rows × 17 columns