Global AI and Data Science

Global AI & Data Science

Train, tune and distribute models with generative AI and machine learning capabilities

View Only

Back to Blog List

Analysing Lichess Dataset

By Moloy De posted Thu August 27, 2020 07:58 PM

Data and Tool :
The Lichess Chess Dataset csv is available here in Kaggle. It’s only a 7.32 MB csv files with 20,058 rows (16,155 Rated Games and 3,903 Unrated Games) and 16 columns. The columns are as follows:

Id: Primary Key with duplicates. I don’t know the explanation behind the duplicates though.
Rated: A logical variable with values TRUE or FALSE.
Created At: A Timestamp I couldn’t decode.
Last Move At: Another Timestamp I couldn’t decode.
Turns: Number of moves.
Victory Status: It has four values – Draw, Mate, Out Of Time, Resigned.
Winner: It has three values – Wite, Black, Draw.
Increment Code: Time Controls with more than 8,000 categories.
White Id: Player ID
White Rating: White’s Rating.
Black Id: Player ID.
Black Rating: Black’s Rating.
Moves: The entire PGN
Opening eco: Opening Codes with more than 15,000 values.
Opening Name: Name of the Opening.
Opening Ply: Number of moves in the Opening Game.

I have used R in Anaconda Jupyter Notebook for the analysis below.

Popular Opening:
Van't Kruijs Opening that starts with 1.e3 is considered as an Irregular Opening that is rarely played in professional chess. However, it is found to be the most popular (1.83%) opening in the available Lichess Dataset. Below are the top five popular openings and the configuration does not change much for rated games and games in general.

Game Length vs Opening Game Length:
Below are the quick summaries and the histograms of the two numeric attributes, Game Length and Opening Game Length.

There are around 380 (2.13%) outliers in the Dataset having the length of the games bigger than 142 which the Tukey Upper Bound.

The correlation (0.0512) between above two attributes is expected to be positive. Although the correlation (0.0512) is found to be close to zero, a small test below confirms that the correlation value is significantly away from zero.

Analysing the Results – White Win, Black Win, Draw:
A Scatter Plot below is developed that matches our intuition. However, I doubt whether it is possible to separate the two clusters anymore.

Further a Decision Tree is attempted that performed only with 62.21% accuracy to predict the winning colour based on the following model.

“e4 e5 Nf3” is found to be the most played first three moves having 4234 (26%) instances while White’s winning chance gets maximised (55.01%) with the opening “e4 d5 exd” and Black’s winning chance gets maximised (55.67%) with the opening “e4 c5 Bc4”. Further Draw percentage reaches maximum (6.45%) with the opening “d4 Nf6 c4”.

Question I:
Could we make Normality Assumption for the two variables Length of the Game and Length of the Opening Game?

Question II:
What is the benchmark of Accuracy in Industry Prediction?

#GlobalAIandDataScience
#GlobalDataScience

0 comments

66 views

Permalink

https://community.ibm.com/community/user/blogs/moloy-de1/2020/08/27/points-to-ponder

Global AI and Data Science

Global AI & Data Science

Analysing Lichess Dataset

By Moloy De posted Thu August 27, 2020 07:58 PM

Permalink

Additional
Resources

Office

Quick Links

Global AI and Data Science

Global AI & Data Science

Analysing Lichess Dataset

By Moloy De posted Thu August 27, 2020 07:58 PM

Permalink

Additional Resources

Office

Quick Links

Additional
Resources