In the previous post we saw how to scrape raw data from a content rich webpage. In this post, we will explore how to process that raw data and use Machine Learning tools to predict the playing role of a cricket player just based on his career statistics.

Here are the tools that we will use for this exercise. For interactive data analysis and number crunching:

  1. Jupyter
  2. Pandas
  3. Numpy

For visualizing data:

  1. Seaborn
  2. matplotlib

For running Machine Learning models:

  1. Tensorflow
  2. Keras
  3. Scikit-learn

Importing data

First let us load the necessary modules:

1
2
3
4
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns

Import the CSV file which we scraped as a pandas data frame and inspect its contents.

1
2
data = pd.read_csv('data/players.csv')
data.dtypes

—–|—- Bat_100|object Bat_4s|object Bat_50|object Bat_6s|object Bat_Ave|object Bat_BF|object Bat_Ct|int64 Bat_HS|object Bat_Inns|object Bat_Mat|int64 Bat_NO|object Bat_Runs|object Bat_SR|object Bat_St|int64 Bio_Also_known_as|object Bio_Batting_style|object Bio_Born|object Bio_Bowling_style|object Bio_Current_age|object Bio_Died|object Bio_Education|object Bio_Fielding_position|object Bio_Full_name|object Bio_Height|object Bio_In_a_nutshell|object Bio_Major_teams|object Bio_Nickname|object Bio_Other|object Bio_Playing_role|object Bio_Relation|object Bowl_10|object Bowl_4w|object Bowl_5w|object Bowl_Ave|object Bowl_BBI|object Bowl_BBM|object Bowl_Balls|object Bowl_Econ|object Bowl_Inns|object Bowl_Mat|int64 Bowl_Runs|object Bowl_SR|object Bowl_Wkts|object dtype: object|

We can see that most of the fields are decoded as object data type which is a generic pandas datatype. It gets assigned if our data consists of mixed types such as characters and numerals. There are some obvious numerical fields which are getting detected as string. But before we recast all of them as string, we need to preprocess some of them to extract numeric value out of them.

For example, let us inspect Bowl_BBI and Bowl_BBM which stands for best bowling figures in an innings and a match respectively.

1
data[['Bowl_BBI','Bowl_BBM']].head(n=10)
Bowl_BBIBowl_BBM
0-
13/31
22/35
3-
4-
5-
6-
7-
86/85
9-

Either fields can be made sense as a combination of two independent variables- Best Bowling Wickets & Best Bowling Runs. Similarly when we cast the field Bat_HS as integer, the notout values will be lost since they are suffixed with an asterisk which makes them a string data type. Let us go ahead to fix these potential issues.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
# Best bowling innings wickets
bbi_df = pd.DataFrame(data['Bowl_BBI'].str.replace('-','').str.split('/').tolist(),
                      columns = ['Bowl_BBIW','Bowl_BBIR'])
bbm_df = pd.DataFrame(data['Bowl_BBM'].str.replace('-','').str.split('/').tolist(),
                      columns = ['Bowl_BBMW','Bowl_BBMR'])

data = data.join([bbi_df,bbm_df])

# Identify numeric columns
numeric_cols = ['Bat_100','Bat_4s','Bat_50','Bat_6s','Bat_Ave','Bat_BF',
                'Bat_Ct','Bat_HS','Bat_Inns','Bat_Mat','Bat_NO','Bat_Runs',
                'Bat_SR','Bat_St','Bowl_10','Bowl_4w','Bowl_5w','Bowl_Ave',
                'Bowl_Balls','Bowl_Econ','Bowl_Inns','Bowl_Mat','Bowl_Runs',
                'Bowl_SR', 'Bowl_Wkts','Bowl_BBIW','Bowl_BBIR','Bowl_BBMW','Bowl_BBMR']

# regex replace * in High scores
data['Bat_HS'] = data['Bat_HS'].replace(r'\*$','',regex=True)
data[numeric_cols] = data[numeric_cols].replace('-',0)
data[numeric_cols] = data[numeric_cols].apply(pd.to_numeric, errors='coerce')
data[numeric_cols] = data[numeric_cols].fillna(0)

If we check the data type again, we will see that all the numerical fields are interpreted as int or float datatype as expected.

Be careful when filling NaN with zeroes. Idea is not to introduce false values in to the dataset. In this case, a value of zero is neutral since it represents the same value as absent numbers. But for certain types of data, such as temperature, zero introduces a false value in to the data set since temperature values can be less than zero.

Pre-processing

Deriving new features

When using data in our models we have to understand the units in which they are represented. Not all features are directly comparable. For instance, Average & Strike rates are already averaged over the number of matches that a player plays. But other aggregate statistics aren’t. So in effect it would be meaningless to compare run tally of a player who has played only 10 matches with that of another who has played a hundred matches.

To understand better, let us plot runs scored vs the matches played.

1
sns.jointplot(x="Bat_Runs", y="Bat_Inns", data=data)

Bat Inns vs Bat Runs

Obviously there is a strong correlation between no. of matches played and no. of runs scored. Ideally we want our features to be as independent of each other as possible. To separate the influence of number of matches played on the batting runs feature, we will divide the aggregate statistics by number of matches played.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# select aggregate stats such as no. of hundreds, runs scored etc.,
bat_features_raw = ['Bat_100', 'Bat_4s', 'Bat_50', 'Bat_6s', 
                    'Bat_BF', 'Bat_Ct', 'Bat_NO', 'Bat_Runs','Bat_St']

# column names for scaled features
bat_features_scaled = ['Bat_100_sc', 'Bat_4s_sc', 'Bat_50_sc', 'Bat_6s_sc', 
                    'Bat_BF_sc', 'Bat_Ct_sc', 'Bat_NO_sc', 'Bat_Runs_sc','Bat_St_sc']

# leave aside match and innings count and other aggregate stats such as best bowling figures, strike rate and average
bowl_features_raw = ['Bowl_10', 'Bowl_4w', 'Bowl_5w',  
                     'Bowl_Balls', 'Bowl_Runs','Bowl_Wkts']

# column names for scaled features
bowl_features_scaled = ['Bowl_10_sc', 'Bowl_4w_sc', 'Bowl_5w_sc',  
                     'Bowl_Balls_sc', 'Bowl_Runs_sc','Bowl_Wkts_sc']

# divide by innings count since it is more relevant than match count
data[bat_features_scaled] = data[bat_features_raw].apply(lambda x: x/data['Bat_Inns'])
data[bowl_features_scaled] = data[bowl_features_raw].apply(lambda x: x/data['Bowl_Inns'])

# these are the meaningful features which will be the input for our model. 
features = ['Bat_Ave','Bat_HS', 'Bat_SR'] + bat_features_scaled + ['Bowl_Ave','Bowl_Econ','Bowl_SR','Bowl_BBIW', 'Bowl_BBIR', 'Bowl_BBMW', 'Bowl_BBMR'] + bowl_features_scaled

# fill numerical features with zero
data[features] = data[features].fillna(0)

It can be argued that averaging the runs scored duplicates the batting average feature. Leaving aside subtle differences in the way in which batting averages are calculated, we would still keep both features to see how our model learns the difference in both the features and assigns weight accordingly.

Now let us plot the scaled runs scored value vs the innings played.

1
sns.jointplot(x="Bat_Runs_sc", y="Bat_Inns", data=data)

Bat Inns vs Bat Runs Scaled

Clearly this is a far better representation of batting capabilities of a player. You can see there is less dependency on the number of innings played. It is not hard to imagine how this scaling affects our final prediction. The impact is obvious when we plot batting runs and bowling wickets (likely to be the most important features) in a KDE plot. Here is the KDE plot before scaling:

1
sns.jointplot(x="Bowl_Wkts", y="Bat_Runs", data=df,kind='kde')

Bowl Wickets vs Bat Runs KDE plot

There is no clear clustering indicating that our classification is not going to be effective. In comparison, if we generate the same chart for scaled values, there is a clear grouping.

1
sns.jointplot(x="Bowl_Wkts_sc", y="Bat_Runs_sc", data=df,kind='kde')

Bowl Wickets Scaled vs Bat Runs Scaled KDE plot

This much more promising. Remember, your model will only perform as well as the data you feed in. If the input data is already confused, there is very little a mathematical model can do. Now that we have almost all that we need we will extract those records that have playing role information and use it for our training & testing. To avoid outliers corrupting our model, we will also exclude players who played less than 5 matches.

1
2
# remove players who played less than 5 matches
df = data[data['Bio_Playing_role'].notnull() & (data['Bat_Mat'] > 5)]

Data Transformation

Next let us look at our target feature which is playing role. We need to understand the values it can assume. Let us look at the unique values for the player features.

1
2
# Check the unique playing roles to identify mapping function
data['Bio_Playing_role'].unique()
1
2
3
array([nan, 'Top-order batsman', 'Bowler', 'Middle-order batsman',
       'Wicketkeeper batsman', 'Allrounder', 'Batsman', 'Opening batsman',
       'Wicketkeeper', 'Bowling allrounder', 'Batting allrounder'], dtype=object)

The playing role definiton is too granular. We want fewer variety of roles so that each role gets sufficient sample data points to train the model. Also the role tagging done by Cricinfo is not consistent. For e.g., not all opening batsmen have been tagged with the opening batsman role. So we define a mapping function to group playing roles in to 4 different categories ['Batsman','Bowler','Wicketkeeper','Allrounder']

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
def get_role(role):
    if  pd.notnull(role):
        if 'keeper' in role:
            return "Wicketkeeper"
        elif 'rounder' in role:
            return "Allrounder"
        elif 'atsman' in role:
            return "Batsman"
        elif 'owler' in role:
            return "Bowler"
        else:
            return ""
    else:
        return ""
    
data['role'] = data['Bio_Playing_role'].apply(get_role)

Note that this feature is a categorical data. It is different from a numerical data such as height or weight. When we want to use Deep Neural Networks we need to represent the target features as numerical data. We will assign one column for each playing role and set its value to one when that playing role fits the player well. Then the function of our model will be to assign a value close to 1 for one of these columns and a value close to 0 for the rest.

It is called One-Hot encoding. Turns out this is a frequent task, so pandas has a handy inbuilt function to perform this.

1
2
3
4
5
6
7
8
9

# y is categorical feature
y = df['role']

# Convert categorical data into numerical columns
y_cat = pd.get_dummies(y)

# X is the input features. We need to covert it from pandas dataframe to numpy array to feed in to our models
X = df[features].as_matrix()

Let us see the new y_cat dataframe

1
y_cat.head()
AllrounderBatsmanBowlerWicketkeeper
190100
200010
220001
251000
280100

One might be tempted to assign unique numbers for each category ( say 1: Batsman, 2: Bowler etc.,) but that will not work. There is no quantitative relation between categories. Assigning raw numbers implies that there is a numerical progression to the categories. Sometimes it can work for contiguous data such as day of the month, but even then one has to be aware of the bounds and circularity of the target variables.

Scaling data

Some fields vary over a larger range compared to the rest. Remember we did a preliminary scaling by dividing these values with the number of innings. But that is not sufficient since it only made sure that one feature (’no. of innings’) did not overtly influence another feature (‘runs scored’). But each features themselves lie between different extremities. For e.g, Bowling Wickets Scaled only ranges from 0 to 5 whereas Batting Runs Scaled ranges from 0 to 50. Most machine learning models works the best when the features are vary within the same range. If we let these datapoints influence our calculation without modification, wickets taken will have negligible influence.

So we perform another round of scaling for all input data points. We will use the MinMax Scaler from Scikit library. This will scale the values such that largest value becomes one and smallest value becomes zero.

1
2
3
4
5
from sklearn.preprocessing import MinMaxScaler
mms = MinMaxScaler(feature_range=(0,1)).fit(X)

# X_mms will our new input array with all values scaled to be between 0 and 1
X_mms = mms.transform(X)

Training the model

Deep Neural Network

First we will try to run a Deep Neural Network model on this data. Here are the necessary modules to import.

1
2
3
4
5
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import SGD, Adam, Adadelta, RMSprop
from keras.wrappers.scikit_learn import KerasClassifier
import keras.backend as K

Create the Keras Sequential model. I am using a DNN with 1 hidden layer and 1 output layer. The hidden layer has 15 nodes. The number of nodes in the output layer should as the number of categories. So we will go with 4.

1
2
3
4
5
6
7
8
def create_baseline():
    # create model
    model = Sequential()
    model.add(Dense(15, input_dim=25, kernel_initializer='he_normal', activation='relu'))
    model.add(Dense(4, kernel_initializer='he_normal', activation='softmax'))
    # Compile model
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])    
    return model

The softmax is a popular activation function for classification problems. In simple words, an activation function is a simple function that decides whether to output TRUE or FALSE for each category. This softmax function receives an array of values from the previous layer and returns a new array which adds up to 1. The category with the largest value is deemed likely match for our data.

Softmax highlights only the likeliest category for a data. Here for simplicity sake, we assume that our categories are mutually exclusive, i.e, a player can only belong to one category at a time. There are other data types where a single entry can belong to more than one category at a time. We may need to use different activation function for that. Again, know your data before deciding on activation function.

The loss function is categorical_crossentropy. A Loss Function can be thought of as a course correction function which measures the perceived error as our model navigates to ideal set of weights over multiple iterations. Categorical Cross-Entropy loss function penalizes weights that are sure to be wrong. It is a common loss function used for classification problems.

These are the important attributes that closely follows our problem definition. Most of the other parameters can be fiddled with.

Next we will use Keras to train the model. The result is a Keras Classifier function whose weights are trained on our data. We can use this function to predict values for inputs which we haven’t seen so far.

1
2
# evaluate model with standardized dataset
estimator = KerasClassifier(build_fn=create_baseline, nb_epoch=100, batch_size=5, verbose=0)

We cannot use all of the data to train our model. The model will closely follow our existing model. It won’t be useful to predict any values we haven’t seen so far. This is called overfitting.

Overfitting

Example of overfitting - Source Wikipedia

To avoid this, we will split the data into train and test datasets. We will use the former to train the model and compute the scores based on the testing against test data for each iteration of cross-validation. Scikit’s provides a helper function called cross_val_score to assist in this. StratifiedKFold is the genertor strategy we will use for selecting this train/test datasets. It splits the data into K folds (set to 10 in our case), trains it on K-1 datasets and tests it against the left out dataset, while preserving the class distribution of the data.

1
2
3
4
# set the random state to a fixed number for reproducing the results
kfold = StratifiedKFold(n_splits=10, shuffle=True,random_state=42)
results = cross_val_score(estimator, X_mms, y.values, cv=kfold)
print("Results: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))

I got a result of 82.2% accuracy. Not bad for the first attempt, particularly since we employed gross simplifications and trained the model with only with around 450 records. The results that you get may be slightly different since we shuffle the data before generating folds.

Random Forest Classifier

This time let us try to model the data using Random Forest classifier. Random Forest Classifier is a much simpler method than neural networks. It relies on building multiple decision trees and assembling the results of these decision trees.

1
2
3
4
5
6
7
from sklearn.ensemble import RandomForestClassifier

# Instantiate model with 1000 decision trees
rf_estimator = RandomForestClassifier(n_estimators = 1000, random_state = 42)

rf_results = cross_val_score(rf_estimator, X_mms, y.values, cv=kfold)
print("Results: %.2f%% (%.2f%%)" % (rf_results.mean()*100, rf_results.std()*100))

You will notice that Random Forest Classifier has performed significantly better than the DNN classifier. I got 87.28% accuracy, which is amazing since Random Forest is several times faster and less resource intensive than the DNN classifier. And I didn’t even have to run it on top of tensorflow and make use of GPU. Decision trees are quite effective at classification tasks but they tend to overfit.

Reviewing the results

Since our Random Forest model has performed significantly better, we will use that model to predict the unseen roles of players.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# Fit the estimator on available data
rf_estimator.fit(X_mms, y.values)

# np array to hold all of the input data
P = data[features].as_matrix()

# An ugly hack to drop infinity values introduced as part of some of the pre-processing tasks
P[P > 1e308] = 0

# Min Max Scaling
P_mms = min_max_scaler.fit_transform(P)

# Prediction based on the Random forest model
data['predicted_role_rf'] = rf_estimator.predict(P_mms)

Confusion Matrix

A score alone is not a good indicator that our model has performed well. We need to review its performance by plotting Confusion Matrix. It is a simple matrix plot based on known test data with predicted values plotted against the true value. The diagonal entries represent correct prediction, rest represents confused values. Let us plot Confusin Matrix for our data.

1
2
3
4
5
6
7
from sklearn.metrics import confusion_matrix
mat = confusion_matrix(data[(data['role']!='')]['role'], 
                       data[(data['role']!='')]['predicted_role_rf'],
                       labels = ['Batsman','Bowler','Allrounder','Wicketkeeper'])
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False)
plt.xlabel('true label')
plt.ylabel('predicted label')

Confusion Matrix

We can see that the model is quite effective in matching the pure roles such as Batsman or Bowler. When it comes to mixed roles such as Allrounder or Wicketkeeper, it fares not that well. Part of the problem lies in our assumption that the roles are mutually exclusive i.e, a player cannot be both Batsman and Bowler at the same time. So we identify only around 37% of the all rounders succesfully. Later we will see that there are other reasons why the predicted role doesn’t match the role marked in cricinfo.

Reviewing the results

Let us see the cases where our predictions differed from the roles defined in cricinfo:

1
data[(data['role'] != data['predicted_role_rf']) & (data['role'] != '') & (data['Bat_Mat'] > 5 )][['Bio_Full_name','predicted_role_rf', 'role', 'Bio_Playing_role']]

I have extracted the differences for popular players of recent times. Subjectively speaking our model hasn’t performed too bad. There seems to be some merit to the classification offered by the model compared to the playing role assigned in cricinfo bio page.

Bio_Full_namepredicted_role_rfroleBio_Playing_role
38Stephen Norman John O’KeefeBowlerAllrounderAllrounder
44Glenn James MaxwellBatsmanAllrounderBatting allrounder
88Andrew SymondsBatsmanAllrounderAllrounder
227Shai Diego HopeBatsmanWicketkeeperWicketkeeper batsman
281Brendon Barrie McCullumBatsmanWicketkeeperWicketkeeper batsman
970Angelo Davis MathewsBatsmanAllrounderAllrounder
1437Abraham Benjamin de VilliersBatsmanWicketkeeperWicketkeeper batsman

In most cases, Cricinfo’s playing role is also based on the ODI and T20 formats. Some of the players like ABD Villiers and Brendon McCullum have donned multiple roles but given up gloves for the games longest format. So we can’t really fault the model here for identifying them as batsman. Then there are other cases of a player being regarded as All rounder based on the role they play in shorter formats.

Next to the most interesting part- let us see how our model behaves for the data it hasn’t seen i.e., the classification of those players whose playing role is missing in their bio page. For ease of identification, I have filtered only those players who have played 100 matches or more.

1
data[(data['role'] != data['predicted_role_rf']) & (data['role'] == '') & (data['Bat_Mat'] > 100 )][['Bio_Full_name','predicted_role_rf', 'role', 'Bio_Playing_role']]
Bio_Full_namepredicted_role_rfroleBio_Playing_role
134Mark Edward WaughBatsmanNaN
137Mark Anthony TaylorBatsmanNaN
139Ian Andrew HealyWicketkeeperNaN
599Sourav Chandidas GangulyBatsmanNaN
743Anil KumbleBowlerNaN
925Brian Charles LaraBatsmanNaN
929Carl Llewellyn HooperBatsmanNaN
937Courtney Andrew WalshBowlerNaN
957Desmond Leo HaynesBatsmanNaN
1072Kapildev Ramlal NikhanjBowlerNaN
1074Dilip Balwant VengsarkarBatsmanNaN
1088Sunil Manohar GavaskarBatsmanNaN
1257Cuthbert Gordon GreenidgeBatsmanNaN
1258Isaac Vivian Alexander RichardsBatsmanNaN
1284Clive Hubert LloydBatsmanNaN
1326Warnakulasuriya Patabendige Ushantha Joseph Chaminda VaasBowlerNaN
1463Makhaya NtiniBowlerNaN
1474Gary KirstenBatsmanNaN
1676Alec James StewartBatsmanNaN
2020Inzamam-ul-HaqBatsmanNaN
2168Graham Alan GoochBatsmanNaN
2205Wasim AkramBowlerNaN
2216Saleem MalikBatsmanNaN
2320Geoffrey BoycottBatsmanNaN
2366Michael Colin CowdreyBatsmanNaN
2381Mohammad Javed Miandad KhanBatsmanNaN

Even a cursory look can tell us that our model worked splendidly. It is surprising how many prominent player bio pages has their playing role information missing. Well, it looks like even a simple ML model can fix that gap.

Let us see how the two most critical features (Bat_Runs_sc and Bowl_Wkts_sc) affects our predicted role.

1
2
3
4
sns.set_palette("bright")
sns.lmplot('Bowl_Wkts_sc','Bat_Runs_sc',data[data['Bat_Mat'] > 5 ],
           hue='predicted_role_rf', fit_reg=False, size=10)
    plt.plot([0,7.0],[100,0])

Bat Runs vs Bowl Wkts

I have plotted a diagonal line, below which most of the points are clustered. It represents a kind of pareto-frontier which only exceptional players can breach. Note that there is no statistical basis for my choice of x and y intercepts, I just based it on visual inspection. Let us see the list of players who reside above this threshold.

1
data[data['Bat_Mat'] > 5].query('Bowl_Wkts_sc*100 + Bat_Runs_sc*7 > 700 ')[['Bio_Full_name','Bat_Mat','predicted_role_rf','Bowl_Wkts_sc','Bat_Runs_sc']]
Bio_Full_nameBat_Matpredicted_role_rfBowl_Wkts_scBat_Runs_sc
62Steven Peter Devereux Smith59Batsman0.28813698.237288
377Ravindrasinh Anirudhsinh Jadeja35Bowler4.71428633.600000
380Ravichandran Ashwin55Bowler5.52727337.363636
456Christopher Lance Cairns62Allrounder3.51612953.548387
526Shakib Al Hasan51Allrounder3.68627570.470588
720Sikandar Raza Butt9Allrounder1.44444484.111111
832Donald George Bradman52Batsman0.038462134.538462
1133Herbert Vivian Hordern7Bowler6.57142936.285714
1177Richard John Hadlee86Bowler5.01162836.325581
1240Yasir Shah28Bowler5.89285715.892857
1468Jacques Henry Kallis166Allrounder1.75903680.054217
1548Charles Thomas Biass Turner17Bowler5.94117619.000000
1748Garfield St Aubrun Sobers93Allrounder2.52688286.365591
1825Michael John Procter7Bowler5.85714332.285714
1838Robert Graeme Pollock23Batsman0.17391398.086957
1849Edgar John Barlow30Allrounder1.33333383.866667
1860Trevor Leslie Goddard41Allrounder3.00000061.365854
2157Ian Terence Botham102Allrounder3.75490250.980392
2285Mulvantrai Himmatlal Mankad44Allrounder3.68181847.931818
2386Imran Khan Niazi88Allrounder4.11363643.261364
2522George Aubrey Faulkner25Allrounder3.28000070.160000
2741George Joseph Thompson6Allrounder3.83333345.500000
2769Sydney Francis Barnes27Bowler7.0000008.962963
2809Thomas Richardson14Bowler6.28571412.642857
2821John James Ferris9Bowler6.77777812.666667
2845George Alfred Lohmann18Bowler6.22222211.833333

The list is dominated by exceptional all-rounders. Among specialists, bowlers fare better. Perhaps it is my fault that I set the bar for greatness too high. The Bat_Runs_sc of Bradman is so far ahead of the rest, that it one tends to choose a higher value for y-intercept.

Finally let us plot Bat_Runs_sc against predicted playing role using a violin plot. This will shows distribution of runs scored across the multiple categories of playing role. We can see that for batsmen, the bulk of the violin plot is top heavy whereas for the bowlers it is bottom heavy.

1
2
3
4
sns.violinplot(x='predicted_role_rf', 
               y='Bat_Runs_sc',
               data=data,
               scale='width')

Bat Runs Violin Plot

Conclusion

If you review the length of the posts, less than 20% is allocated to running the actual machine learning code. That closely reflects the time spent on this project as well. Bulk of the time is spent in collecting and curating the data. Also the results from RandomForest Classifier is revealing. Right tool for the right job is often more effective than a generic tool which is universally useful.

Machine Learning and Data science is a vast subject. Despite the length of this post, I have barely touched the surface of this domain. Apart from the knowledge of tools and procedures, one needs to have a good understanding of the data and be conscious of the inherent biases in the numerical models.

Finally, scikit-learn is an excellent resource for learning and practising Machine learning. It has excellent documentation and helper functions for many of the common tasks. I found Python Data Science Handbook to be another great freely available resource.