3.3. Exercise 2: Training and Fine-Tuning a Decision Tree for the Moons Dataset#
Now we move to creating and training decision trees. Here we will learn how to train and fine-tune a decision tree on a synthetic dataset.
Since decision trees contain multiple trainable hyperparameters, parts of this notebook focus on adjusting and optimizing multiple hyperparameters in the decision tree algorithm.
Goal:
Know how to train a decision tree.
To be proficient in tuning ML model hyperparameters with cross-validation.
Can we grow a tree to predict the moon? 🌳 🌛
The goal of this exercise is to train and adjust the hyperparameters of a decision tree on a synthetic “moons” dataset. The dataset contains two interleaving half circles that we seek to separate via classification.
First, let’s generate a moons dataset using make_moons
.
from sklearn.datasets import make_moons # Import function to make moons
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[1], line 1
----> 1 from sklearn.datasets import make_moons # Import function to make moons
ModuleNotFoundError: No module named 'sklearn'
# Make 10,000 samples with Gaussian noise that has a standard deviation of 0.4
X, y = make_moons(n_samples=10000, noise=0.4)
Q1) Split the moons dataset into a training and a test set
Hint 1: You may use the train_test_split
function of scikit-learn
.
Hint 2: Here we will keep 20% of the dataset for testing.
Hint 3: To ensure you get the same result every time you run the code, we will specify the random_state
option in train_test_split
function to be 42.
# Import the train_test_split function here.
from sklearn.___________ import ______________
# Split the dataset into a training set and a test set
# There will be four outputs in this function. 2 for X (train and test) and 2 for y (train and test)
____,____,____,____ = ___________(_, _, test_size=__, random_state=__)
Q2) Visualize the data, indicating which points belong to each half circle of the moon, and which points belong to the training and test sets
Hint 1: In the moons dataset, X
contains the 2D spatial coordinates of each sample, while y
indicates which half circle of the moon the sample belongs to.
Hint 2: You may randomly subsample your data for visualization purposes. Alternatively, you can adjust the size and transparency of a Matplotlib scatter plot by varying the parameters s
and alpha
. We adopt the latter visualization method in the code snippet. Do not hesitate to experiment with the first visualization method as long as the figure looks good.
# Scatter the moon data and don't forget to add a legend to your figure
import numpy as np
import matplotlib.pyplot as plt
___,__ = plt._______(_,__,figsize=(12,4))
#########################################################################################################
# (1) Plot the whole dataset with plt.scatter
#########################################################################################################
# The X and y indices of each data point can be obtained like this: X[:,0], X[:,1]
# The scatter points should be coloured by y
scatter = _____._________(_______,_____,s=10,c=____,edgecolor='w',linewidths=0.5,alpha=0.5,cmap='RdBu')
# Add legend to the upper right corner of the figure
legend1 = ___.______(*scatter.legend_elements(),loc=_________, title="Classes")
ax[0].set_title('Half Moon')
ax[0].tick_params(axis='both', which='major', labelsize=11)
ax[0].grid(alpha=0.2,ls='--',lw=1)
#########################################################################################################
# (2) Plot the training dataset with plt.scatter
#########################################################################################################
scatter_train = ____.______(________,_________,s=10,c=____,edgecolor='w',linewidths=0.5,alpha=0.5,cmap='RdBu')
ax[1].set_title('Training')
ax[1].tick_params(axis='both', which='major', labelsize=11)
ax[1].grid(alpha=0.2,ls='--',lw=1)
#########################################################################################################
# (2) Plot the test dataset with plt.scatter
#########################################################################################################
scatter_test = ___._____(________,_________,s=10,c=____,edgecolor='w',linewidths=0.5,alpha=0.5,cmap='RdBu')
ax[2].set_title('Test')
ax[2].tick_params(axis='both', which='major', labelsize=11)
ax[2].grid(alpha=0.2,ls='--',lw=1)
plt.show()
Do your data & training/test splits look reasonable?
Q3) Conduct a hyperparameter search to find the two hyperparameters that lead to the best-performing decision tree
Hint 1: You can conduct an exhaustive hyperparameter search over specified parameter values using the GridSearchCV
class documented at this link. We recommend using cross-validation by setting the parameter cv
.
Hint 2: If you choose to train a DecisionTreeClassifier
object, we recommmend conducting the search over the max_leaf_nodes
and min_samples_split
hyperparameters. Consult the DecisionTreeClassifier
documentation to decide which range to search over.
# Import the necessary classes and functions
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
# Conduct the grid search to find good hyperparameter values
# for your decision tree
# For 'max_leaf_nodes', search between 2 and 120
# For 'min_samples_split', search between 2 and 10
##############################################################################################################################
# 1. Define hyperparameter search grid
##############################################################################################################################
# In param_grid, define the hyperparameters you would like to test, and the ranges the hyperparameters should be in
param_grid = {__________:___________), ____________:____________}
##############################################################################################################################
# 2. GridSearch
##############################################################################################################################
# Perform GridSearch on your DecisionTree (random_state=42 to ensure same result every time)
# use cv=3, verbose=1 for GridSearchCV()
gsc_tree = GridSearchCV(____________(_______),____________,_____,________)
##############################################################################################################################
# 3. Fit on training set
##############################################################################################################################
gsc_tree.___(_____,________)
# Print the best values you found for the hyperparameters
# using the `best_estimator_` attribute of your grid search object
gsc_tree.___________
Q4) Using the best hyperparameter values you found, train a decision tree over the entire training set and calculate its accuracy over both the training and test sets
Hint 1: GridSearchCV
has a method predict
that automatically selects the best model found during the search.
Hint 2: Using the accuracy classification score, you should find an accuracy of \(\approx\)85% on the test set.
# Make predictions on the training and test sets with your best model
y_pred_train = gsc_tree._________(_____)
y_pred_test = gsc_tree._________(_____)
# Calculate the accuracy of the best model over the training and test sets
from sklearn.metrics import accuracy_score
print(f'Accuracy over training set: {(accuracy_score(_________,____________)):.2%} \n'
f'Accuracy over test set: {(accuracy_score(_________,___________)):.2%}')
Q5) Visualize the errors made by your best model
Hint: You may recycle the visualization scripts you developed in Question 2
# Scatter points for which your best model made erroneous predictions
# and compare them to points for which your best models made correct predictions
y_pred = gsc_tree.predict(______)
corr_X,wrong_X = [],[]
for indx,obj in enumerate(__________): # Pull out predictions on each data points one-by-one
if ________________: # If the predictions is not equal to output truth (y_test)
wrong_X.append(____________) # Model made an error
else:
corr_X.append(__________) # Model correct prediction
# Recycle the visualization scripts earlier
fig,ax = plt.subplots(1,1,figsize=(5,4))
##############################################################################################################################
# 1. Plot correct prediction here
##############################################################################################################################
scatter_corr = ax.scatter(__________,_____________,s=10,color='k',edgecolor='w',linewidths=0.5,alpha=0.5)
##############################################################################################################################
# 2. Plot wrong prediction here
##############################################################################################################################
scatter_wrong = ax.scatter(__________,_____________,s=10,color='r',edgecolor='w',linewidths=0.5,alpha=0.5)
##############################################################################################################################
# 3. Title etc.
##############################################################################################################################
ax.set_title('Half Moon')
ax.tick_params(axis='both', which='major', labelsize=11)
ax.grid(alpha=0.2,ls='--',lw=1)
plt.show()
Can you think of ways to improve your best model?
3.3.1. Bonus Exercise 2: Upgrading the Decision Tree to a Random Forest#
Is a full forest enough to predict the moon? 🌲
Building on the previous exercise, we would like to upgrade the decision tree to a random forest to make more accurate predictions on the moons
dataset.
Q1) Generate 1,000 subsets of the training set, each containing 100 instances selected randomly
Hint: You may use scikit-learn
’s random permutation cross-validator ShuffleSplit with the appropriate value of the n_splits
parameter.
# Import and build the random permutation cross-validator
# Generate 1,000 subsets of the training sets with
# 100 randomly-selected instances
Q2) Train one DecisionTreeClassifier
on each subset, using the best hyperparameter values found above
# Train one decision tree per subset (clone also works! but here I'd like to do this the hard way =])
Q3) Evaluate each decision tree on the test set and visualize their accuracy
Hint 1: You can quickly make histograms by using Matplotlib.pyplot
’s hist
function
Hint 2: The mean accuracy of your decision trees should be approximately 80% because the decision trees are fitted on smaller sets.
# Calculate the mean accuracy
# Visualize the distribution of accuracies
Now comes the magic ❇ 🌲 ❇
Q4) Generate the predictions of the 1,000 decision trees over the test set, and only keep the most frequent prediction. This gives you majority-vote predictions over the test set
Hint: You may use Scipy
’s mode
function to calculate the most frequent prediction.
# Generate the predictions of all trained decision trees over the test set
# For each instance of the test set, calculate the majority-vote prediction
Congratulations!! 😃 You have created a random forest classifier 🌲 🌳 🌲
Q5) Calculate the accuracy of your random forest classifier and visualize its performance
Hint: Your accuracy should be approximately 1% higher than your best decision tree.
# Calculate the accuracy of your random forest classifier
accuracy: 75.90%
# Visualize its errors: Which points did the
# random forest classify correctly
# when the decision tree was making an error?