4.3. Clustering#

Sorting takes a lot of effort - is there a way that we can get computers to do it automatically for us when we don’t even know where to begin?
This notebook will be used in the lab session for week 4 of the course, covers Chapters 9 of Géron, and builds on the notebooks made available on Github.
Need a reminder of last week’s labs? Click here to go to notebook for week 3 of the course.
Notebook Setup
First, let’s import a few common modules, ensure MatplotLib plots figures inline and prepare a function to save the figures. We also check that Python 3.5 or later is installed (although Python 2.x may work, it is deprecated so we strongly recommend you use Python 3 instead), as well as Scikit-Learn ≥0.20.
# Python ≥3.5 is required
import sys
assert sys.version_info >= (3, 5)
# Is this notebook running on Colab or Kaggle?
IS_COLAB = "google.colab" in sys.modules
# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"
# Common imports
import numpy as np
import os
# to make this notebook's output stable across runs
rnd_seed = 2022
rnd_gen = np.random.default_rng(rnd_seed)
# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)
# Where to save the figures
PROJECT_ROOT_DIR = "."
CHAPTER_ID = "classification"
IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID)
os.makedirs(IMAGES_PATH, exist_ok=True)
def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
path = os.path.join(IMAGES_PATH, fig_id + "." + fig_extension)
print("Saving figure", fig_id)
if tight_layout:
plt.tight_layout()
plt.savefig(path, format=fig_extension, dpi=resolution)
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[1], line 9
6 IS_COLAB = "google.colab" in sys.modules
8 # Scikit-Learn ≥0.20 is required
----> 9 import sklearn
10 assert sklearn.__version__ >= "0.20"
12 # Common imports
ModuleNotFoundError: No module named 'sklearn'
Data Setup
We need to load the MNIST dataset from OpenML - we won’t be loading it as a Pandas dataframe, but will instead use the Dictionary / ndrray representation.
#Load the mnist dataset
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1, as_frame=False)
X = mnist['data']
y = mnist['target'].astype(np.uint8)
We’re going to subsample the digits in the dataset, choosing a random set of digit classes - we won’t even know the number of different digits we will chose; it will be somewhere between 4 and 8!
# Generating the random set of digit labels we will use to extract samples from
# the MNIST handwritten digit dataset.
digits = rnd_gen.choice(np.arange(10), # Digit Possibilities
int( np.round( rnd_gen.uniform(3.5, 8.5) ) ), # Number of digits to use
replace = False) # Can't repeat digits
We will learn on a total of 8000 digits, evenly distributed amongst the randomly selected digits in the dataset. Let’s store the number of samples to be taken from each class.
# Let's find a round number of digits to extract for each digit
num_samples = np.round(8000/len(digits) + 1 ).astype(int)
With that out of the way, let’s generate the dataset!
# Placeholder Vars
sub_X = None
sub_y = None
# Looping through digit types
for digit in digits:
# find indices where target is digit of interest
y_idxs = y==digit
# rnd_gen.choice chooses n = balanced_size indices from the set of digits
# available. Since we know the truth is an array with the same number of rows
# as the subset, full of the current digit
X_subset = X[y_idxs][rnd_gen.choice(np.arange(y_idxs.sum()),(num_samples,))]
y_subset = np.full(X_subset.shape[0],digit)
if type(sub_X) == type(None):
sub_X = X_subset
sub_y = y_subset
else:
sub_X = np.vstack([sub_X, X_subset])
sub_y = np.hstack((sub_y,y_subset))
# Shuffling the dataset, also limitting the number of digits to 8000 so we can't
# cheat and tell how many digits there are by looking at the length of the array
shuffler = rnd_gen.permutation(len(sub_X))
sub_X = sub_X[shuffler][:8000]
sub_y = sub_y[shuffler][:8000]
We now have a set of 8000 random samples that we know belong to somewhere between 4 and 8 clusters 😃
Can we divide them into these groups without knowing the labels beforehand?
Warning: Don’t expect near perfect results this time.
Clustering with KMeans
The first thing we need to do is to import the KMeans model from scikit learn. Let’s go ahead and do so.
4.3.1. Q1) Import the KMeans model from scikit learn.#
Hint: Here is the documentation for the Kmeans implementation in sklearn.
# Import KMeans class from scikit-learn here
from _____._____ import _____
We don’t know how many clusters we should use to split the data. Our first instinct would be to use as many clusters as the number of digits we have, but even that is not necessarily optimal. Why dont we try all K’s between 4 and 40?
To do so, we’ll need to begin by training a Kmeans algorithm for each value of K we’re interested in.
How long does it take to train a single KMeans model with 10 initial centroid settings? This will give us an idea as to whether it may be a good idea to apply a dimensionality reduction algorithm before fitting our models.
4.3.2. Q2) Import python’s time
library and measure how long it takes to train a single KMeans model with 3 clusters on the raw data subset.#
Hint 1: Here is the documetation for the function used to get timestamps
# ---------------------------------------------------------------------------------------------------------------
# Import time to time our training process
# ---------------------------------------------------------------------------------------------------------------
import ______
# ---------------------------------------------------------------------------------------------------------------
# Here we will time the KMeans algorithm very similar to what we did for the first notebook.
# ---------------------------------------------------------------------------------------------------------------
t0 = _____._____()
# --------------------------------------------------------------------------------------------------------------------
# We will now train the KMeans model with 3 clusters and rnd_seed for random_state. Time the training process as well.
# --------------------------------------------------------------------------------------------------------------------
kmeans_test = KMeans(n_clusters=_____, # Number of clusters to split into
random_state = _____) # Random seed
kmeans_test.fit(_____) # Fitting to data subset
t1 = _____._____()
print(f"Training took {(____ - ____):.2f}s")
This should seem like a bit longer than we want. (3.42 seconds during testing of the notebook). Doing this many times means there’s a possibility we’ll be sitting around doing nothing, possiblity for several minutes. Who has time for this?
Let’s reduce the dataset using PCA, capturing 95% of the variability in the data.
4.3.3. Q3) Import PCA from scikit and reduce the dimensionality of our input data. 95% of the variance in the data should be captured.#
Hint 1: Here is the documentation for PCA.
Hint 2: .fit_transform()
will be very useful
# --------------------------------------------------------------------------------------------------------------------
# Import PCA class
# --------------------------------------------------------------------------------------------------------------------
from sklearn._____ import _____
# --------------------------------------------------------------------------------------------------------------------
# Instantiate PCA, only retain enough components for 95% of variance in data
# --------------------------------------------------------------------------------------------------------------------
pca = _____(_____)
# --------------------------------------------------------------------------------------------------------------------
# Fit model and transform the sub_X datast
# --------------------------------------------------------------------------------------------------------------------
reduced_X = pca._____(_____)
Let’s try training a KMeans model on the reduced dataset and see if our training time improved…
4.3.4. Q5) Repeat Q2 using the reduced dataset#
Reminder: We’re still splitting the data into 3 clusters
# --------------------------------------------------------------------------------------------------------------------
# We will now train the KMeans model with 3 clusters and rnd_seed for random_state. Time the training process as well.
# The kmeans algorithm should now fit to the reduced_X subset
# --------------------------------------------------------------------------------------------------------------------
t0 = _____._____()
kmeans_test = KMeans(n_clusters=_____, # Number of clusters to split into
random_state = _____) # Random seed
kmeans_test.fit(_____) # Fitting to reduced data subset
t1 = _____._____()
print(f"Training took {(____ - ____):.2f}s")
4.3.5. Q6) Train a KMeans model for \(\; 2 \le k \le 20\)#
Hint 1: Set up a range using python’s range
function or numpy’s arange
Hint 2: You can store each trained model by appending it to a list as you iterate
# --------------------------------------------------------------------------------------------------------------------
# Create a list of numbers from 2 to 20 with the range() function
# --------------------------------------------------------------------------------------------------------------------
k_list = _____
# --------------------------------------------------------------------------------------------------------------------
# Create an empty list to store the fitted models with different k values
# --------------------------------------------------------------------------------------------------------------------
kmeans_models = _____
# -----------------------------------------------------------------------------------------------------------------------------------
# Use a for loop to train and store different kmeans models. What you fill in here should be similar to what you just did from Q1-Q5.
# ------------------------------------------------------------------------------------------------------------------------------------
t0 = time.time() # Get a timestamp to keep track of time
for k in k_list:
#print out a statement stating which k value you are working on
t1 = time.time() # Get a current timestamp
print(f"\r Currently working on k={_____}, elapsed time: {(t1 - t0):.2f}s", end="")
kmeans = ________(_____=k, # Set the number of clusters
_____= rnd_seed ) # Set the random state
kmeans._____(_____) # Fit the model to the reduced data subset
kmeans_models._____(_____) # store the model trained to predict k clusters
# -----------------------------------------------------------------------------------------------------------------------------------
# We spent around 56.5s to train the models
# ------------------------------------------------------------------------------------------------------------------------------------
print(f"\r Finished training the models! It took: {(time.time() - t0):.2f}s")
You should hopefully remember the silhouette score and inertia metrics from the reading. Let’s go ahead and make a \(k\) vs \(silhouette\) plot and a \(k\) vs \(inertia\) plot.
4.3.6. Q7) Import the silhouette score metric and generate a silhouette score value for each model we trained#
Hint 1: Here is the documentation for the Silhouette score implementation in scikit learn
Hint 2: The silhouette score needs the model input data and the model labels as arguments
Hint 3: The model labels are stored as an attribute in each model. Check the list of available attributes in the sklearn KMeans documentation.
# --------------------------------------------------------------------------------------------------------------------
# Import silhouette_scores metric
# --------------------------------------------------------------------------------------------------------------------
from sklearn.________ import ______
# --------------------------------------------------------------------------------------------------------------------
# Use list comprehension to store the silhouette score for all trained models (check the documentaion if you need help
# setting up in arguments as specified in Hint 2 and 3)
# --------------------------------------------------------------------------------------------------------------------
silhouette_scores = [silhouette_score(_______, model._______)
for model in _____]
4.3.7. Q8) Plot comparing \(k\) vs Silhouette Score. Highlight the maximum score#
Hint 1: You’ll need to find the position of the best score in the silhouette score list. Here is the documentation to a numpy function that would be very useful for this.
Hint 2: matplotlib’s pyplot has been imported as plt
. Here is the documentation to the subplots()
method.
Hint 3: Here is the documentation to the plot()
method in matplotlib’s pyplot. Note that it is also implemented as a method in the axes
objects created with plt.subplots()
# --------------------------------------------------------------------------------------------------------------------------------------------------
# Find the index of model with the highest score. We recommend converting the silhouette_scores list to numpy array to use the function in Hint 1
# --------------------------------------------------------------------------------------------------------------------------------------------------
best_index = __.______(__._____(______))
# --------------------------------------------------------------------------------------------------------------------------------------------------
# Find out which k value the 'best_index' that you just got corresponds to.
# --------------------------------------------------------------------------------------------------------------------------------------------------
best_k = _____[best_index] # Get the best K value, per the silhouette score
# --------------------------------------------------------------------------------------------------------------------------------------------------
# Store the silhouette score corresponds to 'best_index' here.
# --------------------------------------------------------------------------------------------------------------------------------------------------
best_score = silhouette_scores[best_index]
# --------------------------------------------------------------------------------------------------------------------------------------------------
# Create a k-silhouette score diagram (horizontal axis: K values, vertical axis: silhouette score)
# --------------------------------------------------------------------------------------------------------------------------------------------------
# Make a figure with size (12,4)
fig, ax = ___.______(____=____)
ax.plot(_______, ______, "bo-")
ax.set_xlabel("$k$", fontsize=14)
ax.set_ylabel("Silhouette score", fontsize=14)
ax.plot(best_k, best_score, "rs")
plt.show()
We got this figure when we ran the code. Does your figure look the same?
Now we’ll plot the value of \(k\) vs inertia
4.3.8. Q9) Plot comparing \(k\) vs Inertia. Highlight the inertia for the model with the highest silhouette score#
Hint 1: If you followed the previous step as it was written, you have the index for the best model stored inbest_index
.
Hint 2: The KMeans documentation details the attribute in which the model’s intertia is stored.
# --------------------------------------------------------------------------------------------------------------------------------------------------
# Iterate through the list of kmeans models and store the model inertias
# --------------------------------------------------------------------------------------------------------------------------------------------------
inertias = [model._____ for model in _____]
# --------------------------------------------------------------------------------------------------------------------------------------------------
# Get the inertia for the model with the highest silhouette score
# --------------------------------------------------------------------------------------------------------------------------------------------------
best_inertia = inertias[____]
# --------------------------------------------------------------------------------------------------------------------------------------------------
# Create a k-silhouette score diagram (horizontal axis: K values, vertical axis: inertia)
# --------------------------------------------------------------------------------------------------------------------------------------------------
# Make a figure with size (12,4)
fig, ax = ___.___(____=____)
ax.plot(_______, ______, "bo-")
ax.set_xlabel("$k$", fontsize=14)
ax.set_ylabel("Inertia", fontsize=14)
ax.plot(best_k, best_inertia, "rs")
You should see a figure similar to the one below.
If you ran the notebook with the default random seed, you may be surprised to see that the best performing KMeans model is the one that breaks it off into two clusters! The next best two will be those associated with \(k=4\) and \(k=5\). Let’s get some plots to try to make sense of the results, since we have the actual labels.
There aren’t any more questions to answer from here on out, but you will have to change the code if you started out from a different random seed! I’ll try to be good about pointing out what code you’ll need to change.
Let’s begin by using scikit’s TSNE implementation to reduce the dimensionality of the dataset for plotting. This will take a bit of computation time!
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, # We'll project onto 2D plane
random_state=rnd_seed, # We need a random seed
learning_rate='auto') #Let the algorithm handle the learning rate
# And now get the input data in 2-component reduced form
X_plot = tsne.fit_transform(sub_X)
/usr/local/lib/python3.7/dist-packages/sklearn/manifold/_t_sne.py:783: FutureWarning: The default initialization in TSNE will change from 'random' to 'pca' in 1.2.
FutureWarning,
Let’s continue by making a list of the three best models.
# --------------------------------------------------------------------------------------------------------------------------------------------------
# Make an empty list to store the three best models
# --------------------------------------------------------------------------------------------------------------------------------------------------
best_models = []
# --------------------------------------------------------------------------------------------------------------------------------------------------
# Use the best_index to get the best model
# --------------------------------------------------------------------------------------------------------------------------------------------------
best_models.append(kmeans_models[best_index])
# --------------------------------------------------------------------------------------------------------------------------------------------------
# Manually retrieve 2nd and 3rd best models
# First use np.argsort() to sort all silhouette scores. The second to last and third to last indices correspond to the second and third best models.
# --------------------------------------------------------------------------------------------------------------------------------------------------second_index,third_index = np.asarray(silhouette_scores).argsort()[-2],np.asarray(silhouette_scores).argsort()[-3]
second_index,third_index = __.______(np.asarray(___________))[__], __.______(np.asarray(________))[__]
best_models.append(kmeans_models[second_index])
best_models.append(kmeans_models[third_index])
And now, let’s get a set of predictions for each model and store it in a list!
pred_labels = []
for model in best_models:
pred_labels.append(model.predict(reduced_X))
Pandas will make producing a nice plot a lot simpler. Let’s import it and make a dataframe with the reduced input components, the truth labels, and the predicted cluster labels.
Note that the predicted labels don’t correspond to the digit labels since this is an unsupervised model!
import pandas as pd
plot_data = np.stack([X_plot[:,0], X_plot[:,1], sub_y, *pred_labels],axis=1)
df = pd.DataFrame(plot_data, columns=['X1','X2','truth','pred_1','pred_2','pred_3'])
And now we’ll make a nice, big 4x4 plot that allows us to see the true answers and how our algorithm clustered our data!
fig, axes = plt.subplots(2, 2, figsize=(16,16))
groups = df.groupby('truth')
for label, group in groups:
axes[0,0].plot(group.X1, group.X2, marker='o', linestyle='', markersize=4, label=int(label))
axes[0,0].legend(fontsize=12)
axes[0,0].set_title('Truth', fontsize=16)
axes[0,0].axis('off')
groups = df.groupby('pred_1')
for label, group in groups:
axes[0,1].plot(group.X1, group.X2, marker='o', linestyle='', markersize=4)
axes[0,1].set_title('"Best" Clustering', fontsize=16)
axes[0,1].axis('off')
groups = df.groupby('pred_2')
for label, group in groups:
axes[1,0].plot(group.X1, group.X2, marker='o', linestyle='', markersize=4)
axes[1,0].set_title('$2^{nd}$ Best Clustering', fontsize=16)
axes[1,0].axis('off')
groups = df.groupby('pred_3')
for label, group in groups:
axes[1,1].plot(group.X1, group.X2, marker='o', linestyle='', markersize=4)
axes[1,1].set_title('$3^{rd}$ Best Clustering', fontsize=16)
axes[1,1].axis('off')
Our figure looks like this.
Assuming you started out from the intended random seed, you’ll be able to see that the “Best” model is trying to separate the 0 digits from the non-zero digits. The \(2^{nd}\) best model is able to separate the digits pretty well, but lumps 4s and 9s into a single cluster!
The \(3^{rd}\) best model begins to group digits into clusters that may not have much significance to us at first glance, but whose metrics seem to indicate worse performance and whose results would need further analysis to try to understand.