Open In Colab

1.14. (Exercises) Multivariate linear regression and clustering#

What if our dataset has multiple dimensions and we want to find equations that looks something like this? $\( y = w_1 x_1 + w_2 x_2 + w_3 x_3 + ... + b \)$

Here we show how we would do it with scikit-learn to find a linear equation that describes the Kaggle Advertising dataset

1.14.1. Exercise 1: Multivariate linear regression#

1.14.1.1. Q1: Use pandas to import the advertising dataset#

import numpy as np
import matplotlib.pyplot as plt
import pooch
import urllib.request
import pandas as pd
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 3
      1 import numpy as np
      2 import matplotlib.pyplot as plt
----> 3 import pooch
      4 import urllib.request
      5 import pandas as pd

ModuleNotFoundError: No module named 'pooch'
myadvertising = pooch.retrieve('https://unils-my.sharepoint.com/:x:/g/personal/tom_beucler_unil_ch/EeqnI6nF9iBAkUOACbZ3mWUBDZ8N5mVP1oOaFd4vy6tIzw?download=1',
                          known_hash='69104adc017e75d7019f61fe66ca2eb4ab014ee6f2a9b39b452943f209352010')
# Q1a: Use pandas to import the dataset
__ = pd.__(_,_)
# Q1b: Display the first rows of the data with pandas
__.__()

1.14.1.2. Q2: Try to use scikit-learn linear regression model to predict “Sales” from three columns “TV”, “Radio”, “Newspaper”#

This model will basically tell us how sales would change if we put resources to advertise products in three different medias.

Hint

Check out the documentation for the linear-model module in scikit-learn before you train the model

from sklearn.____ import ____

# Construct Input / Output matrices
Xall = __[[_,_,_]].values
y = _[].values
linreg = ___
linreg.__(_,_)

1.14.1.3. Q3: Print out the linear equation coefficients and intercept#

Hints:

(1) When you print the coefficients and intercepts, try to retain only the first two digits. One way to do it is through the .round() function in numpy

(2) Check out the scikit-learn linear-model module for instructions on extracting the coefficients and intercepts of the trained model.

# Print your coefficients and intercept here.
print(f'Coefficients {}, Intercept {}' )

1.14.2. Exercise 2: Clustering Penguin Dataset#

In the exercise, we will try to repeat the k-mean clustering procedure introuduced in the tutorial, but on a different 2D variable plane.

Let’s try “culmen_length_mm” and “Flipper_length_mm”?

Can we differentiate penguins from the length of the beak and their wings?

20 points and their Voronoi cells by Balu Ertl CC BY-SA 4.0
penguinsize = pooch.retrieve('https://unils-my.sharepoint.com/:x:/g/personal/tom_beucler_unil_ch/ETfy8shC_PtBnsYren_f60UBSyn6Zz1CVvE0Z6_z575VZA?download=1',
                          known_hash='aa728597b2228a2637e39c6f08e40a80971f4cdac7faf7bc21ff4481ee3e3ae9')

penguins = pd.read_csv(penguinsize)
print(penguins.head())
Downloading data from 'https://unils-my.sharepoint.com/:x:/g/personal/tom_beucler_unil_ch/ETfy8shC_PtBnsYren_f60UBSyn6Zz1CVvE0Z6_z575VZA?download=1' to file '/root/.cache/pooch/15990ae8be04e5655e98ecb908600619-ETfy8shC_PtBnsYren_f60UBSyn6Zz1CVvE0Z6_z575VZA'.
  species     island  culmen_length_mm  culmen_depth_mm  flipper_length_mm  \
0  Adelie  Torgersen              39.1             18.7              181.0   
1  Adelie  Torgersen              39.5             17.4              186.0   
2  Adelie  Torgersen              40.3             18.0              195.0   
3  Adelie  Torgersen               NaN              NaN                NaN   
4  Adelie  Torgersen              36.7             19.3              193.0   

   body_mass_g     sex  
0       3750.0    MALE  
1       3800.0  FEMALE  
2       3250.0  FEMALE  
3          NaN     NaN  
4       3450.0  FEMALE  

1.14.2.1. Q1: Data clean up. Remove all rows in the table if the rows contain missing values#

Hint: pandas has a easy function for data clean up. Check out documentation for details.

penguin_df = penguins.___________

1.14.2.2. Q2: Create an input dataset X with the culmen_length_mm and flipper_length_mm data columns#

Hints:

(1) The shape of your input data should be (334, 2)

# Create your input for model training here
# Input should contain penguin_df['culmen_length_mm'] and penguin_df['flipper_length_mm']
X = _____________________________________________

1.14.2.3. Q3: Train a k-means clustering algorithm, perform elbow test and silhouette analysis#

Hints:

(1) The documentation for KMeans clustering can be found here

(2) Documentation for silhouette score analysis in scikit-learn

# Import KMeans fron scikit-learn
from sklearn.________ import ______
# Import Silhouette score fron scikit-learn
from sklearn._______ import ________________
# Store the K-means inertia in an empty list
_______________________ = []
_ = ___________
for ___________ in ______ :
 ________ = _______(n_clusters=____________)
 _______.___(X)
 ________________._______(___________)
# To finish the elbow method analysis, plot the change in intertia when you change the number of clusters you used to train the k-means clustering algorithm.
plt.plot(____,__________,marker='s',c='k',lw=2)
plt.xlabel('Number of Clusters')
plt.ylabel('Sum of Squared Distances / Inertia')
plt.title('Elbow Method for Optimal K')
plt.show()

This is what your TA got after filling in and running the code above.

plot1.png

# Import silhouette_score for analysis
from sklearn._________ import __________________
# Perform silhouette analysis following instructions in the tutorial notebook
______________ = []
for num_clusters in range(2,10):
  # initialise kmeans
  ______ = _______(___________)
  ______.fit(X)
  cluster_labels = _______.__________

  # silhouette score
  _____________.append(silhouette_score(_, ________________))
# Plot your silhouette analysis result here.

Here is a screenshot of how the results should look like.

plot2.png

Based on the silhouette and elbow analysis, it seems the either 2 or 3 clusters is suitable for this penguin dataset.

1.14.2.4. Q4: Perform KMeans clustering with 3 clusters#

# Train a k-means clustering model
kmeans = _______(n_clusters=3)
kmeans._____(_)

# Use the model to label all data points
_______ = kmeans._______(_)

Now, our penguin dataset is labelled. So we can compare our clustering results with the truth to see how well our trained model performed.

First store the predicted labels in the pandas dataframe

penguin_df[_______] = _____________

Then extract the culmen_length_mm and flipper_length_mm of the three penguin categories as truth

# Adelie

# Gentoo

# Chinstrap

Recreate the figure below

plot3.png

#
plt.scatter(_____, _________,c=_______,s=50,marker=___,label='Adelie', cmap='cividis',edgecolors='k',linewidths=0.5,vmin=___,vmax=___)
plt.scatter(_____, _________,c=_______,s=30,marker=___,label='Gentoo', cmap='cividis',edgecolors='k',linewidths=0.5,vmin=___,vmax=___)
plt.scatter(_____, _________,c=_______,s=70,marker=___,label='Chinstrap', cmap='cividis',edgecolors='k',linewidths=0.5,vmin=____,vmax=____)
plt.______(__________)
plt.ylabel(___________)
plt.xlabel(___________)
plt._________
plt.show()

1.14.2.5. Q5: Train the KMeans clustering with 2 clusters#

# Train k-means clustering algorithm here

# Predict labels

We got this figure with our model. Can you recreate it?

plot4.png