1.14. (Exercises) Multivariate linear regression and clustering#
What if our dataset has multiple dimensions and we want to find equations that looks something like this? $\( y = w_1 x_1 + w_2 x_2 + w_3 x_3 + ... + b \)$
Here we show how we would do it with scikit-learn to find a linear equation that describes the Kaggle Advertising dataset
1.14.1. Exercise 1: Multivariate linear regression#
1.14.1.1. Q1: Use pandas to import the advertising dataset#
import numpy as np
import matplotlib.pyplot as plt
import pooch
import urllib.request
import pandas as pd
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[1], line 3
1 import numpy as np
2 import matplotlib.pyplot as plt
----> 3 import pooch
4 import urllib.request
5 import pandas as pd
ModuleNotFoundError: No module named 'pooch'
myadvertising = pooch.retrieve('https://unils-my.sharepoint.com/:x:/g/personal/tom_beucler_unil_ch/EeqnI6nF9iBAkUOACbZ3mWUBDZ8N5mVP1oOaFd4vy6tIzw?download=1',
known_hash='69104adc017e75d7019f61fe66ca2eb4ab014ee6f2a9b39b452943f209352010')
# Q1a: Use pandas to import the dataset
__ = pd.__(_,_)
# Q1b: Display the first rows of the data with pandas
__.__()
1.14.1.2. Q2: Try to use scikit-learn linear regression model to predict “Sales” from three columns “TV”, “Radio”, “Newspaper”#
This model will basically tell us how sales would change if we put resources to advertise products in three different medias.
Hint
Check out the documentation for the linear-model
module in scikit-learn before you train the model
from sklearn.____ import ____
# Construct Input / Output matrices
Xall = __[[_,_,_]].values
y = _[].values
linreg = ___
linreg.__(_,_)
1.14.1.3. Q3: Print out the linear equation coefficients and intercept#
Hints:
(1) When you print the coefficients and intercepts, try to retain only the first two digits. One way to do it is through the .round()
function in numpy
(2) Check out the scikit-learn linear-model
module for instructions on extracting the coefficients and intercepts of the trained model.
# Print your coefficients and intercept here.
print(f'Coefficients {}, Intercept {}' )
1.14.2. Exercise 2: Clustering Penguin Dataset#
In the exercise, we will try to repeat the k-mean clustering procedure introuduced in the tutorial, but on a different 2D variable plane.
Let’s try “culmen_length_mm” and “Flipper_length_mm”?
Can we differentiate penguins from the length of the beak and their wings?
penguinsize = pooch.retrieve('https://unils-my.sharepoint.com/:x:/g/personal/tom_beucler_unil_ch/ETfy8shC_PtBnsYren_f60UBSyn6Zz1CVvE0Z6_z575VZA?download=1',
known_hash='aa728597b2228a2637e39c6f08e40a80971f4cdac7faf7bc21ff4481ee3e3ae9')
penguins = pd.read_csv(penguinsize)
print(penguins.head())
Downloading data from 'https://unils-my.sharepoint.com/:x:/g/personal/tom_beucler_unil_ch/ETfy8shC_PtBnsYren_f60UBSyn6Zz1CVvE0Z6_z575VZA?download=1' to file '/root/.cache/pooch/15990ae8be04e5655e98ecb908600619-ETfy8shC_PtBnsYren_f60UBSyn6Zz1CVvE0Z6_z575VZA'.
species island culmen_length_mm culmen_depth_mm flipper_length_mm \
0 Adelie Torgersen 39.1 18.7 181.0
1 Adelie Torgersen 39.5 17.4 186.0
2 Adelie Torgersen 40.3 18.0 195.0
3 Adelie Torgersen NaN NaN NaN
4 Adelie Torgersen 36.7 19.3 193.0
body_mass_g sex
0 3750.0 MALE
1 3800.0 FEMALE
2 3250.0 FEMALE
3 NaN NaN
4 3450.0 FEMALE
1.14.2.1. Q1: Data clean up. Remove all rows in the table if the rows contain missing values#
Hint: pandas
has a easy function for data clean up. Check out documentation for details.
penguin_df = penguins.___________
1.14.2.2. Q2: Create an input dataset X
with the culmen_length_mm and flipper_length_mm data columns#
Hints:
(1) The shape of your input data should be (334, 2)
# Create your input for model training here
# Input should contain penguin_df['culmen_length_mm'] and penguin_df['flipper_length_mm']
X = _____________________________________________
1.14.2.3. Q3: Train a k-means clustering algorithm, perform elbow test and silhouette analysis#
Hints:
(1) The documentation for KMeans clustering can be found here
(2) Documentation for silhouette score analysis in scikit-learn
# Import KMeans fron scikit-learn
from sklearn.________ import ______
# Import Silhouette score fron scikit-learn
from sklearn._______ import ________________
# Store the K-means inertia in an empty list
_______________________ = []
_ = ___________
for ___________ in ______ :
________ = _______(n_clusters=____________)
_______.___(X)
________________._______(___________)
# To finish the elbow method analysis, plot the change in intertia when you change the number of clusters you used to train the k-means clustering algorithm.
plt.plot(____,__________,marker='s',c='k',lw=2)
plt.xlabel('Number of Clusters')
plt.ylabel('Sum of Squared Distances / Inertia')
plt.title('Elbow Method for Optimal K')
plt.show()
This is what your TA got after filling in and running the code above.
# Import silhouette_score for analysis
from sklearn._________ import __________________
# Perform silhouette analysis following instructions in the tutorial notebook
______________ = []
for num_clusters in range(2,10):
# initialise kmeans
______ = _______(___________)
______.fit(X)
cluster_labels = _______.__________
# silhouette score
_____________.append(silhouette_score(_, ________________))
# Plot your silhouette analysis result here.
Here is a screenshot of how the results should look like.
Based on the silhouette and elbow analysis, it seems the either 2 or 3 clusters is suitable for this penguin dataset.
1.14.2.4. Q4: Perform KMeans clustering with 3 clusters#
# Train a k-means clustering model
kmeans = _______(n_clusters=3)
kmeans._____(_)
# Use the model to label all data points
_______ = kmeans._______(_)
Now, our penguin dataset is labelled. So we can compare our clustering results with the truth to see how well our trained model performed.
First store the predicted labels in the pandas dataframe
penguin_df[_______] = _____________
Then extract the culmen_length_mm and flipper_length_mm of the three penguin categories as truth
# Adelie
# Gentoo
# Chinstrap
Recreate the figure below
#
plt.scatter(_____, _________,c=_______,s=50,marker=___,label='Adelie', cmap='cividis',edgecolors='k',linewidths=0.5,vmin=___,vmax=___)
plt.scatter(_____, _________,c=_______,s=30,marker=___,label='Gentoo', cmap='cividis',edgecolors='k',linewidths=0.5,vmin=___,vmax=___)
plt.scatter(_____, _________,c=_______,s=70,marker=___,label='Chinstrap', cmap='cividis',edgecolors='k',linewidths=0.5,vmin=____,vmax=____)
plt.______(__________)
plt.ylabel(___________)
plt.xlabel(___________)
plt._________
plt.show()
1.14.2.5. Q5: Train the KMeans clustering with 2 clusters#
# Train k-means clustering algorithm here
# Predict labels
We got this figure with our model. Can you recreate it?