{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "W4_S1.ipynb",
"provenance": [],
"toc_visible": true,
"include_colab_link": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"source": [
"# (Exercises) Multivariate linear regression and clustering\n",
"\n",
"What if our dataset has multiple dimensions and we want to find equations that looks something like this?\n",
"$$\n",
"y = w_1 x_1 + w_2 x_2 + w_3 x_3 + ... + b\n",
"$$\n",
"\n",
"Here we show how we would do it with scikit-learn to find a linear equation that describes the Kaggle [Advertising dataset](https://www.kaggle.com/datasets/bumba5341/advertisingcsv?resource=download)"
],
"metadata": {
"id": "X-Vw_8lxnRiO"
}
},
{
"cell_type": "markdown",
"source": [
"## Exercise 1: Multivariate linear regression"
],
"metadata": {
"id": "RbHHP9LpVn1V"
}
},
{
"cell_type": "markdown",
"source": [
"### Q1: Use pandas to import the advertising dataset"
],
"metadata": {
"id": "ISJd4ihEnVTl"
}
},
{
"cell_type": "code",
"source": [
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"import pooch\n",
"import urllib.request\n",
"import pandas as pd"
],
"metadata": {
"id": "nLNGTIiSLjjO"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"myadvertising = pooch.retrieve('https://unils-my.sharepoint.com/:x:/g/personal/tom_beucler_unil_ch/EeqnI6nF9iBAkUOACbZ3mWUBDZ8N5mVP1oOaFd4vy6tIzw?download=1',\n",
" known_hash='69104adc017e75d7019f61fe66ca2eb4ab014ee6f2a9b39b452943f209352010')"
],
"metadata": {
"id": "65S4_FN0LmtV"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Q1a: Use pandas to import the dataset\n",
"__ = pd.__(_,_)\n",
"# Q1b: Display the first rows of the data with pandas\n",
"__.__()"
],
"metadata": {
"id": "wu4ncsMYnStF"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"### Q2: Try to use scikit-learn linear regression model to predict \"Sales\" from three columns \"TV\", \"Radio\", \"Newspaper\"\n",
"\n",
"This model will basically tell us how sales would change if we put resources to advertise products in three different medias."
],
"metadata": {
"id": "kFlc8eGznatT"
}
},
{
"cell_type": "markdown",
"source": [
"```{hint}\n",
"Check out the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) for the `linear-model` module in scikit-learn before you train the model\n",
"```"
],
"metadata": {
"id": "dxIgpGguyVgT"
}
},
{
"cell_type": "code",
"source": [
"from sklearn.____ import ____\n",
"\n",
"# Construct Input / Output matrices\n",
"Xall = __[[_,_,_]].values\n",
"y = _[].values\n",
"linreg = ___\n",
"linreg.__(_,_)"
],
"metadata": {
"id": "e5D0eDP-nfwQ"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"### Q3: Print out the linear equation coefficients and intercept\n",
"\n",
"Hints:\n",
"\n",
"(1) When you print the coefficients and intercepts, try to retain only the first two digits. One way to do it is through the `.round()` function in `numpy`\n",
"\n",
"(2) Check out the scikit-learn `linear-model` module for instructions on extracting the coefficients and intercepts of the trained model.\n"
],
"metadata": {
"id": "7HVwwqUNniQg"
}
},
{
"cell_type": "code",
"source": [
"# Print your coefficients and intercept here.\n",
"print(f'Coefficients {}, Intercept {}' )"
],
"metadata": {
"id": "AeRFXcB0noMq"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"## Exercise 2: Clustering Penguin Dataset\n",
"\n",
"In the exercise, we will try to repeat the k-mean clustering procedure introuduced in the tutorial, but on a different 2D variable plane.\n",
"\n",
"Let's try \"culmen_length_mm\" and \"Flipper_length_mm\"?\n",
"\n",
"Can we differentiate penguins from the length of the beak and their wings?\n",
"\n",
"
"
],
"metadata": {
"id": "rSAzg7l8GPR-"
}
},
{
"cell_type": "code",
"source": [
"penguinsize = pooch.retrieve('https://unils-my.sharepoint.com/:x:/g/personal/tom_beucler_unil_ch/ETfy8shC_PtBnsYren_f60UBSyn6Zz1CVvE0Z6_z575VZA?download=1',\n",
" known_hash='aa728597b2228a2637e39c6f08e40a80971f4cdac7faf7bc21ff4481ee3e3ae9')\n",
"\n",
"penguins = pd.read_csv(penguinsize)\n",
"print(penguins.head())"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "Bm-TBjiBGhEo",
"outputId": "e6b2a85e-5b0f-4f15-fe78-3af5214e6a12"
},
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stderr",
"text": [
"Downloading data from 'https://unils-my.sharepoint.com/:x:/g/personal/tom_beucler_unil_ch/ETfy8shC_PtBnsYren_f60UBSyn6Zz1CVvE0Z6_z575VZA?download=1' to file '/root/.cache/pooch/15990ae8be04e5655e98ecb908600619-ETfy8shC_PtBnsYren_f60UBSyn6Zz1CVvE0Z6_z575VZA'.\n"
]
},
{
"output_type": "stream",
"name": "stdout",
"text": [
" species island culmen_length_mm culmen_depth_mm flipper_length_mm \\\n",
"0 Adelie Torgersen 39.1 18.7 181.0 \n",
"1 Adelie Torgersen 39.5 17.4 186.0 \n",
"2 Adelie Torgersen 40.3 18.0 195.0 \n",
"3 Adelie Torgersen NaN NaN NaN \n",
"4 Adelie Torgersen 36.7 19.3 193.0 \n",
"\n",
" body_mass_g sex \n",
"0 3750.0 MALE \n",
"1 3800.0 FEMALE \n",
"2 3250.0 FEMALE \n",
"3 NaN NaN \n",
"4 3450.0 FEMALE \n"
]
}
]
},
{
"cell_type": "markdown",
"source": [
"### Q1: Data clean up. Remove all rows in the table if the rows contain missing values\n",
"\n",
"Hint: `pandas` has a easy function for data clean up. Check out [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html) for details."
],
"metadata": {
"id": "A2pl7msL4k5x"
}
},
{
"cell_type": "code",
"source": [
"penguin_df = penguins.___________"
],
"metadata": {
"id": "sX2FHL_rGpmX"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"### Q2: Create an input dataset `X` with the culmen_length_mm and flipper_length_mm data columns\n",
"\n",
"Hints:\n",
"\n",
"(1) The shape of your input data should be `(334, 2)`"
],
"metadata": {
"id": "LdITW6MVHG1B"
}
},
{
"cell_type": "code",
"source": [
"# Create your input for model training here\n",
"# Input should contain penguin_df['culmen_length_mm'] and penguin_df['flipper_length_mm']\n",
"X = _____________________________________________"
],
"metadata": {
"id": "jksxr7CVHaW1"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"### Q3: Train a k-means clustering algorithm, perform elbow test and silhouette analysis\n",
"\n",
"Hints:\n",
"\n",
"(1) The documentation for KMeans clustering can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html?highlight=kmeans#sklearn.cluster.KMeans)\n",
"\n",
"(2) [Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html?highlight=silhouette+score#sklearn.metrics.silhouette_score) for silhouette score analysis in scikit-learn"
],
"metadata": {
"id": "PgzlFwFX6Wdj"
}
},
{
"cell_type": "code",
"source": [
"# Import KMeans fron scikit-learn\n",
"from sklearn.________ import ______\n",
"# Import Silhouette score fron scikit-learn\n",
"from sklearn._______ import ________________"
],
"metadata": {
"id": "gnsvLh4zG6td"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Store the K-means inertia in an empty list\n",
"_______________________ = []\n",
"_ = ___________\n",
"for ___________ in ______ :\n",
" ________ = _______(n_clusters=____________)\n",
" _______.___(X)\n",
" ________________._______(___________)\n"
],
"metadata": {
"id": "57a4Qq2Y9N4Z"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# To finish the elbow method analysis, plot the change in intertia when you change the number of clusters you used to train the k-means clustering algorithm.\n",
"plt.plot(____,__________,marker='s',c='k',lw=2)\n",
"plt.xlabel('Number of Clusters')\n",
"plt.ylabel('Sum of Squared Distances / Inertia')\n",
"plt.title('Elbow Method for Optimal K')\n",
"plt.show()"
],
"metadata": {
"id": "srkZ0WmL9Dgc"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"This is what your TA got after filling in and running the code above."
],
"metadata": {
"id": "IyATmA40-y1h"
}
},
{
"cell_type": "markdown",
"source": [
""
],
"metadata": {
"id": "j8OEHWC9-u7V"
}
},
{
"cell_type": "code",
"source": [
"# Import silhouette_score for analysis\n",
"from sklearn._________ import __________________"
],
"metadata": {
"id": "FNksFnW29ybf"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Perform silhouette analysis following instructions in the tutorial notebook\n",
"______________ = []\n",
"for num_clusters in range(2,10):\n",
" # initialise kmeans\n",
" ______ = _______(___________)\n",
" ______.fit(X)\n",
" cluster_labels = _______.__________\n",
"\n",
" # silhouette score\n",
" _____________.append(silhouette_score(_, ________________))"
],
"metadata": {
"id": "BIibccJv9_FF"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Plot your silhouette analysis result here.\n"
],
"metadata": {
"id": "cgVj3GMO3oyD"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Here is a screenshot of how the results should look like."
],
"metadata": {
"id": "od-I1rar_Bx6"
}
},
{
"cell_type": "markdown",
"source": [
""
],
"metadata": {
"id": "yVBEJyjl-_k5"
}
},
{
"cell_type": "markdown",
"source": [
"Based on the silhouette and elbow analysis, it seems the either 2 or 3 clusters is suitable for this penguin dataset."
],
"metadata": {
"id": "6tKepyxgAiES"
}
},
{
"cell_type": "markdown",
"source": [
"### Q4: Perform KMeans clustering with 3 clusters"
],
"metadata": {
"id": "2gnggr9DI5Ww"
}
},
{
"cell_type": "code",
"source": [
"# Train a k-means clustering model\n",
"kmeans = _______(n_clusters=3)\n",
"kmeans._____(_)\n",
"\n",
"# Use the model to label all data points\n",
"_______ = kmeans._______(_)"
],
"metadata": {
"id": "ABcPMyIHJGgs"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Now, our penguin dataset is labelled. So we can compare our clustering results with the truth to see how well our trained model performed."
],
"metadata": {
"id": "8W0ef_tlBR3j"
}
},
{
"cell_type": "markdown",
"source": [
"First store the predicted labels in the pandas dataframe"
],
"metadata": {
"id": "hmYFHqxZB5rJ"
}
},
{
"cell_type": "code",
"source": [
"penguin_df[_______] = _____________"
],
"metadata": {
"id": "CbIizf0NB4sD"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Then extract the culmen_length_mm and flipper_length_mm of the three penguin categories as truth"
],
"metadata": {
"id": "Qf6VRpZ7Cp05"
}
},
{
"cell_type": "code",
"source": [
"# Adelie\n",
"\n",
"# Gentoo\n",
"\n",
"# Chinstrap"
],
"metadata": {
"id": "eY8U1ok6Blry"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Recreate the figure below\n",
"\n",
""
],
"metadata": {
"id": "6ZgZ1HyrDp1F"
}
},
{
"cell_type": "code",
"source": [
"#\n",
"plt.scatter(_____, _________,c=_______,s=50,marker=___,label='Adelie', cmap='cividis',edgecolors='k',linewidths=0.5,vmin=___,vmax=___)\n",
"plt.scatter(_____, _________,c=_______,s=30,marker=___,label='Gentoo', cmap='cividis',edgecolors='k',linewidths=0.5,vmin=___,vmax=___)\n",
"plt.scatter(_____, _________,c=_______,s=70,marker=___,label='Chinstrap', cmap='cividis',edgecolors='k',linewidths=0.5,vmin=____,vmax=____)\n",
"plt.______(__________)\n",
"plt.ylabel(___________)\n",
"plt.xlabel(___________)\n",
"plt._________\n",
"plt.show()"
],
"metadata": {
"id": "hxbKOXfHG4Ek"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"### Q5: Train the KMeans clustering with 2 clusters"
],
"metadata": {
"id": "3yPfjCszD4Ni"
}
},
{
"cell_type": "code",
"source": [
"# Train k-means clustering algorithm here\n",
"\n",
"# Predict labels"
],
"metadata": {
"id": "U1f2VSwWJu7W"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"We got this figure with our model. Can you recreate it?\n",
"\n",
""
],
"metadata": {
"id": "I4sZITWIEHwl"
}
}
]
}