{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"provenance": [],
"toc_visible": true,
"include_colab_link": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"source": [
"# Regression, Classification, and Clustering with Scikit-learn"
],
"metadata": {
"id": "W4hqLiTUUtcQ"
}
},
{
"cell_type": "markdown",
"source": [
""
],
"metadata": {
"id": "HEz9QEIM4tYr"
}
},
{
"cell_type": "markdown",
"source": [
"Today, we are going to introduce three pillars of statistical modeling:\n",
"\n",
"1. *Regression*, which learns the relationship between continuous inputs (or predictors/features) and outputs (or predictands/targets), and\n",
"\n",
"2. *Clustering*, which groups objects in similar clusters.\n",
"\n",
"3. *Classification*, which learns which of a set of classes a new sample belongs to.\n",
"\n",
"For all these tasks, we will use an easy-to-use and versatile Python library for statistical learning: [scikit-learn](https://scikit-learn.org/stable/)."
],
"metadata": {
"id": "7Je6ZIMb17KG"
}
},
{
"cell_type": "markdown",
"source": [
"## Linear Regression\n",
"\n",
"Linear regression is arguably the simplest regression model.\n",
"\n",
"Assuming you have some measurements $y $ of how one physical property depends on another physical property $x $, a linear regression predicts $w*x+b $, where $w $ is the slope and $b $ is the intercept. You can optimize $a $ and $b $ based on the data you collected using the [fit](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.fit) method of the [linear regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) object.\n",
"\n",
"Reference:\n",
"1. Machine Learning Foundations and Practice (CS345_Colorado State University)"
],
"metadata": {
"id": "HiZkD3R1y48S"
}
},
{
"cell_type": "markdown",
"source": [
"First, let's import the required libraries:"
],
"metadata": {
"id": "jVjjlcdR0-Kr"
}
},
{
"cell_type": "code",
"source": [
"import numpy as np\n",
"from numpy.random import default_rng\n",
"import matplotlib.pyplot as plt\n",
"import pooch\n",
"import urllib.request"
],
"metadata": {
"id": "lQGM5Hry037X"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"and create fake data to play with:"
],
"metadata": {
"id": "QvZA9WZAyu6j"
}
},
{
"cell_type": "code",
"source": [
"seed = 40001\n",
"rng = default_rng(seed)\n",
"x = 10*rng.random(50)\n",
"y = 2 * x - 5 + rng.standard_normal(50)*2.5"
],
"metadata": {
"id": "tjCxCzeC1KY9"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"x"
],
"metadata": {
"id": "ZYuJv8Zl-WQl"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"We just created 50 random data points in a vector ($x$), and assumed that $x$ is related to another vector $y$ via the following equation:\n",
"$y = 2x-5$\n",
"\n",
"To mimic real observations, we add some normally-distributed noise (of magnitude 2.5) so that these \"observations\" do not exactly follow the equation.\n",
"\n",
"Now let's see how the fake data looks like!"
],
"metadata": {
"id": "Emp3Zmlh1lf_"
}
},
{
"cell_type": "code",
"source": [
"plt.scatter(x,y)\n",
"plt.xlabel('x')\n",
"plt.ylabel('y')\n",
"plt.show()"
],
"metadata": {
"id": "1NqL-bHj08pS"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"If we have data that only have information about $x $ and assume that the underlying relationship between $y$ and $x$ holds, we can plug in the observed $x$ into the equation to get the corresponding $y$ value 😀\n",
"\n",
"But can we find this equation $y = 2x-5$ from data?\n",
"\n",
"This is where **regression** comes in."
],
"metadata": {
"id": "szkdvm6BEVIm"
}
},
{
"cell_type": "markdown",
"source": [
"In one dimension, linear regression takes the following form:\n",
"\n",
"$$\n",
"\\large\n",
"y = wx + b\n",
"$$\n",
"\n",
"$w$ and $b$ are the **model parameters** we would like to optimize based on the data we currently have. Once we have the equation, we can use it to make predictions on new data (e.g., the examples we just discussed)\n",
"\n",
"Fortunately, we do not need to code the model optimization and parameters finding procedure from scratch! We can just use the pre-packaged sklearn [LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) functions instead.\n",
"\n",
"Let's try it out!"
],
"metadata": {
"id": "DCZWJYijFn6t"
}
},
{
"cell_type": "code",
"source": [
"from sklearn.linear_model import LinearRegression\n",
"\n",
"# instantiate, fit, and predict:\n",
"# (Reshape your data either using array.reshape(-1, 1) if your data has a single feature\n",
"# or array.reshape(1, -1) if it contains a single sample.)\n",
"linreg = LinearRegression()\n",
"linreg.fit(x.reshape(-1,1), y.reshape(-1,1))\n",
"y_pred = linreg.predict(x.reshape(-1,1))\n",
"\n"
],
"metadata": {
"id": "ez00hK7KG9KY"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# plot the reslts\n",
"plt.scatter(x, y, color='blue', alpha=0.5, label = 'data')\n",
"plt.plot(x, y_pred, color='black', alpha=0.8,linewidth=2, label = 'model')\n",
"plt.legend(loc=\"best\")\n",
"plt.show()"
],
"metadata": {
"id": "Spq2OGWv-1SS"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"This is what the optimized parameters look like. The values are not exactly what we defined originally, but it's in the same ballpark."
],
"metadata": {
"id": "5VbTvzbXORPz"
}
},
{
"cell_type": "code",
"source": [
"#print(f\"Model Prediction:{linreg.predict(np.asarray(2).reshape(1,-1))}\")\n",
"#print(f\"Real:{2*2-5}\")\n",
"print(f\"The calculated slope w is {np.round(float(linreg.coef_),decimals=2)}\\\n",
", while the calculated intercept b is {np.round(float(linreg.intercept_),decimals=2)}\")"
],
"metadata": {
"id": "8p1obDfpOZdF"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"One way to get a more accurate model is to collect more data. The model parameters are significantly more accurate when we use 10000 data points to train the regression model."
],
"metadata": {
"id": "rOUo_J-6QZl_"
}
},
{
"cell_type": "code",
"source": [
"seed = 40001\n",
"rng = default_rng(seed)\n",
"x = 10*rng.random(10000)\n",
"y = 2 * x - 5 + rng.standard_normal(10000)*2.5\n",
"linregN = LinearRegression()\n",
"linregN.fit(x.reshape(-1,1), y.reshape(-1,1))\n",
"print(f\"The calculated slope w is {np.round(float(linregN.coef_),decimals=2)}\\\n",
", while the calculated intercept b is {np.round(float(linregN.intercept_),decimals=2)}\")"
],
"metadata": {
"id": "CockODwdQLlq"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"How are the model parameters derived? In the scikit-learn package, the least-squared method is used.\n",
"\n",
"The parameters $(w,b)$ are chosen as to minimize the sum-squared error (using singular value decomposition so there's a unique solution):\n",
"$$\n",
"J( w,b ) = \\sum_{i=1}^N (y_i - \\hat{y}_i)^2,\n",
"$$\n",
"\n",
"where $y_i$ are the known labels and $\\hat{y}_i = w x_i + b$ are the predicted labels. So the function actually repeats the model prediction multiple times with different `w` and `b`, and ultimately settles on the combination that minimizes the `J`."
],
"metadata": {
"id": "GgQK92OpT09a"
}
},
{
"cell_type": "markdown",
"source": [
"## Clustering\n",
"Now that we have introduced how to use regression tools to find linear relationships in data, we can make predictions, e.g., in new situations or to \"fill in\" incomplete data.\n",
"\n",
"However, there are some data analysis tasks for which it would not make sense to use regression tools. For example, imagine we have a large quantity of unlabeled data with penguins' physical characteristics 🐧 We want to find the optimal way to separate the data into distinct clusters. Each cluster will then represent one specific type of penguins. This ***clusering analysis task*** can be helpful when we try to interpret a complex, high-dimensional dataset.\n",
"\n",
"Here, we will introduce some algorithms that specifically deal with clustering tasks.\n",
"\n",
"Reference:\n",
"1. Python Data Science Handbook\n",
"2. UC Davis - AIX0008 (Introduction to Data Science, summer 2022)\n",
"3. Tutorial for R package (clusterpval; https://www.lucylgao.com/clusterpval/index.html)"
],
"metadata": {
"id": "c4_SixQusPgf"
}
},
{
"cell_type": "markdown",
"source": [
"### K-means clustering\n",
"There is a large number of clustering algorithms, some of which are listed [at this link](https://scikit-learn.org/stable/modules/clustering.html). Among them, *K-means* clustering is the most commonly-used algorithm: it tries to separate data points in an unlabeled multidimensional space based on a pre-determined number of clusters.\n",
"\n",
"The first step in the *K-means* algorithms label data involves K different randomly-initiated centroid points in the data. Based on distance comparison, we will get a first guess of how data points are distributed in the clusters.This separation is not optimal. Therefore, the *k*-means algorithm optimizes by iteratively updating the centroid locations. After the first iteration, centroids are moved to the mean coordinates of different clusters. The process is repeated until the solution converges.\n",
"\n",
"Once the solution converges, each point is closer to its own cluster center than to the other cluster centers.\n",
"\n",
"Here is a visual depiction of the algorithm from the \"Python Data Science Handbook\"\n",
"\n",
"
\n",
"\n",
"First, let's generate a fake two-dimensional dataset containing four distinct blobs to see how the algorithm works."
],
"metadata": {
"id": "4AF766Nlf583"
}
},
{
"cell_type": "code",
"source": [
"from sklearn.datasets import make_blobs\n",
"import matplotlib.pyplot as plt\n",
"X, y_true = make_blobs(n_samples=300, centers=4,\n",
" cluster_std=0.60, random_state=0)\n",
"plt.scatter(X[:, 0], X[:, 1], s=50)\n",
"plt.show()"
],
"metadata": {
"id": "fX5C8FYGkDpE"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"The K-Means algorithm is available in the [cluster package of the scikit-learn library](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.cluster)!"
],
"metadata": {
"id": "pNn00JZ5kmi5"
}
},
{
"cell_type": "code",
"source": [
"from sklearn.cluster import KMeans\n",
"kmeans = KMeans(n_clusters=4)\n",
"kmeans.fit(X)\n",
"y_kmeans = kmeans.predict(X)"
],
"metadata": {
"id": "IlwGpnWRkk54"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='cividis')\n",
"\n",
"centers = kmeans.cluster_centers_\n",
"plt.scatter(centers[:, 0], centers[:, 1], c='white', s=200, alpha=0.8,edgecolors='k',linewidths=2,marker='s')\n",
"plt.show()"
],
"metadata": {
"id": "0kFkwmZkku0O"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Neat! Now let's move to a real environmental dataset 🧊 🌊 🐧"
],
"metadata": {
"id": "lVSnEBe1lMMl"
}
},
{
"cell_type": "markdown",
"source": [
"### Palmer Penguin Dataset\n",
"\n",
"In this tutorial, we will discuss how to apply clustering tools to a real dataset. The dataset we will be using is the **Palmer Penguin Dataset** (collected by Kristen Gorman).\n",
"\n",
"This dataset contains 344 entries of the physical attributes of penguins residing in the Palmer Archipelego, Antarctica. These penguins belong to three different species.\n",
"\n",
"
\n",
"\n",
"Let's start by reading the dataset and quickly visualizing it."
],
"metadata": {
"id": "oBuMLZR8xE5E"
}
},
{
"cell_type": "code",
"source": [
"penguinsize = pooch.retrieve('https://unils-my.sharepoint.com/:x:/g/personal/tom_beucler_unil_ch/ETfy8shC_PtBnsYren_f60UBSyn6Zz1CVvE0Z6_z575VZA?download=1',\n",
" known_hash='aa728597b2228a2637e39c6f08e40a80971f4cdac7faf7bc21ff4481ee3e3ae9')"
],
"metadata": {
"id": "P5Q88Pp06_L5"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"import pandas as pd\n",
"penguins = pd.read_csv(penguinsize)\n",
"print(penguins.head())"
],
"metadata": {
"id": "iu4DTRuCy8pU"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"The [dropna](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html) method drops NaN values."
],
"metadata": {
"id": "an4gY3jMmRBm"
}
},
{
"cell_type": "code",
"source": [
"penguin_df = (penguins.dropna())\n",
"penguin_df"
],
"metadata": {
"id": "j12pI4-Jlzf6"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"penguin_df[penguin_df['species']=='Adelie']['flipper_length_mm']"
],
"metadata": {
"id": "I5apvCHommNR"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"Adelie, Gentoo, Chinstrap = penguin_df[penguin_df['species']=='Adelie'], penguin_df[penguin_df['species']=='Gentoo'], \\\n",
"penguin_df[penguin_df['species']=='Chinstrap']\n",
"\n",
"plt.scatter(Adelie['body_mass_g'], Adelie['flipper_length_mm'],s=50,marker='*',label='Adelie',c='#1f77b4',edgecolors='k',linewidths=0.5)\n",
"plt.scatter(Gentoo['body_mass_g'], Gentoo['flipper_length_mm'],s=30,marker='s',label='Gentoo',c='#2ca02c',edgecolors='k',linewidths=0.5)\n",
"plt.scatter(Chinstrap['body_mass_g'], Chinstrap['flipper_length_mm'],s=50,marker='o',label='Chinstrap',c='deepskyblue',edgecolors='k',linewidths=0.5)\n",
"plt.grid(linestyle='--',alpha=0.3)\n",
"plt.ylabel('Flipper Length (mm)')\n",
"plt.xlabel('Body mass (g)')\n",
"plt.legend()\n",
"plt.show()\n"
],
"metadata": {
"id": "Ug1SqIg1mPCk"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Let's try to use k-means clustering to separate our data.\n",
"\n",
"Since we know that there are three types of penguins in the dataset, we will ask the algorithm to identify 3 clusters for us by giving it the `n_clusters=3` argument."
],
"metadata": {
"id": "oJoKs0uO71us"
}
},
{
"cell_type": "code",
"source": [
"import numpy as np\n",
"X = np.vstack((penguin_df['body_mass_g'],penguin_df['flipper_length_mm'])).T\n"
],
"metadata": {
"id": "-AOfpAkT6zTr"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"X.shape"
],
"metadata": {
"id": "tyWqt3e6Blrt"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Call Kmeans\n",
"kmeans = KMeans(n_clusters=3)\n",
"kmeans.fit(X)\n",
"y_kmeans = kmeans.predict(X)"
],
"metadata": {
"id": "bKrBTr4uBkv-"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"penguin_df['label'] = y_kmeans"
],
"metadata": {
"id": "cMZADt2z8yzZ"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"plt.scatter(penguin_df[penguin_df['species']=='Adelie']['body_mass_g'], penguin_df[penguin_df['species']=='Adelie']['flipper_length_mm'],\n",
" c=penguin_df[penguin_df['species']=='Adelie']['label'],\n",
" s=50,marker='o',label='Adelie', cmap='cividis',edgecolors='k',linewidths=0.5,vmin=0,vmax=2)\n",
"plt.scatter(penguin_df[penguin_df['species']=='Gentoo']['body_mass_g'], penguin_df[penguin_df['species']=='Gentoo']['flipper_length_mm'],\n",
" c=penguin_df[penguin_df['species']=='Gentoo']['label'],\n",
" s=30,marker='s',label='Gentoo', cmap='cividis',edgecolors='k',linewidths=0.5,vmin=0,vmax=2)\n",
"plt.scatter(penguin_df[penguin_df['species']=='Chinstrap']['body_mass_g'], penguin_df[penguin_df['species']=='Chinstrap']['flipper_length_mm'],\n",
" c=penguin_df[penguin_df['species']=='Chinstrap']['label'],\n",
" s=70,marker='*',label='Chinstrap', cmap='cividis',edgecolors='k',linewidths=0.5,vmin=0,vmax=2)\n",
"plt.grid(linestyle='--',alpha=0.3)\n",
"plt.ylabel('Flipper Length (mm)')\n",
"plt.xlabel('Body mass (g)')\n",
"plt.legend()\n",
"plt.show()"
],
"metadata": {
"id": "jiCOGHsi86qm"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"The K-Means clustering algorithm successfully separated our data into three distinct categories. But as we can see, there is an abrupt transition between the black and yellow clusters around `body_mass_g=5000`.\n",
"\n",
"When we compare the results to our labels, we see a lot of mismatches between the assigned clusters and penguin species, especially the black and gray clusters.\n",
"\n",
"What if we change the desired cluster number to 2?"
],
"metadata": {
"id": "XIJtyKFi9Zcn"
}
},
{
"cell_type": "code",
"source": [
"kmeans = KMeans(n_clusters=2)\n",
"kmeans.fit(X)\n",
"y_kmeans = kmeans.predict(X)\n",
"penguin_df['label2'] = y_kmeans"
],
"metadata": {
"id": "IOsLcJKY-a46"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"plt.scatter(penguin_df[penguin_df['species']=='Adelie']['body_mass_g'], penguin_df[penguin_df['species']=='Adelie']['flipper_length_mm'],\n",
" c=penguin_df[penguin_df['species']=='Adelie']['label2'],\n",
" s=50,marker='o',label='Adelie', cmap='cividis',edgecolors='k',linewidths=0.5,vmin=0,vmax=2)\n",
"plt.scatter(penguin_df[penguin_df['species']=='Gentoo']['body_mass_g'], penguin_df[penguin_df['species']=='Gentoo']['flipper_length_mm'],\n",
" c=penguin_df[penguin_df['species']=='Gentoo']['label2'],\n",
" s=30,marker='s',label='Gentoo', cmap='cividis',edgecolors='k',linewidths=0.5,vmin=0,vmax=2)\n",
"plt.scatter(penguin_df[penguin_df['species']=='Chinstrap']['body_mass_g'], penguin_df[penguin_df['species']=='Chinstrap']['flipper_length_mm'],\n",
" c=penguin_df[penguin_df['species']=='Chinstrap']['label2'],\n",
" s=70,marker='*',label='Chinstrap', cmap='cividis',edgecolors='k',linewidths=0.5,vmin=0,vmax=2)\n",
"plt.grid(linestyle='--',alpha=0.3)\n",
"plt.ylabel('Flipper Length (mm)')\n",
"plt.xlabel('Body mass (g)')\n",
"plt.legend()\n",
"plt.show()"
],
"metadata": {
"id": "yXiM2XOj-h-I"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"The clustering results now look much more natural than before. This suggests that `n_clusters=2` is probably the most appropriate number of clusters for the `body_mass_g`-`flipper_length_mm` 2D plane."
],
"metadata": {
"id": "2flDU9O7-s-X"
}
},
{
"cell_type": "markdown",
"source": [
"### Find the most appropriate cluster number in an objective way\n",
"There are several ways to find the most appropriate number of clusters for a given dataset. The first is the so-called `elbow curve method`.\n",
"\n",
"The basic idea behind this method is that when clusters contain data points that should not be included, the distance between these data points and the cluster centroids will be very large!\n",
"\n",
"We can visualize the change in these distances with number of clusters."
],
"metadata": {
"id": "a6JzoYRN_NGZ"
}
},
{
"cell_type": "markdown",
"source": [
"#### Elbow Method"
],
"metadata": {
"id": "zrpVv-svEzOu"
}
},
{
"cell_type": "code",
"source": [
"Sum_of_squared_distances = []\n",
"K = range(1,10)\n",
"for num_clusters in K :\n",
" kmeans = KMeans(n_clusters=num_clusters)\n",
" kmeans.fit(X)\n",
" Sum_of_squared_distances.append(kmeans.inertia_)\n",
"\n",
"plt.plot(K,Sum_of_squared_distances,marker='s',c='k',lw=2)\n",
"plt.xlabel('Number of Clusters')\n",
"plt.ylabel('Sum of Squared Distances / Inertia')\n",
"plt.title('Elbow Method for Optimal K')\n",
"plt.show()"
],
"metadata": {
"id": "y376ZqlbAlY3"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Using the \"elbow method\", the optimal number of clusters occur between `n_clusters`$=1$, for which the sum of squared distance is very large, and `n_clusters`$>4$, for which the sum of squared distance starts to saturate, i.e., decrease very slowly.\n",
"\n",
"From this analysis, it is difficult to say whether `n_clusters`$=2$ or `n_clusters`$=3$ is more suitable to cluster the penguin dataset.\n",
"\n",
"To objectively pinpoint the optimal number of clusters, we can use **silhouette analysis**, based on the silhouette score $S$, defined for a cluster $i$ as:\n",
"\n",
"$S(i) = \\frac{b(i)-a(i)}{max(a(i),b(i)}$,\n",
"\n",
"where:\n",
"\n",
"- `a(i)` is the distance between the cluster centroid and all data points **within** the corresponding cluster\n",
"- `b(i)` is the distance between the cluster centroid and all data points **outside** of the corresponding cluster\n",
"\n",
"We perform silhouette analysis for each cluster and average the `S(i)`, thus obtaining a single silhouette score for a given number of cluster `n_clusters`."
],
"metadata": {
"id": "PjCbf9qiCknb"
}
},
{
"cell_type": "markdown",
"source": [
"#### Silhouette Analysis"
],
"metadata": {
"id": "4EK4hOBsEr5R"
}
},
{
"cell_type": "code",
"source": [
"from sklearn.metrics import silhouette_score\n",
"silhouette_avg = []\n",
"for num_clusters in range(2,10):\n",
" # initialise kmeans\n",
" kmeans = KMeans(n_clusters=num_clusters)\n",
" kmeans.fit(X)\n",
" cluster_labels = kmeans.labels_\n",
"\n",
" # silhouette score\n",
" silhouette_avg.append(silhouette_score(X, cluster_labels))\n",
"\n",
"plt.plot(range(2,10),silhouette_avg,marker='s',c='k',lw=2)\n",
"plt.xlabel('Number of Clusters')\n",
"plt.ylabel('Silhouette Score')\n",
"plt.title('Silhouette Analysis for Optimal K')\n",
"plt.show()"
],
"metadata": {
"id": "LxTeeAwSBKHg"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"The silhouette score should be as large as possible. An average silhouette score `Savg` close to zero means that the clusters are not compact and overlap with each other.\n",
"\n",
"From the plot above, it seems like `n_clusters`=2 is an optimal `n_clusters` number.\n",
"\n",
"**But can it really help us separate the different penguins? It doesn't seem so, at least based on the body mass and flipper length 😞**"
],
"metadata": {
"id": "pVnavM00E4ku"
}
},
{
"cell_type": "code",
"source": [
"Adelie, Gentoo, Chinstrap = penguin_df[penguin_df['species']=='Adelie'], penguin_df[penguin_df['species']=='Gentoo'], \\\n",
"penguin_df[penguin_df['species']=='Chinstrap']\n",
"\n",
"plt.scatter(Adelie['body_mass_g'], Adelie['flipper_length_mm'],s=50,marker='*',label='Adelie',c='#1f77b4',edgecolors='k',linewidths=0.5)\n",
"plt.scatter(Gentoo['body_mass_g'], Gentoo['flipper_length_mm'],s=30,marker='s',label='Gentoo',c='#2ca02c',edgecolors='k',linewidths=0.5)\n",
"plt.scatter(Chinstrap['body_mass_g'], Chinstrap['flipper_length_mm'],s=50,marker='o',label='Chinstrap',c='deepskyblue',edgecolors='k',linewidths=0.5)\n",
"plt.grid(linestyle='--',alpha=0.3)\n",
"plt.ylabel('Flipper Length (mm)')\n",
"plt.xlabel('Body mass (g)')\n",
"plt.legend()\n",
"plt.show()\n"
],
"metadata": {
"id": "ticTdMvMFuXj"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Indeed, Gentoo penguins are distinct from the other two penguin species, because they are heavier and have longer flippers. But it is very difficult to separate chinstraps or adelies in this 2-D plane.\n",
"\n",
"This means that body mass and flipper lengths might not be the ideal combination to create clusters that represent penguin species.\n"
],
"metadata": {
"id": "KFe0MBbWqmbl"
}
},
{
"cell_type": "code",
"source": [
"plt.scatter(Adelie['culmen_length_mm'], Adelie['flipper_length_mm'],s=50,marker='*',label='Adelie',c='#1f77b4',edgecolors='k',linewidths=0.5)\n",
"plt.scatter(Gentoo['culmen_length_mm'], Gentoo['flipper_length_mm'],s=30,marker='s',label='Gentoo',c='#2ca02c',edgecolors='k',linewidths=0.5)\n",
"plt.scatter(Chinstrap['culmen_length_mm'], Chinstrap['flipper_length_mm'],s=50,marker='o',label='Chinstrap',c='deepskyblue',edgecolors='k',linewidths=0.5)\n",
"plt.grid(linestyle='--',alpha=0.3)\n",
"plt.ylabel('Flipper Length (mm)')\n",
"plt.xlabel('Culmen Length (mm)')\n",
"plt.legend()\n",
"plt.show()"
],
"metadata": {
"id": "eYSUAt_8oagP"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"The figure above suggests that one way to separate the three penguin species is to look at the combination of the lengths of their bills and their flippers. In particular, the Chinstraps have similar flipper lengths compared to the Adelies, but their culmens are more comparable to the Gentoos.\n",
"\n",
"In the exercise, we will try to repeat the k-means clustering analysis procedure shown above, but using this more appropriate 2D plane."
],
"metadata": {
"id": "r-CPeFtysz5n"
}
},
{
"cell_type": "markdown",
"source": [
"
"
],
"metadata": {
"id": "yklAyRI2FAft"
}
},
{
"cell_type": "markdown",
"source": [
"## Linear Classification"
],
"metadata": {
"id": "F5TDRdp3KuCo"
}
},
{
"cell_type": "markdown",
"source": [
"Another physical characteristic, namely the length of the bill, is very informative and can be used to separate adelies from chinstraps! 🐧\n",
"\n",
"Here, we introduce a simple algorithm to classify the two penguins categories with scikit-learn, i.e., **logistic regression**.\n",
"\n",
"(Reference: Scikit-learn tutorial [https://inria.github.io/scikit-learn-mooc/python_scripts/logistic_regression.html])"
],
"metadata": {
"id": "eNFGUffWUTuo"
}
},
{
"cell_type": "markdown",
"source": [
"First let's use the `loc` attribute of the Pandas DataFrame to only keep the data for Adelies and Chinstraps."
],
"metadata": {
"id": "KYMem2JuVa-v"
}
},
{
"cell_type": "code",
"source": [
"# only keep the Adelie and Chinstrap classes\n",
"penguins_small = penguin_df.set_index(\"species\").loc[\n",
" [\"Adelie\", \"Chinstrap\"]].reset_index()\n",
"culmen_columns = [\"culmen_length_mm\", \"culmen_depth_mm\"]\n",
"target_column = \"species\""
],
"metadata": {
"id": "6Qr9NJm5UjVY"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"penguins_small"
],
"metadata": {
"id": "U0NpXF3iUN1H"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Let's visualize the `culmen_length` and `culmen_depth` columns in the table."
],
"metadata": {
"id": "Kgy-nNuUVZIH"
}
},
{
"cell_type": "code",
"source": [
"for feature_name in culmen_columns:\n",
" plt.figure()\n",
" # plot the histogram for each specie\n",
" penguins.groupby(\"species\")[feature_name].plot.hist(alpha=0.5, legend=True)\n",
" plt.xlabel(feature_name)"
],
"metadata": {
"id": "OMGamFuqV8GP"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"We can see that when we look at the statistical distributions of the `culmen_length_mm` column, the Adelies are nicely concentrated in the figure's left-hand side, whereas Chinstraps are concentrated in the figure's right-hand side.\n",
"\n",
"On the other hand, `culmen_depth_mm` seems to be not as useful in separating the two species of penguins.\n",
"\n",
"We can phrase this problem as a binary classification. We include the `culmen_depth` and `culmen_depth` columns into a regression model that gives binary outputs. The model will only have two outcomes. One outcome will be *this penguin is an Adelie*, the other will be *this penguin is not an Adelie (a Chinstrap)*.\n",
"\n",
"Scikit-learn conveniently includes a [`LogisticRegression`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) module that is arguably the simplest way to do this task."
],
"metadata": {
"id": "im_fkE7AWmLU"
}
},
{
"cell_type": "markdown",
"source": [
""
],
"metadata": {
"id": "Pv3Tut_eZwVx"
}
},
{
"cell_type": "markdown",
"source": [
"A well-behaved classification model should be able to correctly classify penguins not in the data table.\n",
"\n",
"We ensure our model *generalizes well* by not using certain portions of the data when training the regression model."
],
"metadata": {
"id": "iZY5h6g0Z9tK"
}
},
{
"cell_type": "code",
"source": [
"from sklearn.model_selection import train_test_split\n",
"\n",
"penguins_train, penguins_test = train_test_split(penguins_small, random_state=0)\n",
"\n",
"data_train = penguins_train[culmen_columns]\n",
"data_test = penguins_test[culmen_columns]\n",
"\n",
"target_train = penguins_train[target_column]\n",
"target_test = penguins_test[target_column]"
],
"metadata": {
"id": "ierm6HVVYROF"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Let's look at the data used as *inputs* of the model. Doing the `train_test_split` function can automatically separate the dataset into a larger dataframe **for training** and a smaller one **for testing**."
],
"metadata": {
"id": "8drhxRpHarVt"
}
},
{
"cell_type": "code",
"source": [
"data_train"
],
"metadata": {
"id": "AKttf4DNa5ZI"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Since `culmen_length_mm` is much larger than `culmen_depth_mm` (almost 2-3 times larger in fact), the weights of the model will be biased towards `culmen_length_mm`. We don't want this!\n",
"\n",
"Here we use `StandardScaler` in scikit-learn to **standardize** the inputs. Standardization means removing the sample mean from each sample values and dividing it by the standard deviation of the samples.\n",
"\n",
"$X_{standard} = \\frac{X-mean(X)}{std(X)}$"
],
"metadata": {
"id": "dp6ILljAcVQm"
}
},
{
"cell_type": "code",
"source": [
"c_length_stand = (data_train['culmen_length_mm']-np.mean(data_train['culmen_length_mm']))/np.std(data_train['culmen_length_mm'])\n",
"c_depth_stand = (data_train['culmen_depth_mm']-np.mean(data_train['culmen_depth_mm']))/np.std(data_train['culmen_depth_mm'])"
],
"metadata": {
"id": "jepTuY6Ld3le"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"print(np.min(c_length_stand),np.max(c_length_stand))\n",
"print(np.min(c_depth_stand),np.max(c_depth_stand))"
],
"metadata": {
"id": "ccH3tjcdeMxS"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"In this tutorial, we redo what we just did with [`StandardScaler()`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) on the two columns, and train a [`LogisticRegression()`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) model for binary predictions.\n",
"\n",
"Sklearn has a `make_pipeline` to streamline this process."
],
"metadata": {
"id": "AQznp0yZeY6q"
}
},
{
"cell_type": "code",
"source": [
"from sklearn.pipeline import make_pipeline\n",
"from sklearn.preprocessing import StandardScaler\n",
"from sklearn.linear_model import LogisticRegression\n",
"\n",
"logistic_regression = make_pipeline(\n",
" StandardScaler(), LogisticRegression(penalty=\"none\")\n",
")\n",
"logistic_regression.fit(data_train, target_train)\n",
"accuracy = logistic_regression.score(data_test, target_test)\n",
"print(f\"Accuracy on test set: {accuracy:.3f}\")"
],
"metadata": {
"id": "P06XdpF5b62J"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"And you can see, our model performs quite well in separating our penguins, only making a few misclassications."
],
"metadata": {
"id": "3LQYabIme0D8"
}
},
{
"cell_type": "markdown",
"source": [
"But how did our model classify the samples? The way the model works is encoded in the `coef_` information in the regression model.\n",
"\n",
"Let's visualize them!"
],
"metadata": {
"id": "KQjMyvB0i4DS"
}
},
{
"cell_type": "code",
"source": [
"coefs = logistic_regression[-1].coef_[0] # the coefficients is a 2d array\n",
"weights = pd.Series(coefs, index=culmen_columns)"
],
"metadata": {
"id": "WiBxchT3ivck"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"weights.plot.barh()\n",
"_ = plt.title(\"Weights of the logistic regression\")"
],
"metadata": {
"id": "5NaMyxZFizwi"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"print(f\"weights: {weights}\\n\")\n",
"print(f\"intercept:{logistic_regression[-1].intercept_}\")"
],
"metadata": {
"id": "LnRhE9itjpOg"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"This is actually the weights in a linear equation\n",
"\n",
"$w1(culmen length)+w2(culmen depth)+intecept = 0$\n",
"\n",
"In the trained model, this linear equation is actually:\n",
"\n",
"$11.67(culmenlength)-4.18(culmendepth)-3.97 = 0$"
],
"metadata": {
"id": "q6-fjtYscFgK"
}
},
{
"cell_type": "markdown",
"source": [
"The linear equation can separate the penguins quite nicely!"
],
"metadata": {
"id": "dKjP9mQJlY4i"
}
},
{
"cell_type": "markdown",
"source": [
""
],
"metadata": {
"id": "Deii-cAplOYr"
}
}
]
}