{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github"
},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"source": [
"# Exercise 1: Comparing Different Types of Support Vector Machines for Classification"
],
"metadata": {
"id": "2oP2G-5B5Emv"
}
},
{
"cell_type": "markdown",
"source": [
"Now that you have learned the basics of Support Vector Machines and tuning regularization parameters can make the trained SVMs more generalizable, it is time to learn how to create and train a simple SVM.\n",
"\n",
"We will start with a sample dataset including measurements of different physical characteristics of flowers. We would like to train a support vector machine to automatically differentiate two different types of flowers. After training our first SVM model, we will make additional experiments to see how the decision boundaries of regularized versions of trained SVMs differ from less regularized ones."
],
"metadata": {
"id": "3w4eh3lzwb6F"
}
},
{
"cell_type": "markdown",
"source": [
"**Goal**: Building similar models based on different types of Support Vector Machines (SVMs) to classify linearly separable classes, here [*Iris Setosa*](https://en.wikipedia.org/wiki/Iris_setosa) and [*Iris Versicolor*](https://en.wikipedia.org/wiki/Iris_versicolor) from the [Iris dataset](https://archive.ics.uci.edu/ml/datasets/iris)."
],
"metadata": {
"id": "hKxrncoO7AYT"
}
},
{
"cell_type": "markdown",
"source": [
""
],
"metadata": {
"id": "0v4gebeh6eqM"
}
},
{
"cell_type": "markdown",
"source": [
"**Caption**: Iris flowers in the evening light. Are they Irises Setosa or Irises Versicolor?\n",
"\n",
"**Source**: Photo by Christina Brinza on Unsplash\n",
" "
],
"metadata": {
"id": "MI7uj-Ye6cVV"
}
},
{
"cell_type": "markdown",
"source": [
"First, let's load the Iris dataset! 💐"
],
"metadata": {
"id": "pBekLF0J8K6M"
}
},
{
"cell_type": "code",
"source": [
"from sklearn import datasets # Import datasets from scikit-learn\n",
"import matplotlib.pyplot as plt\n",
"import numpy as np"
],
"metadata": {
"id": "xzJxYtxd8bfW"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"iris = datasets.load_iris() # Load the Iris dataset specifically\n",
"X = iris[\"data\"][:, (2, 3)] # Features = petal length, petal width\n",
"y = iris[\"target\"] # Target = Iris species"
],
"metadata": {
"id": "58_rKayD8edF"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Pwq8I9RPuN5T"
},
"outputs": [],
"source": [
"# The iris dataset contains information of different types of flowers.\n",
"# Here we want to choose two flowers: setosa and versicolor\n",
"\n",
"setosa_or_versicolor = (y == 0) | (y == 1) # Indices of Irises setosa/versicolor\n",
"X = X[setosa_or_versicolor] # Only keep Irises setosa/versicolor in features\n",
"y = y[setosa_or_versicolor] # Only keep Irises setosa/versicolor in target"
]
},
{
"cell_type": "markdown",
"source": [
"Now we have our pre-processed dataset 💐:\n",
"\n",
"Our features are (petal length, petal width) in `X`.\n",
"\n",
"Our target is (Iris species) in `y`."
],
"metadata": {
"id": "sNrBdnen98aJ"
}
},
{
"cell_type": "code",
"source": [
"# (Optional) Expore X and y to familiarize yourself\n",
"# with the pre-processed dataset\n",
"# A couple of things you can try is to look the size of the dataset you just loaded.\n",
"# (e.g., X.shape, y.shape)\n"
],
"metadata": {
"id": "4IRhL3qE-YmS"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"**Q1) Train a Linear Support Vector Classification model on the pre-processed dataset**\n",
"\n",
"Hint: The documentation for `LinearSVC` is [at this link](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html)"
],
"metadata": {
"id": "fHhJAFiI9agD"
}
},
{
"cell_type": "code",
"source": [
"# Import the LinearSVC class from the scikit-learn `svm` library\n",
"from sklearn.___ import ___________"
],
"metadata": {
"id": "BgVbd88GeayE"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Fit a LinearSVC object on the Iris dataset\n",
"# (1) Initiate a LinearSVC object\n",
"______ = ________\n",
"# (2) Use the LinearSVC object to fit the dataset you just created. Use .fit() for this task.\n",
"______.___(__,__)"
],
"metadata": {
"id": "fkpDwPCtf0L0"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"**Q2) Plot the decision boundary of this classifier**\n",
"\n",
"Hint: According to the [documentation](https://scikit-learn.org/stable/auto_examples/svm/plot_separating_hyperplane.html#example-svm-plot-separating-hyperplane-py), given a SVC object `svc`:\n",
"\n",
"* Weights: `W = svc.coef_[0]`, and\n",
"\n",
"* Intercept: `I = svc.intercept_`\n",
"\n",
"the decision boundary is the line:\n",
"\n",
"$y_{boundary} = -\\frac{W\\left[0\\right] x + I\\left[0\\right]}{W\\left[1\\right]}$\n",
"\n",
"⚠ If you normalized your inputs before feeding it to the SVM in the previous question (e.g., via the [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)), the equation above is only valid in \"normalized\" coordinates."
],
"metadata": {
"id": "x-1yMvQFGTxh"
}
},
{
"cell_type": "code",
"source": [
"# Use the first part of the hint to get the weights of the fitted SVC object\n",
"W = ____._____\n",
"# Use the second part of the hint to get the intercept of the fitted SVC object\n",
"I = ____._____"
],
"metadata": {
"id": "xAq-ckCuj0dV"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Now show the decision boundary that you just got in a scatter plot. Can it cleanly separate different flowers?\n",
"\n",
"Hint: (1) We will use `plt.scatter()` to plot the flower data. Check [documentation](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.scatter.html) for details.\n",
"(2) We will need to initiate a `X` array to plot the decision boundary. There are many ways to create such an array, but let's use `np.linspace()` for now. [Documentation](https://numpy.org/doc/stable/reference/generated/numpy.linspace.html)"
],
"metadata": {
"id": "IfjB5534gl61"
}
},
{
"cell_type": "code",
"source": [
"# On the same figure: Scatter the features X and plot the decision boundary\n",
"# Don't forget to label the axes and add a legend to your figure\n",
"# Initiate a figure using plt.subplots\n",
"___,ax = plt.subplots(__,___)\n",
"\n",
"# Now plot all feature Xs in a scatter plot. Use these settings:\n",
"# (s = 80, color='r', edgecolor='k', linewidths=1.5)\n",
"# X, y indices of the inputs can be accessed like this X[:,0] [X indices]; X[:,1] [Y indices]\n",
"ax.________(______,________,s=___,color=___,edgecolor=___,linewidths=___)\n",
"\n",
"# Decision boundary: yboundary = - (W[0]*x+I[0])/W[1]\n",
"# Initiate a 1D array X with np.linspace()\n",
"x = np.linspace(_____)\n",
"# Calculate yboundary\n",
"yboundary = ___________\n",
"plt.plot(___,____,c='r',lw=2)\n",
"\n",
"ax.set_xlabel(\"Pedal length\",fontdict={'size':13})\n",
"ax.set_ylabel(\"Pedal width\",fontdict={'size':13})\n",
"ax.set_title('Iris dataset')\n",
"ax.tick_params(axis='both', which='major', labelsize=11)\n",
"ax.grid(alpha=0.2,ls='--',lw=1)\n",
"#[x.set_linewidth(2) for x in ax.spines.values()]\n",
"plt.show()\n"
],
"metadata": {
"id": "yNzN1W9gl-4v"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"**Q3) Train a `SVC` and a `SGDClassifier` for the same task and compare these two models to the `LinearSVC`. Use `kernel = 'Linear'` when instantiating the SVC!**\n",
"\n",
"Hint: Here is the documentation for the [`SVC`](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) class and the [`SGDClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html) class."
],
"metadata": {
"id": "tyXjOi0SV43Y"
}
},
{
"cell_type": "code",
"source": [
"# Import SVC and SGDClassifier\n",
"from sklearn.___ import _____ #SVC\n",
"from sklearn.__________ import ______________ #SGDClassifier"
],
"metadata": {
"id": "M0tFVPsen_rO"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Fit the `SVC` and the `SGDClassifier` to the pre-processed dataset\n",
"# Instantiate 'SVC', use kernel='linear'\n",
"______ = SVC(_____________)\n",
"# Instantiate 'SGDClassifier'\n",
"______ = __________()\n",
"\n",
"# Fit models with data\n",
"______.____(_,_) # SVC\n",
"______.____(_,_)"
],
"metadata": {
"id": "Zc_yd0MZW7KB"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Plot all three decision boundaries on the same labeled figure\n",
"#########################################################################\n",
"# Same as what you did to get the first figure\n",
"___,___ = ______.____________(____,____)\n",
"ax.scatter(________________________)\n",
"\n",
"# Decision boundary: np.dot(W,x)+b = 0\n",
"_ = ________(_______)\n",
"__._____(_,_________,c='r',lw=2,label='LinearSVC')\n",
"#########################################################################\n",
"# Now add the decision boundary for 'SVC'\n",
"__._____(_,______________________,c='k',lw=2,label='SVC')\n",
"#########################################################################\n",
"# Now add the decision boundary for 'SGDClassifier'\n",
"__._____(_,______________________,c='b',lw=2,label='SGDClas.')\n",
"#########################################################################\n",
"# Same as the first figure again\n",
"ax.legend()\n",
"ax.set_xlabel(\"Pedal length\",fontdict={'size':13})\n",
"ax.set_ylabel(\"Pedal width\",fontdict={'size':13})\n",
"ax.set_title('Iris dataset')\n",
"ax.tick_params(axis='both', which='major', labelsize=11)\n",
"ax.grid(alpha=0.2,ls='--',lw=1)\n",
"plt.show()"
],
"metadata": {
"id": "HEKHwcWXXDUg"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"**Q4) Create more regularized versions of each model and compare these new models to the previous ones**\n",
"\n",
"Hint: Vary the hyperparameter `C` for the `LinearSVC` and `SVC` models, and vary the hyperparameter `alpha` for the `SGDClassifier` **model**. Consult the documentation to know whether to increase or decrease the regularization parameters."
],
"metadata": {
"id": "So-w9eykXLgo"
}
},
{
"cell_type": "code",
"source": [
"# Fit more regularized versions of the `LinearSVC` model\n",
"irisSVC_r = LinearSVC(C=_____).fit(X,y)"
],
"metadata": {
"id": "mxJ0sXC0YIcw"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Fit more regularized versions of the `SVC` model\n",
"irisSVC2_r = SVC(C=_____,______='_____').fit(X,y)"
],
"metadata": {
"id": "pKG3_mVhYXXN"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Fit more regularized versions of the `SGDClassifier` model\n",
"irisSGDc_r = SGDClassifier(alpha=_________).fit(X,y)"
],
"metadata": {
"id": "K5Yfley3YaSx"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Compare the new decision boundaries to the old ones\n",
"# Plot all three decision boundaries on the same labeled figure\n",
"#########################################################################\n",
"# Same as what you did to get the second figure\n",
"___,___ = ______.____________(____,____)\n",
"ax.scatter(________________________)\n",
"\n",
"# Decision boundary: np.dot(W,x)+b = 0\n",
"_ = ________(_______)\n",
"__._____(_,_________,c='r',lw=2,label='LinearSVC')\n",
"__._____(_,______________________,c='k',lw=2,label='SVC')\n",
"__._____(_,______________________,c='b',lw=2,label='SGDClas.')\n",
"#########################################################################\n",
"# Now add the decision boundaries for the regularized models\n",
"# Regularized LinearSVC model\n",
"# Find the weights and intercept\n",
"Wlsvc_r,Ilsvc_r = _____________,_____________\n",
"ax.plot(x,_____________________________,c='r',lw=2,ls='--',label='LinearSVC_r')\n",
"# Regularized SVC model\n",
"# Find the weights and intercept\n",
"Wsvc_r,Isvc_r = _____________,_____________\n",
"ax.plot(x,_____________________________,c='k',lw=2,ls='--',label='SVC_r')\n",
"# Regularized SGDClassifier model\n",
"# Find the weights and intercept\n",
"Wsgdc_r,Isgdc_r = _____________,_____________\n",
"ax.plot(x,_____________________________,c='b',lw=2,ls='--',label='SGDClas._r')\n",
"#########################################################################\n",
"# Same as what you did to get the second figure\n",
"ax.legend()\n",
"ax.set_xlabel(\"Pedal length\",fontdict={'size':13})\n",
"ax.set_ylabel(\"Pedal width\",fontdict={'size':13})\n",
"ax.set_title('Iris dataset')\n",
"ax.tick_params(axis='both', which='major', labelsize=11)\n",
"ax.grid(alpha=0.2,ls='--',lw=1)\n",
"plt.show()"
],
"metadata": {
"id": "2ZoYBMaPn3dJ"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"How does regularization affect the decision boundary in this simple case?"
],
"metadata": {
"id": "t7MA-YalYivu"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "rVPmUtqiuN5c"
},
"source": [
"## Bonus Exercise 1: Training a SVM Regressor on the California Housing Dataset"
]
},
{
"cell_type": "markdown",
"source": [
""
],
"metadata": {
"id": "XWNEAFYgaBz_"
}
},
{
"cell_type": "markdown",
"source": [
"**Can we use SVMs to predict the price of a house in California** (in 1990) **based on its characteristics** (longitude, latitude, total_rooms, etc.)?\n",
"\n",
"The dataset was originally used in [Pace, R. Kelley, and Ronald Barry. \"Sparse spatial autoregressions.\" *Statistics & Probability Letters* 33.3 (1997): 291-297.](https://www.sciencedirect.com/science/article/pii/S016771529600140X)"
],
"metadata": {
"id": "Gf9Oad-ubNRI"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "HCM719OKuN5c"
},
"source": [
"Let's first load the dataset using Scikit-Learn's `fetch_california_housing()` function:"
]
},
{
"cell_type": "code",
"source": [
"from sklearn.datasets import fetch_california_housing # Import function"
],
"metadata": {
"id": "KVmljCvDcJ3t"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "iaZY5nATuN5c"
},
"outputs": [],
"source": [
"housing = fetch_california_housing() # Fetch dataset\n",
"X = housing[\"data\"] # Features\n",
"y = housing[\"target\"] # Targets"
]
},
{
"cell_type": "code",
"source": [
"# Don't hesitate to do some preliminary data analysis here"
],
"metadata": {
"id": "ZznSTRwccTjn"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "PSnoIEWFuN5c"
},
"source": [
"Let's split the data into a training set and a test set:"
]
},
{
"cell_type": "code",
"source": [
"from sklearn.model_selection import train_test_split # Import function\n",
"# from scikit-learn"
],
"metadata": {
"id": "fI8NnUUAcrlN"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "wefK5elSuN5d"
},
"outputs": [],
"source": [
"# Split the dataset into training and test sets\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y,\n",
" test_size=0.2,\n",
" random_state=42)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "4SONf_ZnuN5d"
},
"source": [
"**Q1) Normalize the features `X` before training the regressor**\n",
"\n",
"Hint: You may use the [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) to normalize `X` using its [z-score](https://en.wikipedia.org/wiki/Standard_score)."
]
},
{
"cell_type": "code",
"source": [
"# Define the normalization/scaler"
],
"metadata": {
"id": "4mJKISxqdchn"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Normalize the features `X`"
],
"metadata": {
"id": "1SA4NKTrdg3i"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"**Q2) Start by training a simple Linear Support Vector Regression and assess its performance on the training and test sets**\n",
"\n",
"Hint 1: [Here's the documentation](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVR.html) for `scikit-learn`'s `LinearSVR`\n",
"\n",
"Hint 2: You may assess the regressor's performance using the any [scikit-learn regression metric](https://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics) you find interpretable."
],
"metadata": {
"id": "ZXM6cbalduAV"
}
},
{
"cell_type": "code",
"source": [
"# Train a simple linear SVR"
],
"metadata": {
"id": "nXq4bMBoe-mj"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Assess its performance on the training and test sets"
],
"metadata": {
"id": "28Z-hZUNfBjf"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"**Q3) How large is the model error in $?**\n",
"\n",
"Hint: The unit of `y` in the dataset is $10,000."
],
"metadata": {
"id": "6wJ06AonfEWD"
}
},
{
"cell_type": "code",
"source": [
"# Estimate the approximate model error in units 10,000$"
],
"metadata": {
"id": "jKUBS0kmfL_e"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"**Q4) Try to beat your linear model using more complex SVM regressors.**\n",
"\n",
"Hint 1: The performance of a model should be assessed using the test set.\n",
"\n",
"Hint 2: Géron's model uses a [`SVR`](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html) for which the hyperparameters `gamma` and `C` were optimized using a [Randomized Search](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html), and gets the root mean-squared error down to approximately 6,000 dollars on the test dataset."
],
"metadata": {
"id": "O2VQ5cimfSV9"
}
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "dHRXw__VuN5g"
},
"outputs": [],
"source": [
"# Experiment here"
]
},
{
"cell_type": "code",
"source": [
"# Have fun"
],
"metadata": {
"id": "mNY-zE3wqxqe"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# But don't spend too too much time on it as we have two more exercises\n",
"# to go through before we get to the wildfire dataset"
],
"metadata": {
"id": "1nrswu35qyTR"
},
"execution_count": null,
"outputs": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.10"
},
"nav_menu": {},
"toc": {
"navigate_menu": true,
"number_sections": true,
"sideBar": true,
"threshold": 6,
"toc_cell": false,
"toc_section_display": "block",
"toc_window_display": false
},
"colab": {
"provenance": [],
"toc_visible": true
}
},
"nbformat": 4,
"nbformat_minor": 0
}