{ "cells": [ { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "view-in-github" }, "source": [ "\"Open" ] }, { "cell_type": "markdown", "metadata": { "id": "5Tt5C4PoIRl0" }, "source": [ "# (Exercises) Training Models\n", "\n", "This week's notebook is based off of the exercises in Chapter 4 of Géron's book." ] }, { "cell_type": "markdown", "metadata": { "id": "666iBNeL8-7H" }, "source": [ "## Notebook Setup\n", "Let's begin like in the last notebook: importing a few common modules, ensuring MatplotLib plots figures inline and preparing a function to save the figures. We also check that Python 3.5 or later is installed (although Python 2.x may work, it is deprecated so once again we strongly recommend you use Python 3 instead), as well as Scikit-Learn ≥0.20.\n", "\n", "You don't need to worry about understanding everything that is written in this section." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "S_OXSp49IOF2" }, "outputs": [], "source": [ "#@title Run this cell for preliminary requirements. Double click it if you want to check out the source :)\n", "\n", "# Python ≥3.5 is required\n", "import sys\n", "assert sys.version_info >= (3, 5)\n", "\n", "# Is this notebook running on Colab or Kaggle?\n", "IS_COLAB = \"google.colab\" in sys.modules\n", "\n", "# Scikit-Learn ≥0.20 is required\n", "import sklearn\n", "assert sklearn.__version__ >= \"0.20\"\n", "\n", "# Common imports\n", "import numpy as np\n", "import os\n", "\n", "# To make this notebook's output stable across runs\n", "rnd_seed = 42\n", "rnd_gen = np.random.default_rng(rnd_seed)\n", "\n", "# To plot pretty figures\n", "%matplotlib inline\n", "import matplotlib as mpl\n", "import matplotlib.pyplot as plt\n", "mpl.rc('axes', labelsize=14)\n", "mpl.rc('xtick', labelsize=12)\n", "mpl.rc('ytick', labelsize=12)\n", "\n", "# Where to save the figures\n", "PROJECT_ROOT_DIR = \".\"\n", "CHAPTER_ID = \"classification\"\n", "IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, \"images\", CHAPTER_ID)\n", "os.makedirs(IMAGES_PATH, exist_ok=True)\n", "\n", "def save_fig(fig_id, tight_layout=True, fig_extension=\"png\", resolution=300):\n", " path = os.path.join(IMAGES_PATH, fig_id + \".\" + fig_extension)\n", " print(\"Saving figure\", fig_id)\n", " if tight_layout:\n", " plt.tight_layout()\n", " plt.savefig(path, format=fig_extension, dpi=resolution)\n", "\n", "#Ensure the palmerspenguins dataset is installed\n", "%pip install palmerpenguins --quiet" ] }, { "cell_type": "markdown", "metadata": { "id": "RtuO7Elb9LuC" }, "source": [ " **Data Setup**" ] }, { "cell_type": "markdown", "metadata": { "id": "wKsvLXdmzqD8" }, "source": [ "In this notebook we will be working with the [*Palmer Penguins dataset*](https://allisonhorst.github.io/palmerpenguins/articles/intro.html). Each entry in the dataset includes the penguin's species, island, sex, flipper length, body mass, bill length, bill depth, and the year the study was carried out. Let's take a moment and observe our subjects!
\n", "\n", "
🐧
\n", "In order: Adélie (Pygoscelis adeliae), Chinstrap (Pygoscelis antarcticus), and Gentoo (Pygoscelis papua) penguins
\n", "\n", "\n", "\n", "\n", "\n", "
\n", "\n", "As you can imagine, this dataset is normally used to train *multiclass*/*multinomial* classification algorithms and not *binary* classification algorithms, since there *are* more than 2 classes. \n", "\n", "\"*Three classes, even!*\" - an observant TA\n", "\n", "For this exercise, however, we will implement the binary classification algorithm referred to as the *logistic regression* algorithm (also called logit regression)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "emWru72owjEI" }, "outputs": [], "source": [ "# Let's load the Palmer Penguins Dataset!\n", "from palmerpenguins import load_penguins\n", "data = load_penguins()" ] }, { "cell_type": "markdown", "metadata": { "id": "kbk8zvwOf2-g" }, "source": [ "Like with the Titanic dataset in the previous notebook, the data here is loaded as a Pandas DataFrame. Feel free to play around with it in the cell below!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "AR2NqgIdTcuk" }, "outputs": [], "source": [ "# The following code will make the dataframe be shown in an interactive table\n", "# inside of Google colab. Use data.head(5) if you're running this locally\n", "\n", "from google.colab import data_table\n", "data_table.enable_dataframe_formatter()\n", "\n", "data" ] }, { "cell_type": "markdown", "metadata": { "id": "nDFjkxZE1BA3" }, "source": [ "As we mentioned before, there are three species of penguin in the dataset. However, today we'll be implementing a _binary classification algorithm_, which means we need to have exactly two target classes! Let's go ahead and filter the data so that we keep the Adelie and Gentoo species." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "kTUHBGUT1dTY" }, "outputs": [], "source": [ "# We define the species that we're interested in\n", "species = ['Adelie','Gentoo']\n", "\n", "# And use the .loc method in Pandas to keep only the two species mentioned above\n", "data = data.loc[data['species'].isin(species)]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "fa4m4iPxgpu9" }, "outputs": [], "source": [ "#@title Today, we'll be learning to classify the penguins based on the length and depth of their bills. Run the cell and take a look at the data! 🔎\n", "\n", "import plotly.express as px\n", "\n", "# Dimensions for interactive plot\n", "dims = ['bill_length_mm', 'bill_depth_mm']\n", "colors = ['orange','black','lightseagreen']\n", "\n", "fig = px.scatter_matrix(\n", " data, \n", " dimensions=dims,\n", " color=\"species\",\n", " color_discrete_sequence = colors\n", " )\n", "\n", "fig.show()" ] }, { "cell_type": "markdown", "metadata": { "id": "kGI9Ruq6BRPZ" }, "source": [ "We now have a dataframe with all the information that we need. Let's go ahead and extract the bill length and depth to use as input data, storing it in $x$. \n", "Then we'll store the labels (i.e., the _targets_) in $y$.\n", "\n", "## Q1) Extract the bill length and bill depth to use as the input vector $x$, and store the label (i.e., the target data) in $y$\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "9kwuG96d5GrS" }, "outputs": [], "source": [ "#@title Hints - Data Loading and Filtering\n", "\n", "'''\n", "Loading data into X:\n", "\n", "You can access multiple columns of a pandas dataframe using a list! The snippet\n", "below will return the species and island associated with each penguin in the\n", "database. \n", "\n", "In the cell below, you want to load the bill length and bull depth columns.\n", "Make sure you use the right column name! Copy it from the dataframe view we\n", "printed before, and make sure there aren't any extra spaces\n", "''';\n", "data[['species','island']];\n", "\n", "'''\n", "Finding the NaN row indices\n", "\n", "Pandas has a built-in function to determine if the value is a NaN (Not a Number)\n", "value. \n", "\n", "mydata.notna() will return True wherever the data isn't a NaN value, but we need\n", "to check if each row has _any_ NaN values - that's what the _all(axis=1)_ does.\n", "''';" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "mRognoEcPUmf" }, "outputs": [], "source": [ "# Load the bill length and depth into X\n", "X = data[______]\n", "\n", "# Find out the rows where you don't have an valid input (i.e., rows with a nan value)\n", "indices = X._____().all(axis=1)\n", "\n", "# Filter out the datapoints using the indices we found\n", "X = X[___]\n", "\n", "# We'll also normalize the data using the mean and standard deviation\n", "X = (X - X.mean())/X.std()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "0IoLpmkLugIR" }, "outputs": [], "source": [ "# Let's take a look at the input dataset - if you did everything right, you'll \n", "# have 274 entries and printing out x.shape will return (274,2)\n", "print(X.shape)" ] }, { "cell_type": "markdown", "metadata": { "id": "S63cMjGKLpd2" }, "source": [ "We have our input data, but we need a target to predict. We previously filtered the data to only include Adélie and Gentoo penguins, but we still have them as strings! Let's convert them to a binary representation (i.e., 0 or 1). Make sure you have the same penguins as in your input!\n", "\n", "## **Q2) Convert the species label to a binary classification, and filter the target data to match the input data.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "s3psibnt08RZ" }, "outputs": [], "source": [ "#@title Hints - Boolean Representation & Type Conversion\n", "\n", "''' \n", "Boolean Representation\n", "\n", "You can access the species data by calling data['species']\n", "\n", "== is the operator that lets you check if the data is equal to another value\n", "\n", "data['island'] == Torgesen \n", "will return True for each row if the penguin was studied in Torgesen, and False\n", "if it was studied in another island\n", "''';\n", "\n", "'''\n", "Type Conversion\n", "\n", "Pandas dataframes include a method to change the type of the data being called.\n", "\n", "data['bill_length_mm'].astype(int) will return the bill length data as integers\n", "''';" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "TEJwe-AvLvN2" }, "outputs": [], "source": [ "# Convert species data into boolean form by checking if the species is Adélie\n", "y = (data['_______'] _____ '_______')\n", "\n", "# Filter out the points for which we have NaN values. Reuse the indices from Q1! \n", "y = y[____]\n", "\n", "# Convert the boolean data into an integer\n", "y = y._______(_____)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "x9WveIx-1w5g" }, "outputs": [], "source": [ "# Print out y! If everything is implemented correctly, you should see a panda \n", "# series full of ones and zeroes with 274 rows\n", "print(y)" ] }, { "cell_type": "markdown", "metadata": { "id": "jvNBaOWZ9fXM" }, "source": [ "We now have a set of binary classification data we can use to train an algorithm.\n", "\n", "As we saw during our reading, we need to define three things in order to train our algorithm: \n", "> $\\cdot$ the type of algorithm we will train, \\\\\n", "> $\\cdot$ the cost function (which will tell us how close our prediction is to the truth), and \\\\\n", "> $\\cdot$ a method for updating the parameters in our model according to the value of the cost function (e.g., the gradient descent method). \n", "\n", "Let's begin by defining the type of algorithm we will use. We will train a logistic regression model to differentiate between two classes. A reminder of how the logistic regression algorithm works is given below.\n", "


\n", "The logistic regression algorithm will thus take an input $t$ that is a linear combination of the features:\n", "\n", "\n", "\n", "
$t_{\\small{n}} = \\beta_{\\small{0}} + \\beta_{\\small{1}} \\cdot X_{1,n} + \\beta_{\\small{2}} \\cdot X_{2,n}$
\n", "\n", "where \n", "* $n$ is the ID of the sample \n", "* $X_{\\small{0}}$ represents the bill length\n", "* $X_{\\small{1}}$ represents the bill width\n", "\n", "This input is then fed into the logistic function, $\\sigma$:\n", "\\begin{align} \n", "\\sigma: t\\mapsto \\dfrac{1}{1+e^ {-t}}\n", "\\end{align}\n", "\n", "Let's define the logistic function for later use." ] }, { "cell_type": "markdown", "metadata": { "id": "PzrhQ2E-zkDr" }, "source": [ "## **Q3) Define the logistic function**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "oBvebKZVDWx0" }, "outputs": [], "source": [ "#@title Hint - Exponential Function\n", "'''\n", "Numpy includes the exponential function in its library as numpy.exp\n", "https://numpy.org/doc/stable/reference/generated/numpy.exp.html\n", "''';\n", "\n", "np.exp(2);" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "EdelvJlJzuE5" }, "outputs": [], "source": [ "def logistic(in_val):\n", " # Return the value of the logistic function\n", " out_value = _________ \n", " return out_value" ] }, { "cell_type": "markdown", "metadata": { "id": "WqIkC1wZ0gAA" }, "source": [ "Now that the logistic function has been defined, we can plot it (this will help us remember what it looks like!) Run the code below - you won't have to fill anything in for this one 😀 But feel free to show the code and read through it - some of the functions used can be helpful to you down the line!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "lgt9dI6b9Zwa" }, "outputs": [], "source": [ "#@title Run this to plot the logistic function!\n", "# Let's generate an array of 20 points with values from -4 to +4 \n", "t = np.linspace(-4,4,20)\n", "\n", "# Initiate a figure and axes object using matplotlib\n", "fig, ax = plt.subplots()\n", "\n", "# Draw the X and Y axes\n", "ax.axvline(0, c='black', alpha=1)\n", "ax.axhline(0, c='black', alpha=1)\n", "\n", "# Draw the threshold line (y_val=0,5) and asymptote (y=1)\n", "[ax.axhline(y_val, c='black', alpha=0.5, linestyle='dotted') for y_val in (0.5,1)]\n", "\n", "# Scale things to make the graph look nicer\n", "plt.autoscale(axis='x', tight=True)\n", "\n", "# Plot the logistic function. X values from the t vector, y values from logistic(t)\n", "ax.plot(t, logistic(t));\n", "ax.set_xlabel('$t$')\n", "ax.set_ylabel('$\\\\sigma\\\\ \\\\left(t\\\\right)$')\n", "fig.tight_layout()" ] }, { "cell_type": "markdown", "metadata": { "id": "0Ll1PKpjxqLX" }, "source": [ "With the logistic function, we define inputs resulting in $\\sigma\\geq0.5$ as belonging to the ***one*** class, and any value below that is considered to belong to the ***zero*** class.\n", "\n", "We now have a function which lets us map the value of the bill length and width to the class to which the observation belongs (i.e., whether the length and width correspond to Adélie or Gentoo penguins). However, there is a parameter vector **$\\theta$** with a number of parameters that we do not have a value for:
$\\theta = [ \\beta_{\\small{0}}, \\beta_{\\small{1}}$, $\\beta_{\\small{2}} ]$" ] }, { "cell_type": "markdown", "metadata": { "id": "O_lT4EaK2ICa" }, "source": [ "## **Q4) Set up an array of random numbers between 0 and 1 representing the $\\theta$ vector.**\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "4ULzWzd750RT" }, "outputs": [], "source": [ "#@title Hints: Random Number Generation \n", "''' \n", "Random Number Generation\n", "Use `rnd_gen`! If you're not sure how to use it, consult the `default_rng` \n", "documentation at this address:\n", "https://numpy.org/doc/stable/reference/random/generator.html\n", "\n", "For instance, you may use the `random` method of `rnd_gen`.*\n", "''';\n", "\n", "'''\n", "The theta array should have 3 elements in it! \n", "''';" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "R0uUE4VbEfgt" }, "outputs": [], "source": [ "#@title Hint: Code Snipppet\n", "'''\n", "rnd_gen.random((___,)) # length of array\n", "''';" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "-Vk05y1C2VBs" }, "outputs": [], "source": [ "theta = ______" ] }, { "cell_type": "markdown", "metadata": { "id": "s8KM_CeF2Ven" }, "source": [ "In order to determine whether a set of $\\beta$ values is better than the other, we need to quantify well the values are able to predict the class. This is where the cost function comes in.\n", "\n", "The cost function, $c$, will return a value close to zero when the prediction, $\\hat{p}$, is correct and a large value when it is wrong. In a binary classification problem, we can use the log loss function. For a single prediction and truth value, it is given by:\n", "\\begin{align}\n", " \\text{c}(\\hat{p},y) = \\left\\{\n", " \\begin{array}{cl}\n", " -\\log(\\hat{p})& \\text{if}\\; y=1\\\\\n", " -\\log(1-\\hat{p}) & \\text{if}\\; y=0\n", " \\end{array}\n", " \\right.\n", " \\end{align}\n", "\n", "However, we want to apply the cost function to an n-dimensional set of predictions and truth values. Thankfully, we can find the average value of the log loss function $J$ for an an-dimensional set of $\\hat{y}$ & $y$ as follows:\n", "\n", "\\begin{align}\n", " \\text{J}(\\mathbf{\\hat{p}},y) = - \\dfrac{1}{n} \\sum_{i=1}^{n} \n", " \\left[ y_i\\cdot \\log\\left( \\hat{p}_i \\right) \\right] + \n", " \\left[ \\left( 1 - y_i \\right) \\cdot \\log\\left( 1-\\hat{p}_i \\right) \\right]\n", " \\end{align}\n", "\n", "We now have a formula that can be used to calculate the average cost over the training set of data.\n", "\n", "Now let's code 💻\n" ] }, { "cell_type": "markdown", "metadata": { "id": "XBLxwlSWMoo1" }, "source": [ "## **Q5) Define a log_loss function that takes in an arbitrarily large set of prediction and truths**\n", "\n", "*Hint 1: You need to encode the function $J$ above, for which Numpy's functions may be quite convenient (e.g., [`log`](https://numpy.org/doc/stable/reference/generated/numpy.log.html), [`mean`](https://numpy.org/doc/stable/reference/generated/numpy.mean.html), etc.)*\n", "\n", "*Hint 2: Asserting the dimensions of the vector is a good way to check that your function is working correctly. [Here's a tutorial on how to use `assert`](https://swcarpentry.github.io/python-novice-inflammation/10-defensive/index.html#assertions). For instance, to assert that two vectors `X` and `y` have the same dimension, you may use:*\n", "```\n", "assert X.shape==y.shape\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "Jmnzz4h_Cq01" }, "outputs": [], "source": [ "#@title Hint: Example code snippet\n", "'''\n", "J_vector = -(y * np.log(p_hat + epsilon) + (1-y) * np.log(1-y_hat))\n", "J.mean()\n", "''';" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "H5fDeL36EauO" }, "outputs": [], "source": [ "def log_loss(p_hat, y, epsilon=1e-7):\n", " \n", " # Begin by calculating the two possibilities for the cost function, i.e.\n", " # 1: -log(p_hat + epsilon), and 2: -log(1- p_hat). We added an epsilon term \n", " # to -log(p_hat) because we can run into mathematical problems if p_hat = 0.\n", " term_1 = -np.___( _____ + _____ )\n", " term_2 = -np.___( 1 - ____ )\n", " \n", " # We can almost calculate J! We'll need to 1) multiply term_1 by y, and \n", " # 2) multiply term_2 by (1-y). We then add the new terms together.\n", " # Calculate the value of the cost function (i.e., what's inside the brackets)\n", " inside_brackets = (__) * term_1 + ( ___ - ___ ) * term_2\n", "\n", " #Verify the shape of inside_brackets. \n", " print(f'The size of the term inside the brackets is {inside_brackets.shape}')\n", "\n", " # You should have a cost value for each one of your predictions. We won't\n", " # use the individual values, though. We'll aggregate the information from\n", " # all our predictions by calculating the mean! (i.e., 1/n_terms * terms_sum)\n", " # This single value is J\n", " J = _____.mean()\n", "\n", " return J" ] }, { "cell_type": "markdown", "metadata": { "id": "aO4Bkm1gFV3C" }, "source": [ "We now have a way of quantifying how good our predictions are. The final thing needed for us to train our algorithm is figuring out a way to update the parameters in a way that improves the average quality of our predictions. \n", "\n", "

**Warning**: we'll go into a bit of math below

\n", "\n", "Let's look at the change in a single parameter within $\\theta$: $\\beta_1$ (given $X_{1,i} = X_1$, $\\;\\hat{p}_{i} = \\hat{p}$, $\\;y_{i} = y$). If we want to know what the effect of changing the value of $\\beta_1$ will have on the log loss function we can find this with the partial derivative:\n", "
$\n", " \\dfrac{\\partial J}{\\partial \\beta_1}\n", "$
\n", "\n", "This may not seem very helpful by itself - after all, $\\beta_1$ isn't even in the expression of $J$. But if we use the chain rule, we can rewrite the expression as:\n", "
\n", " $\\dfrac{\\partial J}{\\partial \\hat{p}} \\cdot\n", " \\dfrac{\\partial \\hat{p}}{\\partial \\theta} \\cdot\n", " \\dfrac{\\partial \\theta}{\\partial \\beta_1}$\n", "
\n", "\n", "We'll spare you the math (feel free to verify it youself, however!):\n", "\n", "
$\\dfrac{\\partial J}{\\partial \\hat{p}} = \\dfrac{\\hat{p} - y}{\\hat{p}(1-\\hat{p})}, \\quad\n", " \\dfrac{\\partial \\hat{p}}{\\partial \\theta} = \\hat{p} (1-\\hat{p}), \\quad\n", " \\dfrac{\\partial \\theta}{\\partial \\beta_1} = X_1 $\n", "
\n", "\n", "and thus \n", "
$\n", " \\dfrac{\\partial J}{\\partial \\beta_1} = (\\hat{p} - y) \\cdot X_1\n", "$
\n", "\n", "We can calculate the partial derivative for each parameter in $\\theta$ which, as you may have realized, is simply the $\\theta$ gradient of $J$: $\\nabla_{\\theta}(J)$\n", "\n", "With all of this information, we can now write $\\nabla_{\\theta} J$ in terms of the error, the feature vector, and the number of samples we're training on!\n", "\n", "\n", "\n", "
$\\nabla_{\\mathbf{\\theta}^{(k)}} \\, J(\\mathbf{\\theta^{(k)}}) = \\dfrac{1}{n} \\sum\\limits_{i=1}^{n}{ \\left ( \\hat{p}^{(k)}_{i} - y_{i} \\right ) \\mathbf{X}_{i}}$
\n", "\n", "Note that here $k$ represents the iteration of the parameters we are currently on.\n", "\n", "We now have a gradient we can calculate and use in the batch gradient descent method! The updated parameters will thus be:\n", "\n", "\n", "\n", "\\begin{align} \n", "{\\mathbf{\\theta}^{(k+1)}} = {\\mathbf{\\theta}^{(k)}} - \\eta\\,\\nabla_{\\theta^{(k)}}J(\\theta^{(k)})\n", "\\end{align}\n", "\n", "Where $\\eta$ is the learning rate parameter. It's also worth pointing out that $\\;\\hat{p}^{(k)}_i = \\sigma\\left(\\theta^{(k)}, X_i\\right) $" ] }, { "cell_type": "markdown", "metadata": { "id": "ML4uik7sbdMZ" }, "source": [ "In order to easily calculate the input to the logistic regression, we'll multiply the $\\theta$ vector with the X data, and as we have a non-zero bias $\\beta_0$ we'd like to have an X matrix whose first column is filled with ones.\n", "\n", "\\begin{align}\n", " X_{\\small{with\\ bias}} = \\begin{pmatrix}\n", " 1 & X_{1,0} & X_{2,0}\\\\\n", " 1 & X_{1,1} & X_{2,1}\\\\\n", " &...&\\\\\n", " 1 & X_{1,n} & X_{2,n} \n", " \\end{pmatrix}\n", "\\end{align}\n", "
\n" ] }, { "cell_type": "markdown", "metadata": { "id": "sqwV5qgrisB-" }, "source": [ "## **Q6) Prepare the `X_with_bias` matrix.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "PlipJo530lPp" }, "outputs": [], "source": [ "#@title Hints: Making an an array filled with ones, hints on concatenation\n", "\n", "'''\n", "Making the ones array\n", "\n", "Making an array with ones and the same number of entries as rows in your input \n", "data: You can use numpy.ones( (array_dimensions) ) in order to generate an array with\n", "the given array_dimensions shape. e.g., np.ones((4,)) => array([1,1,1,1])\n", "\n", "Accessing the number of rows: dataframes have the \"shape\" attribute implemented.\n", "For our penguin data, the input vector shape should be (274,2), and so using\n", "shape[0] should return the right length for our ones array\n", "''';\n", "\n", "\n", "'''\n", "Concatenation\n", "\n", "You can quickly concatenate your arrays using np.c_[array1,array2]. Note that\n", "the order matters, so make sure array1 is the array filled with ones :). Also,\n", "np.c_ uses square brackets! [] - you'll get an error if you use regular \n", "brackets ().\n", "\n", "numpy.c_ will automagically understand that the second array is a dataframe - \n", "you don't need to worry about transforming it into a numpy array for today!\n", "\n", "''';" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "A_-GIW9cCTFn" }, "outputs": [], "source": [ "# Generate the ones array\n", "ones_array = _______._______(_______.______[___])\n", "\n", "# Make the x_with_bias matrix\n", "x_with_bias = ______._______(_______,_______)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "QiqtcQj5IIH6" }, "outputs": [], "source": [ "# Print your x with bias matrix to make sure it looks the way it's supposed to\n", "print(X_with_bias[:10])" ] }, { "cell_type": "markdown", "metadata": { "id": "wN1zdxPLkhO3" }, "source": [ "Our X_with_bias matrix looks like this: \\\\\n", "[[ 1. $\\quad$ -0.69346042 $\\quad$ 0.92572752] \\\\\n", " [ 1. $\\quad$ -0.6164717 $\\quad$ 0.28005659] \\\\\n", " [ 1. $\\quad$ -0.46249427 $\\quad$ 0.57805856] \\\\\n", " [ 1. $\\quad$ -1.15539273 $\\quad$ 1.22372949] \\\\\n", " [ 1. $\\quad$ -0.65496606 $\\quad$ 1.86940041] \\\\\n", " [ 1. $\\quad$ -0.73195478 $\\quad$ 0.47872457] \\\\\n", " [ 1. $\\quad$ -0.67421324 $\\quad$ 1.37273047] \\\\\n", " [ 1. $\\quad$ -1.65581939 $\\quad$ 0.62772555] \\\\\n", " [ 1. $\\quad$ -0.13529222 $\\quad$ 1.67073244] \\\\\n", " [ 1. $\\quad$ -0.94367375 $\\quad$ 0.13105561]]" ] }, { "cell_type": "markdown", "metadata": { "id": "cThmFWcB0v-a" }, "source": [ "## **Q7) Write a function called `predict` that takes in the parameter vector $\\theta$ and the `X_with_bias` matrix and evaluates the logistic function for each of the samples.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "opqgiTM2L0jt" }, "outputs": [], "source": [ "#@title Hint: Pseudocode Snippet\n", "\n", "'''\n", "Pseudocode below:\n", "\n", "define predict_function(x_with_bias, theta_vector):\n", " argument_for_logistic_function = dot_product(x_with_bias, theta_vector)\n", " return logistic_function(argument_for_logistic_function)\n", "\n", "''';" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "tBLryApsbatR" }, "outputs": [], "source": [ "# Write your predict function here\n", "def predict_function(____, ____):\n", " # Find the dot product of X_with_bias and theta\n", " dot_product = _______._______(_______,_______)\n", "\n", " # Use your logistic function!\n", " output = _______(_______)\n", "\n", " return _____ # Return the value you get" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "6QIweEz8OKPO" }, "outputs": [], "source": [ "# Let's test your predict function!\n", "\n", "# Set up debug data and parameters\n", "debug_data = np.c_[np.ones(5), np.linspace(-1,1,10).reshape((-1,2))]\n", "debug_theta = np.array([0.2,0.1,0.9]) \n", "\n", "print(predict_function(debug_data, debug_theta))" ] }, { "cell_type": "markdown", "metadata": { "id": "HQ6oK6ewOcHO" }, "source": [ "If everything is set up correctly and you didn't change the debug data and theta, the output for your predict function should be:\n", "\n", "`[0.35434369 0.46118934 0.57172409 0.67553632 0.76454801]`" ] }, { "cell_type": "markdown", "metadata": { "id": "p6cPbu4LvVES" }, "source": [ "## **Q8) Now that you have a `predict` function, write a `gradient_calc` function that calculates the gradient for the logistic function.**\n", "\n", "*Hint: You'll have to feed `theta`, `X`, and `y` to the `gradient_calc` function.*\n", "\n", "*Hint: You can use [this equation](#grad_eq) to calculate the gradient of the cost function.*" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "XRQNz-2nPGVZ" }, "outputs": [], "source": [ "#@title Hint: Pseudocode Snippet\n", "\n", "'''\n", "\n", "define gradient_calculator_function(y, X_with_bias, theta_vector):\n", " # predicted values using theta and inputs\n", " prediction = predict(x_with_bias,theta_vector)\n", " \n", " number_of_predictions = len(prediction)\n", "\n", " assert number_of_predictions == len(y)\n", "\n", " error = prediction - y\n", "\n", " X_transpose = transpose(X)\n", "\n", " return dot_product(X_transpose, error) / number_of_predictions\n", "\n", "''';\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "BtnANN5WvVuy" }, "outputs": [], "source": [ "def gradient_calculator(_______, _______, _______):\n", " # Find predicted values using the predict function\n", " prediction = _______(_______, _______)\n", "\n", " # Assert that you have the same number of predictions as you do targets\n", " # Otherwise, something went wrong!\n", " assert len(prediction) == __________\n", "\n", " # Calculate the error\n", " error = _______ - _______\n", "\n", " # Find the dot product with the input matrix and divide by the number of \n", " # predictions\n", " output = \n", " return output" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "sMs0pmFcVo1h" }, "outputs": [], "source": [ "# Let's test the gradient calculator\n", "# Begin by creating dummy labels\n", "debug_labels = np.array([0,0,0,1,1])\n", "\n", "# And call the function you defined with the dummy labels and data we made before\n", "print(gradient_calculator(debug_labels, debug_data, debug_theta))" ] }, { "cell_type": "markdown", "metadata": { "id": "0wc43bGxV-en" }, "source": [ "If you kept the same dummy data we included by default in the notebook, you should get `[ 0.16546829 -0.19307376 -0.15630302]` as the output of your gradient calculator! 💻" ] }, { "cell_type": "markdown", "metadata": { "id": "PU4A5HVKuAGG" }, "source": [ "We can now write a function that will train a logistic regression algorithm!\n", "\n", "Your `logistic_regression` function needs to:\n", "* Take in a set of training input/output data, validation input/output data, a number of iterations to train for, a set of initial parameters $\\theta$, and a learning rate $\\eta$\n", "* At each iteration:\n", " * Generate a set of predictions on the training data. Hint: You may use your function `predict` on inputs `X_train` from the training set.\n", " * Calculate and store the loss function for the training data at each iteration. Hint: You may use your function `log_loss` on inputs `X_train` and outputs `y_train` from the training set.\n", " * Calculate the gradient. Hint: You may use your function `grad_calc`.\n", " * Update the $\\theta$ parameters. Hint: You need to implement [this equation](#grad_descent).\n", " * Generate a set of predictions on the validation data using the updated parameters. Hint: You may use your function `predict` on inputs `X_valid` from the validation set. \n", " * Calculate and store the loss function for the validation data. Hint: You may use your function `log_loss` on inputs `X_valid` and outputs `y_valid` from the validation set. \n", " * Bonus: Calculate and store the accuracy of the model on the training and validation data as a metric!\n", "* Return the final set of parameters $\\theta$ & the stored training/validation loss function values (and the accuracy, if you did the bonus)\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "id": "182qtGB7i_vm" }, "source": [ "## **Q9) Write the `logistic_regression` function**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "eNfODgtZYm2V" }, "outputs": [], "source": [ "#@title Hint: Pseudocode Snippet\n", "\n", "'''\n", "define logistic_regression(\n", " X_train,\n", " y_train,\n", " X_validation,\n", " y_validation,\n", " theta_vector,\n", " number_of_iterations,\n", " learning_rate_eta,\n", " ):\n", " #initialize the list of losses\n", " training_losses = list()\n", " validation_losses = list()\n", "\n", " for iteration in range(number_of_iterations):\n", " train_set_predictions = predict(X_train, theta_vector)\n", " train_loss = log_loss(train_set_predictions, y_train)\n", " training_losses.append(train_loss)\n", "\n", " gradient = gradient_calculator(y_train, X_train, theta_vector)\n", " theta_vector = theta_vector - gradient * learning_rate_eta\n", "\n", " validation_set_predictions = predict(X_validation, theta_vector)\n", " validation_loss = log_loss(validation_set_predictions, y_validation)\n", " validation_losses.append(validation_loss)\n", "\n", " print(Completed (iteration)/(number_of_iterations)*100%)\n", "\n", " return [training_losses, validation_losses], theta\n", "''';" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "HDsR5TxPt-0Y" }, "outputs": [], "source": [ "def logistic_regression(_______,\n", " _______,\n", " _______,\n", " _______,\n", " _______,\n", " num_iters,\n", " _______,\n", " ):\n", " # Initialize the list of losses\n", " training_losses = _______\n", " validation_losses = _______\n", " \n", " # Loop through as many times as defined in the function call\n", " for iteration in _______(_______):\n", " \n", " #--------Training-------\n", " # Get predictions on training dataset\n", " _______ = _______(_______, _______)\n", " \n", " # Calculate the loss\n", " _______ = _______(_______, _______)\n", "\n", " # Add it to the list of training losses to keep track of it\n", " training_losses._______(_______)\n", " \n", " # Calculate the Gradient\n", " _______ = _______(_______, _______, _______)\n", " \n", " # Find the new value of theta\n", " _______ = _______ - _______ * _______\n", "\n", " #--------Validation-----------\n", " # Get predictions on the validation dataset\n", " _______ = _______(_______, _______)\n", "\n", " # Calculate the validation loss\n", " _______ = _______(_______, _______)\n", "\n", " # Add it to the list of validation losses to keep track of it\n", " validation_losses._______(_______)\n", " \n", " # Progress Indicator\n", " if (iteration/num_iters * 100) % 5 == 0:\n", " print(f'\\rCompleted {(iteration)/(num_iter)*100}%', end='')\n", " \n", " print('\\rCompleted 100%')\n", " return [_______, _______], _______" ] }, { "cell_type": "markdown", "metadata": { "id": "EWMDLk7wFB0f" }, "source": [ "**¡¡¡Important Note!!!**\n", "\n", "The notebook assumes that you will return \n", "1. a Losses list, where Losses[0] is the training loss and Losses[1] is the validation loss\n", "2. a tuple with the 3 final coefficients ($\\beta_0$, $\\beta_1$, $\\beta_2$)\n", "\n", "---------------------" ] }, { "cell_type": "markdown", "metadata": { "id": "2ep5FQYBmqG5" }, "source": [ "Now that we have our logistic regression function, we're all set to train our algorithm! Or _are_ we?\n", "\n", "There's an **important** data step that we've neglected up to this point - we need to **split the data** into the train, validation, and test datasets.\n", "\n", "
train ✂️ validation ✂️ test
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "CVrXzjYA2iil" }, "outputs": [], "source": [ "test_ratio = 0.2\n", "validation_ratio = 0.2\n", "total_size = len(X_with_bias)\n", "\n", "test_size = int(total_size * test_ratio)\n", "validation_size = int(total_size * validation_ratio)\n", "train_size = total_size - test_size - validation_size\n", "\n", "rnd_indices = rnd_gen.permutation(total_size)\n", "\n", "X_train = X_with_bias[rnd_indices[:train_size]]\n", "y_train = y.iloc[rnd_indices[:train_size]]\n", "X_valid = X_with_bias[rnd_indices[train_size:-test_size]]\n", "y_valid = y.iloc[rnd_indices[train_size:-test_size]]\n", "X_test = X_with_bias[rnd_indices[-test_size:]]\n", "y_test = y.iloc[rnd_indices[-test_size:]]" ] }, { "cell_type": "markdown", "metadata": { "id": "33IhRpME8LOX" }, "source": [ "Now we're ready! \n", "\n", "## **Q10) Train your logistic regression algorithm. We recommend you use 500 iterations, $\\eta$=0.1**\n", "\n", "*Hint: It's time to use the `logistic_regression` function you defined in Q5.*" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "dWAr0ORYEYi2" }, "outputs": [], "source": [ "# Complete the code\n", "losses, coeffs = ________(_______,\n", " _______,\n", " _______,\n", " _______,\n", " _______, \n", " _______,\n", " _______,\n", " )" ] }, { "cell_type": "markdown", "metadata": { "id": "e7WHcpPiEcIS" }, "source": [ "Let's see how our model did while learning!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "T_ImydMTKkfh" }, "outputs": [], "source": [ "#@title Run this cell to produce the Loss Function Visualization Graphs\n", "fig, ax = plt.subplots(figsize=(9,6), dpi=100)\n", "ax.plot(losses[0], color='blue', label='Training', linewidth=3);\n", "ax.plot(losses[1], color='black', label='Validation', linewidth=3);\n", "ax.legend();\n", "ax.set_ylabel('Log Loss')\n", "ax.set_xlabel('Iterations')\n", "ax.set_title('Loss Function Graph')\n", "ax.autoscale(axis='x', tight=True)\n", "fig.tight_layout();" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "4wXFzZPjFjOn" }, "outputs": [], "source": [ "# Let's get predictions from our model for the training, validation, and testing\n", "# datasets\n", "y_hat_train = (predict(X_train, coeffs)>=.5).astype(int)\n", "y_hat_valid = (predict(X_valid, coeffs)>=.5).astype(int)\n", "y_hat_test = (predict(X_test, coeffs)>=.5).astype(int)\n", "\n", "y_sets = [ [y_hat_train, y_train],\n", " [y_hat_valid, y_valid],\n", " [y_hat_test, y_test] ]\n", "\n", "def accuracy_score(y_hat, y):\n", " assert(y_hat.size==y.size)\n", " return (y_hat == y).sum()/y.size\n", "accuracies=[]\n", "[accuracies.append(accuracy_score(y_set[0],y_set[1])) for y_set in y_sets]\n", "\n", "printout= (f'Training Accuracy:{accuracies[0]:.1%} \\n'\n", " f'Validation Accuracy:{accuracies[1]:.1%} \\n'\n", " f'Test Accuracy:{accuracies[2]:.1%} \\n')\n", "\n", "# Add the testing accuracy only once you're sure that your model works!\n", "print(printout)" ] }, { "cell_type": "markdown", "metadata": { "id": "4zfXs8M8Osie" }, "source": [ "Congratulations on training a logistic regression algorithm from scratch! \n", "\n", "Your loss function graph should look something similar to this...\n", "\n", "\n", "And the accuracies we got during development of the notebook are:\n", "\n", "`Training Accuracy:99.4%`
\n", "`Validation Accuracy:100.0%`
\n", "`Test Accuracy:100.0% `\n", "\n", "Once you're done with the upcoming environmental science applications notebook, feel free to come back to take a look at the challenges 😀" ] }, { "cell_type": "markdown", "metadata": { "id": "VAa4bzT7PHRG" }, "source": [ "## Challenges\n", "\n", "* **C1)** Add more features to try to improve our accuracies! \n", "\n", "* **C2)** Add early stopping to the training algorithm! (e.g., stop training when the accuracy is greater than a target accuracy)" ] } ], "metadata": { "colab": { "collapsed_sections": [], "include_colab_link": true, "provenance": [], "toc_visible": true }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" } }, "nbformat": 4, "nbformat_minor": 1 }