{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "provenance": [], "collapsed_sections": [ "0-WlA6efBRki" ], "toc_visible": true, "include_colab_link": true }, "kernelspec": { "name": "python3", "display_name": "Python 3" }, "language_info": { "name": "python" } }, "cells": [ { "cell_type": "markdown", "metadata": { "id": "view-in-github", "colab_type": "text" }, "source": [ "\"Open" ] }, { "cell_type": "markdown", "source": [ "# Dimensionality Reduction" ], "metadata": { "id": "fab2zKXwAinB" } }, { "cell_type": "markdown", "source": [ "\n", "
Caption: Denise diagnoses an overheated CPU at our data center in The Dalles, Oregon.
For more than a decade, we have built some of the world's most efficient servers.

Photo from the Google Data Center gallery
" ], "metadata": { "id": "y7Q5WigQxsVd" } }, { "cell_type": "markdown", "source": [ "*Our world is increasingly filled with data from all sorts of sources, including environmental data. Can we reduce the data to a reduced, meaningful space to save on computation time and increase explainability?*" ], "metadata": { "id": "XGGHmOj1ygXe" } }, { "cell_type": "markdown", "source": [ "This notebook will be used in the lab session for week 4 of the course, covers Chapters 8 of Géron, and builds on the [notebooks made available on _Github_](https://github.com/ageron/handson-ml2).\n", "\n", "Need a reminder of last week's labs? Click [_here_](https://colab.research.google.com/github/tbeucler/2022_ML_Earth_Env_Sci/blob/main/Lab_Notebooks/Week_3_Decision_Trees_Random_Forests_SVMs.ipynb) to go to notebook for week 3 of the course." ], "metadata": { "id": "AlTDG-57-aAI" } }, { "cell_type": "markdown", "source": [ "**Notebook Setup**\n", "\n", "First, let's import a few common modules, ensure MatplotLib plots figures inline and prepare a function to save the figures. We also check that Python 3.5 or later is installed (although Python 2.x may work, it is deprecated so we strongly recommend you use Python 3 instead), as well as Scikit-Learn ≥0.20." ], "metadata": { "id": "0-WlA6efBRki" } }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "zw6fcA3O-Uls" }, "outputs": [], "source": [ "# Python ≥3.5 is required\n", "import sys\n", "assert sys.version_info >= (3, 5)\n", "\n", "# Scikit-Learn ≥0.20 is required\n", "import sklearn\n", "assert sklearn.__version__ >= \"0.20\"\n", "\n", "# Common imports\n", "import numpy as np\n", "import os\n", "\n", "# to make this notebook's output stable across runs\n", "rnd_seed = 42\n", "rnd_gen = np.random.default_rng(rnd_seed)\n", "\n", "# To plot pretty figures\n", "%matplotlib inline\n", "import matplotlib as mpl\n", "import matplotlib.pyplot as plt\n", "mpl.rc('axes', labelsize=14)\n", "mpl.rc('xtick', labelsize=12)\n", "mpl.rc('ytick', labelsize=12)\n", "\n", "# Where to save the figures\n", "PROJECT_ROOT_DIR = \".\"\n", "CHAPTER_ID = \"dim_reduction\"\n", "IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, \"images\", CHAPTER_ID)\n", "os.makedirs(IMAGES_PATH, exist_ok=True)\n", "\n", "def save_fig(fig_id, tight_layout=True, fig_extension=\"png\", resolution=300):\n", " path = os.path.join(IMAGES_PATH, fig_id + \".\" + fig_extension)\n", " print(\"Saving figure\", fig_id)\n", " if tight_layout:\n", " plt.tight_layout()\n", " plt.savefig(path, format=fig_extension, dpi=resolution)" ] }, { "cell_type": "markdown", "source": [ "Dimensionality Reduction using PCA\n", "\n", "This week we'll be looking at how to reduce the dimensionality of a large dataset in order to improve our classifying algorithm's performance! With that in mind, let's being the exercise by loading the MNIST dataset.\n", "\n", "## Q1) Load the input features and truth variable into X and y, then split the data into a training and test dataset using scikit's train_test_split method. Use *test_size=0.15*, and remember to set the random state to *rnd_seed!*\n", "\n", "*Hint 1: The `'data'` and `'target'` keys for mnist will return X and y.*\n", "\n", "*Hint 2: [Here's the documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) for train/test split.*" ], "metadata": { "id": "H3QU33M3D--N" } }, { "cell_type": "code", "source": [ "# Load the mnist dataset\n", "from sklearn.datasets import fetch_openml\n", "mnist = fetch_openml('mnist_784', version=1, as_frame=False)" ], "metadata": { "id": "H9slNfR3D-kg" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "# Read in the mnist digit images and corresponding numbers\n", "# ---------------------------------------------------------------------------\n", "# The procedure here is similar to the notebooks we did last week. Use Hint 1 to store the input and target data.\n", "X = _____[____]\n", "y = _____[____]" ], "metadata": { "id": "zNcNkJ3u92cW" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "# Import train_test_split() to create your training and test sets\n", "# ---------------------------------------------------------------------------\n", "from ________._______ import __________\n", "\n", "# Now separate your X and y into training and test sets (use train_test_split)\n", "# ---------------------------------------------------------------------------\n", "_____,_____,____,____ = __________(_,_,___=___,_____=____) # (data,target,test_size,random_state)" ], "metadata": { "id": "yOmYNwuT920P" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "We now once again have a training and testing dataset with which to work with. Let's try training a random forest tree classifier on it. You've had experience with them before, so let's have you import the `RandomForestClassifier` from sklearn and instantiate it.\n", "\n", "## Q2) Import the `RandomForestClassifier` model from sklearn. Then, instantiate it with 100 estimators and set the random state to *rnd_seed!*\n", "\n", "*Hint 1: [Here's the documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) for `RandomForestClassifier`*\n", "\n", "*Hint 2: [Here's the documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) for train/test split.*\n", "\n", "*Hint 3: If you're still confused about **instantiation**, there's a [blurb on wikipedia](https://en.wikipedia.org/wiki/Instance_(computer_science)) describing it in the context of computer science.*" ], "metadata": { "id": "EhBQOdVxfr2U" } }, { "cell_type": "code", "source": [ "# Import RandomForestClassifier here.\n", "# ---------------------------------------------------------------------------\n", "from sklearn.______ import _______" ], "metadata": { "id": "ZZaWwNGUg9Qb" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "# Here we initiate a RF classifier objects with custom settings: 100 estimators, random_state=rnd_seed\n", "# ------------------------------------------------------------------------------------------------------\n", "rnd_clf = _____(______=______, #Number of estimators\n", " ______=______) #Random State" ], "metadata": { "id": "qJc0deCO-Ibt" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "We're now going to measure how quickly the algorithm is fitted to the mnist dataset! To do this, we'll have to import the `time` library. With it, we'll be able to get a timestamp immediately before and after we fit the algorithm, and we'll get the time by calculating the difference.\n", "\n", "## Q3) Import the time library and calculate how long it takes to fit the `RandomForestClassifier` model.\n", "\n", "*Hint 1: [Here's the documentation](https://docs.python.org/3/library/time.html#time.time) to the function used for getting timestamps*\n", "\n", "*Hint 2: [Here's the documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.fit) for the fitting method used in `RandomForestClassifier`.*" ], "metadata": { "id": "gi1HTS-KjUJ8" } }, { "cell_type": "code", "source": [ "import time" ], "metadata": { "id": "EZaQPn2XkV06" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "# Use a function in time (check documentation) to load **current** time before training the RF classifier\n", "# ------------------------------------------------------------------------------------------------------\n", "t0 = _____._____()\n", "\n", "# Train the RF classifier\n", "# ------------------------------------------------------------------------------------------------------\n", "rnd_clf.___(_____, _____)\n", "\n", "# Use the same function for t0 to load **current** time **after** training the RF classifier\n", "# ------------------------------------------------------------------------------------------------------\n", "t1 = _____._____()" ], "metadata": { "id": "B4jPNCXl-OIM" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "# Run this as is, how many seconds did it take to train the classifier?\n", "# ------------------------------------------------------------------------------------------------------\n", "train_t_rf = t1-t0\n", "\n", "print(f\"Training took {train_t_rf:.2f}s\")" ], "metadata": { "id": "LFuLLVWj-PXZ", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "0b9a10bc-6fc1-4b02-f5e2-9386acd2ef90" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Training took 53.15s\n" ] } ] }, { "cell_type": "markdown", "source": [ "We care about more than just how long we took to trian the model, however! Let's get an accuracy score for our model.\n", "\n", "## Q4) Get an accuracy score for the predictions from the RandomForestClassifier\n", "\n", "*Hint 1: [Here is the documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html) for the `accuracy_score` metric in sklearn.*\n", "\n", "*Hint 2: [Here is the documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.predict) for the predict method in `RandomForestClassifier`*" ], "metadata": { "id": "X0-hEhlOnLqh" } }, { "cell_type": "code", "source": [ "# Import the accuracy score metric in scikit-learn (check Hint 1 for ideas on how to import metrics)\n", "# ------------------------------------------------------------------------------------------------------\n", "from _____._____ import _____" ], "metadata": { "id": "lscBW_sFnLVS" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "# Now try to use the trained classifier to generate predictions for the unseen test set (X_test)\n", "# ------------------------------------------------------------------------------------------------------\n", "y_pred = _____._____(_____)" ], "metadata": { "id": "x-93C_-n-cle" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "# Use the accuracy_score() metric on y_test and y_pred to evaluate the accuracy of our model\n", "# ------------------------------------------------------------------------------------------------------\n", "rf_accuracy = accuracy_score(_____, _____)\n", "\n", "# Run this as is. We got an accuracy of 96.7%. Did you get similar scores?\n", "# ------------------------------------------------------------------------------------------------------\n", "print(f\"RF Model Accuracy: {rf_accuracy:.2%}\")" ], "metadata": { "id": "n09PnHuy-cTf" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "Let's try doing the same with with a logistic regression algorithm to see how it compares.\n", "\n", "## Q5) Repeat Q2-4 with a logistic regression algorithm using sklearn's `LogisticRegression` class. Hyperparameters: `multi_class='multinomial'` and `solver='lbfgs'`\n", "\n", "*Hint 1: [Here is the documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) for the `LogisticRegression` class." ], "metadata": { "id": "XEZX7xBAHJj9" } }, { "cell_type": "code", "source": [ "# Import LogisticRegression class here.\n", "# ---------------------------------------------------------------------------\n", "from _____._____ import _____" ], "metadata": { "id": "kwX8ZwzQI6p6" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "# Initiate a LogisticRegression object with custom hyperparameters\n", "# ---------------------------------------------------------------------------\n", "log_clf = _____(_____=\"multinomial\", #Multiclass\n", " _____=\"lbfgs\", Solver\n", " _____=42) #Random State" ], "metadata": { "id": "CvUwrxtS-mTf" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "# Timestamp for **current** time before training the LogisticRegression classifier\n", "# ------------------------------------------------------------------------------------------------------\n", "t0 = time.time()\n", "\n", "# Training the LogisticRegression classifier\n", "# ------------------------------------------------------------------------------------------------------\n", "log_clf.fit(_____, _____)\n", "\n", "# Timestamp for **current** time after training the LogisticRegression classifier\n", "# ------------------------------------------------------------------------------------------------------\n", "t1 = time.time()\n", "\n", "# Run this as is, how many seconds did it take to train the classifier?\n", "# ------------------------------------------------------------------------------------------------------\n", "train_t_log = t1-t0\n", "print(f\"Training took {train_t_log:.2f}s\")" ], "metadata": { "id": "F6Dr9j1T-mgz" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "# Now try to use the trained classifier to generate predictions for the unseen test set (X_test)\n", "# ------------------------------------------------------------------------------------------------------\n", "y_pred = _____._____(_____)\n", "\n", "# Run this as is. We got an accuracy of 92.1%. Did you get similar scores?\n", "# ------------------------------------------------------------------------------------------------------\n", "log_accuracy = accuracy_score(_____, _____) # Feed in the truth and predictions\n", "\n", "print(f\"Log Model Accuracy: {log_accuracy:.2%}\")" ], "metadata": { "id": "Armw_a0V-mAs" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "Up to now, everything that we've done are things we've done in previous labs - but now we'll get to try out some algorithms useful for reducing dimensionality! Let's use principal component analysis. Here, we'll reduce the space using enough axes to explain over 95% of the variability in the data...\n", "\n", "## Q6) Import scikit's implementation of `PCA` and fit it to the training dataset so that 95% of the variability is explained.\n", "\n", "*Hint 1: [Here is the documentation](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) for scikit's `PCA` class.*\n", "\n", "*Hint 2: [Here is the documentation](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA.fit_transform) for scikit's `.fit_transform()` method.*" ], "metadata": { "id": "b_5XiaQfJ5NV" } }, { "cell_type": "code", "source": [ "# Here we will experiment a bit with reducing the dimensionality of the mnist data.\n", "# First, import the PCA class from scikit-learn\n", "# ------------------------------------------------------------------------------------------------------\n", "from _____._____ import _____ # Importing PCA" ], "metadata": { "id": "rrP5043rJc-1" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "# We will now initiate the PCA algorithm, with a custom hyperparameter to only keep only a certain amount of PC components\n", "# In the documentation, search for the keywords \"numbers ... to keep\"\n", "# ---------------------------------------------------------------------------------------------------------------------------\n", "pca = PCA(_____=_____) # Set number of components to explain 95% of variability" ], "metadata": { "id": "UZAeoAlI_Ok9" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "# Fit the PCA model and use it to transform our training data (reducing data dimentionality) [fit_transform]\n", "# ---------------------------------------------------------------------------------------------------------------------------\n", "X_train_reduced = pca._____(____) # Fit-transform the training data" ], "metadata": { "id": "b3FHiYMA_OwR" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "# Transform our test data (reducing data dimentionality) with the pca algorithm (do not fit the model again!)\n", "# ---------------------------------------------------------------------------------------------------------------------------\n", "X_test_reduced = pca._____(____)" ], "metadata": { "id": "zydXZOAV_T1U" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "## Q7) Repeat Q3 & Q4 using the *reduced* `X_train` dataset instead of `X_train`." ], "metadata": { "id": "mKXeXWn4M8K1" } }, { "cell_type": "code", "source": [ "# Load current time step, train RF classifier with X_train_reduced, load time step after training\n", "# ------------------------------------------------------------------------------------------------------\n", "t0 = _____._____() # Load the timestamp before running\n", "rnd_clf.___(_____, _____) # Fit the model with the reduced training data\n", "t1 = _____._____() # Load the timestamp after running\n", "\n", "# How many seconds did it take to train the model?\n", "# ------------------------------------------------------------------------------------------------------\n", "train_t_rf = t1-t0\n", "print(f\"Training took {train_t_rf:.2f}s\")" ], "metadata": { "id": "m1oZFFfljH0N" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "# Use trained classifier to generate predictions from the **reduced** test set (X_test_reduced)\n", "# ------------------------------------------------------------------------------------------------------\n", "y_pred = _____._____(_____)\n", "\n", "# Use accuracy_score to compare truth and prediction. We got 94.7% accuracy.\n", "# ------------------------------------------------------------------------------------------------------\n", "red_rf_accuracy = accuracy_score(_____, _____) # Feed in the truth and predictions\n", "print(f\"RF Model Accuracy on reduced dataset: {red_rf_accuracy:.2%}\")" ], "metadata": { "id": "jNisAXlgnUMe" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "## Q8) Repeat Q5 using the *reduced* X_train dataset instead of X_train." ], "metadata": { "id": "46j-guE8NStk" } }, { "cell_type": "code", "source": [ "# Load current time step, train LogisticRegression with X_train_reduced, load time step after training\n", "# ------------------------------------------------------------------------------------------------------\n", "t0 = time.time() # Timestamp before training\n", "log_clf.fit(_____, _____) # Fit the model with the reduced training data\n", "t1 = time.time() # Timestamp after training\n", "\n", "# How many seconds did it take to train the model?\n", "# ------------------------------------------------------------------------------------------------------\n", "train_t_log = t1-t0\n", "print(f\"Training took {train_t_log:.2f}s\")" ], "metadata": { "id": "JerFiDoKMpAx" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "# Use trained classifier to generate predictions from the **reduced** test set (X_test_reduced)\n", "# ------------------------------------------------------------------------------------------------------\n", "y_pred = _____._____(_____) # Get a set of predictions from the test set\n", "\n", "\n", "# Use accuracy_score to compare truth and prediction. We got 91.38% accuracy.\n", "# ------------------------------------------------------------------------------------------------------\n", "log_accuracy = accuracy_score(_____, _____) # Feed in the truth and predictions\n", "print(f\"Log Model Accuracy on reduced training data: {log_accuracy:.2%}\")" ], "metadata": { "id": "R3Pc9LRK_f4I" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "You can now compare how well the random forest classifier and logistic regression classifier performed on both the full dataset and the reduced dataset. What were you able to observe?" ], "metadata": { "id": "_P_-tnZstz99" } }, { "cell_type": "markdown", "source": [ "Write your comments on the performance of the algorithms in this box, if you'd like 😀\n", "(Double click to activate editing mode)" ], "metadata": { "id": "6AFlS89UuZTy" } } ] }