{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": [],
      "collapsed_sections": [
        "0-WlA6efBRki"
      ],
      "toc_visible": true,
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/github/tbeucler/2023_MLEES_JB/blob/main/ML_EES/ML/S3_1_Dimensionality.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Dimensionality Reduction"
      ],
      "metadata": {
        "id": "fab2zKXwAinB"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "<img src='https://unils-my.sharepoint.com/:i:/g/personal/tom_beucler_unil_ch/EX7KlNGWYypLnH_53OnJR6oBjfgb_gCZ4gmnOeR68a6zMA?download=1'>\n",
        "<center> Caption: <i>Denise diagnoses an overheated CPU at our data center in The Dalles, Oregon. <br> For more than a decade, we have built some of the world's most efficient servers.</i> <br> Photo from the <a href='https://www.google.com/about/datacenters/gallery/'>Google Data Center gallery</a> </center>"
      ],
      "metadata": {
        "id": "y7Q5WigQxsVd"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "*Our world is increasingly filled with data from all sorts of sources, including environmental data. Can we reduce the data to a reduced, meaningful space to save on computation time and increase explainability?*"
      ],
      "metadata": {
        "id": "XGGHmOj1ygXe"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "This notebook will be used in the lab session for week 4 of the course, covers Chapters 8 of Géron, and builds on the [notebooks made available on _Github_](https://github.com/ageron/handson-ml2).\n",
        "\n",
        "Need a reminder of last week's labs? Click [_here_](https://colab.research.google.com/github/tbeucler/2022_ML_Earth_Env_Sci/blob/main/Lab_Notebooks/Week_3_Decision_Trees_Random_Forests_SVMs.ipynb) to go to notebook for week 3 of the course."
      ],
      "metadata": {
        "id": "AlTDG-57-aAI"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "**Notebook Setup**\n",
        "\n",
        "First, let's import a few common modules, ensure MatplotLib plots figures inline and prepare a function to save the figures. We also check that Python 3.5 or later is installed (although Python 2.x may work, it is deprecated so we strongly recommend you use Python 3 instead), as well as Scikit-Learn ≥0.20."
      ],
      "metadata": {
        "id": "0-WlA6efBRki"
      }
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "zw6fcA3O-Uls"
      },
      "outputs": [],
      "source": [
        "# Python ≥3.5 is required\n",
        "import sys\n",
        "assert sys.version_info >= (3, 5)\n",
        "\n",
        "# Scikit-Learn ≥0.20 is required\n",
        "import sklearn\n",
        "assert sklearn.__version__ >= \"0.20\"\n",
        "\n",
        "# Common imports\n",
        "import numpy as np\n",
        "import os\n",
        "\n",
        "# to make this notebook's output stable across runs\n",
        "rnd_seed = 42\n",
        "rnd_gen = np.random.default_rng(rnd_seed)\n",
        "\n",
        "# To plot pretty figures\n",
        "%matplotlib inline\n",
        "import matplotlib as mpl\n",
        "import matplotlib.pyplot as plt\n",
        "mpl.rc('axes', labelsize=14)\n",
        "mpl.rc('xtick', labelsize=12)\n",
        "mpl.rc('ytick', labelsize=12)\n",
        "\n",
        "# Where to save the figures\n",
        "PROJECT_ROOT_DIR = \".\"\n",
        "CHAPTER_ID = \"dim_reduction\"\n",
        "IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, \"images\", CHAPTER_ID)\n",
        "os.makedirs(IMAGES_PATH, exist_ok=True)\n",
        "\n",
        "def save_fig(fig_id, tight_layout=True, fig_extension=\"png\", resolution=300):\n",
        "    path = os.path.join(IMAGES_PATH, fig_id + \".\" + fig_extension)\n",
        "    print(\"Saving figure\", fig_id)\n",
        "    if tight_layout:\n",
        "        plt.tight_layout()\n",
        "    plt.savefig(path, format=fig_extension, dpi=resolution)"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "Dimensionality Reduction using PCA\n",
        "\n",
        "This week we'll be looking at how to reduce the dimensionality of a large dataset in order to improve our classifying algorithm's performance! With that in mind, let's being the exercise by loading the MNIST dataset.\n",
        "\n",
        "## Q1) Load the input features and truth variable into X and y, then split the data into a training and test dataset using scikit's train_test_split method. Use *test_size=0.15*, and remember to set the random state to *rnd_seed!*\n",
        "\n",
        "*Hint 1: The `'data'` and `'target'` keys for mnist will return X and y.*\n",
        "\n",
        "*Hint 2: [Here's the documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) for train/test split.*"
      ],
      "metadata": {
        "id": "H3QU33M3D--N"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Load the mnist dataset\n",
        "from sklearn.datasets import fetch_openml\n",
        "mnist = fetch_openml('mnist_784', version=1, as_frame=False)"
      ],
      "metadata": {
        "id": "H9slNfR3D-kg"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# Read in the mnist digit images and corresponding numbers\n",
        "# ---------------------------------------------------------------------------\n",
        "# The procedure here is similar to the notebooks we did last week. Use Hint 1 to store the input and target data.\n",
        "X = _____[____]\n",
        "y = _____[____]"
      ],
      "metadata": {
        "id": "zNcNkJ3u92cW"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# Import train_test_split() to create your training and test sets\n",
        "# ---------------------------------------------------------------------------\n",
        "from ________._______ import __________\n",
        "\n",
        "# Now separate your X and y into training and test sets (use train_test_split)\n",
        "# ---------------------------------------------------------------------------\n",
        "_____,_____,____,____ = __________(_,_,___=___,_____=____) # (data,target,test_size,random_state)"
      ],
      "metadata": {
        "id": "yOmYNwuT920P"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "We now once again have a training and testing dataset with which to work with. Let's try training a random forest tree classifier on it. You've had experience with them before, so let's have you import the `RandomForestClassifier` from sklearn and instantiate it.\n",
        "\n",
        "## Q2) Import the `RandomForestClassifier` model from sklearn. Then, instantiate it with 100 estimators and set the random state to *rnd_seed!*\n",
        "\n",
        "*Hint 1: [Here's the documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) for `RandomForestClassifier`*\n",
        "\n",
        "*Hint 2: [Here's the documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) for train/test split.*\n",
        "\n",
        "*Hint 3: If you're still confused about **instantiation**, there's a [blurb on wikipedia](https://en.wikipedia.org/wiki/Instance_(computer_science)) describing it in the context of computer science.*"
      ],
      "metadata": {
        "id": "EhBQOdVxfr2U"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Import RandomForestClassifier here.\n",
        "# ---------------------------------------------------------------------------\n",
        "from sklearn.______ import _______"
      ],
      "metadata": {
        "id": "ZZaWwNGUg9Qb"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# Here we initiate a RF classifier objects with custom settings: 100 estimators, random_state=rnd_seed\n",
        "# ------------------------------------------------------------------------------------------------------\n",
        "rnd_clf = _____(______=______, #Number of estimators\n",
        "                ______=______) #Random State"
      ],
      "metadata": {
        "id": "qJc0deCO-Ibt"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "We're now going to measure how quickly the algorithm is fitted to the mnist dataset! To do this, we'll have to import the `time` library. With it, we'll be able to get a timestamp immediately before and after we fit the algorithm, and we'll get the time by calculating the difference.\n",
        "\n",
        "## Q3) Import the time library and calculate how long it takes to fit the `RandomForestClassifier` model.\n",
        "\n",
        "*Hint 1: [Here's the documentation](https://docs.python.org/3/library/time.html#time.time) to the function used for getting timestamps*\n",
        "\n",
        "*Hint 2: [Here's the documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.fit) for the fitting method used in `RandomForestClassifier`.*"
      ],
      "metadata": {
        "id": "gi1HTS-KjUJ8"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "import time"
      ],
      "metadata": {
        "id": "EZaQPn2XkV06"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# Use a function in time (check documentation) to load **current** time before training the RF classifier\n",
        "# ------------------------------------------------------------------------------------------------------\n",
        "t0 = _____._____()\n",
        "\n",
        "# Train the RF classifier\n",
        "# ------------------------------------------------------------------------------------------------------\n",
        "rnd_clf.___(_____, _____)\n",
        "\n",
        "# Use the same function for t0 to load **current** time **after** training the RF classifier\n",
        "# ------------------------------------------------------------------------------------------------------\n",
        "t1 = _____._____()"
      ],
      "metadata": {
        "id": "B4jPNCXl-OIM"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# Run this as is, how many seconds did it take to train the classifier?\n",
        "# ------------------------------------------------------------------------------------------------------\n",
        "train_t_rf = t1-t0\n",
        "\n",
        "print(f\"Training took {train_t_rf:.2f}s\")"
      ],
      "metadata": {
        "id": "LFuLLVWj-PXZ",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "0b9a10bc-6fc1-4b02-f5e2-9386acd2ef90"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Training took 53.15s\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "We care about more than just how long we took to trian the model, however! Let's get an accuracy score for our model.\n",
        "\n",
        "## Q4) Get an accuracy score for the predictions from the RandomForestClassifier\n",
        "\n",
        "*Hint 1: [Here is the documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html) for the `accuracy_score` metric in sklearn.*\n",
        "\n",
        "*Hint 2: [Here is the documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.predict) for the predict method in `RandomForestClassifier`*"
      ],
      "metadata": {
        "id": "X0-hEhlOnLqh"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Import the accuracy score metric in scikit-learn (check Hint 1 for ideas on how to import metrics)\n",
        "# ------------------------------------------------------------------------------------------------------\n",
        "from _____._____ import _____"
      ],
      "metadata": {
        "id": "lscBW_sFnLVS"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# Now try to use the trained classifier to generate predictions for the unseen test set (X_test)\n",
        "# ------------------------------------------------------------------------------------------------------\n",
        "y_pred = _____._____(_____)"
      ],
      "metadata": {
        "id": "x-93C_-n-cle"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# Use the accuracy_score() metric on y_test and y_pred to evaluate the accuracy of our model\n",
        "# ------------------------------------------------------------------------------------------------------\n",
        "rf_accuracy = accuracy_score(_____, _____)\n",
        "\n",
        "# Run this as is. We got an accuracy of 96.7%. Did you get similar scores?\n",
        "# ------------------------------------------------------------------------------------------------------\n",
        "print(f\"RF Model Accuracy: {rf_accuracy:.2%}\")"
      ],
      "metadata": {
        "id": "n09PnHuy-cTf"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Let's try doing the same with with a logistic regression algorithm to see how it compares.\n",
        "\n",
        "## Q5) Repeat Q2-4 with a logistic regression algorithm using sklearn's `LogisticRegression` class. Hyperparameters: `multi_class='multinomial'` and `solver='lbfgs'`\n",
        "\n",
        "*Hint 1: [Here is the documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) for the `LogisticRegression` class."
      ],
      "metadata": {
        "id": "XEZX7xBAHJj9"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Import LogisticRegression class here.\n",
        "# ---------------------------------------------------------------------------\n",
        "from _____._____ import _____"
      ],
      "metadata": {
        "id": "kwX8ZwzQI6p6"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# Initiate a LogisticRegression object with custom hyperparameters\n",
        "# ---------------------------------------------------------------------------\n",
        "log_clf = _____(_____=\"multinomial\", #Multiclass\n",
        "                _____=\"lbfgs\",  Solver\n",
        "                _____=42) #Random State"
      ],
      "metadata": {
        "id": "CvUwrxtS-mTf"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# Timestamp for **current** time before training the LogisticRegression classifier\n",
        "# ------------------------------------------------------------------------------------------------------\n",
        "t0 = time.time()\n",
        "\n",
        "# Training the LogisticRegression classifier\n",
        "# ------------------------------------------------------------------------------------------------------\n",
        "log_clf.fit(_____, _____)\n",
        "\n",
        "# Timestamp for **current** time after training the LogisticRegression classifier\n",
        "# ------------------------------------------------------------------------------------------------------\n",
        "t1 = time.time()\n",
        "\n",
        "# Run this as is, how many seconds did it take to train the classifier?\n",
        "# ------------------------------------------------------------------------------------------------------\n",
        "train_t_log = t1-t0\n",
        "print(f\"Training took {train_t_log:.2f}s\")"
      ],
      "metadata": {
        "id": "F6Dr9j1T-mgz"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# Now try to use the trained classifier to generate predictions for the unseen test set (X_test)\n",
        "# ------------------------------------------------------------------------------------------------------\n",
        "y_pred = _____._____(_____)\n",
        "\n",
        "# Run this as is. We got an accuracy of 92.1%. Did you get similar scores?\n",
        "# ------------------------------------------------------------------------------------------------------\n",
        "log_accuracy = accuracy_score(_____, _____)  # Feed in the truth and predictions\n",
        "\n",
        "print(f\"Log Model Accuracy: {log_accuracy:.2%}\")"
      ],
      "metadata": {
        "id": "Armw_a0V-mAs"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Up to now, everything that we've done are things we've done in previous labs - but now we'll get to try out some algorithms useful for reducing dimensionality! Let's use principal component analysis. Here, we'll reduce the space using enough axes to explain over 95% of the variability in the data...\n",
        "\n",
        "## Q6) Import scikit's implementation of `PCA` and fit it to the training dataset so that 95% of the variability is explained.\n",
        "\n",
        "*Hint 1: [Here is the documentation](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) for scikit's `PCA` class.*\n",
        "\n",
        "*Hint 2: [Here is the documentation](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA.fit_transform) for scikit's `.fit_transform()` method.*"
      ],
      "metadata": {
        "id": "b_5XiaQfJ5NV"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Here we will experiment a bit with reducing the dimensionality of the mnist data.\n",
        "# First, import the PCA class from scikit-learn\n",
        "# ------------------------------------------------------------------------------------------------------\n",
        "from _____._____ import _____ # Importing PCA"
      ],
      "metadata": {
        "id": "rrP5043rJc-1"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# We will now initiate the PCA algorithm, with a custom hyperparameter to only keep only a certain amount of PC components\n",
        "# In the documentation, search for the keywords \"numbers ... to keep\"\n",
        "# ---------------------------------------------------------------------------------------------------------------------------\n",
        "pca = PCA(_____=_____) # Set number of components to explain 95% of variability"
      ],
      "metadata": {
        "id": "UZAeoAlI_Ok9"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# Fit the PCA model and use it to transform our training data (reducing data dimentionality) [fit_transform]\n",
        "# ---------------------------------------------------------------------------------------------------------------------------\n",
        "X_train_reduced = pca._____(____) # Fit-transform the training data"
      ],
      "metadata": {
        "id": "b3FHiYMA_OwR"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# Transform our test data (reducing data dimentionality) with the pca algorithm (do not fit the model again!)\n",
        "# ---------------------------------------------------------------------------------------------------------------------------\n",
        "X_test_reduced = pca._____(____)"
      ],
      "metadata": {
        "id": "zydXZOAV_T1U"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Q7) Repeat Q3 & Q4 using the *reduced* `X_train` dataset instead of `X_train`."
      ],
      "metadata": {
        "id": "mKXeXWn4M8K1"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Load current time step, train RF classifier with X_train_reduced, load time step after training\n",
        "# ------------------------------------------------------------------------------------------------------\n",
        "t0 = _____._____() # Load the timestamp before running\n",
        "rnd_clf.___(_____, _____) # Fit the model with the reduced training data\n",
        "t1 = _____._____()  # Load the timestamp after running\n",
        "\n",
        "# How many seconds did it take to train the model?\n",
        "# ------------------------------------------------------------------------------------------------------\n",
        "train_t_rf = t1-t0\n",
        "print(f\"Training took {train_t_rf:.2f}s\")"
      ],
      "metadata": {
        "id": "m1oZFFfljH0N"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# Use trained classifier to generate predictions from the **reduced** test set (X_test_reduced)\n",
        "# ------------------------------------------------------------------------------------------------------\n",
        "y_pred = _____._____(_____)\n",
        "\n",
        "# Use accuracy_score to compare truth and prediction. We got 94.7% accuracy.\n",
        "# ------------------------------------------------------------------------------------------------------\n",
        "red_rf_accuracy = accuracy_score(_____, _____)  # Feed in the truth and predictions\n",
        "print(f\"RF Model Accuracy on reduced dataset: {red_rf_accuracy:.2%}\")"
      ],
      "metadata": {
        "id": "jNisAXlgnUMe"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Q8) Repeat Q5 using the *reduced* X_train dataset instead of X_train."
      ],
      "metadata": {
        "id": "46j-guE8NStk"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Load current time step, train LogisticRegression with X_train_reduced, load time step after training\n",
        "# ------------------------------------------------------------------------------------------------------\n",
        "t0 = time.time() # Timestamp before training\n",
        "log_clf.fit(_____, _____) # Fit the model with the reduced training data\n",
        "t1 = time.time() # Timestamp after training\n",
        "\n",
        "# How many seconds did it take to train the model?\n",
        "# ------------------------------------------------------------------------------------------------------\n",
        "train_t_log = t1-t0\n",
        "print(f\"Training took {train_t_log:.2f}s\")"
      ],
      "metadata": {
        "id": "JerFiDoKMpAx"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# Use trained classifier to generate predictions from the **reduced** test set (X_test_reduced)\n",
        "# ------------------------------------------------------------------------------------------------------\n",
        "y_pred = _____._____(_____)   # Get a set of predictions from the test set\n",
        "\n",
        "\n",
        "# Use accuracy_score to compare truth and prediction. We got 91.38% accuracy.\n",
        "# ------------------------------------------------------------------------------------------------------\n",
        "log_accuracy = accuracy_score(_____, _____)  # Feed in the truth and predictions\n",
        "print(f\"Log Model Accuracy on reduced training data: {log_accuracy:.2%}\")"
      ],
      "metadata": {
        "id": "R3Pc9LRK_f4I"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "You can now compare how well the random forest classifier and logistic regression classifier performed on both the full dataset and the reduced dataset. What were you able to observe?"
      ],
      "metadata": {
        "id": "_P_-tnZstz99"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Write your comments on the performance of the algorithms in this box, if you'd like 😀\n",
        "(Double click to activate editing mode)"
      ],
      "metadata": {
        "id": "6AFlS89UuZTy"
      }
    }
  ]
}