{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github"
},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"source": [
"One of the best ways to improve skills of machine learning models is to combine multiple machine learning models. Since each model in the ensemble will make slightly different predictions, the models could be more robust and generalizable to unseen data. We can also characterize the uncertainty of ML predictions with an ensemble approach.\n",
"\n",
"Here we will create multiple individual classifiers on the MNIST data, a dataset with hand-written digit images, and perform skill evaluation. The skills of individual models will be compared with skills of ensemble models."
],
"metadata": {
"id": "-5uehqDJGlm7"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "TsKRi0KZGlF5"
},
"source": [
"# Exercise 3: Comparing (Ensemble of) Classifiers on MNIST Data"
]
},
{
"cell_type": "markdown",
"source": [
""
],
"metadata": {
"id": "Iu7YEO51_ieg"
}
},
{
"cell_type": "markdown",
"source": [
"The goal is to train and compare individual classifiers on [MNIST data](https://en.wikipedia.org/wiki/MNIST_database), before combining them into an ensemble model. Will the power of teamwork shine through? 🔢"
],
"metadata": {
"id": "Cc_giO4--x6A"
}
},
{
"cell_type": "markdown",
"source": [
"Let's start by loading the [MNIST database](https://en.wikipedia.org/wiki/MNIST_database)!"
],
"metadata": {
"id": "NaNNfeQ5AguD"
}
},
{
"cell_type": "code",
"source": [
"from sklearn.datasets import fetch_openml\n",
"import numpy as np"
],
"metadata": {
"id": "LG202nlSAdEt"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Setting `as_frame` to False to avoid loading mnist as Pandas dataframe\n",
"mnist = fetch_openml('mnist_784', version=1, as_frame=False)\n",
"# Making sure we are working with 8-bit unsigned integers as targets\n",
"mnist.target = mnist.target.astype(np.uint8)"
],
"metadata": {
"id": "kQ7n6SQuAvc-"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Here we are creating our dataset to train our ML model\n",
"# X contains the digit pictures, and y contains the label corresponding to each picture\n",
"X = mnist['data'] # Read digit pictures\n",
"y = mnist['target'].astype(np.uint8) # Read labels\n",
"X.shape"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "yWk-uMX8B7r8",
"outputId": "51bf26f8-a714-430c-9e58-ffcb84cd5e24"
},
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"(70000, 784)"
]
},
"metadata": {},
"execution_count": 4
}
]
},
{
"cell_type": "markdown",
"source": [
"**Q1) Split the MNIST dataset into a training, a validation, and a test sets**\n",
"\n",
"Hint 1: The documentation for `scikit-learn`'s `train_test_split` function is [at this link](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html).\n",
"\n",
"Hint 2: You may use 50k instances for training, 10k instances for validation, and 10k instances for testing."
],
"metadata": {
"id": "mjQ53EhUCRRy"
}
},
{
"cell_type": "code",
"source": [
"# Import the necessary functions and utilities\n",
"# You will need train_test_split() that we used in previous notebooks\n",
"#from sklearn._____________ import ____________________\n",
"from sklearn.model_selection import train_test_split"
],
"metadata": {
"id": "ghURb2XNC1RI"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Split the MNIST data into training, validation, and test\n",
"# set data to be used for training and testing/validation\n",
"# train_test_split only allows two-way splits. To get our training, validation, and test set, we will need to call train_test_split() 2 times\n",
"#############################################################################################################################################\n",
"# 1. Split the data into training set and validation-test set. Set the size of the training set to be 50000\n",
"#############################################################################################################################################\n",
"________,_______,__________,__________ = train_test_split(__,__, __________=50000)\n",
"\n",
"#############################################################################################################################################\n",
"# 2. Split the validation-test set into validation set and test set. Set the size of the test set to be 10000\n",
"#############################################################################################################################################\n",
"________,_______,__________,__________ = train_test_split(__,__, __________=10000)\n",
"\n",
"#############################################################################################################################################\n",
"# 3. Print the shape of your training, validation, and test data. You should see (50000,784) [training], (10000,784) [validation], (10000,784) [test]\n",
"#############################################################################################################################################\n",
"print(____________________)"
],
"metadata": {
"id": "8vcomcH2C6kI"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"**Q2) Train various classifiers on the training set and compare them on the validation set**\n",
"\n",
"Hint: You may compare a [`RandomForestClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html), an [`ExtraTreesClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html), and a [`SVC`](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html), but we encourage you to be creative and include additional classifiers you find promising! The more the merrier 😀\n",
"\n",
"Note from TA: The SVC can be slow to train. Test RandomForest and ExtraTrees first for quick results."
],
"metadata": {
"id": "JAY0mlBsFobL"
}
},
{
"cell_type": "code",
"source": [
"# Import all the classifiers you need\n",
"from sklearn._________ import RandomForestClassifier\n",
"from sklearn._________ import ExtraTreesClassifier\n",
"from sklearn._____ import SVC"
],
"metadata": {
"id": "OgJaMxU9Gtmy"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Initiate Classifiers\n",
"rfc = ___________________ # RandomForestClassifier\n",
"etc = ___________________ # ExtraTreesClassifier\n",
"\n",
"# Fit Classifiers on training data\n",
"rfc.___(_________,________) # RandomForestClassifier\n",
"etc.___(_________,________) # ExtraTreesClassifier"
],
"metadata": {
"id": "wjCaCtDiGv1A"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Import accuracy_score module from sklearn\n",
"from sklearn._______ import accuracy_score"
],
"metadata": {
"id": "hD03asiWHLHY"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Use the trained classifiers to make predictions on the validation set\n",
"rfc_preds = rfc._______(____)\n",
"etc_preds = etc._______(____)\n",
"\n",
"# Use accuracy_score() to see if our models can successfully classify the validation data.\n",
"# We got around 96-97% accuracy. Did your models perform well on the validation data as well?\n",
"rfc_acc = accuracy_score(______,_______)\n",
"etc_acc = accuracy_score(______,_______)\n",
"print(rfc_acc)\n",
"print(etc_acc)"
],
"metadata": {
"id": "q5aOFZcyGzdR"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Now it's time to make the individual classifiers vote to form an *ensemble* model"
],
"metadata": {
"id": "eCksbNtVHORZ"
}
},
{
"cell_type": "markdown",
"source": [
"**Q3) Combine the classifiers into an ensemble that outperforms them all on the validation set, using a soft or hard voting classifier.**\n",
"\n",
"Hint: The documentation for `scikit-learn`'s [`VotingClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html) class can be found [at this link](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html). Note that its argument `voting` can be changed from `hard` to `soft`."
],
"metadata": {
"id": "FhffZUkUIXxU"
}
},
{
"cell_type": "code",
"source": [
"# Import VotingClassifier module from sklearn\n",
"from sklearn.__________ import VotingClassifier"
],
"metadata": {
"id": "wHSz59irJvmj"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Define your voting classifier here\n",
"vc_hard = VotingClassifier(________=[(____,_______), (___,_____)]) # Hard Voting\n",
"vc_soft = VotingClassifier(________=[(____,_______), (___,_____)], ____=____) # Soft Voting"
],
"metadata": {
"id": "CtFMbHkDJyvG"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Train the two voting classifiers\n",
"vc_soft.____(________, _________) # Hard voting\n",
"vc_hard.____(________, _________) # Soft voting\n",
"\n",
"# Evaluate classifier performance on validation set. The evaluation will be based on accuracy_score(), similar to previous notebooks.\n",
"vc_soft_preds = vc_soft._________(_____)\n",
"vc_hard_preds = vc_hard._________(_____)\n",
"\n",
"# Calculate your accuracy scores here.\n",
"vc_soft_acc = ____________(______, ____________)\n",
"vc_hard_acc = ____________(____, _____________)\n"
],
"metadata": {
"id": "6KP32aHdJ4oy"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# How well did our voting classifier perform on the validation data\n",
"# Which of the two was better?"
],
"metadata": {
"id": "n_ZnWI0GWlGC"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Compare the accuracy values for the voting classifiers to the accuracy of individual classifiers"
],
"metadata": {
"id": "RbfEORxFXHTJ"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Hint: If your ensemble does significantly worse than individual classifiers, consider deleting the individual classifiers negatively affecting the performance of your ensemble using `del Voting_Classifier.estimators_[index_of_model_to_delete]`, where the `estimators_` attribute of your `Voting_Classifier`'s lists the individual classifiers that were trained as part of the ensemble."
],
"metadata": {
"id": "66eirCsCXToh"
}
},
{
"cell_type": "markdown",
"source": [
"**Q4) Does your ensemble clearly outperform your individual classifiers on the test set**"
],
"metadata": {
"id": "xWxq93F7W-4O"
}
},
{
"cell_type": "code",
"source": [
"# Use the classifiers to make classification on the test set\n",
"################################################################################\n",
"# Individual Classifier predictions\n",
"################################################################################\n",
"rfc_preds_test = rfc._______(______)\n",
"etc_preds_test = etc._______(______)\n",
"################################################################################\n",
"# Voting Classifier predictions\n",
"################################################################################\n",
"vc_soft_preds_test = vc_soft._______(______)\n",
"vc_hard_preds_test = vc_hard._______(______)\n",
"################################################################################\n",
"# Compare accuracy scores\n",
"################################################################################\n",
"vc_soft_acc_test = ______________(______________, ______________)\n",
"vc_hard_acc_test = ______________(______________, ______________)\n",
"rfc_acc_test = ______________(______________, ______________)\n",
"etc_acc_test = ______________(______________, ______________)"
],
"metadata": {
"id": "wHzcCAtbXqk0"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Does it clearly beat the best individual classifier?\n",
"print(\"VC soft:\", vc_soft_acc_test)\n",
"print(\"VC hard:\", vc_hard_acc_test)\n",
"print(\"RandomForest: \", rfc_acc_test)\n",
"print(\"ExtraTrees: \", etc_acc_test)"
],
"metadata": {
"id": "EbtS-G0-ZB7b"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "CurHgNWIGlF7"
},
"source": [
"Your voting classifier may only slightly beat the best model. Maybe voting isn't the best way to get the best prediction!\n",
"\n",
"Let's try the brute-force approach: Training a classifier on the individual model's predictions to beat the voting approach."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "dZR3e-4pGlF-"
},
"source": [
"## Bonus Exercise 3: From Individual Classifiers to Ensemble Stacking via Blenders"
]
},
{
"cell_type": "markdown",
"source": [
""
],
"metadata": {
"id": "gPikwjQUbA5k"
}
},
{
"cell_type": "markdown",
"source": [
"Let's learn how to best blend the individual classifiers' predictions!"
],
"metadata": {
"id": "S9gdO_Ydbcex"
}
},
{
"cell_type": "markdown",
"source": [
"**Q1) Run the individual classifiers from the previous exercise to make predictions on the validation set, and create a new training set with the resulting predictions**\n",
"\n",
"Hint: The target stays the same, but now each training instance is a vector containing the set of predictions from all your individual classifiers. You may group all these vectors into a feature array `X_val_predictions` that should have the [shape](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.shape.html): `(Number_of_validation_instances,Number_of_individual_classifiers)`.\n",
"\n",
""
],
"metadata": {
"id": "zCWZUzRBbxwa"
}
},
{
"cell_type": "code",
"source": [
"# Create the new training set"
],
"metadata": {
"id": "lV1OOCt6cnXn"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Make sure it has the right shape and contains sensical values"
],
"metadata": {
"id": "vVsgD6Plcyi7"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"**Q2) Train a classifier on this new training set**\n",
"\n",
"Hint 1: You may train a [`RandomForestClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html).\n",
"\n",
"Hint 2: You could fine-tune this blender or try other types of blenders (e.g., a [`LogisticRegression`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) or an [`MLPClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html)), then select the best one using cross-validation."
],
"metadata": {
"id": "Qy2VhbS8c69B"
}
},
{
"cell_type": "code",
"source": [
"# Fit the classifier to the new training set"
],
"metadata": {
"id": "y-GKK3Jrdd5r"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Calculate its mean accuracy"
],
"metadata": {
"id": "Ft17SfYVdkuz"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# (Optional) Try other classifiers on this new training set\n",
"# if you're not satisfied with the new accuracy"
],
"metadata": {
"id": "oPB71Yw6d_SA"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Congratulations! 😃\n",
"\n",
"You have just trained a blender, and together with classifiers they form a [stacking ensemble](https://en.wikipedia.org/wiki/Ensemble_learning#Stacking). Now let's evaluate the ensemble on the test set."
],
"metadata": {
"id": "fqG4T-TxeGkx"
}
},
{
"cell_type": "markdown",
"source": [
"**Q3) Evaluate the blender on the test set and compare it to the voting classifier you trained earlier**\n",
"\n",
"Hint 1: You will have to first calculate the predictions of your individual classifiers on the test set, similar to what you did in [Question 1](#Q1).\n",
"\n",
"Hint 2: Make sure you use the same score (e.g., the [`accuracy_score`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html)) to compare both ensemble models."
],
"metadata": {
"id": "sreASWk6e6eK"
}
},
{
"cell_type": "code",
"source": [
"# Calculate the predictions of your individual classifiers on the test set\n",
"# and format them so you can feed them to your blender"
],
"metadata": {
"id": "sWMlCBF9f-uq"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Calculate the mean accuracy of the blender on the test set"
],
"metadata": {
"id": "OmeOsTRygItI"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Compare it to the mean accuracy of individual models and the voting classifier"
],
"metadata": {
"id": "Q_opfKJkgLxS"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Is the blender worth the effort?"
],
"metadata": {
"id": "1EepvRHzGlF_"
}
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.10"
},
"nav_menu": {
"height": "252px",
"width": "333px"
},
"toc": {
"navigate_menu": true,
"number_sections": true,
"sideBar": true,
"threshold": 6,
"toc_cell": false,
"toc_section_display": "block",
"toc_window_display": false
},
"colab": {
"provenance": []
}
},
"nbformat": 4,
"nbformat_minor": 0
}