{ "cells": [ { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "view-in-github" }, "source": [ "\"Open" ] }, { "cell_type": "markdown", "metadata": { "id": "nf7hCHXe6lk7" }, "source": [ "# **Exercise 2 – Recurrent Neural Networks for Hydrological Modeling**\n" ] }, { "cell_type": "markdown", "metadata": { "id": "GIjU3VmP7cVw" }, "source": [ "
The rain fell alike upon the just and upon the unjust, and for nothing was there a why and a wherefore. - William Somerset Maugham
\n", "
\n", "\n", "Rain is oft described as having a gloomy beauty to it, and it continues to inspire authors and scientists alike. Can we use machine learning algorithms to predict rainfall runoff with comparable accuracies to other established methods?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "_cUF5bY6xWKy" }, "outputs": [], "source": [ "#@title Run this cell for preliminary requirements. Double click it if you want to check out the source :)\n", "\n", "# Python ≥3.5 is required\n", "import sys\n", "assert sys.version_info >= (3, 5)\n", "\n", "# Is this notebook running on Colab or Kaggle?\n", "IS_COLAB = \"google.colab\" in sys.modules\n", "\n", "# Scikit-Learn ≥0.20 is required\n", "import sklearn\n", "assert sklearn.__version__ >= \"0.20\"\n", "\n", "# Common imports\n", "import numpy as np\n", "import pandas as pd\n", "import os\n", "\n", "# To make this notebook's output stable across runs\n", "rnd_seed = 42\n", "rnd_gen = np.random.default_rng(rnd_seed)\n", "\n", "# To plot pretty figures\n", "%matplotlib inline\n", "import matplotlib as mpl\n", "import matplotlib.pyplot as plt\n", "mpl.rc('axes', labelsize=14)\n", "mpl.rc('xtick', labelsize=12)\n", "mpl.rc('ytick', labelsize=12)\n", "\n", "# Where to save the figures\n", "PROJECT_ROOT_DIR = \".\"\n", "CHAPTER_ID = \"classification\"\n", "IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, \"images\", CHAPTER_ID)\n", "os.makedirs(IMAGES_PATH, exist_ok=True)\n", "\n", "def save_fig(fig_id, tight_layout=True, fig_extension=\"png\", resolution=300):\n", " path = os.path.join(IMAGES_PATH, fig_id + \".\" + fig_extension)\n", " print(\"Saving figure\", fig_id)\n", " if tight_layout:\n", " plt.tight_layout()\n", " plt.savefig(path, format=fig_extension, dpi=resolution)\n", "\n", "# Import pooch - used to handle data downloading\n", "import pooch" ] }, { "cell_type": "markdown", "metadata": { "id": "Aw1dLJvi_rab" }, "source": [ "Welcome to the practical application notebook for this week, where we'll be using a Long-Short Term Memory (LSTM) neural network in order to model rainfall-runoff! Let's begin by loading in our data using pooch and pandas." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "5EfbgXH40jbk" }, "outputs": [], "source": [ "#@title Let's download the data onto the storage on the server using pooch. Remember you can check the source by double cliking on cells like these. \n", "data_url = 'https://unils-my.sharepoint.com/:x:/g/personal/tom_beucler_unil_ch/EQ0OKNafmxdJvXUMPfwGyecBwCPZJ1Y8_ATlmbrUFySkzw?download=1'\n", "data_file = pooch.retrieve(data_url, known_hash='3b647bb9318865be5858bccd1539148fbc58f7425d09ac62d8e459682958940a');\n" ] }, { "cell_type": "markdown", "metadata": { "id": "hTPSh07UFntb" }, "source": [ "## **Q1) Go ahead and load the csv file with the data for our project today. The filepath is stored in the `data_file` variable that was defined in the hidden code within the cell above this one!**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "8eXgMZlhHPsC" }, "outputs": [], "source": [ "#@title Hints - Loading CSVs with Pandas\n", "\n", "'''\n", "Pandas has a handy .read_csv function!\n", "\n", "There's an argument you can pass which will let you define the column to use as\n", "an index. \n", "\n", "There's also a single argument which when set to true will make pandas try to\n", "automatically parse fields it thinks are dates into datetime variables. This\n", "will be usefull when we split the data!\n", "\n", "''';\n", "# Uncomment the line below and run this cell to open up the help dialog in colab\n", "#pd.read_csv??\n", "\n", "# Uncomment the code below and run the cell to get a look at the data \n", "#pd.read_csv(data_file).head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "51yflhMwFHed" }, "outputs": [], "source": [ "# Let's load the CSV file into the \"data\" variable\n", "data = _______._______(_______, # Path to the CSV\n", " _______ = _______, # Define the date column as the index. \n", " _______ = _______, # Tell pandas to parse the date index.\n", " )\n", "\n", "# Print the first 5 rows of data to see if we've loaded it correctly! \n", "data.head()" ] }, { "cell_type": "markdown", "metadata": { "id": "vnnBVmqzU9mh" }, "source": [ "We now have our data in a nice format for us to use. You'll see that our data has an index column, a T column, a P1 column, a P2 column, and a runoff column. The index column is the date of our measurements, the T column is the mean daily temperature at a station, the P1 and P2 columns are the daily precipitations at two stations, and the runoff column is the daily streamflow at one station.\n", "\n", " Let's go ahead and figure out how many years of data we have by looking at the time index." ] }, { "cell_type": "markdown", "metadata": { "id": "aeDbgSQ4_yfV" }, "source": [ "## **Q2) Store the list of available years of data in a variable, and print out the length of the list.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "jeMvYAoZ_71Y" }, "outputs": [], "source": [ "#@title Hint: How to use Pandas datetimeIndex\n", "'''\n", "When loading the data, we used the date column as the index - this means that \n", "we can access the date using the .index attribute of the dataframe.\n", "e.g.: for the df dataframe, df.index will return the index column\n", "\n", "Additionally, we set parse_dates to True, so the dates were transformed from\n", "a simple test representation of the date to a more powerfull pandas datetime\n", "format. Here, we can directly access key parts of the date\n", "e.g.: df.index.month will retrieve the month for each row index.\n", "\n", "This can be useful for verifying or selected seasonal data, such as\n", "for winter (december-january-february, or djf). Using built-in pandas functions\n", "to find the unique values in the index.month column would let us\n", "verify that all relevant month values are present (i.e., 1, 2, 12)\n", "''';" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "K0Vwm9mb_xrs" }, "outputs": [], "source": [ "years_available = _______._______._______._______() # Find the unique year values\n", "print(len(years_available))" ] }, { "cell_type": "markdown", "metadata": { "id": "YNIMPQ-JAjNb" }, "source": [ "Now we need to split the dataset into a training, validation, and test set to use with our algorithm. Since we're dealing with time series data, let's use the last 10 of the years available as the test dataset, the 10 years previous to that as the validation dataset, and the remainder of the data will be used to train our algorithm." ] }, { "cell_type": "markdown", "metadata": { "id": "qSR_tbHgYbzi" }, "source": [ "## **Q3) Split the dataset into a training, validation, and test set.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "1-yKwVG_ZpK7" }, "outputs": [], "source": [ "#@title Hint: Review of array slicing\n", "'''\n", "You can select data in an array/DataFrame using an index within square brackets\n", "data[num_index] will return the value of data at the nth num_index entry. This\n", "also works with negative numbers, which are used to count from the end of the \n", "array.\n", "e.g., if we define myArray = [0 1 2 3 4 5 6 7 8 9]\n", "myArray[0] will return 0, myArray[1] will return 1\n", "myArray[-1] will return 9, myArray[-2] will return 8\n", "\n", "We can also select multiple values of the array using colons. Remember that\n", "indices use the [start:end:step] convention\n", "myArray [4:] will return [4, 5, 6, 7, 8, 9]\n", "myArray [:4] will return [0, 1, 2, 3]\n", "myArray [2:5] will return [2,3,4]\n", "myArray [-3:] will return [7, 8, 9]\n", "myArray [-5:-2] will return [5, 6, 7]\n", "myArray [2:-2] will return [2, 3, 4, 5, 6, 7]\n", "\n", "Stepping\n", "myArray [::2] will return [0, 2, 4, 6, 8]\n", "myArray [-4::2] will return [6, 8]\n", "myArray [2:-2:2] will return [2, 4, 6]\n", "''';" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "Zp-9x0lOZpCY" }, "outputs": [], "source": [ "#@title Hint: Finding/Selecting data in a Pandas dataframe\n", "'''\n", "Pandas dataframes have useful functions that allow you to filter values.\n", "One way to filter for specific time values is to rely on the `isin()` method\n", "of datasets. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isin.html\n", "\n", "For example, if we were trying to take a dataset covering every month of the\n", "year and we're only interested in the winter season (DJF), we could use:\n", "boolean_index = df.index.month.isin([12,1,2])\n", "This would return a boolean index that you can use to select the relevant rows\n", "from the dataframe.\n", "\n", "In order to select the relevant datapoints, you can rely on the .loc method: \n", "https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html\n", "Using this method, you could select the data as follows:\n", "relevant_datapoints = df.loc[boolean_index]\n", "\n", "If you want to select every month that is _not_ in winter, you would instead\n", "write: \n", "relevant_datapoints = df.loc[~boolean_index]\n", "Since ~ is the bitwise negation operator, it effectively \"nots\" boolean arrays\n", "''';" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "-Sk3guB-Iby0" }, "outputs": [], "source": [ "#Select the last ten years for your test set\n", "test_years = years_available[_______:]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "-6fBBVlsRZR7" }, "outputs": [], "source": [ "# Select ten years at random from the remaining set\n", "validation_years = years_available[_______:_______]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "hmTCwLpvRbzx" }, "outputs": [], "source": [ "# define the test set: test_set\n", "_______ = data._______[data._______._______._______(test_years)]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "7GATUdKdRfUO" }, "outputs": [], "source": [ "# define the validation set: val_set\n", "_______ = data._______[data._______._______._______(validation_years)]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "l4Ct_cMZRg-I" }, "outputs": [], "source": [ "# define the training set: train_set\n", "# Note that the '~' translates to \"not\" - this allows us to find the training \n", "# years by specifying it's those not in the list of test_years and validation_years\n", "_______ = data._______[~data._______._______.([*test_years,*validation_years])]" ] }, { "cell_type": "markdown", "metadata": { "id": "7W1IHq11gb33" }, "source": [ "The data we're using today has already been cleaned up - we know it doesn't have any NaN values. \n", "\n", "Still, it's good practire to check! Do so in the cell below." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "69WrwcJ0eZLD" }, "outputs": [], "source": [ "# Since booleans are interpreted as 0 or 1 when doing math, we can use the\n", "# isna() method in pandas dataframes and sum the result to check how many NaN\n", "# values are in the dataset. We should get 0 today!\n", "test_set.isna().sum()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "E3ML8czNZSif" }, "outputs": [], "source": [ "# Print out the first five rows of each dataset.\n", "# Make sure that your split matches ours!\n", "print(train_set.head(), val_set.head(), test_set.head(), sep='\\n\\n')" ] }, { "cell_type": "markdown", "metadata": { "id": "9p0FKjD6MlnG" }, "source": [ "You should now have your dataset split up as three pandas dataframes. If you print out the first five rows of each dataset, you should get the following:\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": { "id": "4yo5mVnAip6V" }, "source": [ "We now have our data for training, validating, and testing our algorithm! However, the values are still the original ones from our measurements. We saw before that we generally want to transform these into stardardized values, otherwise we might end up with a _problem of scales._\n", "\n", "

get it?
\n", "\n", "With that joke out of our system, let's go ahead and prepare a Scikit-Learn pipeline to scale our data." ] }, { "cell_type": "markdown", "metadata": { "id": "xkNLBF2wE6bm" }, "source": [ "## **Q4) Scale the input data using a StandardScaler (i.e., using the mean and standard deviation).**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "fg6wC9JvFs7m" }, "outputs": [], "source": [ "#@title Hint: Review of Scikit Learn pipelines and transformers\n", "'''\n", "Scikit include several tools to preprocess data, ranging from utilities for\n", "imputation to utilities for scaling data.\n", "\n", "The column transformer will allow us to apply specific transformers to columns\n", "in the pandas dataframe. We also can specify what to do with the \n", "columns to which we don't want to apply transformations. You can check the\n", "documentation at the link below:\n", "https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html\n", "\n", "For example, take a dataframe including the temperature T, sea level pressure\n", "SST, relative humidity RH, mean sea level pressure MSLP, and surface pressure SP\n", "\n", "We want to train a model that takes T, SST, RH, and MSLP to make a weather \n", "prediction, and want the variables to be scaled to fit strictly between 0 and 1.\n", "Rather than code a script for this, a Column Transformer can be defined that\n", "uses remained='drop' as a parameter (thereby dropping SP), and a MinMaxScaler \n", "function for the transformer.\n", "\n", "The code would look something like:\n", "preprocesser = ColumnTransformer(\n", " remainder='drop', \n", " transformers=[\n", " ('scaling', MinMaxScaler(), ('T', 'SST', 'RH', 'MSLP')),\n", " ])\n", "\n", "The column transformer serves two functions: fitting and transforming\n", "When fitting, the transformer takes the input data and figures out how it will\n", "transform data later. In the example above, this would include calculating the\n", "minimum and maximum value for each column.\n", "\n", "Transforming takes the fitted transformer and calculates the new values of the\n", "column, as well as performing any other operations on the data (e.g., dropping\n", "the SP variable).\n", "\n", "Generally, you want to fit the transformer using the training data, and then\n", "transform the training data, the validation data, and the test data. We avoid\n", "fitting to the validation and test data.\n", "\n", "Additionally, though transformer.fit() and transformer.transform() exist\n", "individually, fitting to and transforming the training data is so common that\n", "a combined method is implemented in sklearn: fit_transform()\n", "\n", "Today, you should use the StandardScaler in order to process your numerical\n", "data. You can find more infomation about StandardScaler in the \n", "''';" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "zh3mtuhw8h7b" }, "outputs": [], "source": [ "# Import the parts of Scikit Learn we need for scaling\n", "from sklearn.compose import _______\n", "from sklearn.preprocessing import _______" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "iI65j_O9EVmi" }, "outputs": [], "source": [ "# Create a Column Transformer\n", "scaler = _______(\n", " _______='_______', # We leave the non-input columns alone\n", " _______=[\n", " ('_______', _______, (_______, _______, _______)),\n", " ]) #Define the scaler to use and which columns to use it on" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "y9fdLCV8EzUb" }, "outputs": [], "source": [ "train = _______._______(_______) # fit the scaler and transform the train_set\n", "val = _______._______(_______) # transform the val_set\n", "test = _______._______(_______) # transform the test_set\n" ] }, { "cell_type": "markdown", "metadata": { "id": "7e3qvuPVHZ9r" }, "source": [ "If you did everything right, the first two rows of your dataset should return:\n", "\n", "Training: \\\n", "`[[-0.67248997 -0.24609328 -0.32674546 3.91 ]`\\\n", "` [-0.67248997 -0.07489897 0.08869363 4.01 ]]`\n", "\n", "Validation: \\\n", "`[[-1.28097285 -0.47435235 -0.47511657 3.22 ]`\\\n", "`[-1.06462338 -0.10343136 -0.03000326 3.12 ]]`\n", "\n", "Test:\\\n", "`[[-1.02405785 0.38161917 0.6228296 3.6 ]`\\\n", "`[-1.1051889 -0.44581996 -0.47511657 3.4 ]]`" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "aezcolNxHTE_" }, "outputs": [], "source": [ "# Make sure your datasets have been scaled correctly\n", "print(train[:2], val[:2],test[:2], sep='\\n\\n')" ] }, { "cell_type": "markdown", "metadata": { "id": "dGfL67CdJsXV" }, "source": [ "We're almost ready to start training our algorithm! However, we currently have a list of temperature, precipitation, and runoff readings _for each day_ in our datasets. However, we're interested in looking at an $n$-sized __window__ of temperature and precipitation readings to predict a point in the runoff series! \n", "\n", "As an example with $ n= 5$: $\\;$\n", "to predict the runoff on a given Friday, we need the temperature and precipitation readings for Monday, Tuesday, Wednesday, Thursday, and the Friday itself." ] }, { "cell_type": "markdown", "metadata": { "id": "w8on_0--TzkA" }, "source": [ "## **Q5) Turn the training, validation, and test data into a set of series to feed into an LSTM**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "QH2pRmsHUmgt" }, "outputs": [], "source": [ "# Let's start by printing out the shape of our training dataset\n", "print(train.shape)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Jbm8fYY8Uzae" }, "outputs": [], "source": [ "# Sequence length will determine how many days we will take into consideration\n", "# when trying to predict the runoff. You can try multiple values, and we \n", "# recommend you check out the results for n= [7, 30, 182, 365]\n", "sequence_length = _______" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "pplc9blaZGcb" }, "outputs": [], "source": [ "# In order to get the windows for the input data, we'll rely on numpy's \n", "# sliding window view. You can check the documentation for\n", "# numpy.lib.stride_tricks.sliding_window_view for more information\n", "# https://numpy.org/devdocs/reference/generated/numpy.lib.stride_tricks.sliding_window_view.html\n", "\n", "window_size = _______\n", "\n", "\n", "# For the source array, since we're saving X, we just need the columns for T,\n", "# P1, and P2. We don't want to include the runoff!\n", "train_X = np.lib.stride_tricks.sliding_window_view(_______[:,:_______], # source array\n", " _______, # window size \n", " _______=0) # Window Sliding Direction\n", "\n", "val_X = np.lib.stride_tricks.sliding_window_view(_______[:,:_______], # source array\n", " _______, # window size \n", " _______=0) # Window Sliding Direction\n", "\n", "test_X = np.lib.stride_tricks.sliding_window_view(_______[:,:_______], # source array\n", " _______, # window size \n", " _______=0) # Window Sliding Direction" ] }, { "cell_type": "markdown", "metadata": { "id": "hFl4_Sulflr-" }, "source": [ "Let's try printing out the shape of a transformed dataset..." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "HlefjFDzfrz-" }, "outputs": [], "source": [ "print(train_X.shape)" ] }, { "cell_type": "markdown", "metadata": { "id": "w6E68TS-f07D" }, "source": [ "You should find that the dataset has the wrong shape for our purposes! The sliding_window_view function will have returned the data in the shape:\\\n", "(number of samples, **number_of_features**, sequence_length) \n", "\n", "\n", "However, the convention we've followed so far in our course (which is quite common in the field) is features last: \\\n", "(number of samples, sequence_length, **number_of_features**).\n", "\n", "We can fix this with `numpy.moveaxis()`!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "wYCIMzgBugY4" }, "outputs": [], "source": [ "train_X = np.moveaxis(_______, # source array\n", " _______, # axis to move \n", " _______) # destination for the axis\n", "\n", "val_X = np.moveaxis(_______, # source array\n", " _______, # axis to move \n", " _______) # destination for the axis\n", "\n", "test_X = np.moveaxis(_______, # source array\n", " _______, # axis to move \n", " _______) # destination for the axis" ] }, { "cell_type": "markdown", "metadata": { "id": "FpYhlxdXLcfT" }, "source": [ "We now need to prepare the target data using the runoff column! \\\n", "Here, we just need to select the column, skipping over the first ($window\\_size - 1$) elements." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "MeFaGrBjMtbj" }, "outputs": [], "source": [ "# For each set, take the data between the window_size-1th element and the last element\n", "train_y = train[_______:,_______] \n", "val_y = val[_______:,_______]\n", "test_y = test[_______:,_______]" ] }, { "cell_type": "markdown", "metadata": { "id": "FrSlKIQxXsnF" }, "source": [ "Let's check the shape of our arrays to make sure that things make sense!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "1h9GTFxjNqUg" }, "outputs": [], "source": [ "print(f'Train Shape: X:{train_X.shape}, y:{train_y.shape}',\n", " f'Validation Shape: X:{val_X.shape}, y:{val_y.shape}',\n", " f'Test Shape: X:{test_X.shape}, y:{test_y.shape}', sep='\\n')" ] }, { "cell_type": "markdown", "metadata": { "id": "dYZrn_YAQx9t" }, "source": [ "During development, a window size of 365 days was used. With this window size, the shape of our input/output datasets are: \\\n", "`Train Shape: X:(6941, 365, 3), y:(6941,)` \\\n", "`Validation Shape: X:(3289, 365, 3), y:(3289,)` \\\n", "`Test Shape: X:(3288, 365, 3), y:(3288,)`" ] }, { "cell_type": "markdown", "metadata": { "id": "y2aOa4YaRRfj" }, "source": [ "We finally have a dataset that we can easily use to train an LSTM! From this point on, we'll be relying on [PyTorch](https://pytorch.org/) to get our model ready.\n", "\n", "Let's start by importing the parts that we'll need. We won't hide any of the code here so you can see everything that's being done _without_ extra clicking. \\\n", "
😀
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "bo0JMDiU0ns-" }, "outputs": [], "source": [ "# Let's import the main PyTorch library\n", "import torch\n", "\n", "# Let's import the pytorch nn module, which gives us access to a simpler API for \n", "# defining our model (It includes layer abstractions and other nifty tools)\n", "import torch.nn as nn\n", "\n", "# And import utilities to help us easily handle batch sizing\n", "from torch.utils.data import DataLoader, TensorDataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "5onopneNPDBc" }, "outputs": [], "source": [ "#@title Run this cell to check if a GPU is available. We'll use a variable called `calc_device` to store either the cpu or gpu to run calculations on. \n", "if torch.cuda.device_count()>0:\n", " calc_device = torch.device('cuda:0')\n", "else:\n", " calc_device = torch.device('cpu')" ] }, { "cell_type": "markdown", "metadata": { "id": "AWgt5u-jJWK4" }, "source": [ "First, let's start by moving away from Pandas Dataframes and NumPy N-Dimensional Arrays and into the realm of PyTorch _Tensors_. " ] }, { "cell_type": "markdown", "metadata": { "id": "s3QqeUcTQNgP" }, "source": [ "## **Q6) Convert the prepared datasets into PyTorch Datasets**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "E79WUAsNqD5o" }, "outputs": [], "source": [ "#@title Hint: Transitioning from Numpy ND-Arrays to Pytorch Tensors\n", "'''\n", "Pytorch includes utilities to change from numpy arrays to tensors. These include\n", "torch.from_numpy(DATA_SOURCE) and torch.FloatTensor(DATA_SOURCE). We recommend\n", "that you use FloatTensor, as its the default precision in PyTorch at the time of\n", "writing. \n", "\n", "Using from_numpy will default to double precision using our scripts so\n", "far, and will raise an exception later unless you change everything to double\n", "precision.\n", "\n", "the .to() method sends the tensor to a device. Importantly, all tensors\n", "should be on the same device. (Trying to do things across devices is very fancy\n", "and is therefore outside the scope of the course!)\n", "''';" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "UDOspjTxrVsf" }, "outputs": [], "source": [ "#@title Hint: Datasets from Tensors\n", "'''\n", "You can make a dataset by calling \n", "torch.TensorDataset(input_tensor, output_tensor)\n", "\n", "This creates an subscriptable iterator (i.e., you can use an [index] to access \n", "the index-th pair of input-output data)\n", "\n", "Using this abstraction ensures that your input and output tensors have the same\n", "size! It's a good sanity check.\n", "''';\n", "#TensorDataset?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "mDr3yF_cHE9N" }, "outputs": [], "source": [ "# Let's convert our numpy arrays into tensors\n", "train_Xtensor = _______._______(_______).to(calc_device)\n", "train_ytensor = _______._______(_______).to(calc_device)\n", "\n", "val_Xtensor = _______._______(_______).to(calc_device)\n", "val_ytensor = _______._______(_______).to(calc_device)\n", "\n", "test_Xtensor = _______._______(_______).to(calc_device)\n", "test_ytensor = _______._______(_______).to(calc_device)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "bJgYWKW8QpKv" }, "outputs": [], "source": [ "# And group them into a dataset using the TensorDataset utility\n", "train_data = _______(_______, _______)\n", "val_data = _______(_______, _______)\n", "test_data = _______(_______, _______)" ] }, { "cell_type": "markdown", "metadata": { "id": "wMtICIFlQ5hi" }, "source": [ "Let's check that your tensors and datasets have been converted properly. Run the code below and compare it to our results (remember, we used 365 days as the window_size and a batch size of 128 - if your hyperparameters are different your numbers will be different)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "xwOQSC7_aYPq" }, "outputs": [], "source": [ "print(f'X tensor size (train): {train_Xtensor.size()}',\n", " f'X tensor size (val): {train_Xtensor.size()}',\n", " f'X tensor size (test): {train_Xtensor.size()}\\n',\n", " f'Train dataset size =: {len(train_data)}',\n", " f'Validation dataset size =: {len(val_data)}',\n", " f'Test dataset size =: {len(test_data)}',\n", " sep='\\n')" ] }, { "cell_type": "markdown", "metadata": { "id": "d5n-AC3ASIyA" }, "source": [ "During development, the output of the cell above was: \\\n", "`X tensor size (train): torch.Size([6941, 365, 3])` \\\n", "`X tensor size (val): torch.Size([6941, 365, 3])` \\\n", "`X tensor size (test): torch.Size([6941, 365, 3])` \\\n", "\n", "`Train dataset size =: 6941` \\\n", "`Validation dataset size =: 3289` \\\n", "`Test dataset size =: 3288` \\" ] }, { "cell_type": "markdown", "metadata": { "id": "ubDLMv1jX_Ml" }, "source": [ "Now that we have our datasets as PyTorch tensors, we can go ahead and load them into a $\\color{Green}{\\textit{DataLoader}}$. \n", "\n", "In PyTorch, DataLoaders are an abstraction that let's you do many powerful things to a dataset, such as: getting samples, shuffling, and other operations that are outside of the scope of this lab / course.\n", "\n", "Before we make the DataLoaders, we need to define some hyperparameters for our training. Specifically, we want to define the __batch size__ \\(i.e., how many sample you train on at a time). We also need to define a setting: the number of workers we'll use.\n", "\n", "We recommend you use a batch size of 128, and two workers. We won't go into details on how large a batch size to use, especially as there is ongoing research on the topic (e.g., [in](https://arxiv.org/abs/1609.04836) [these](https://arxiv.org/abs/2006.08517) [papers](https://arxiv.org/abs/1711.00489)). However, too large a batch size tends to make it harder for ML algorithms to generalize. As for the number of workers, this is generally dictaded by how many cpu/gpu cores you have available." ] }, { "cell_type": "markdown", "metadata": { "id": "qDwKD-h5TezQ" }, "source": [ "## **Q7) Define the hyperparameters and settings for training**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "7OhbVw4EdAaI" }, "outputs": [], "source": [ "#Define the hyperparameters\n", "batch_size = _______\n", "num_workers = _______" ] }, { "cell_type": "markdown", "metadata": { "id": "1WGQhEZsT9AL" }, "source": [ "## **Q8) Define the training, validation, and test Dataloaders**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "mQaOa7cer96a" }, "outputs": [], "source": [ "#@title Hint: PyTorch DataLoaders\n", "'''\n", "PyTorch Dataloaders\n", "Now that you have a pytorch dataset, you can feed it into a dataloader.\n", "Dataloaders can handle shuffling (important during training! Not so much\n", "for validation and testing...), and will automatically cut the data into\n", "batches if the batch_size parameter is given.\n", "\n", "Our data is small this time, so this isn't strictly necessary. But it's good\n", "practice.\n", "''';\n", "#DataLoader?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "S5xpw6CqZTBC" }, "outputs": [], "source": [ "train_loader = _______(\n", " _______=_______, # Define the dataset to use\n", " _______ = _______, # Set the batch size\n", " _______ = True/False # Define if shuffling is necessary for this loader\n", ")\n", "\n", "val_loader = _______(\n", " _______=_______, # Define the dataset to use\n", " _______ = _______, # Set the batch size\n", " _______ = True/False # Define if shuffling is necessary for this loader\n", ")\n", "\n", "test_loader = _______(\n", " _______=_______, # Define the dataset to use\n", " _______ = _______, # Set the batch size\n", " _______ = True/False # Define if shuffling is necessary for this loader\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "z4mb8kMUapCK" }, "source": [ "We're now fully ready on the data side - let's go on to setting up the neural network.\n", "\n", "When we were using Tensorflow, we defined a neural network by defining an instance of `tensorflow.keras.Model()`. Now that we're using pytorch, we'll instead _extend_ the equivalent of the model class in PyTorch - the `nn.Module` class. You can read more about the class in the [documentation](https://pytorch.org/docs/stable/generated/torch.nn.Module.html) online.\n", "\n", "Whenever we want to design a model with an LSTM layer, we'll need to define how many LSTM units we want to use (this will be the first hyperparameter for our simple LSTM model).\n", "\n", "We'll also be adding a dropout layer to our simple LSTM architecture. That is, there will be a fixed chance that the output of each LSTM unit will be zeroed during training - this will make our model more robust. Also, the probability of zeroing an output will be another hyperparameter.\n", "\n", "Finally, the last state of the LSTM layer are going to be combined linearly into a prediction for the runoff at the end of the time series. The default notebook will use a simple linear combination (i.e., we won't be using an activation function on the combination the way we often did before)." ] }, { "cell_type": "markdown", "metadata": { "id": "GR_ILrWjUSFC" }, "source": [ "## **Q9) Define the LSTM model architecture**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "N4ql2CWWKNYl" }, "outputs": [], "source": [ "#@title PyTorch Hints\n", "\n", "\"\"\"\n", "You can uncomment each snippet below to bring up the help pane for each one of\n", "the nn.Module, nn.LSTM, nn.Dropout, and nn.Linear objects.\n", "\"\"\";\n", "\n", "#nn.Module?\n", "#nn.LSTM?\n", "#nn.Dropout?\n", "#nn.Linear?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "ehnOQHH-liC8" }, "outputs": [], "source": [ "# We'll start by defining a class extension:\n", "\n", "class MyLSTM(_______): # Define the class which we're extending using our MyLSTM class\n", " # Begin by defining the initialization function for our class\n", " def _______(_______, # The first argument to the class in all methods is itself \n", " _______, # Model Hyperparameter 1\n", " _______): # ModelHyperparameter 2\n", " \n", " #To make sure our class initializes properly, we'll call its superclass\n", " # (i.e., the class it's based off of) and run its \n", " super(MyLSTM, self).__init__()\n", "\n", " # Store the hyperparameters passed into the class during initialization\n", " self._______ = _______\n", " self._______ = _______\n", "\n", " # Define the LSTM layer\n", " self.LSTM_layer = ________.________( # Let's instantiate the pytorch LSTM module \n", " ________ = ________, # The number of features in the input series\n", " ________ = self.________, # The number of LSTM cells per layer\n", " ________ = ________, # The number of LSTM Layers\n", " ________ = ________, # Enable the use of biases\n", " batch_first = True # Let the layer know the input shape is (batch_size, series_len, num_features) \n", " )\n", "\n", " # Define the Dropout Layer\n", " self.dropout_layer = ________.________( #Instantiate the pytorch dropout layer\n", " ________ = self.________ # Set the dropout rate\n", " )\n", "\n", " # Define the the output layer - we'll make a linear aggregation of the LSTM outputs \n", " self.out_layer = ________.________( #Instantiate the pytorch linear layer\n", " ________ = self.________ # Number of inputs\n", " ________ = ________ # Number of predictions per number of inputs\n", " )\n", " \n", " # Define the forward pass\n", " def forward(self, X):\n", " \n", " # Calculate the LSTM output, hidden state, and memory cell\n", " # Note that output is not our model's output!\n", " # Read the documentation here: https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html\n", " output, (h_n, c_n) = self.LSTM_layer(_____)\n", " \n", " # Calculate our model's hidden state from the last hidden state of the LSTM\n", " # using our dropout layer\n", " hidden_state = self._______(h_n[0])\n", "\n", " # Make the prediction by using the relu( dense ( hidden_state ) ) \n", " p_hat = torch.flatten(torch._______(self._______(hidden_state)))\n", " return _______" ] }, { "cell_type": "markdown", "metadata": { "id": "CaS8neifMRlD" }, "source": [ "Now that we have the model defined, we'll also be using a metric of performance that is somewhat more uncommon in machine learning applications than in hydrology - the **Nash-Sutcliff-Efficiency (NSE) Coefficient**. You can read more about it [on Wikipedia](https://en.wikipedia.org/wiki/Nash%E2%80%93Sutcliffe_model_efficiency_coefficient)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "nzO7zA8zP6bj" }, "outputs": [], "source": [ "#@title Run this cell to define the NSE calculation function. It will be accessible as `calc_NSE`\n", "\n", "# Customzied evaluation metric NSE for validation set and test set # \n", "def calc_nse(sim: torch.FloatTensor, obs: torch.FloatTensor, global_obs_mean: torch.FloatTensor) -> float:\n", " \"\"\"Calculate the Nash-Sutcliff-Efficiency coefficient.\n", "\n", " :param obs: Array containing the observations\n", " :param sim: Array containing the simulations\n", " :param global_obs_mean: mean of the whole observation series\n", " :return: NSE value.\n", " \"\"\"\n", " numerator = torch.square(sim - obs).sum()\n", " denominator = torch.square(obs - global_obs_mean).sum()\n", " nse_val = 1 - numerator / denominator\n", "\n", " return nse_val" ] }, { "cell_type": "markdown", "metadata": { "id": "yeK-XEuzWits" }, "source": [ "The next thing we need to think about is the hyperparameters for our model and training. More specifically, we need to choose:\n", "* a number of LSTM units\n", "* the dropout rate \n", "* define our loss function\n", "* choose our optimizer and its parameters. \n", "* define the number of epochs we will train for\n", "\n", "Thankfully, it's not our first rodeo! \\\n", "
🤠
\n", "\n", "Regarding the number of units, we chose 16 units during development of the notebook. Feel free to change this, e.g. to values between 1 and 128.\n", "\n", "The dropout rate is the probability that an LSTM unit will be zeroed. We'll set it to 0.125 (i.e., 1/8), which means we expect to drop the output from ~two of the LSTM cells at random. You can read more [here](https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html).\n", "\n", "Since this is a regression problem, we'll rely on MSE as the loss function.\n", "\n", "We've also know that Adam is a reliable optimizer, so we'll go ahead and use that. Adam needs us to define a learning rate, and $1*10^{-3}$ is a common default value. We'll try it out to see if it's appropriate. \n", "\n", " Note that you're free to play around with these hyperparameters - your performance will just be different from the ones we will show if you do so. I'm sure you can find a better solution 😀" ] }, { "cell_type": "markdown", "metadata": { "id": "GasdRSUyUdqP" }, "source": [ "## **Q10) Define the loss function, instantiate the model, define the optimizer, and set the number of epochs to iterate through**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "KoFpcBctTa0J" }, "outputs": [], "source": [ "# Define the loss function for training \n", "loss_func = _______._______()\n", "\n", "# Instantiate our LSTM model\n", "model = MyLSTM(_______=_______, # Hyperparameter 1\n", " _______=_______, # Hyperparameter 2\n", " ).to(calc_device) # Make sure the model is on the same device as the Tensors \n", "\n", "# Define our optimizer\n", "optimizer = torch.optim.Adam(_______._______(), # We need to pass the parameters the optimizer will optimize\n", " _______=_______) # and pass the learning rate for the optimizer\n", "\n", "# Define the number of epochs\n", "num_epochs = _______" ] }, { "cell_type": "markdown", "metadata": { "id": "oe6xRxpvVb5_" }, "source": [ "We're almost ready to train our model. Before we move on to the training routine, let's take a minute to define how we will evaluate the performance of the model - both for validation during training and for testing after training! " ] }, { "cell_type": "markdown", "metadata": { "id": "lCZDtFnwWC6v" }, "source": [ "## **Q11) Define the model evaluation function**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "sFG54eNEW3Y_" }, "outputs": [], "source": [ "def eval_model(_______, # the model to be evaluated\n", " _______, # the dataloader for the dataset used for the evaluation\n", " _______, # the main loss function to be used \n", " _______): # the function to be used as a performance metric\n", " # Tell pytorch that we don't need to keep track of the gradients\n", " # After all, gradients are only used during training.\n", "\n", " with torch._______(): \n", " \n", " #Zero the loss and the metric\n", " _______ = 0\n", " _______ = 0\n", "\n", " # We'll start by calculating the mean of the whole dataset\n", " # This will be used in the NSE coefficient calculation\n", " \n", " # Start by defining a placeholder variable for the sum of the \n", " # observations, and another for the number of datapoints in the set\n", " global_sum = 0\n", " label_size = 0\n", "\n", " # Iterate through the features and labels in the dataloader\n", " for _______, _______ in _______:\n", " # add the sum of the labels to the global sum\n", " global_sum += _______._______()\n", "\n", " # Keep track of how many labels we've seen\n", " label_size += len(_______) \n", " \n", " # Calculate the mean of the observations using the information gathered\n", " global_mean = _______/_______\n", "\n", " # Iterate through the features and labels in the dataloader, this time\n", " # for evaluating the model\n", " for _______, _______ in _______:\n", " # get predictions from the model using the features in the batch\n", " _______ = _______(_______)\n", "\n", " # calculate the batch loss\n", " batch_loss = _______(_______, _______)\n", "\n", " #calculate the batch metric\n", " batch_metric = _______(_______, _______, _______)\n", "\n", " # Keep track of the loss and metric. Remember to convert them from \n", " # pytorch tensors to scalars\n", " loss += _______._______()\n", " metric += _______._______()\n", " \n", " # Calculate the number of batches in the dataloader\n", " num_batches = len(_______)\n", "\n", " # Calculate the mean loss\n", " loss = loss/_______\n", " metric = metric/_______\n", " \n", " return (loss, metric)" ] }, { "cell_type": "markdown", "metadata": { "id": "gV0JeIYwV52x" }, "source": [ "With PyTorch, unlike with Keras, we need to write out the training and evaluation routines! We'll do this by using a nested for loops - the outer loop will run for the number of epochs, while the inner loops will iterate over the batches to train and validate the model. \n", "\n", "The outer loop will do the following:\n", " \n", "* Set the training loss to zero \n", "* Train the model \n", "* Get the validation loss and metrics using our evaluation function\n", "* Store the training and validation metrics\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "id": "11092VtXbGK4" }, "source": [ "## **Q12) Write the training loop**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "gdkT3SMjbCib" }, "outputs": [], "source": [ "# Define empty lists that will be used to keep track of the training losses,\n", "# validation losses, and validation metrics \n", "_______ = _______\n", "_______ = _______\n", "_______ = _______\n", "\n", "#\n", "for epoch in range(_______):\n", " # Zero the training loss\n", " _______ = 0\n", "\n", " # Iterate through the features and labels in the train dataloader \n", " for _______, _______ in _______:\n", " # We need to zero the gradients associated with the model parameters.\n", " # We can do this directly using the optimizer\n", " _______._______()\n", "\n", " # Get predictions from the features using the model\n", " pred = _______(_______)\n", "\n", " # And use the predictions to calculate the batch loss\n", " batch_loss = _______(_______, _______)\n", "\n", " # Do the backprogragation from the batch_loss\n", " _______._______()\n", "\n", " # Step through the optimizer\n", " _______._______()\n", "\n", " # Keep track of the train loss sum. \n", " # Remember to turn the tensor into a scalar!\n", " train_loss += _______._______()\n", "\n", " # Calculate the number of batches in the training dataloader\n", " num_batches = _______(_______)\n", "\n", " # Get the mean training loss over the batches\n", " train_loss = _______/_______\n", "\n", " # And append it into the list we defined to keep track of the loss\n", " _______._______(_______)\n", " \n", " # Calculate the validation loss and metric.\n", " # Use the function we defined before!\n", " val_loss, val_NSE = _______(_______, # The model\n", " _______, # The dataloader\n", " _______, # The loss function\n", " _______) # The metric function\n", "\n", " # Append the metric and loss values into the lists we made to keep track\n", " val_losses._______(_______)\n", " val_NSEs._______(_______)\n", "\n", " # We want to save the model if it's the best version of it we've found\n", " # during training. If the NSE coefficient is the maximum in our training \n", " # history, we'll go ahead and save the model as our best model.\n", " if val_NSE >= max(val_NSEs):\n", " torch.save(model, './best_model.pt')\n", "\n", " # And print out a statement to keep track of our training as we iterate\n", " print(f'\\rEpoch: {_______+1}/{_______},' # Current epoch, total number of epochs\n", " f'train_loss: {_______},' # training loss we found this epoch\n", " f'val_loss: {_______},' # validation loss we found this epoch\n", " f'NSE: {_______}', # NSE coefficient we found this epoch\n", " end=\"\")\n", " " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "jRH51uE2aq4f" }, "outputs": [], "source": [ "#@title Run this cell to load the best model from our training before we evaluate the performance.\n", "model = torch.load( './best_model.pt')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "-GJ6LBV4yuXF" }, "outputs": [], "source": [ "# Plotting out the Training and Validation performance. We'll use \n", "# torch.no_grad() since we don't need to calculate the gradients for\n", "# plotting\n", "with torch.no_grad():\n", " fig, ax = plt.subplots(figsize = (20,5), dpi=150)\n", " \n", " # Plot the training losses\n", " ax.plot(train_losses,\n", " c='black', \n", " linewidth=1,\n", " label='Training Loss')\n", " \n", " # Plot the validation losses\n", " ax.plot(val_losses,\n", " c='orange', \n", " linewidth=1,\n", " label='Validation Loss')\n", " \n", " # Empty NSE plot, used for easy legend generation\n", " ax.plot([],\n", " color='teal',\n", " label='Nash-Sutcliff-Efficiency')\n", " \n", " # Copy\n", " metric_ax = ax.twinx()\n", " metric_ax.plot(val_NSEs,\n", " c='teal', \n", " linewidth=1,\n", " label='Nash-Sutcliff-Efficiency')\n", " metric_ax.set_ylim([-1,1])\n", " metric_ax.set_yticks(np.arange(-1,1.01,0.25))\n", " ax.autoscale(enable=True, axis='x', tight=True)\n", " ax.legend(loc='center right')\n", " fig.set_facecolor('lightgrey') " ] }, { "cell_type": "markdown", "metadata": { "id": "pyBZVqG8aldp" }, "source": [ "If you did everything the exact same way we did during development of the notebook (i.e., you chose the same hyperparameters as us) you should$^*$ get a set of training curves that look just like the ones below:\n", "
\n", "\n", "$_{^{*}\\text{GPU calculations are often non-deterministic for performance reasons, so you should get something remarkably similar, though not quite the same}}$" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "JtspqneDGs03" }, "outputs": [], "source": [ "#Plotting out the Predicted Time Series on the Dataset\n", "test_obs_mean = torch.mean(test_ytensor)\n", "with torch.no_grad():\n", " fig, ax = plt.subplots(figsize = (20,5), dpi=150)\n", " ax.plot(test_set.index[window_size-1:], \n", " test_ytensor.cpu().numpy(),\n", " c='teal', \n", " linewidth=1,\n", " label='Observed')\n", " ax.plot(test_set.index[window_size-1:], \n", " model(test_Xtensor).cpu().numpy(), \n", " c='orange', \n", " alpha=0.85, \n", " linewidth=1,\n", " label='Predictions')\n", " ax.axhline(test_ytensor.mean().item(), color='teal', alpha=0.5, linestyle='--', label='Observation Mean')\n", " ax.legend()\n", " mse, NSE = eval_model(model,\n", " test_loader,\n", " loss_func,\n", " calc_nse)\n", " fig.suptitle('Performance on Test Dataset', size=14)\n", " fig.set_facecolor('lightgrey')\n", " ax.set_title(f'MSE: {mse:0.2f} Nash-Sutcliff-Efficiency: {NSE:0.2f}', size=11)\n", " ax.set_ylabel('Discharge ($\\\\frac{mm}{day}$)', size=11)\n", " ax.autoscale(enable=True, axis='x', tight=True)" ] }, { "cell_type": "markdown", "metadata": { "id": "-ml5iZxkdl7-" }, "source": [ "Finally, assuming your decisions and ours were the same, evaluating your model with the code above should$^*$ give you the figure below:\n", "
\n", "\n", "$_{^{*}\\text{Once again, GPU operations are generally non-deterministic - you should get something remarkably similar}}$" ] } ], "metadata": { "accelerator": "GPU", "colab": { "include_colab_link": true, "provenance": [], "toc_visible": true }, "gpuClass": "standard", "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" } }, "nbformat": 4, "nbformat_minor": 1 }