4.1. Unsupervised Learning for Clustering and Dimensionality Reduction#
4.1.1. Learning objectives:#
Distinguish supervised from unsupervised learning
Understand the necessity of reducing dimensionality for big datasets
Know at least two approaches for dimensionality reduction
Understand the steps of PCA
Distinguish clustering from supervised classification
Know how to implement the Kmeans algorithm and select the number of clusters
Know how to implement Gaussian mixture models and how to detect anomalies
4.1.2. Unsupervised Learning (UL) v.s Supervised Learning (SL)#
UL
Trains models without labeled output data.
Discovers patterns, groupings, or structures in data.
Includes techniques like clustering, dimensionality reduction, and density estimation.
Useful when specific output labels are unknown or unavailable.
SL
Trains models with labeled examples for predictions.
Classifies data into predefined categories.
Common tasks: classification, regression, object detection.
Requires labeled data for model training.
Semi-SL
Combines elements of both unsupervised and supervised learning.
Uses a small portion of labeled data and a larger amount of unlabeled data.
Aims to leverage labeled data for improved model performance.
Suited for scenarios with limited labeled data availability.
4.1.3. Dimension Reduction#
Dimensionality reduction is the technique of reducing the number of features in a dataset while retaining essential information. It aids data visualization, analysis, and enhances machine learning model performance.
The Curse of Dimensionality
High dimensional datasets are likely very sparse, with training instances far away from each other, which increases the risk of overfitting.
The curse of dimensionality refers to the challenges and issues that arise as data dimensionality increases, leading to increased sparsity, computational complexity, and decreased efficiency in various machine learning and data analysis tasks.
Main Methods for Dimension Reduction
Projection Methods: Linearly transform data to lower dimensions, e.g. Principal Component Analysis (PCA)
Manifold Learning: Captures underlying data structure for nonlinear relationships, e.g. t-Distributed Stochastic Neighbor Embedding (t-SNE), Locally Linear Embedding (LLE)
Principal Component Analysis (PCA)
PCA is a dimensionality reduction method that identifies and preserves the most important information in a dataset while reducing its dimensionality. It employs the Singular Value Decomposition (SVD) technique to find the principal components.
PCA identifies orthogonal axes (principal components) that maximize variance in the data. It projects the data onto these components, effectively reducing dimensionality.
Other PCA techniques, such as Incremental PCA, Randomized PCA, and Kernel PCA, offer variations and optimizations for specific use cases.
PCA, along with other dimension reduction techniques, is applied in environmental sciences to uncover patterns such as El Niño and variance modes, diminish collinearity among variables, identify pollution sources in ambient air and soil, compare water quality in different watersheds, and quantify phenotypic variations amongst species based on multiple measurements, and more, aiding in various environmental analyses and modeling.
4.1.4. Clustering#
Clustering v.s Classification
Clustering groups data into clusters based on similarities, without predefined labels. Classification classifies data into predefined categories using labeled examples. Clustering discovers patterns and relationships, while classification predicts labels based on known outcomes.
Clustering assists in categorizing ecosystems based on characteristics, supporting climate change and stressor analysis. It also aids in urban planning by identifying built environment patterns.
K-means clustering
K-means clustering divides data into K clusters by minimizing the sum of squared distances between points and cluster centroids. It iteratively assigns points to the nearest centroid and updates centroids.
K-Means Clustering Steps:
Initialization: Randomly select K initial centroids.
Assignment: Assign each point to the nearest centroid.
Update Centroids: Recalculate centroids based on cluster points.
Reassignment: Repeat steps 2 and 3 until convergence.
Convergence: Stop when centroids stabilize or after a set number of iterations.
Clusters: Resulting centroids define distinct clusters in the data.
The K-means algorithm is fast and scalable but struggles with clusters of varying sizes, densities, and nonspherical shapes.Inertia quantifies cluster quality. The silhouette score and the “elbow” method determines the ideal number of clusters (k).
Accelerated K-Means and Mini-batch K-Means are advanced variants of K-Means clustering designed to enhance the algorithm’s speed and efficiency, particularly for large datasets.
Gaussian mixture models (GMM)
GMM is a probabilistic model for representing data as a mixture of several Gaussian distributions.
It employs the Expectation-Maximization (EM) algorithm to categorize instances into either hard clusters (clearly defined) or soft clusters (with estimated probabilities).
Selecting the appropriate number of clusters involves minimizing information criteria like the Bayesian Information Criterion (BIC) or the Akaike Information Criterion (AIC), considering factors such as the number of dimensions, instances, and clusters.
GMM helps anomaly detection by modeling normal data distribution. Anomalies are identified as data points with low probability under the GMM, serving as outliers.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
DBSCAN is a clustering algorithm that identifies clusters as continuous regions of high data density. It excels when clusters have varying densities and are separated by lower-density regions. DBSCAN categorizes data points into core instances (well within dense areas) and border instances (on cluster fringes) based on their proximity to other data points. Anomalies are data points that are neither core instances nor have nearby core instances. This method is robust to outliers, can handle clusters of different shapes, and is useful for a variety of applications.
Examples of Dimensionality Reduction and Unsupervised Learning in Environmental Science🌄:
Dimensionality reduction:
PCA helps identify pollution sources in ambient air and soil, compare water quality in different watersheds, and quantify phenotypic variations amongst species based on multiple measurements.
Clustering:
Clustering can be used to group different types of ecosystems together, based on their characteristics such as vegetation, wildlife, and climate. This information can be used to understand how different ecosystems respond to climate change and other environmental stressors.
Cluster analysis can be used to identify built environmental patterns.
GMMs could be applied to stratified lake water samples to identify distinct water quality profiles based on their chemical composition.
Tips and Tricks 💡
Exercise 1: Dimensionality Reduction
Does PCA always reduce model training time and increase model performance?
Load the MNIST dataset and split it into a training set and a test set;
Train a Random Forest classifier on the dataset,
Time how long it takes,
Evaluate the resulting model on the test set.Train a Logistic Regression classifier on the dataset,
Time how long it takes,
Evaluate the resulting model on the test set.Use PCA to reduce the dataset’s dimensionality, with an explained variance ratio of 95%.
Train a new Random Forest classifier on the reduced dataset. Was training much faster? Was the performance better?
Train a new Logistic Regression classifier on the reduced dataset. Was training much faster? Was the performance better?
Exercise 2: Clustering
How to choose the number of clusters when using K-means?
Load the MNIST dataset;
Time one K-Means training;
Use PCA for dimension reduction;
Train K-Means with multiple ks;
Calculate the performance of different k - silhouette score;
Visualize silhouette score & inertia against k;
Visualize clusters.
Exercise 3: Application to Dynamical Regime Identification - Tracking the impact of global Heating on Ocean Regimes (THOR)
Reading: Transparent Machine learning (ML) method that explains the governing mechanisms of the North Atlantic Meridional Overturning Circulation (AMOC) called Tracking global Heating with Ocean Regimes (THOR).
Transparent ML
Dynamics contributing to AMOC changes under a global heating model
The paper demonstrates practical applications of machine learning techniques, including clustering, to analyze environmental data. Specifically, clustering is employed to categorize distinct dynamical regimes in the North Atlantic Circulation, as depicted in Figure 4. The authors utilize an Ensemble MLP trained with labeled data obtained through unsupervised machine learning, emphasizing six dynamical regimes related to oceanic transport and circulation patterns in the North Atlantic.
Exercise: Step1 of THOR - Identify 2D dynamical regimes
Data
Reduced to 5 dimensions: (1) curlA; (2) curlB, (3) curlTau, (4)curlCori, (5) BPT,
i.e., with shape (360, 720, 5) - 5 layers of 720x360 images, each pixel/cell has 5 features;
pixels/cells to be clustered into groups based on these features.
Use Xarray to format data.
Use K-Means to cluster the 5D training data;
Visualize identified clusters.
Estimating the Circulation and Climate of the Ocean dynamical regimes geographical expanse, area averaged term magnitudes and learning contributions. Figure credit: (Sonnewald et al., 2019)