{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "view-in-github", "colab_type": "text" }, "source": [ "\"Open" ] }, { "cell_type": "markdown", "id": "e1cdc987", "metadata": { "id": "e1cdc987" }, "source": [ "# Geospatial Data with Geopandas" ] }, { "cell_type": "markdown", "source": [ "![geopandas_logo_green.png]()" ], "metadata": { "id": "gbmLcVZY4V6O" }, "id": "gbmLcVZY4V6O" }, { "cell_type": "markdown", "source": [ "## Installing GeoPandas\n", "Please run the following code blocks in this section to:\n", "* Install the GeoPandas's dependencies and GeoPandas\n", "* Download and unzip the data used in this notebook\n", "* Import GeoPandas and other required modules for the notebook" ], "metadata": { "id": "FGK9JHUkutNy" }, "id": "FGK9JHUkutNy" }, { "cell_type": "code", "source": [ "#Install the GeoPandas's dependencies\n", "!pip install --upgrade pyshp\n", "\n", "!pip install --upgrade shapely\n", "\n", "!pip install --upgrade descartes\n", "\n", "!pip install --upgrade rtree" ], "metadata": { "id": "wcyinTQVuoOm" }, "id": "wcyinTQVuoOm", "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "#Install the GeoPandas\n", "\n", "!pip install --upgrade geopandas" ], "metadata": { "id": "tIRleXwFvDC2" }, "id": "tIRleXwFvDC2", "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "#Download the data used in this notebook\n", "!gdown 1b1lngOIvuNnZxepbT8RyV3KX1itRky5z" ], "metadata": { "id": "bdKuowGVyGeO" }, "id": "bdKuowGVyGeO", "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "#Unzip the data used in this notebook\n", "!unzip '/content/data.zip'" ], "metadata": { "id": "5rkcEABq13_8" }, "id": "5rkcEABq13_8", "execution_count": null, "outputs": [] }, { "cell_type": "code", "execution_count": null, "id": "77589768", "metadata": { "id": "77589768" }, "outputs": [], "source": [ "#Import GeoPandas and other required modules for the notebook\n", "import geopandas as gpd\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "\n", "import warnings\n", "warnings.filterwarnings('ignore')" ] }, { "cell_type": "markdown", "id": "f4a3ef56", "metadata": { "id": "f4a3ef56" }, "source": [ "\n", "\n", "\n", "\n", "References: \n", "1. Geopandas official website: Introduction to GeoPandas\n", "https://geopandas.org/en/stable/getting_started/introduction.html \n", "2. Automating GIS process\n", "https://autogis-site.readthedocs.io/en/latest/notebooks/L2/01-geopandas-basics.html \n", "3. Use Data for Earth and Environmental Science in Open Source Python\n", "https://www.earthdatascience.org/courses/use-data-open-source-python/\n", "4. The Shapely User Manual\n", "https://shapely.readthedocs.io/en/stable/manual.html\n", "5. Geospatial Analysis with Python and R\n", "https://kodu.ut.ee/~kmoch/geopython2020/index.html\n", "6. Introduction to Geospatial Data in Python\n", "https://www.datacamp.com/tutorial/geospatial-data-python" ] }, { "cell_type": "markdown", "source": [ "**Learning goals**\n", "\n", "1. Understand the data structure of GeoPandas\n", "2. Read and write geospatial data with GeoPandas\n", "3. Access geospatial data attributes and methods using GeoPandas\n", "4. Perform spatial operations with GeoPandas" ], "metadata": { "id": "_s3ytYuzvg4N" }, "id": "_s3ytYuzvg4N" }, { "cell_type": "markdown", "id": "ecf116e7", "metadata": { "id": "ecf116e7" }, "source": [ "**GeoPandas** extends the data science library pandas for geospatial data by combining:\n", "* **Pandas**: data analysis\n", "* **Shapely**: handle geometries (Deterministic spatial analysis - set theoretic manipulations of planar features)\n", "* **Fiona**: read and write files\n", "* **pyproj**: manage coordinate reference systems\n", "* **matplotlib**: plotting.\n" ] }, { "cell_type": "markdown", "id": "ef9eb787", "metadata": { "id": "ef9eb787" }, "source": [ "## Geopandas Data Structure\n", "* The main difference between GeoDataFrame and DataFrame is that GeoDataFrame contains at least one column **'geometry'**, which contains shapely objects, e.g. points, lines, polygons, multipolygons, etc." ] }, { "cell_type": "code", "execution_count": null, "id": "a6229cfa", "metadata": { "id": "a6229cfa" }, "outputs": [], "source": [ "### Create a GeoPandas GeoDataFrame from a Pandas DataFrame with coordiantes\n", "### Tutorial: https://geopandas.org/en/stable/gallery/create_geopandas_from_pandas.html\n", "\n", "# Create Pandas DataFrame from csv used in the Pandas exercise\n", "path = r'/content/data/usgs_earthquakes_2014.csv'\n", "df_earthquakes = pd.read_csv(path)" ] }, { "cell_type": "code", "execution_count": null, "id": "a5158be7", "metadata": { "id": "a5158be7" }, "outputs": [], "source": [ "#Have a look at the Pandas DataFrame\n", "df_earthquakes.head(2)" ] }, { "cell_type": "code", "execution_count": null, "id": "953af241", "metadata": { "id": "953af241" }, "outputs": [], "source": [ "# Create GeoPandas GeoDataFrame from the Pandas DataFrame\n", "gdf_earthquakes = gpd.GeoDataFrame(df_earthquakes,\n", " geometry=gpd.points_from_xy(df_earthquakes.longitude,\n", " df_earthquakes.latitude))" ] }, { "cell_type": "code", "execution_count": null, "id": "293ce6f8", "metadata": { "id": "293ce6f8" }, "outputs": [], "source": [ "# Have a look at the GeoPandas GeoDataFrame and notice the 'geometry' column\n", "gdf_earthquakes.head(2)" ] }, { "cell_type": "markdown", "source": [ "* The core data structure is **geopandas.GeoDataFrame**, a subclass of pandas.DataFrame." ], "metadata": { "id": "SlUqDTTbrTrH" }, "id": "SlUqDTTbrTrH" }, { "cell_type": "code", "source": [ "# See the type of the geodataframe\n", "type(gdf_earthquakes)" ], "metadata": { "id": "ZaMk0-LyrWia" }, "id": "ZaMk0-LyrWia", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "* **geopandas.GeoSeries** is a subclass of pandas.Series. GeoSeries can contain any geometry type and has a `GeoSeries.crs` attribute (Coordinate Reference System) for projection." ], "metadata": { "id": "iS_EUsz6rjnQ" }, "id": "iS_EUsz6rjnQ" }, { "cell_type": "code", "source": [ "# See the type of a column in the geodataframe\n", "type(gdf_earthquakes['time'])" ], "metadata": { "id": "v_i0iQQ7riMw" }, "id": "v_i0iQQ7riMw", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "* **GeoDataFrame combines** pandas.Series with traditional data (numerical, boolean, text, etc.) and geopandas.GeoSeries, thus can work with geospatial data." ], "metadata": { "id": "totlRGQHqiV1" }, "id": "totlRGQHqiV1" }, { "cell_type": "code", "execution_count": null, "id": "e5153c48", "metadata": { "id": "e5153c48" }, "outputs": [], "source": [ "# Take a quick look at the geospatial data contained in the GeoDataFrame on a map\n", "ax = gdf_earthquakes.plot()" ] }, { "cell_type": "markdown", "id": "6a9e94f8", "metadata": { "id": "6a9e94f8" }, "source": [ "## Read and write files" ] }, { "cell_type": "markdown", "source": [ "### Reading files\n", "* Assuming you have a file containing both data and geometry (e.g. GeoPackage, GeoJSON, Shapefile), you can read it using `geopandas.read_file()`, which automatically detects the filetype and creates a GeoDataFrame.\n", "\n" ], "metadata": { "id": "F_7OvY6_Ph6p" }, "id": "F_7OvY6_Ph6p" }, { "cell_type": "code", "execution_count": null, "id": "5afc2588", "metadata": { "id": "5afc2588" }, "outputs": [], "source": [ "# Set filepath\n", "path = r\"/content/data/damselfish-data/DAMSELFISH_distributions.shp\"" ] }, { "cell_type": "code", "execution_count": null, "id": "8b3d29a5", "metadata": { "id": "8b3d29a5" }, "outputs": [], "source": [ "# Read data into a GeoDataFrame\n", "gdf_DAMSELFISH = gpd.read_file(path)" ] }, { "cell_type": "markdown", "source": [ "### GeoPandas for handing vector format spatial data\n", "\n", "* Three fundamental types of **geometric objects** (the spatial data model implemented by shapely & defined by *interior, boundary, and exterior*):\n", " * **Point**: A single point defined by a pair of x, y coordinates, e.g. locations of trees\n", " * *Point class*. *Interior* - exactly one point. *Boundary* - exactly no points. *Exterior* - all other points\n", " * **Line**: At least two connected points, e.g. roads, streams\n", " * *LineString and LinearRing classes*. *Interior* - the infinitely many points along its length. *Boundary* - Two end points. *Exterior* - all other points.\n", " * **Polygon**: At least three points connected and closed by lines, e.g. boundaries of lakes, countries.\n", " * *Polygon class*. *Interior* - the infinitely many points within. *Boundary* - one or more lines. *Exterior* - all other points. " ], "metadata": { "id": "gwpCspSesUj2" }, "id": "gwpCspSesUj2" }, { "cell_type": "code", "execution_count": null, "id": "c36d351e", "metadata": { "id": "c36d351e" }, "outputs": [], "source": [ "# Check type of the data\n", "type(gdf_DAMSELFISH)" ] }, { "cell_type": "code", "execution_count": null, "id": "b9d825ae", "metadata": { "id": "b9d825ae" }, "outputs": [], "source": [ "# Take a quick look at the data\n", "gdf_DAMSELFISH.head(3)" ] }, { "cell_type": "code", "execution_count": null, "id": "e6f42f69", "metadata": { "id": "e6f42f69" }, "outputs": [], "source": [ "# Look at the geometries in the GeoDataFrame\n", "gdf_DAMSELFISH['geometry'].head(3)" ] }, { "cell_type": "code", "source": [ "# See the type of geometric objects in the data\n", "type(gdf_DAMSELFISH.iloc[0]['geometry'])" ], "metadata": { "id": "EE9o2RIVsjhu" }, "id": "EE9o2RIVsjhu", "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "# See the geometric objects in the data\n", "gdf_DAMSELFISH.iloc[0]['geometry']" ], "metadata": { "id": "ilV2ljdnsyGU" }, "id": "ilV2ljdnsyGU", "execution_count": null, "outputs": [] }, { "cell_type": "code", "execution_count": null, "id": "2c6385cc", "metadata": { "id": "2c6385cc" }, "outputs": [], "source": [ "# Take a quick look at the data on a map\n", "ax = gdf_DAMSELFISH.plot()" ] }, { "cell_type": "markdown", "source": [ "### Writing files\n", "* To write a GeoDataFrame back to file use GeoDataFrame.to_file(). The default file format is Shapefile, but you can specify your own with the driver keyword." ], "metadata": { "id": "H1TC8Y3usa_H" }, "id": "H1TC8Y3usa_H" }, { "cell_type": "code", "execution_count": null, "id": "901280c2", "metadata": { "id": "901280c2" }, "outputs": [], "source": [ "# Create a output path for the data\n", "out_file_path = r\"/content/data/damselfish-data/DAMSELFISH_distributions_SELECTION.shp\"\n", "\n", "# Select first 50 rows, this a the numpy/pandas syntax to ``slice``\n", "# parts out a dataframe or array, from position 0 until (excluding) 50\n", "selection = gdf_DAMSELFISH[0:50]\n", "\n", "# Write those rows into a new Shapefile (the default output file format is Shapefile)\n", "selection.to_file(out_file_path)" ] }, { "cell_type": "markdown", "id": "b3440cb5", "metadata": { "id": "b3440cb5" }, "source": [ "## Attributes and Methods\n" ] }, { "cell_type": "markdown", "source": [ "### Area and distance\n", "\n", "* To measure the area of each polygon (or MultiPolygon in this specific case), access the [`GeoDataFrame.area`](https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoSeries.area.html) attribute, which returns a pandas.Series. Note that `GeoDataFrame.area` is just `GeoSeries.area` applied to the active geometry column.\n", "* The Euclidian distance between points can be accessed using `GeoDataFrame.distance`, which calls [`GeoDataFrame.distance`](https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoSeries.distance.html) to measure the distance from the point of interest\n" ], "metadata": { "id": "JvxLOhLWPnTQ" }, "id": "JvxLOhLWPnTQ" }, { "cell_type": "code", "execution_count": null, "id": "e1a7ee78", "metadata": { "id": "e1a7ee78" }, "outputs": [], "source": [ "# Calculate and store the areas individual polygons\n", "gdf_DAMSELFISH['area'] = gdf_DAMSELFISH.area\n", "gdf_DAMSELFISH['area'].head(3)" ] }, { "cell_type": "markdown", "source": [ "### Boundary and centroid\n", "\n", "* To get the boundary of each polygon (LineString), access the [`GeoDataFrame.boundary`](https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoSeries.boundary.html)\n", "* The Centroid of a given geometry (line, polygon, etc.) can be accessed via the [`.centroid`](https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoSeries.centroid.html) attribute\n" ], "metadata": { "id": "8dunLtmptbp0" }, "id": "8dunLtmptbp0" }, { "cell_type": "code", "execution_count": null, "id": "92d50011", "metadata": { "id": "92d50011" }, "outputs": [], "source": [ "# Get the boundary of each polygon\n", "gdf_DAMSELFISH['boundary'] = gdf_DAMSELFISH.boundary\n", "gdf_DAMSELFISH['boundary'].head(3)" ] }, { "cell_type": "code", "execution_count": null, "id": "8d6526e8", "metadata": { "id": "8d6526e8" }, "outputs": [], "source": [ "# Get the centroid of each polygon\n", "gdf_DAMSELFISH['centroid'] = gdf_DAMSELFISH.centroid\n", "gdf_DAMSELFISH['centroid'].head(3)" ] }, { "cell_type": "code", "execution_count": null, "id": "450b9bfb", "metadata": { "id": "450b9bfb" }, "outputs": [], "source": [ "# Measure distance between each centroid and the first centroid\n", "first_point = gdf_DAMSELFISH['centroid'].iloc[0]\n", "gdf_DAMSELFISH['distance'] = gdf_DAMSELFISH['centroid'].distance(first_point)\n", "gdf_DAMSELFISH['distance'].head(3)" ] }, { "cell_type": "markdown", "source": [ "### Projection\n", "\n", "* [Coordinate reference systems (CRS)](https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoSeries.crs.html) are important because the geometric shapes in a GeoDataFrame are simply a collection of coordinates in an arbitrary space. A CRS tells Python how those coordinates related to places on the Earth. A map projection (or a projected coordinate system) is a systematic transformation of the latitudes and longitudes into a plain surface where units are quite commonly represented as meters (instead of decimal degrees).\n", "\n", "* As map projections of gis-layers are fairly often defined differently (i.e. they do not match), it is a common procedure to redefine the map projections to be identical in both layers. It is important that the layers have the same projection as it makes it possible to analyze the spatial relationships between layers, such as in conducting the Point in Polygon spatial query.\n", "* The EPSG number (“European Petroleum Survey Group”) is a code that tells about the coordinate system of the dataset.\n" ], "metadata": { "id": "3JdAFREztjTF" }, "id": "3JdAFREztjTF" }, { "cell_type": "code", "execution_count": null, "id": "46181537", "metadata": { "id": "46181537" }, "outputs": [], "source": [ "# Reading the data from the Europe_borders.shp file\n", "path = r\"/content/data/Europe_borders/Europe_borders.shp\"\n", "gdf_Europe_borders = gpd.read_file(path)" ] }, { "cell_type": "code", "execution_count": null, "id": "022d2007", "metadata": { "id": "022d2007" }, "outputs": [], "source": [ "# See the current coordinate reference system from .crs attribute\n", "gdf_Europe_borders.crs" ] }, { "cell_type": "code", "execution_count": null, "id": "c0d84473", "metadata": { "id": "c0d84473" }, "outputs": [], "source": [ "# Convert (aka reproject) into Lambert Azimuthal Equal Area projection (EPSG: 3035)\n", "gdf_Europe_borders_proj = gdf_Europe_borders.to_crs(epsg=3035)" ] }, { "cell_type": "code", "execution_count": null, "id": "dfbb8c1f", "metadata": { "id": "dfbb8c1f" }, "outputs": [], "source": [ "# See the current coordinate reference system from .crs attribute\n", "gdf_Europe_borders_proj.crs" ] }, { "cell_type": "markdown", "source": [ "### Plot\n", "\n", "* GeoPandas can plot maps, so we can check how the geometries appear in space. To plot the active geometry, call [GeoDataFrame.plot()](https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoSeries.plot.html)." ], "metadata": { "id": "meXlrhFjttoN" }, "id": "meXlrhFjttoN" }, { "cell_type": "code", "execution_count": null, "id": "d11d7a7e", "metadata": { "id": "d11d7a7e" }, "outputs": [], "source": [ "# Plot GeoDataFrames\n", "# Understand the difference between the projections\n", "fig, axs = plt.subplots(1, 2, figsize=(8, 4))\n", "# Plot the WGS84\n", "gdf_Europe_borders.plot(facecolor='gray', ax=axs[0]);\n", "\n", "# Add title\n", "axs[0].set_title(\"WGS84 CRS\", y=1.05);\n", "\n", "# Plot the one with ETRS-LAEA projection\n", "gdf_Europe_borders_proj.plot(facecolor='blue', ax=axs[1]);\n", "\n", "# Add title\n", "axs[1].set_title(\"ETRS Lambert Azimuthal Equal Area projection\", y=1.05);\n", "\n", "# Remove empty white space around the plot\n", "fig.tight_layout()" ] }, { "cell_type": "markdown", "id": "59378219", "metadata": { "id": "59378219" }, "source": [ "## Spatial Relationships and Operations" ] }, { "cell_type": "markdown", "source": [ "## Geometric Manipulations\n", "GeoPandas makes available all the tools for geometric manipulations in the shapely library: buffer, boundary, convex_hull, envelope, unary_union\n" ], "metadata": { "id": "xhQpmrBxPuNZ" }, "id": "xhQpmrBxPuNZ" }, { "cell_type": "code", "execution_count": null, "id": "55e353ac", "metadata": { "id": "55e353ac" }, "outputs": [], "source": [ "# Tutorial: https://geopandas.org/en/stable/docs/user_guide/geometric_manipulations.html\n", "\n", "# Generate a GeoSeries containing 2000 random points\n", "import numpy as np\n", "from shapely.geometry import Point\n", "xmin, xmax, ymin, ymax = 900000, 1080000, 120000, 280000\n", "xc = (xmax - xmin) * np.random.random(2000) + xmin\n", "yc = (ymax - ymin) * np.random.random(2000) + ymin\n", "pts = gpd.GeoSeries([Point(x, y) for x, y in zip(xc, yc)])" ] }, { "cell_type": "code", "execution_count": null, "id": "7ba82ce3", "metadata": { "id": "7ba82ce3" }, "outputs": [], "source": [ "# Draw a circle with fixed radius around each point\n", "circles = pts.buffer(2000)" ] }, { "cell_type": "code", "execution_count": null, "id": "82ef3ea9", "metadata": { "id": "82ef3ea9" }, "outputs": [], "source": [ "# Collapse these circles into a single MultiPolygon geometry with unary_union\n", "# https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoSeries.unary_union.html\n", "mp = circles.unary_union" ] }, { "cell_type": "code", "execution_count": null, "id": "a70507ea", "metadata": { "id": "a70507ea" }, "outputs": [], "source": [ "# See unified multipolygon\n", "mp" ] }, { "cell_type": "code", "execution_count": null, "id": "c0dcc61f", "metadata": { "id": "c0dcc61f" }, "outputs": [], "source": [ "# Check the type of the unified multipolygon\n", "type(mp)" ] }, { "cell_type": "markdown", "source": [ "### Set-Operations with Overlay\n", "\n", "When working with multiple spatial datasets – especially multiple polygon or line datasets – users often wish to create new shapes based on places where those datasets overlap (or don’t overlap). These manipulations are often referred using the language of sets – intersections, unions, and differences. These types of operations are made available in the geopandas library through the [overlay()](https://geopandas.org/en/stable/docs/user_guide/set_operations.html?highlight=overlay) method." ], "metadata": { "id": "7OkpKQ_nuBNi" }, "id": "7OkpKQ_nuBNi" }, { "cell_type": "code", "execution_count": null, "id": "708c8b59", "metadata": { "id": "708c8b59" }, "outputs": [], "source": [ "# Tutorial: https://geopandas.org/en/stable/docs/user_guide/set_operations.html\n", "# Create some example data\n", "from shapely.geometry import Polygon\n", "\n", "polys1 = gpd.GeoSeries([Polygon([(0,0), (2,0), (2,2), (0,2)]),\n", " Polygon([(2,2), (4,2), (4,4), (2,4)])])\n", "\n", "\n", "polys2 = gpd.GeoSeries([Polygon([(1,1), (3,1), (3,3), (1,3)]),\n", " Polygon([(3,3), (5,3), (5,5), (3,5)])])\n", "\n", "\n", "df1 = gpd.GeoDataFrame({'geometry': polys1, 'df1':[1,2]})\n", "\n", "df2 = gpd.GeoDataFrame({'geometry': polys2, 'df2':[1,2]})\n", "\n", "ax = df1.plot(color='red');\n", "\n", "df2.plot(ax=ax, color='green', alpha=0.5)" ] }, { "cell_type": "code", "execution_count": null, "id": "9503ce15", "metadata": { "id": "9503ce15" }, "outputs": [], "source": [ "# The overlay() method with how='intersection', it returns only those geometries that are contained by both GeoDataFrames:\n", "res_intersection = df1.overlay(df2, how='intersection')" ] }, { "cell_type": "code", "execution_count": null, "id": "c57bc475", "metadata": { "id": "c57bc475" }, "outputs": [], "source": [ "# Plot and check the result of the overlay\n", "ax = res_intersection.plot(cmap='tab10')\n", "\n", "df1.plot(ax=ax, facecolor='none', edgecolor='k');\n", "\n", "df2.plot(ax=ax, facecolor='none', edgecolor='k')" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" }, "colab": { "provenance": [], "toc_visible": true, "include_colab_link": true } }, "nbformat": 4, "nbformat_minor": 5 }