1.11. Geospatial Data with Geopandas#
1.11.1. Installing GeoPandas#
Please run the following code blocks in this section to:
Install the GeoPandas’s dependencies and GeoPandas
Download and unzip the data used in this notebook
Import GeoPandas and other required modules for the notebook
#Install the GeoPandas's dependencies
!pip install --upgrade pyshp
!pip install --upgrade shapely
!pip install --upgrade descartes
!pip install --upgrade rtree
Requirement already satisfied: pyshp in c:\users\tbeucler\.conda\envs\jb\lib\site-packages (2.3.1)
Requirement already satisfied: shapely in c:\users\tbeucler\.conda\envs\jb\lib\site-packages (2.0.6)
Requirement already satisfied: numpy<3,>=1.14 in c:\users\tbeucler\.conda\envs\jb\lib\site-packages (from shapely) (1.25.1)
Requirement already satisfied: descartes in c:\users\tbeucler\.conda\envs\jb\lib\site-packages (1.1.0)
Requirement already satisfied: matplotlib in c:\users\tbeucler\.conda\envs\jb\lib\site-packages (from descartes) (3.7.2)
Requirement already satisfied: contourpy>=1.0.1 in c:\users\tbeucler\.conda\envs\jb\lib\site-packages (from matplotlib->descartes) (1.1.0)
Requirement already satisfied: cycler>=0.10 in c:\users\tbeucler\.conda\envs\jb\lib\site-packages (from matplotlib->descartes) (0.11.0)
Requirement already satisfied: fonttools>=4.22.0 in c:\users\tbeucler\.conda\envs\jb\lib\site-packages (from matplotlib->descartes) (4.40.0)
Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\tbeucler\.conda\envs\jb\lib\site-packages (from matplotlib->descartes) (1.4.4)
Requirement already satisfied: numpy>=1.20 in c:\users\tbeucler\.conda\envs\jb\lib\site-packages (from matplotlib->descartes) (1.25.1)
Requirement already satisfied: packaging>=20.0 in c:\users\tbeucler\.conda\envs\jb\lib\site-packages (from matplotlib->descartes) (23.1)
Requirement already satisfied: pillow>=6.2.0 in c:\users\tbeucler\.conda\envs\jb\lib\site-packages (from matplotlib->descartes) (10.0.0)
Requirement already satisfied: pyparsing<3.1,>=2.3.1 in c:\users\tbeucler\.conda\envs\jb\lib\site-packages (from matplotlib->descartes) (3.0.9)
Requirement already satisfied: python-dateutil>=2.7 in c:\users\tbeucler\.conda\envs\jb\lib\site-packages (from matplotlib->descartes) (2.8.2)
Requirement already satisfied: importlib-resources>=3.2.0 in c:\users\tbeucler\.conda\envs\jb\lib\site-packages (from matplotlib->descartes) (6.0.0)
Requirement already satisfied: zipp>=3.1.0 in c:\users\tbeucler\.conda\envs\jb\lib\site-packages (from importlib-resources>=3.2.0->matplotlib->descartes) (3.16.0)
Requirement already satisfied: six>=1.5 in c:\users\tbeucler\.conda\envs\jb\lib\site-packages (from python-dateutil>=2.7->matplotlib->descartes) (1.16.0)
Requirement already satisfied: rtree in c:\users\tbeucler\.conda\envs\jb\lib\site-packages (1.3.0)
#Install the GeoPandas
!pip install --upgrade geopandas
Requirement already satisfied: geopandas in c:\users\tbeucler\.conda\envs\jb\lib\site-packages (1.0.1)
Requirement already satisfied: numpy>=1.22 in c:\users\tbeucler\.conda\envs\jb\lib\site-packages (from geopandas) (1.25.1)
Requirement already satisfied: pyogrio>=0.7.2 in c:\users\tbeucler\.conda\envs\jb\lib\site-packages (from geopandas) (0.9.0)
Requirement already satisfied: packaging in c:\users\tbeucler\.conda\envs\jb\lib\site-packages (from geopandas) (23.1)
Requirement already satisfied: pandas>=1.4.0 in c:\users\tbeucler\.conda\envs\jb\lib\site-packages (from geopandas) (2.0.3)
Requirement already satisfied: pyproj>=3.3.0 in c:\users\tbeucler\.conda\envs\jb\lib\site-packages (from geopandas) (3.6.0)
Requirement already satisfied: shapely>=2.0.0 in c:\users\tbeucler\.conda\envs\jb\lib\site-packages (from geopandas) (2.0.6)
Requirement already satisfied: python-dateutil>=2.8.2 in c:\users\tbeucler\.conda\envs\jb\lib\site-packages (from pandas>=1.4.0->geopandas) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in c:\users\tbeucler\.conda\envs\jb\lib\site-packages (from pandas>=1.4.0->geopandas) (2023.3)
Requirement already satisfied: tzdata>=2022.1 in c:\users\tbeucler\.conda\envs\jb\lib\site-packages (from pandas>=1.4.0->geopandas) (2023.3)
Requirement already satisfied: certifi in c:\users\tbeucler\.conda\envs\jb\lib\site-packages (from pyogrio>=0.7.2->geopandas) (2024.8.30)
Requirement already satisfied: six>=1.5 in c:\users\tbeucler\.conda\envs\jb\lib\site-packages (from python-dateutil>=2.8.2->pandas>=1.4.0->geopandas) (1.16.0)
#Download the data used in this notebook
!gdown 1b1lngOIvuNnZxepbT8RyV3KX1itRky5z
'gdown' is not recognized as an internal or external command,
operable program or batch file.
#Unzip the data used in this notebook
!unzip '/content/data.zip'
unzip: cannot find either '/content/data.zip' or '/content/data.zip'.zip.
#Import GeoPandas and other required modules for the notebook
import geopandas as gpd
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
References:
Geopandas official website: Introduction to GeoPandas https://geopandas.org/en/stable/getting_started/introduction.html
Automating GIS process https://autogis-site.readthedocs.io/en/latest/notebooks/L2/01-geopandas-basics.html
Use Data for Earth and Environmental Science in Open Source Python https://www.earthdatascience.org/courses/use-data-open-source-python/
The Shapely User Manual https://shapely.readthedocs.io/en/stable/manual.html
Geospatial Analysis with Python and R https://kodu.ut.ee/~kmoch/geopython2020/index.html
Introduction to Geospatial Data in Python https://www.datacamp.com/tutorial/geospatial-data-python
Learning goals
Understand the data structure of GeoPandas
Read and write geospatial data with GeoPandas
Access geospatial data attributes and methods using GeoPandas
Perform spatial operations with GeoPandas
GeoPandas extends the data science library pandas for geospatial data by combining:
Pandas: data analysis
Shapely: handle geometries (Deterministic spatial analysis - set theoretic manipulations of planar features)
Fiona: read and write files
pyproj: manage coordinate reference systems
matplotlib: plotting.
1.11.2. Geopandas Data Structure#
The main difference between GeoDataFrame and DataFrame is that GeoDataFrame contains at least one column ‘geometry’, which contains shapely objects, e.g. points, lines, polygons, multipolygons, etc.
### Create a GeoPandas GeoDataFrame from a Pandas DataFrame with coordiantes
### Tutorial: https://geopandas.org/en/stable/gallery/create_geopandas_from_pandas.html
# Create Pandas DataFrame from csv used in the Pandas exercise
path = r'/content/data/usgs_earthquakes_2014.csv'
df_earthquakes = pd.read_csv(path)
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
Cell In[6], line 6
1 ### Create a GeoPandas GeoDataFrame from a Pandas DataFrame with coordiantes
2 ### Tutorial: https://geopandas.org/en/stable/gallery/create_geopandas_from_pandas.html
3
4 # Create Pandas DataFrame from csv used in the Pandas exercise
5 path = r'/content/data/usgs_earthquakes_2014.csv'
----> 6 df_earthquakes = pd.read_csv(path)
File ~\.conda\envs\JB\lib\site-packages\pandas\io\parsers\readers.py:912, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)
899 kwds_defaults = _refine_defaults_read(
900 dialect,
901 delimiter,
(...)
908 dtype_backend=dtype_backend,
909 )
910 kwds.update(kwds_defaults)
--> 912 return _read(filepath_or_buffer, kwds)
File ~\.conda\envs\JB\lib\site-packages\pandas\io\parsers\readers.py:577, in _read(filepath_or_buffer, kwds)
574 _validate_names(kwds.get("names", None))
576 # Create the parser.
--> 577 parser = TextFileReader(filepath_or_buffer, **kwds)
579 if chunksize or iterator:
580 return parser
File ~\.conda\envs\JB\lib\site-packages\pandas\io\parsers\readers.py:1407, in TextFileReader.__init__(self, f, engine, **kwds)
1404 self.options["has_index_names"] = kwds["has_index_names"]
1406 self.handles: IOHandles | None = None
-> 1407 self._engine = self._make_engine(f, self.engine)
File ~\.conda\envs\JB\lib\site-packages\pandas\io\parsers\readers.py:1661, in TextFileReader._make_engine(self, f, engine)
1659 if "b" not in mode:
1660 mode += "b"
-> 1661 self.handles = get_handle(
1662 f,
1663 mode,
1664 encoding=self.options.get("encoding", None),
1665 compression=self.options.get("compression", None),
1666 memory_map=self.options.get("memory_map", False),
1667 is_text=is_text,
1668 errors=self.options.get("encoding_errors", "strict"),
1669 storage_options=self.options.get("storage_options", None),
1670 )
1671 assert self.handles is not None
1672 f = self.handles.handle
File ~\.conda\envs\JB\lib\site-packages\pandas\io\common.py:859, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
854 elif isinstance(handle, str):
855 # Check whether the filename is to be opened in binary mode.
856 # Binary mode does not support 'encoding' and 'newline'.
857 if ioargs.encoding and "b" not in ioargs.mode:
858 # Encoding
--> 859 handle = open(
860 handle,
861 ioargs.mode,
862 encoding=ioargs.encoding,
863 errors=errors,
864 newline="",
865 )
866 else:
867 # Binary mode
868 handle = open(handle, ioargs.mode)
FileNotFoundError: [Errno 2] No such file or directory: '/content/data/usgs_earthquakes_2014.csv'
#Have a look at the Pandas DataFrame
df_earthquakes.head(2)
# Create GeoPandas GeoDataFrame from the Pandas DataFrame
gdf_earthquakes = gpd.GeoDataFrame(df_earthquakes,
geometry=gpd.points_from_xy(df_earthquakes.longitude,
df_earthquakes.latitude))
# Have a look at the GeoPandas GeoDataFrame and notice the 'geometry' column
gdf_earthquakes.head(2)
The core data structure is geopandas.GeoDataFrame, a subclass of pandas.DataFrame.
# See the type of the geodataframe
type(gdf_earthquakes)
geopandas.GeoSeries is a subclass of pandas.Series. GeoSeries can contain any geometry type and has a
GeoSeries.crs
attribute (Coordinate Reference System) for projection.
# See the type of a column in the geodataframe
type(gdf_earthquakes['time'])
GeoDataFrame combines pandas.Series with traditional data (numerical, boolean, text, etc.) and geopandas.GeoSeries, thus can work with geospatial data.
# Take a quick look at the geospatial data contained in the GeoDataFrame on a map
ax = gdf_earthquakes.plot()
1.11.3. Read and write files#
1.11.3.1. Reading files#
Assuming you have a file containing both data and geometry (e.g. GeoPackage, GeoJSON, Shapefile), you can read it using
geopandas.read_file()
, which automatically detects the filetype and creates a GeoDataFrame.
# Set filepath
path = r"/content/data/damselfish-data/DAMSELFISH_distributions.shp"
# Read data into a GeoDataFrame
gdf_DAMSELFISH = gpd.read_file(path)
1.11.3.2. GeoPandas for handing vector format spatial data#
Three fundamental types of geometric objects (the spatial data model implemented by shapely & defined by interior, boundary, and exterior):
Point: A single point defined by a pair of x, y coordinates, e.g. locations of trees
Point class. Interior - exactly one point. Boundary - exactly no points. Exterior - all other points
Line: At least two connected points, e.g. roads, streams
LineString and LinearRing classes. Interior - the infinitely many points along its length. Boundary - Two end points. Exterior - all other points.
Polygon: At least three points connected and closed by lines, e.g. boundaries of lakes, countries.
Polygon class. Interior - the infinitely many points within. Boundary - one or more lines. Exterior - all other points.
# Check type of the data
type(gdf_DAMSELFISH)
# Take a quick look at the data
gdf_DAMSELFISH.head(3)
# Look at the geometries in the GeoDataFrame
gdf_DAMSELFISH['geometry'].head(3)
# See the type of geometric objects in the data
type(gdf_DAMSELFISH.iloc[0]['geometry'])
# See the geometric objects in the data
gdf_DAMSELFISH.iloc[0]['geometry']
# Take a quick look at the data on a map
ax = gdf_DAMSELFISH.plot()
1.11.3.3. Writing files#
To write a GeoDataFrame back to file use GeoDataFrame.to_file(). The default file format is Shapefile, but you can specify your own with the driver keyword.
# Create a output path for the data
out_file_path = r"/content/data/damselfish-data/DAMSELFISH_distributions_SELECTION.shp"
# Select first 50 rows, this a the numpy/pandas syntax to ``slice``
# parts out a dataframe or array, from position 0 until (excluding) 50
selection = gdf_DAMSELFISH[0:50]
# Write those rows into a new Shapefile (the default output file format is Shapefile)
selection.to_file(out_file_path)
1.11.4. Attributes and Methods#
1.11.4.1. Area and distance#
To measure the area of each polygon (or MultiPolygon in this specific case), access the
GeoDataFrame.area
attribute, which returns a pandas.Series. Note thatGeoDataFrame.area
is justGeoSeries.area
applied to the active geometry column.The Euclidian distance between points can be accessed using
GeoDataFrame.distance
, which callsGeoDataFrame.distance
to measure the distance from the point of interest
# Calculate and store the areas individual polygons
gdf_DAMSELFISH['area'] = gdf_DAMSELFISH.area
gdf_DAMSELFISH['area'].head(3)
1.11.4.2. Boundary and centroid#
To get the boundary of each polygon (LineString), access the
GeoDataFrame.boundary
The Centroid of a given geometry (line, polygon, etc.) can be accessed via the
.centroid
attribute
# Get the boundary of each polygon
gdf_DAMSELFISH['boundary'] = gdf_DAMSELFISH.boundary
gdf_DAMSELFISH['boundary'].head(3)
# Get the centroid of each polygon
gdf_DAMSELFISH['centroid'] = gdf_DAMSELFISH.centroid
gdf_DAMSELFISH['centroid'].head(3)
# Measure distance between each centroid and the first centroid
first_point = gdf_DAMSELFISH['centroid'].iloc[0]
gdf_DAMSELFISH['distance'] = gdf_DAMSELFISH['centroid'].distance(first_point)
gdf_DAMSELFISH['distance'].head(3)
1.11.4.3. Projection#
Coordinate reference systems (CRS) are important because the geometric shapes in a GeoDataFrame are simply a collection of coordinates in an arbitrary space. A CRS tells Python how those coordinates related to places on the Earth. A map projection (or a projected coordinate system) is a systematic transformation of the latitudes and longitudes into a plain surface where units are quite commonly represented as meters (instead of decimal degrees).
As map projections of gis-layers are fairly often defined differently (i.e. they do not match), it is a common procedure to redefine the map projections to be identical in both layers. It is important that the layers have the same projection as it makes it possible to analyze the spatial relationships between layers, such as in conducting the Point in Polygon spatial query.
The EPSG number (“European Petroleum Survey Group”) is a code that tells about the coordinate system of the dataset.
# Reading the data from the Europe_borders.shp file
path = r"/content/data/Europe_borders/Europe_borders.shp"
gdf_Europe_borders = gpd.read_file(path)
# See the current coordinate reference system from .crs attribute
gdf_Europe_borders.crs
# Convert (aka reproject) into Lambert Azimuthal Equal Area projection (EPSG: 3035)
gdf_Europe_borders_proj = gdf_Europe_borders.to_crs(epsg=3035)
# See the current coordinate reference system from .crs attribute
gdf_Europe_borders_proj.crs
1.11.4.4. Plot#
GeoPandas can plot maps, so we can check how the geometries appear in space. To plot the active geometry, call GeoDataFrame.plot().
# Plot GeoDataFrames
# Understand the difference between the projections
fig, axs = plt.subplots(1, 2, figsize=(8, 4))
# Plot the WGS84
gdf_Europe_borders.plot(facecolor='gray', ax=axs[0]);
# Add title
axs[0].set_title("WGS84 CRS", y=1.05);
# Plot the one with ETRS-LAEA projection
gdf_Europe_borders_proj.plot(facecolor='blue', ax=axs[1]);
# Add title
axs[1].set_title("ETRS Lambert Azimuthal Equal Area projection", y=1.05);
# Remove empty white space around the plot
fig.tight_layout()
1.11.5. Spatial Relationships and Operations#
1.11.6. Geometric Manipulations#
GeoPandas makes available all the tools for geometric manipulations in the shapely library: buffer, boundary, convex_hull, envelope, unary_union
# Tutorial: https://geopandas.org/en/stable/docs/user_guide/geometric_manipulations.html
# Generate a GeoSeries containing 2000 random points
import numpy as np
from shapely.geometry import Point
xmin, xmax, ymin, ymax = 900000, 1080000, 120000, 280000
xc = (xmax - xmin) * np.random.random(2000) + xmin
yc = (ymax - ymin) * np.random.random(2000) + ymin
pts = gpd.GeoSeries([Point(x, y) for x, y in zip(xc, yc)])
# Draw a circle with fixed radius around each point
circles = pts.buffer(2000)
# Collapse these circles into a single MultiPolygon geometry with unary_union
# https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoSeries.unary_union.html
mp = circles.unary_union
# See unified multipolygon
mp
# Check the type of the unified multipolygon
type(mp)
1.11.6.1. Set-Operations with Overlay#
When working with multiple spatial datasets – especially multiple polygon or line datasets – users often wish to create new shapes based on places where those datasets overlap (or don’t overlap). These manipulations are often referred using the language of sets – intersections, unions, and differences. These types of operations are made available in the geopandas library through the overlay() method.
# Tutorial: https://geopandas.org/en/stable/docs/user_guide/set_operations.html
# Create some example data
from shapely.geometry import Polygon
polys1 = gpd.GeoSeries([Polygon([(0,0), (2,0), (2,2), (0,2)]),
Polygon([(2,2), (4,2), (4,4), (2,4)])])
polys2 = gpd.GeoSeries([Polygon([(1,1), (3,1), (3,3), (1,3)]),
Polygon([(3,3), (5,3), (5,5), (3,5)])])
df1 = gpd.GeoDataFrame({'geometry': polys1, 'df1':[1,2]})
df2 = gpd.GeoDataFrame({'geometry': polys2, 'df2':[1,2]})
ax = df1.plot(color='red');
df2.plot(ax=ax, color='green', alpha=0.5)
# The overlay() method with how='intersection', it returns only those geometries that are contained by both GeoDataFrames:
res_intersection = df1.overlay(df2, how='intersection')
# Plot and check the result of the overlay
ax = res_intersection.plot(cmap='tab10')
df1.plot(ax=ax, facecolor='none', edgecolor='k');
df2.plot(ax=ax, facecolor='none', edgecolor='k')