Open In Colab

1.11. Geospatial Data with Geopandas#

geopandas_logo_green.png

1.11.1. Installing GeoPandas#

Please run the following code blocks in this section to:

  • Install the GeoPandas’s dependencies and GeoPandas

  • Download and unzip the data used in this notebook

  • Import GeoPandas and other required modules for the notebook

#Install the GeoPandas's dependencies
!pip install --upgrade pyshp

!pip install --upgrade shapely

!pip install --upgrade descartes

!pip install --upgrade rtree
Requirement already satisfied: pyshp in c:\users\tbeucler\.conda\envs\jb\lib\site-packages (2.3.1)
Requirement already satisfied: shapely in c:\users\tbeucler\.conda\envs\jb\lib\site-packages (2.0.6)
Requirement already satisfied: numpy<3,>=1.14 in c:\users\tbeucler\.conda\envs\jb\lib\site-packages (from shapely) (1.25.1)
Requirement already satisfied: descartes in c:\users\tbeucler\.conda\envs\jb\lib\site-packages (1.1.0)
Requirement already satisfied: matplotlib in c:\users\tbeucler\.conda\envs\jb\lib\site-packages (from descartes) (3.7.2)
Requirement already satisfied: contourpy>=1.0.1 in c:\users\tbeucler\.conda\envs\jb\lib\site-packages (from matplotlib->descartes) (1.1.0)
Requirement already satisfied: cycler>=0.10 in c:\users\tbeucler\.conda\envs\jb\lib\site-packages (from matplotlib->descartes) (0.11.0)
Requirement already satisfied: fonttools>=4.22.0 in c:\users\tbeucler\.conda\envs\jb\lib\site-packages (from matplotlib->descartes) (4.40.0)
Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\tbeucler\.conda\envs\jb\lib\site-packages (from matplotlib->descartes) (1.4.4)
Requirement already satisfied: numpy>=1.20 in c:\users\tbeucler\.conda\envs\jb\lib\site-packages (from matplotlib->descartes) (1.25.1)
Requirement already satisfied: packaging>=20.0 in c:\users\tbeucler\.conda\envs\jb\lib\site-packages (from matplotlib->descartes) (23.1)
Requirement already satisfied: pillow>=6.2.0 in c:\users\tbeucler\.conda\envs\jb\lib\site-packages (from matplotlib->descartes) (10.0.0)
Requirement already satisfied: pyparsing<3.1,>=2.3.1 in c:\users\tbeucler\.conda\envs\jb\lib\site-packages (from matplotlib->descartes) (3.0.9)
Requirement already satisfied: python-dateutil>=2.7 in c:\users\tbeucler\.conda\envs\jb\lib\site-packages (from matplotlib->descartes) (2.8.2)
Requirement already satisfied: importlib-resources>=3.2.0 in c:\users\tbeucler\.conda\envs\jb\lib\site-packages (from matplotlib->descartes) (6.0.0)
Requirement already satisfied: zipp>=3.1.0 in c:\users\tbeucler\.conda\envs\jb\lib\site-packages (from importlib-resources>=3.2.0->matplotlib->descartes) (3.16.0)
Requirement already satisfied: six>=1.5 in c:\users\tbeucler\.conda\envs\jb\lib\site-packages (from python-dateutil>=2.7->matplotlib->descartes) (1.16.0)
Requirement already satisfied: rtree in c:\users\tbeucler\.conda\envs\jb\lib\site-packages (1.3.0)
#Install the GeoPandas

!pip install --upgrade geopandas
Requirement already satisfied: geopandas in c:\users\tbeucler\.conda\envs\jb\lib\site-packages (1.0.1)
Requirement already satisfied: numpy>=1.22 in c:\users\tbeucler\.conda\envs\jb\lib\site-packages (from geopandas) (1.25.1)
Requirement already satisfied: pyogrio>=0.7.2 in c:\users\tbeucler\.conda\envs\jb\lib\site-packages (from geopandas) (0.9.0)
Requirement already satisfied: packaging in c:\users\tbeucler\.conda\envs\jb\lib\site-packages (from geopandas) (23.1)
Requirement already satisfied: pandas>=1.4.0 in c:\users\tbeucler\.conda\envs\jb\lib\site-packages (from geopandas) (2.0.3)
Requirement already satisfied: pyproj>=3.3.0 in c:\users\tbeucler\.conda\envs\jb\lib\site-packages (from geopandas) (3.6.0)
Requirement already satisfied: shapely>=2.0.0 in c:\users\tbeucler\.conda\envs\jb\lib\site-packages (from geopandas) (2.0.6)
Requirement already satisfied: python-dateutil>=2.8.2 in c:\users\tbeucler\.conda\envs\jb\lib\site-packages (from pandas>=1.4.0->geopandas) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in c:\users\tbeucler\.conda\envs\jb\lib\site-packages (from pandas>=1.4.0->geopandas) (2023.3)
Requirement already satisfied: tzdata>=2022.1 in c:\users\tbeucler\.conda\envs\jb\lib\site-packages (from pandas>=1.4.0->geopandas) (2023.3)
Requirement already satisfied: certifi in c:\users\tbeucler\.conda\envs\jb\lib\site-packages (from pyogrio>=0.7.2->geopandas) (2024.8.30)
Requirement already satisfied: six>=1.5 in c:\users\tbeucler\.conda\envs\jb\lib\site-packages (from python-dateutil>=2.8.2->pandas>=1.4.0->geopandas) (1.16.0)
#Download the data used in this notebook
!gdown 1b1lngOIvuNnZxepbT8RyV3KX1itRky5z
'gdown' is not recognized as an internal or external command,
operable program or batch file.
#Unzip the data used in this notebook
!unzip '/content/data.zip'
unzip:  cannot find either '/content/data.zip' or '/content/data.zip'.zip.
#Import GeoPandas and other required modules for the notebook
import geopandas as gpd
import pandas as pd
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')

References:

  1. Geopandas official website: Introduction to GeoPandas https://geopandas.org/en/stable/getting_started/introduction.html

  2. Automating GIS process https://autogis-site.readthedocs.io/en/latest/notebooks/L2/01-geopandas-basics.html

  3. Use Data for Earth and Environmental Science in Open Source Python https://www.earthdatascience.org/courses/use-data-open-source-python/

  4. The Shapely User Manual https://shapely.readthedocs.io/en/stable/manual.html

  5. Geospatial Analysis with Python and R https://kodu.ut.ee/~kmoch/geopython2020/index.html

  6. Introduction to Geospatial Data in Python https://www.datacamp.com/tutorial/geospatial-data-python

Learning goals

  1. Understand the data structure of GeoPandas

  2. Read and write geospatial data with GeoPandas

  3. Access geospatial data attributes and methods using GeoPandas

  4. Perform spatial operations with GeoPandas

GeoPandas extends the data science library pandas for geospatial data by combining:

  • Pandas: data analysis

  • Shapely: handle geometries (Deterministic spatial analysis - set theoretic manipulations of planar features)

  • Fiona: read and write files

  • pyproj: manage coordinate reference systems

  • matplotlib: plotting.

1.11.2. Geopandas Data Structure#

  • The main difference between GeoDataFrame and DataFrame is that GeoDataFrame contains at least one column ‘geometry’, which contains shapely objects, e.g. points, lines, polygons, multipolygons, etc.

### Create a GeoPandas GeoDataFrame from a Pandas DataFrame with coordiantes
### Tutorial: https://geopandas.org/en/stable/gallery/create_geopandas_from_pandas.html

# Create Pandas DataFrame from csv used in the Pandas exercise
path = r'/content/data/usgs_earthquakes_2014.csv'
df_earthquakes = pd.read_csv(path)
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[6], line 6
      1 ### Create a GeoPandas GeoDataFrame from a Pandas DataFrame with coordiantes
      2 ### Tutorial: https://geopandas.org/en/stable/gallery/create_geopandas_from_pandas.html
      3 
      4 # Create Pandas DataFrame from csv used in the Pandas exercise
      5 path = r'/content/data/usgs_earthquakes_2014.csv'
----> 6 df_earthquakes = pd.read_csv(path)

File ~\.conda\envs\JB\lib\site-packages\pandas\io\parsers\readers.py:912, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)
    899 kwds_defaults = _refine_defaults_read(
    900     dialect,
    901     delimiter,
   (...)
    908     dtype_backend=dtype_backend,
    909 )
    910 kwds.update(kwds_defaults)
--> 912 return _read(filepath_or_buffer, kwds)

File ~\.conda\envs\JB\lib\site-packages\pandas\io\parsers\readers.py:577, in _read(filepath_or_buffer, kwds)
    574 _validate_names(kwds.get("names", None))
    576 # Create the parser.
--> 577 parser = TextFileReader(filepath_or_buffer, **kwds)
    579 if chunksize or iterator:
    580     return parser

File ~\.conda\envs\JB\lib\site-packages\pandas\io\parsers\readers.py:1407, in TextFileReader.__init__(self, f, engine, **kwds)
   1404     self.options["has_index_names"] = kwds["has_index_names"]
   1406 self.handles: IOHandles | None = None
-> 1407 self._engine = self._make_engine(f, self.engine)

File ~\.conda\envs\JB\lib\site-packages\pandas\io\parsers\readers.py:1661, in TextFileReader._make_engine(self, f, engine)
   1659     if "b" not in mode:
   1660         mode += "b"
-> 1661 self.handles = get_handle(
   1662     f,
   1663     mode,
   1664     encoding=self.options.get("encoding", None),
   1665     compression=self.options.get("compression", None),
   1666     memory_map=self.options.get("memory_map", False),
   1667     is_text=is_text,
   1668     errors=self.options.get("encoding_errors", "strict"),
   1669     storage_options=self.options.get("storage_options", None),
   1670 )
   1671 assert self.handles is not None
   1672 f = self.handles.handle

File ~\.conda\envs\JB\lib\site-packages\pandas\io\common.py:859, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
    854 elif isinstance(handle, str):
    855     # Check whether the filename is to be opened in binary mode.
    856     # Binary mode does not support 'encoding' and 'newline'.
    857     if ioargs.encoding and "b" not in ioargs.mode:
    858         # Encoding
--> 859         handle = open(
    860             handle,
    861             ioargs.mode,
    862             encoding=ioargs.encoding,
    863             errors=errors,
    864             newline="",
    865         )
    866     else:
    867         # Binary mode
    868         handle = open(handle, ioargs.mode)

FileNotFoundError: [Errno 2] No such file or directory: '/content/data/usgs_earthquakes_2014.csv'
#Have a look at the Pandas DataFrame
df_earthquakes.head(2)
# Create GeoPandas GeoDataFrame from the Pandas DataFrame
gdf_earthquakes = gpd.GeoDataFrame(df_earthquakes,
                                   geometry=gpd.points_from_xy(df_earthquakes.longitude,
                                                               df_earthquakes.latitude))
# Have a look at the GeoPandas GeoDataFrame and notice the 'geometry' column
gdf_earthquakes.head(2)
  • The core data structure is geopandas.GeoDataFrame, a subclass of pandas.DataFrame.

# See the type of the geodataframe
type(gdf_earthquakes)
  • geopandas.GeoSeries is a subclass of pandas.Series. GeoSeries can contain any geometry type and has a GeoSeries.crs attribute (Coordinate Reference System) for projection.

# See the type of a column in the geodataframe
type(gdf_earthquakes['time'])
  • GeoDataFrame combines pandas.Series with traditional data (numerical, boolean, text, etc.) and geopandas.GeoSeries, thus can work with geospatial data.

# Take a quick look at the geospatial data contained in the GeoDataFrame on a map
ax = gdf_earthquakes.plot()

1.11.3. Read and write files#

1.11.3.1. Reading files#

  • Assuming you have a file containing both data and geometry (e.g. GeoPackage, GeoJSON, Shapefile), you can read it using geopandas.read_file(), which automatically detects the filetype and creates a GeoDataFrame.

# Set filepath
path = r"/content/data/damselfish-data/DAMSELFISH_distributions.shp"
# Read data into a GeoDataFrame
gdf_DAMSELFISH = gpd.read_file(path)

1.11.3.2. GeoPandas for handing vector format spatial data#

  • Three fundamental types of geometric objects (the spatial data model implemented by shapely & defined by interior, boundary, and exterior):

    • Point: A single point defined by a pair of x, y coordinates, e.g. locations of trees

      • Point class. Interior - exactly one point. Boundary - exactly no points. Exterior - all other points

    • Line: At least two connected points, e.g. roads, streams

      • LineString and LinearRing classes. Interior - the infinitely many points along its length. Boundary - Two end points. Exterior - all other points.

    • Polygon: At least three points connected and closed by lines, e.g. boundaries of lakes, countries.

      • Polygon class. Interior - the infinitely many points within. Boundary - one or more lines. Exterior - all other points.

# Check type of the data
type(gdf_DAMSELFISH)
# Take a quick look at the data
gdf_DAMSELFISH.head(3)
# Look at the geometries in the GeoDataFrame
gdf_DAMSELFISH['geometry'].head(3)
# See the type of geometric objects in the data
type(gdf_DAMSELFISH.iloc[0]['geometry'])
# See the geometric objects in the data
gdf_DAMSELFISH.iloc[0]['geometry']
# Take a quick look at the data on a map
ax = gdf_DAMSELFISH.plot()

1.11.3.3. Writing files#

  • To write a GeoDataFrame back to file use GeoDataFrame.to_file(). The default file format is Shapefile, but you can specify your own with the driver keyword.

# Create a output path for the data
out_file_path = r"/content/data/damselfish-data/DAMSELFISH_distributions_SELECTION.shp"

# Select first 50 rows, this a the numpy/pandas syntax to ``slice``
# parts out a dataframe or array, from position 0 until (excluding) 50
selection = gdf_DAMSELFISH[0:50]

# Write those rows into a new Shapefile (the default output file format is Shapefile)
selection.to_file(out_file_path)

1.11.4. Attributes and Methods#

1.11.4.1. Area and distance#

  • To measure the area of each polygon (or MultiPolygon in this specific case), access the GeoDataFrame.area attribute, which returns a pandas.Series. Note that GeoDataFrame.area is just GeoSeries.area applied to the active geometry column.

  • The Euclidian distance between points can be accessed using GeoDataFrame.distance, which calls GeoDataFrame.distance to measure the distance from the point of interest

# Calculate and store the areas individual polygons
gdf_DAMSELFISH['area'] = gdf_DAMSELFISH.area
gdf_DAMSELFISH['area'].head(3)

1.11.4.2. Boundary and centroid#

  • To get the boundary of each polygon (LineString), access the GeoDataFrame.boundary

  • The Centroid of a given geometry (line, polygon, etc.) can be accessed via the .centroid attribute

# Get the boundary of each polygon
gdf_DAMSELFISH['boundary'] = gdf_DAMSELFISH.boundary
gdf_DAMSELFISH['boundary'].head(3)
# Get the centroid of each polygon
gdf_DAMSELFISH['centroid'] = gdf_DAMSELFISH.centroid
gdf_DAMSELFISH['centroid'].head(3)
# Measure distance between each centroid and the first centroid
first_point = gdf_DAMSELFISH['centroid'].iloc[0]
gdf_DAMSELFISH['distance'] = gdf_DAMSELFISH['centroid'].distance(first_point)
gdf_DAMSELFISH['distance'].head(3)

1.11.4.3. Projection#

  • Coordinate reference systems (CRS) are important because the geometric shapes in a GeoDataFrame are simply a collection of coordinates in an arbitrary space. A CRS tells Python how those coordinates related to places on the Earth. A map projection (or a projected coordinate system) is a systematic transformation of the latitudes and longitudes into a plain surface where units are quite commonly represented as meters (instead of decimal degrees).

  • As map projections of gis-layers are fairly often defined differently (i.e. they do not match), it is a common procedure to redefine the map projections to be identical in both layers. It is important that the layers have the same projection as it makes it possible to analyze the spatial relationships between layers, such as in conducting the Point in Polygon spatial query.

  • The EPSG number (“European Petroleum Survey Group”) is a code that tells about the coordinate system of the dataset.

# Reading the data from the Europe_borders.shp file
path =  r"/content/data/Europe_borders/Europe_borders.shp"
gdf_Europe_borders = gpd.read_file(path)
# See the current coordinate reference system from .crs attribute
gdf_Europe_borders.crs
# Convert (aka reproject) into Lambert Azimuthal Equal Area projection (EPSG: 3035)
gdf_Europe_borders_proj = gdf_Europe_borders.to_crs(epsg=3035)
# See the current coordinate reference system from .crs attribute
gdf_Europe_borders_proj.crs

1.11.4.4. Plot#

  • GeoPandas can plot maps, so we can check how the geometries appear in space. To plot the active geometry, call GeoDataFrame.plot().

# Plot GeoDataFrames
# Understand the difference between the projections
fig, axs = plt.subplots(1, 2, figsize=(8, 4))
# Plot the WGS84
gdf_Europe_borders.plot(facecolor='gray', ax=axs[0]);

# Add title
axs[0].set_title("WGS84 CRS", y=1.05);

# Plot the one with ETRS-LAEA projection
gdf_Europe_borders_proj.plot(facecolor='blue', ax=axs[1]);

# Add title
axs[1].set_title("ETRS Lambert Azimuthal Equal Area projection", y=1.05);

# Remove empty white space around the plot
fig.tight_layout()

1.11.5. Spatial Relationships and Operations#

1.11.6. Geometric Manipulations#

GeoPandas makes available all the tools for geometric manipulations in the shapely library: buffer, boundary, convex_hull, envelope, unary_union

# Tutorial: https://geopandas.org/en/stable/docs/user_guide/geometric_manipulations.html

# Generate a GeoSeries containing 2000 random points
import numpy as np
from shapely.geometry import Point
xmin, xmax, ymin, ymax = 900000, 1080000, 120000, 280000
xc = (xmax - xmin) * np.random.random(2000) + xmin
yc = (ymax - ymin) * np.random.random(2000) + ymin
pts = gpd.GeoSeries([Point(x, y) for x, y in zip(xc, yc)])
# Draw a circle with fixed radius around each point
circles = pts.buffer(2000)
# Collapse these circles into a single MultiPolygon geometry with unary_union
# https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoSeries.unary_union.html
mp = circles.unary_union
# See unified multipolygon
mp
# Check the type of the unified multipolygon
type(mp)

1.11.6.1. Set-Operations with Overlay#

When working with multiple spatial datasets – especially multiple polygon or line datasets – users often wish to create new shapes based on places where those datasets overlap (or don’t overlap). These manipulations are often referred using the language of sets – intersections, unions, and differences. These types of operations are made available in the geopandas library through the overlay() method.

# Tutorial: https://geopandas.org/en/stable/docs/user_guide/set_operations.html
# Create some example data
from shapely.geometry import Polygon

polys1 = gpd.GeoSeries([Polygon([(0,0), (2,0), (2,2), (0,2)]),
                              Polygon([(2,2), (4,2), (4,4), (2,4)])])


polys2 = gpd.GeoSeries([Polygon([(1,1), (3,1), (3,3), (1,3)]),
                              Polygon([(3,3), (5,3), (5,5), (3,5)])])


df1 = gpd.GeoDataFrame({'geometry': polys1, 'df1':[1,2]})

df2 = gpd.GeoDataFrame({'geometry': polys2, 'df2':[1,2]})

ax = df1.plot(color='red');

df2.plot(ax=ax, color='green', alpha=0.5)
# The overlay() method with how='intersection', it returns only those geometries that are contained by both GeoDataFrames:
res_intersection = df1.overlay(df2, how='intersection')
# Plot and check the result of the overlay
ax = res_intersection.plot(cmap='tab10')

df1.plot(ax=ax, facecolor='none', edgecolor='k');

df2.plot(ax=ax, facecolor='none', edgecolor='k')