1.10. (Exercise) Earthquake Data Analysis#
2014 South Napa CA M6 Earthquake - August 24
Continuous “mole-track” running parallel to the strike of the fault indicates some E-W compression in addition to right-lateral faulting. Photo taken near Buhman Rd.
Source: USGS
In this assignment, we will review pandas
fundamentals, such as how to
Open csv files
Manipulate dataframe indexes
Parse date columns
Examine basic dataframe statistics
Manipulate text columns and extract values
Plot dataframe contents using
- Bar charts
- Histograms
- Scatter plots
Data for this assignment in .csv format downloaded from the USGS Earthquakes Database is available at:
You don’t need to download this file. You can open it directly with Pandas, with a little help from Pooch (don’t worry about reading into the Pooch documentation, unless you really want to! 😃).
We’ll load the datafile into memory and store the path to the file in the variable datafile
# Pooch Code
import pooch
datafile = pooch.retrieve('https://unils-my.sharepoint.com/:x:/g/personal/tom_beucler_unil_ch/Efg089STo25Gq6N_BBn_qGoBIsAOd2yUNBgeTfPR2wxw4g?download=1',
known_hash='84d455fb96dc8f782fba4b5fbe56cb8970cab678f07c766fcba1b1c4674de1b1')
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[1], line 2
1 # Pooch Code
----> 2 import pooch
3 datafile = pooch.retrieve('https://unils-my.sharepoint.com/:x:/g/personal/tom_beucler_unil_ch/Efg089STo25Gq6N_BBn_qGoBIsAOd2yUNBgeTfPR2wxw4g?download=1',
4 known_hash='84d455fb96dc8f782fba4b5fbe56cb8970cab678f07c766fcba1b1c4674de1b1')
ModuleNotFoundError: No module named 'pooch'
Q1) First, import Numpy
, Pandas
and Matplotlib
and (optional) set the display options.
Hint: Display options are documented at this link
# Import all libraries here
import numpy as __
import pandas as __
import matplotlib.pyplot as ___
Q2) Use Pandas’ read_csv
function directly on the datafile to open it as a DataFrame
The dataframe should look something like this
To display the first few rows of the table, you will use .head()
function. To display information, use .info()
function.
Check out these tutorials if you have doubts about what these functions do.
# Open the URL as a Pandas' DataFrame
df = pd.read_csv(__)
# Display first few rows
df.____()
# Display DataFrame info
df.____()
The dates were not automatically parsed into datetime
types!
What can we do?
Q3) Re-read the data in such a way that all date columns are identified as dates and the earthquake ID is used as the index
Recreate the screenshot below.
Hint: The documentation for .read_csv()
function is here
# Re-read the URL
df = pd.read_csv(__,__=__)
# Use the `head` function to check that it worked
df.___()
# Use the `info` function to check that it worked
df.___()
Q4) Use describe
to get the basic statistics of all the columns
Hint: The documentation of describe
is at this link
# Use the `describe` function
___.___()
Q5) Use nlargest
to get the top 20 earthquakes by magnitude
Hint: The documentation of nlargest
is at this link
# Use `nlargest`
df.___(___,___)
Examine the column titled ‘place’. It seems to contain both state and country information. How would you get it out?
Q6) Extract the state or country using Pandas
text data functions, and add it as a new column to the DataFrame
Hint 1: The documentation for Pandas’ text data functions is here
Hint 2: You will use .split()
to extract the country names. The documentation of this function can be found here
# Extract the state or country
country = df.__.str.__(',',expand=True,n=_)
# Add it as a new column to the `DataFrame` called `country`
Q7) Display each unique value from the new country
column
Hint: You may use the unique
function documented at this link
You should see an array with different country and state names (e.g., array([‘Alaska’,’Nevada’,…]))
# Display unique values
df.___.___()
Q8) Create a filtered dataset that only has earthquakes larger than magnitude 4
Hint: Print the table to see the name of the column containing the earthquake magnitude.
Check out this link to find examples of filtering by column values
# Filter the dataset based on the earthquakes' magnitudes
Q9) Using the filtered dataset (magnitude > 4), count the number of earthquakes whose magnitudes >4 [Num 1], count the number of earthquakes in each country/state [Num 2]. Make a bar chart of Num2 for the top 5 locations with the most earthquakes
Hint 1: To get Num 1, Pandas
has a count
function documented at this link
Hint 2: Check out the value_counts
function to get Num 2
# Count the number of earthquakes whose magnitudes are larger than 4
df_filt.__()
# Count the number of >4 earthquakes in each country/state
df_filt.__.__()
Recreate the bar chart below.
# print the first 5 rows of num2
num2.__[:_]
# convert what you just printed to a DataFrame with 2 columns: 'country' (text) and 'earthquake_num' (number)
# Hint: separate the text and numbers and store them separately in lists
top5_df = pd.DataFrame({'country':list(num2.__[__].__),'earthquake_num':list(num2.__[__]._____)})
top5_df
# Now plot the numbers in the top5_df dataset with bar chart.
__,__ = plt._______(figsize=(_,_))
top5_df.plot.bar(x=_____,y=_____,rot=0,ax=__)
Q10) Make a histogram for the distribution of the earthquakes’ magnitudes
Hint: Pandas
has a histogram function documented at this link and Matplotlib
has one documented at this link
__,__ = plt.subplots(figsize=(9,4))
# Make the histogram
____.__.____
__.__________('______') # X axis label
__._______('_____') # Y axis label
__._____(alpha=0.2,c='b',ls='--')
1.10.1. Use a logarithmic scale for y axis#
Hint: Here you can find a tutorial for how to change axis scale.
__,__ = plt.subplots(figsize=(9,4))
# Make the histogram
____.__.____
__.__________('______') # X axis label
__._______('_____') # Y axis label
__._____(alpha=0.2,c='b',ls='--')
# Use a logarithmic scale for y axis
__.______(___)
1.10.2. Make one histogram for the filtered dataset, and one for the unfiltered dataset#
# Make one histogram for the filtered dataset, and one for the unfiltered dataset
fig,ax = plt.subplots(1,2,figsize=(12,4))
______.__.____(___=____,color='#1f77b4')
__.__________('______') # X axis label
__._______('_____') # Y axis label
__._____(alpha=0.2,c='b',ls='--')
__.________(________) # Add title for the filtered figure on the left
______.__.____(___=____,color='#ff7f0e')
__.__________('______') # X axis label
__._______('_____') # Y axis label
__._____(alpha=0.2,c='b',ls='--')
__.________(________) # Add title for the unfiltered figure on the right
plt.show()
Q11) Visualize the locations of earthquakes by making a scatterplot of their latitude and longitude
Hint: Consider reading the documentation for plt.scatter
to make the scatter plot and that of plt.colorbar
to color the points by magnitude.
# You can use a two-column subplot with
# both the filtered/unfiltered datasets
# to facilitate their comparison.
__,__ = plt.subplots(_,_,_______=(__,__))
# Filtered data
_____.plot.scatter(x=______,y=_______,c=____,ax=____,cmap=____,vmin=0,vmax=8)
# Unfilted data
df.plot.scatter(x=______,y=_______,c=____,ax=____,cmap=____,vmin=0,vmax=8)
ax[0].set_title('Filtered')
ax[1].set_title('Unfiltered')
plt.show()
Do you notice a difference between filtered and unfiltered datasets?