Open In Colab

1.10. (Exercise) Earthquake Data Analysis#

South_Napa_Earthquake.jfif

2014 South Napa CA M6 Earthquake - August 24

Continuous “mole-track” running parallel to the strike of the fault indicates some E-W compression in addition to right-lateral faulting. Photo taken near Buhman Rd.

Source: USGS

In this assignment, we will review pandas fundamentals, such as how to

  • Open csv files

  • Manipulate dataframe indexes

  • Parse date columns

  • Examine basic dataframe statistics

  • Manipulate text columns and extract values

  • Plot dataframe contents using

  • Bar charts
  • Histograms
  • Scatter plots

Data for this assignment in .csv format downloaded from the USGS Earthquakes Database is available at:

https://unils-my.sharepoint.com/:x:/g/personal/tom_beucler_unil_ch/Efg089STo25Gq6N_BBn_qGoBIsAOd2yUNBgeTfPR2wxw4g?download=1

You don’t need to download this file. You can open it directly with Pandas, with a little help from Pooch (don’t worry about reading into the Pooch documentation, unless you really want to! 😃).

We’ll load the datafile into memory and store the path to the file in the variable datafile

# Pooch Code
import pooch
datafile = pooch.retrieve('https://unils-my.sharepoint.com/:x:/g/personal/tom_beucler_unil_ch/Efg089STo25Gq6N_BBn_qGoBIsAOd2yUNBgeTfPR2wxw4g?download=1',
                          known_hash='84d455fb96dc8f782fba4b5fbe56cb8970cab678f07c766fcba1b1c4674de1b1')
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 2
      1 # Pooch Code
----> 2 import pooch
      3 datafile = pooch.retrieve('https://unils-my.sharepoint.com/:x:/g/personal/tom_beucler_unil_ch/Efg089STo25Gq6N_BBn_qGoBIsAOd2yUNBgeTfPR2wxw4g?download=1',
      4                           known_hash='84d455fb96dc8f782fba4b5fbe56cb8970cab678f07c766fcba1b1c4674de1b1')

ModuleNotFoundError: No module named 'pooch'

Q1) First, import Numpy, Pandas and Matplotlib and (optional) set the display options.

Hint: Display options are documented at this link

# Import all libraries here
import numpy as __
import pandas as __
import matplotlib.pyplot as ___

Q2) Use Pandas’ read_csv function directly on the datafile to open it as a DataFrame

The dataframe should look something like this

aaaaa.jpg

To display the first few rows of the table, you will use .head() function. To display information, use .info() function.

Check out these tutorials if you have doubts about what these functions do.

.head()

.info()

# Open the URL as a Pandas' DataFrame
df = pd.read_csv(__)
# Display first few rows
df.____()
# Display DataFrame info
df.____()

The dates were not automatically parsed into datetime types! What can we do?

Q3) Re-read the data in such a way that all date columns are identified as dates and the earthquake ID is used as the index

Recreate the screenshot below.

Hint: The documentation for .read_csv() function is here

aaaaa3.jpg

# Re-read the URL
df = pd.read_csv(__,__=__)
# Use the `head` function to check that it worked
df.___()
# Use the `info` function to check that it worked
df.___()

Q4) Use describe to get the basic statistics of all the columns

Hint: The documentation of describe is at this link

# Use the `describe` function
___.___()

Q5) Use nlargest to get the top 20 earthquakes by magnitude

Hint: The documentation of nlargest is at this link

aaaaa4.jpg

# Use `nlargest`
df.___(___,___)

Examine the column titled ‘place’. It seems to contain both state and country information. How would you get it out?

Q6) Extract the state or country using Pandas text data functions, and add it as a new column to the DataFrame

Hint 1: The documentation for Pandas’ text data functions is here

Hint 2: You will use .split() to extract the country names. The documentation of this function can be found here

aaaaa5.jpg

# Extract the state or country
country = df.__.str.__(',',expand=True,n=_)
# Add it as a new column to the `DataFrame` called `country`

Q7) Display each unique value from the new country column

Hint: You may use the unique function documented at this link

You should see an array with different country and state names (e.g., array([‘Alaska’,’Nevada’,…]))

# Display unique values
df.___.___()

Q8) Create a filtered dataset that only has earthquakes larger than magnitude 4

Hint: Print the table to see the name of the column containing the earthquake magnitude.

Check out this link to find examples of filtering by column values

# Filter the dataset based on the earthquakes' magnitudes

Q9) Using the filtered dataset (magnitude > 4), count the number of earthquakes whose magnitudes >4 [Num 1], count the number of earthquakes in each country/state [Num 2]. Make a bar chart of Num2 for the top 5 locations with the most earthquakes

Hint 1: To get Num 1, Pandas has a count function documented at this link

Hint 2: Check out the value_counts function to get Num 2

aaaaa6.jpg

# Count the number of earthquakes whose magnitudes are larger than 4
df_filt.__()
# Count the number of >4 earthquakes in each country/state
df_filt.__.__()

Recreate the bar chart below.

aaaaa7.jpg

# print the first 5 rows of num2
num2.__[:_]

# convert what you just printed to a DataFrame with 2 columns: 'country' (text) and 'earthquake_num' (number)
# Hint: separate the text and numbers and store them separately in lists
top5_df = pd.DataFrame({'country':list(num2.__[__].__),'earthquake_num':list(num2.__[__]._____)})
top5_df
# Now plot the numbers in the top5_df dataset with bar chart.
__,__ = plt._______(figsize=(_,_))
top5_df.plot.bar(x=_____,y=_____,rot=0,ax=__)

Q10) Make a histogram for the distribution of the earthquakes’ magnitudes

Hint: Pandas has a histogram function documented at this link and Matplotlib has one documented at this link

aaaaa8.jpg

__,__ = plt.subplots(figsize=(9,4))
# Make the histogram
____.__.____
__.__________('______') # X axis label
__._______('_____') # Y axis label
__._____(alpha=0.2,c='b',ls='--')

1.10.1. Use a logarithmic scale for y axis#

aaaaa9.jpg

Hint: Here you can find a tutorial for how to change axis scale.

__,__ = plt.subplots(figsize=(9,4))
# Make the histogram
____.__.____
__.__________('______') # X axis label
__._______('_____') # Y axis label
__._____(alpha=0.2,c='b',ls='--')
# Use a logarithmic scale for y axis
__.______(___)

1.10.2. Make one histogram for the filtered dataset, and one for the unfiltered dataset#

aaaaa10.jpg

# Make one histogram for the filtered dataset, and one for the unfiltered dataset
fig,ax = plt.subplots(1,2,figsize=(12,4))
______.__.____(___=____,color='#1f77b4')
__.__________('______') # X axis label
__._______('_____') # Y axis label
__._____(alpha=0.2,c='b',ls='--')
__.________(________) # Add title for the filtered figure on the left

______.__.____(___=____,color='#ff7f0e')
__.__________('______') # X axis label
__._______('_____') # Y axis label
__._____(alpha=0.2,c='b',ls='--')
__.________(________) # Add title for the unfiltered figure on the right
plt.show()

Q11) Visualize the locations of earthquakes by making a scatterplot of their latitude and longitude

Hint: Consider reading the documentation for plt.scatter to make the scatter plot and that of plt.colorbar to color the points by magnitude.

aaaaa11.jpg

# You can use a two-column subplot with
# both the filtered/unfiltered datasets
# to facilitate their comparison.
__,__ = plt.subplots(_,_,_______=(__,__))
# Filtered data
_____.plot.scatter(x=______,y=_______,c=____,ax=____,cmap=____,vmin=0,vmax=8)
# Unfilted data
df.plot.scatter(x=______,y=_______,c=____,ax=____,cmap=____,vmin=0,vmax=8)
ax[0].set_title('Filtered')
ax[1].set_title('Unfiltered')
plt.show()

Do you notice a difference between filtered and unfiltered datasets?