1.16. (Exercise) Marathon Data Analysis#
(Reference: Python Data Sciences Handbook)
Here we’ll use Seaborn
to visualize and understand finishing results from a marathon.
We will start by downloading the data from the Web and loading it into Pandas:
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import datetime
import pandas as pd
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[1], line 1
----> 1 import seaborn as sns
2 import matplotlib.pyplot as plt
3 import numpy as np
ModuleNotFoundError: No module named 'seaborn'
We use the a linux command line tool curl
to load data from an url address.
We add an exclamation mark !
to use command line tools directly in Google Colab.
Check this website if you are interested in how to use curl
.
!curl -O https://raw.githubusercontent.com/jakevdp/marathon-data/master/marathon-data.csv
def convert_time(s):
h, m, s = map(int, s.split(':'))
return datetime.timedelta(hours=h, minutes=m, seconds=s)
data = pd.read_csv('marathon-data.csv',
converters={'split':convert_time, 'final':convert_time})
data.head()
This looks much better. However time information is quite tricky to visualize. To make our life easier, we will use the convert_time
function to change the time stamps information to time values in seconds.
1.16.1. Q1: Add two new columns to store the split
and final
in seconds, and name these columns split_sec
and final_sec
#
Conversion can be made using this equation: data['split'].astype(int)/1e9
Hint: Here you can find some ways to add new columns to existing pandas
dataframe.
data[__________] = data[________]_____________
data[__________] = data[______]______________
# Print the first few rows to ensure we really appended the table with the new columns.
data._______
Now that we have processed our dataset, can you help us analyze the marathon dataset?
1.16.2. Q2: Use jointplot()
to visualize two columns (x axis will be split_sec
, y axis will be final_sec
)#
We will use hex
for the 2D histogram part of the figure.
You should see something like this.
Hint: Refer to the jointplot()
documentation for details.
with sns.axes_style('white'):
g = sns.jointplot(_________, ___________, ____, kind=____)
g.ax_joint.plot(np.linspace(4000, 16000),
np.linspace(8000, 32000), ':k')
The dotted line shows where someone’s time would lie if they ran the marathon at a perfectly steady pace. The fact that the distribution lies above this indicates (as you might expect) that most people slow down over the course of the marathon. If you have run competitively, you’ll know that those who do the opposite—run faster during the second half of the race—are said to have “negative-split” the race.
1.16.3. Q3: Add a new column split_frac
to the table#
Let’s create another column in the data, the split fraction, which measures the degree to which each runner negative-splits or positive-splits the race:
\(SF = 1-2*\frac{split sec}{final sec}\)
Hint: This is similar to what you just did for Q1.
data[____________] = ______________________
# Print the first few rows to ensure we really appended the table with the new column.
data.head()
1.16.4. Q4: Print the number of people who had a split_frac
that was less than zero#
You should see a single number of 251 being printed out.
print(__________________________)
Out of nearly 40,000 participants, there were only 250 people who negative-split their marathon.
1.16.5. Q5: See if there is any correlation between split fraction and other variables. Use pairgrid()
to visualize all these correlations#
We will use sns.PairGrid()
to visualize the correlations. Let’s try to visualize age
, split_sec
, final_sec
, and split_frac
in the table.
We will plot the 2D scatter plots for male and female separately with the gender
column. Check out hue
option in the PairGrid()
tutorial on how to do that.
Use the palette RdBu_r
for a visually pleasing plot~
We would also like to add a legend as well.
Hints: Check PairGrid in the seaborn website to see how we use scatter plots for our figures.
g = sns.PairGrid(_________, vars=__________________________,hue=_______,palette=_________)
g_____(______________)
# Add legend here.
g.____________;
1.16.6. Q6: Separate the runner stats by gender and explore the differences in split fraction distributions with KDE plots#
Hint: You can use kdeplot()
function in seaborn for this question. Check the seaborn documentation for details.
# Visualize male split_frac here.
sns.___________(______________________, label=____, shade=_____,color=__)
# Visualize female split_frac here.
sns.____________(______________________, label=______, shade=_____,color=____)
plt.legend(______)
plt.xlabel(________);
1.16.7. (BONUS) Q7: Compare the gender differences in split fraction again with violinplot(), but now examine these differences as a function of age#
Hint:
Create a new column in the array that specifies the decade of age that each person is in. (Equation: 10 * (age // 10))
Add new column to Pandas dataframe (data), you could name it “age_decade”
Visualize your data with this:
sns.violinplot("age_decade", "split_frac",...)
Try to explore different ways to produce the plot below, we used violinplot(), but it may be more convenient to use catplot().
# Recreate this figure here