SPL Top 500 Data Exploration (Solution)

pandas
exercise
solution
Published

February 25, 2026

Solution

SPL Top 500 Data Exploration

Exercise Without Solutions

These exercises explore checkout data from the Seattle Public Library for the Top 500 “Greatest” Novels — the novels most widely held in libraries according to OCLC. For more context about the dataset, see the data essay.

Concepts covered:

  • Groupby and aggregation (sum of checkouts)
  • Sorting and ranking (top N values)
  • Time series line plots (monthly checkouts over time)
  • String filtering (finding titles that contain a keyword)
  • Pivot tables
  • Correlation matrices and heatmaps — a correlation coefficient measures how closely two variables move together, ranging from -1 (perfect inverse relationship) to 1 (perfect positive relationship), with 0 meaning no relationship

Load the data

Code
import pandas as pd
import plotly.express as px

spl_df = pd.read_csv("https://responsible-datasets-in-context.s3.us-west-2.amazonaws.com/top_500_spl_df.csv")
spl_df.head()
/tmp/ipykernel_8224/3933414714.py:4: DtypeWarning:

Columns (12) have mixed types. Specify dtype option on import or set low_memory=False.
Unnamed: 0 usageclass checkouttype materialtype checkoutyear checkoutmonth checkouts title subjects creator ... gr_num_ratings gr_num_reviews gr_avg_rating_rank gr_num_ratings_rank oclc_owi author_viaf gr_url wiki_url pg_eng_url pg_orig_url
0 0 Physical Horizon BOOK 2006 1 1 The merry adventures of Robin Hood / Howard Py... Robin Hood Legendary character Legends Juvenil... Pyle, Howard, 1853-1911 ... 64,249 1,428 166 370 4.922740e+05 12428253 https://www.goodreads.com/book/show/1277190.Th... https://en.wikipedia.org/wiki/The_Merry_Advent... https://www.gutenberg.org/cache/epub/10148/pg1... NaN
1 1 Physical Horizon BOOK 2006 1 5 King of the Wind; illus. by Wesley Dennis. Horses Juvenile fiction, Horses Fiction Henry, Marguerite, 1902-1997 ... 26,436 920 67 446 1.025728e+08 91717300 https://www.goodreads.com/book/show/423156.Kin... https://en.wikipedia.org/wiki/King_of_the_Wind NA_not-pub-domain NaN
2 2 Physical Horizon BOOK 2006 1 7 A painted house / a novel by John Grisham. Boys Arkansas Fiction, Cotton growing Arkansas... Grisham, John ... 93,602 4,927 423 331 6.890420e+05 7449742 https://www.goodreads.com/book/show/5360.A_Pai... https://en.wikipedia.org/wiki/A_Painted_House NA_not-pub-domain NaN
3 3 Physical Horizon BOOK 2006 1 1 Don Quijote de la Mancha / Miguel de Cervantes... Cervantes Saavedra Miguel de 1547 1616 Don Qui... Cervantes Saavedra, Miguel de, 1547-1616 ... 269,435 12,053 318 211 1.810748e+09 17220427 https://www.goodreads.com/book/show/3836.Don_Q... https://en.wikipedia.org/wiki/Don_Quixote https://www.gutenberg.org/cache/epub/996/pg996... https://www.gutenberg.org/cache/epub/2000/pg20...
4 4 Physical Horizon BOOK 2006 1 3 Sula. African Americans Juvenile fiction, Friendship... Morrison, Toni ... 97,287 7,449 207 324 1.812330e+05 109406177 https://www.goodreads.com/book/show/11346.Sula... https://en.wikipedia.org/wiki/Sula_(novel) NA_not-pub-domain NaN

5 rows × 46 columns

Exercise 1

Find the top 10 authors and top 10 books by total checkouts in the SPL Top 500 dataset. Display them as styled tables.

Save the results as top_authors and top_books.

Code
top_authors = spl_df.groupby('author')['checkouts'].sum().nlargest(10).reset_index()
top_authors.columns = ['Author', 'Total Checkouts']
top_authors.style.background_gradient(cmap='Blues')
  Author Total Checkouts
0 J.K. Rowling 145161
1 Suzanne Collins 67500
2 Jane Austen 53731
3 J.R.R. Tolkien 52374
4 Stieg Larsson 45227
5 Stephen King 39926
6 Charles Dickens 37749
7 Stephenie Meyer 37370
8 Gillian Flynn 35366
9 John Grisham 33372
Code
top_books = spl_df.groupby('top_500_title')['checkouts'].sum().nlargest(10).reset_index()
top_books.columns = ['Title', 'Total Checkouts']
top_books.style.background_gradient(cmap='Greens')
  Title Total Checkouts
0 Gone Girl 35366
1 The Handmaid's Tale 33240
2 Harry Potter and the Sorcerer's Stone 31386
3 The Book Thief 30549
4 Twilight 29289
5 The Hunger Games 27960
6 The Giver 27906
7 Pride and Prejudice 27250
8 The Girl on the Train 26610
9 A Wrinkle in Time 24762

Discuss/consider: Which authors and books are most popular at the Seattle Public Library? Are there any surprises?

Exercise 2

Create a time series line plot of monthly checkouts for “Pride and Prejudice” over time.

Filter the data for “Pride and Prejudice”, group by year and month, and plot the results.

Code
pride_df = spl_df[spl_df['top_500_title'] == 'Pride and Prejudice']
pride_monthly = pride_df.groupby(['checkoutyear', 'checkoutmonth'])['checkouts'].sum().reset_index()
pride_monthly['date'] = pd.to_datetime(
    pride_monthly['checkoutyear'].astype(str) + '-' +
    pride_monthly['checkoutmonth'].astype(str) + '-01')

fig = px.line(pride_monthly.sort_values('date'), x='date', y='checkouts',
              title='Monthly SPL Checkouts for Pride and Prejudice')
fig.show()

Discuss/consider: What patterns do you notice in the checkout trends? Are there any seasonal patterns or notable changes over time?

Exercise 3

Calculate the correlation between the monthly checkout patterns of Harry Potter books and display the results as a heatmap.

Filter for Harry Potter titles, pivot the data so each book is a column, compute the correlation matrix, and visualize it.

Code
hp_df = spl_df[spl_df['top_500_title'].str.contains('Harry Potter', case=False)]
hp_monthly = hp_df.groupby(['top_500_title', 'checkoutyear', 'checkoutmonth'])['checkouts'].sum().reset_index()
hp_monthly['date'] = pd.to_datetime(
    hp_monthly['checkoutyear'].astype(str) + '-' +
    hp_monthly['checkoutmonth'].astype(str) + '-01')

hp_pivot = hp_monthly.pivot_table(index='date', columns='top_500_title', values='checkouts', fill_value=0)
cor_matrix = hp_pivot.corr()

fig = px.imshow(cor_matrix, text_auto='.2f',
                title='Correlation Between Harry Potter Book Checkouts',
                color_continuous_scale='RdBu_r')
fig.show()

Discuss/consider: Which Harry Potter books have the most correlated checkout patterns? What might explain these correlations?