SPL Top 500 Data Exploration (Exercise)

dplyr
exercise
Published

February 25, 2026

Exercises

SPL Top 500 Data Exploration

Solutions

These exercises explore checkout data from the Seattle Public Library for the Top 500 “Greatest” Novels — the novels most widely held in libraries according to OCLC. For more context about the dataset, see the data essay.

Concepts covered:

  • Groupby and aggregation (sum of checkouts)
  • Sorting and ranking (top N values)
  • Time series line plots (monthly checkouts over time)
  • String filtering (finding titles that contain a keyword)
  • Pivot tables
  • Correlation matrices and heatmaps — a correlation coefficient measures how closely two variables move together, ranging from -1 (perfect inverse relationship) to 1 (perfect positive relationship), with 0 meaning no relationship

Load the data

Code
library(dplyr)
library(ggplot2)
library(tidyr)

spl_df <- read.csv("https://responsible-datasets-in-context.s3.us-west-2.amazonaws.com/top_500_spl_df.csv",
  stringsAsFactors = FALSE)
head(spl_df)

Exercise 1

Find the top 10 authors and top 10 books by total checkouts in the SPL Top 500 dataset. Display them as tables.

Save the results as top_authors and top_books.

Code
# Your code here

Discuss/consider: Which authors and books are most popular at the Seattle Public Library? Are there any surprises?

Exercise 2

Create a time series line plot of monthly checkouts for “Pride and Prejudice” over time.

Filter the data for “Pride and Prejudice”, group by year and month, and plot the results.

Code
# Your code here

Discuss/consider: What patterns do you notice in the checkout trends? Are there any seasonal patterns or notable changes over time?

Exercise 3

Calculate the correlation between the monthly checkout patterns of Harry Potter books and display the results as a heatmap.

Filter for Harry Potter titles, pivot the data so each book is a column, compute the correlation matrix, and visualize it.

Code
# Your code here

Discuss/consider: Which Harry Potter books have the most correlated checkout patterns? What might explain these correlations?