SPL Top 500 Data Exploration (Solution)

dplyr
exercise
solution
Published

February 25, 2026

Solution

SPL Top 500 Data Exploration

Exercise Without Solutions

These exercises explore checkout data from the Seattle Public Library for the Top 500 “Greatest” Novels — the novels most widely held in libraries according to OCLC. For more context about the dataset, see the data essay.

Concepts covered:

  • Groupby and aggregation (sum of checkouts)
  • Sorting and ranking (top N values)
  • Time series line plots (monthly checkouts over time)
  • String filtering (finding titles that contain a keyword)
  • Pivot tables
  • Correlation matrices and heatmaps — a correlation coefficient measures how closely two variables move together, ranging from -1 (perfect inverse relationship) to 1 (perfect positive relationship), with 0 meaning no relationship

Load the data

Code
library(dplyr)
library(ggplot2)
library(tidyr)

spl_df <- read.csv("https://responsible-datasets-in-context.s3.us-west-2.amazonaws.com/top_500_spl_df.csv",
  stringsAsFactors = FALSE)
head(spl_df)
Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")
X usageclass checkouttype materialtype checkoutyear checkoutmonth checkouts title subjects creator publisher publicationyear isbn year.month last_name first_name total_checkouts top_500_rank top_500_title author pub_year orig_lang genre author_birth author_death author_gender author_primary_lang author_nationality author_field_of_activity author_occupation oclc_holdings oclc_eholdings oclc_total_editions oclc_holdings_rank oclc_editions_rank gr_avg_rating gr_num_ratings gr_num_reviews gr_avg_rating_rank gr_num_ratings_rank oclc_owi author_viaf gr_url wiki_url pg_eng_url pg_orig_url
0 Physical Horizon BOOK 2006 1 1 The merry adventures of Robin Hood / Howard Pyle ; illustrated by Scott McKowen. Robin Hood Legendary character Legends Juvenile literature, Folklore England Juvenile literature Pyle, Howard, 1853-1911 Sterling Pub., 2004. NA 2006-01-01 pyle howard 1191 292 The Merry Adventures of Robin Hood of Great Renown In Nottinghamshire Howard Pyle 1883 English na 1853 1911 male eng US illustrators,authors 6636 2829 401 389 279 4.07 64,249 1,428 166 370 492274 12428253 https://www.goodreads.com/book/show/1277190.The_Merry_Adventures_of_Robin_Hood_of_Great_Renown_in_Nottinghamshire?ref=nav_sb_ss_1_69 https://en.wikipedia.org/wiki/The_Merry_Adventures_of_Robin_Hood https://www.gutenberg.org/cache/epub/10148/pg10148.txt
1 Physical Horizon BOOK 2006 1 5 King of the Wind; illus. by Wesley Dennis. Horses Juvenile fiction, Horses Fiction Henry, Marguerite, 1902-1997 Rand McNally [1948] NA 2006-01-01 henry marguerite 535 418 King of the Wind Marguerite Henry 1948 English history 1902 1997 female eng US childrens literature author 6702 154 176 377 431 4.21 26,436 920 67 446 102572752 91717300 https://www.goodreads.com/book/show/423156.King_of_the_Wind?ref=nav_sb_ss_2_16 https://en.wikipedia.org/wiki/King_of_the_Wind NA_not-pub-domain
2 Physical Horizon BOOK 2006 1 7 A painted house / a novel by John Grisham. Boys Arkansas Fiction, Cotton growing Arkansas Fiction, Tenant farmers Arkansas Fiction, Mexican American migrant agricultural laborers Arkansas Fiction, Migrant agricultural laborers Ozark Mountains Fiction, Baseball stories, Childrens secrets Fiction, Mystery fiction, Bildungsromans Grisham, John Doubleday, c2001. NA 2006-01-01 grisham john 1619 220 A Painted House John Grisham 2001 English na 1955 ALIVE male eng US motion pictures production and direction,motion picture authorship,law,fiction authorship,acting screenwriters,novelists,motion picture producers and directors,lawyers,authors,actors 7091 444 251 334 364 3.75 93,602 4,927 423 331 689042 7449742 https://www.goodreads.com/book/show/5360.A_Painted_House?ref=nav_sb_ss_1_15 https://en.wikipedia.org/wiki/A_Painted_House NA_not-pub-domain
3 Physical Horizon BOOK 2006 1 1 Don Quijote de la Mancha / Miguel de Cervantes ; con las ilustraciones de Gustavo Doré, grabadas por pisan. Cervantes Saavedra Miguel de 1547 1616 Don Quixote, Don Quixote Fictitious character Fiction, Spain Social life and customs 16th century Fiction, Historical fiction, Picaresque literature Cervantes Saavedra, Miguel de, 1547-1616 Edimat Libros, c1998. NA 2006-01-01 150 1 Don Quixote Miguel de Cervantes 1605 Spanish action 1547 1616 male spa ES soldiers,poets spanish,novelists spanish,dramatists spanish,authors spanish 30840 4992 9017 9 1 3.90 269,435 12,053 318 211 1810748013 17220427 https://www.goodreads.com/book/show/3836.Don_Quixote?from_search=true&from_srp=true&qid=Npjwg6DCCM&rank=1 https://en.wikipedia.org/wiki/Don_Quixote https://www.gutenberg.org/cache/epub/996/pg996.txt https://www.gutenberg.org/cache/epub/2000/pg2000.txt
4 Physical Horizon BOOK 2006 1 3 Sula. African Americans Juvenile fiction, Friendship Fiction Morrison, Toni Knopf; [distributed by Random House], 1974 [c1973] NA 2006-01-01 morrison toni 231 290 Sula Toni Morrison 1973 English na 1931 2019 female eng US united states race relations,racism,african americans,african american women,african american girls,literature english teachers,college teachers,novelists,authors,writers,university and college faculty members,literature teachers,english teachers of 8221 390 367 263 294 4.02 97,287 7,449 207 324 181233 109406177 https://www.goodreads.com/book/show/11346.Sula?ref=nav_sb_ss_1_4 https://en.wikipedia.org/wiki/Sula_(novel) NA_not-pub-domain
5 Digital OverDrive EBOOK 2006 1 1 A Tale of Two Cities Classic Literature, Fiction, Historical Fiction Charles Dickens NuVision Publications 2005 NA 2006-01-01 6223 15 A Tale of Two Cities Charles Dickens 1859 English history 1812 1870 male eng GB english literature authors english,writers,novelists 28144 10044 3683 12 17 3.87 943,466 23,243 342 86 2487779497 88666393 https://www.goodreads.com/book/show/1953.A_Tale_of_Two_Cities?ref=nav_sb_ss_1_20 https://en.wikipedia.org/wiki/A_Tale_of_Two_Cities https://www.gutenberg.org/cache/epub/98/pg98.txt

Exercise 1

Find the top 10 authors and top 10 books by total checkouts in the SPL Top 500 dataset. Display them as tables.

Save the results as top_authors and top_books.

Code
top_authors <- spl_df %>%
  group_by(author) %>%
  summarize(total_checkouts = sum(checkouts)) %>%
  arrange(desc(total_checkouts)) %>%
  head(10)

top_authors
Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")
author total_checkouts
J.K. Rowling 145161
Suzanne Collins 67500
Jane Austen 53731
J.R.R. Tolkien 52374
Stieg Larsson 45227
Stephen King 39926
Charles Dickens 37749
Stephenie Meyer 37370
Gillian Flynn 35366
John Grisham 33372
Code
top_books <- spl_df %>%
  group_by(top_500_title) %>%
  summarize(total_checkouts = sum(checkouts)) %>%
  arrange(desc(total_checkouts)) %>%
  head(10)

top_books
Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")
top_500_title total_checkouts
Gone Girl 35366
The Handmaid’s Tale 33240
Harry Potter and the Sorcerer’s Stone 31386
The Book Thief 30549
Twilight 29289
The Hunger Games 27960
The Giver 27906
Pride and Prejudice 27250
The Girl on the Train 26610
A Wrinkle in Time 24762

Discuss/consider: Which authors and books are most popular at the Seattle Public Library? Are there any surprises?

Exercise 2

Create a time series line plot of monthly checkouts for “Pride and Prejudice” over time.

Filter the data for “Pride and Prejudice”, group by year and month, and plot the results.

Code
pride_df <- spl_df %>%
  filter(top_500_title == "Pride and Prejudice") %>%
  group_by(checkoutyear, checkoutmonth) %>%
  summarize(monthly_checkouts = sum(checkouts), .groups = "drop") %>%
  mutate(date = as.Date(paste(checkoutyear, sprintf("%02d", checkoutmonth), "01", sep = "-")))

ggplot(pride_df, aes(x = date, y = monthly_checkouts)) +
  geom_line(color = "darkgreen") +
  labs(title = "Monthly SPL Checkouts for Pride and Prejudice",
       x = "Date", y = "Checkouts") +
  theme_minimal()

Discuss/consider: What patterns do you notice in the checkout trends? Are there any seasonal patterns or notable changes over time?

Exercise 3

Calculate the correlation between the monthly checkout patterns of Harry Potter books and display the results as a heatmap.

Filter for Harry Potter titles, pivot the data so each book is a column, compute the correlation matrix, and visualize it.

Code
hp_df <- spl_df %>%
  filter(grepl("Harry Potter", top_500_title, ignore.case = TRUE)) %>%
  group_by(top_500_title, checkoutyear, checkoutmonth) %>%
  summarize(total_checkouts = sum(checkouts), .groups = "drop") %>%
  mutate(date = paste(checkoutyear, sprintf("%02d", checkoutmonth), sep = "-")) %>%
  select(top_500_title, date, total_checkouts) %>%
  pivot_wider(names_from = top_500_title, values_from = total_checkouts, values_fill = 0)

cor_matrix <- cor(hp_df %>% select(-date), use = "complete.obs")

cor_long <- as.data.frame(as.table(cor_matrix))
names(cor_long) <- c("Book1", "Book2", "Correlation")

ggplot(cor_long, aes(x = Book1, y = Book2, fill = Correlation)) +
  geom_tile() +
  geom_text(aes(label = round(Correlation, 2)), size = 3) +
  scale_fill_gradient2(low = "blue", mid = "white", high = "red", midpoint = 0) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Correlation Between Harry Potter Book Checkouts",
       x = "", y = "")

Discuss/consider: Which Harry Potter books have the most correlated checkout patterns? What might explain these correlations?