Introduction
This dataset contains library circulation information for books in the The Top 500 “Greatest” Novels list, the the set of novels most widely held in libraries according to OCLC. The Seattle Public Library (SPL) Checkout Data is one of the only publically available sources on circulation and popularity of literary texts (baccini2021explainingGupta_Christensen_Walsh_2025?), due to the unavailability of proprietary book sales data (walsh2024?). The dataset presented here is a combination of both the top 500 “Greatest” novels dataset published on this site, and [mirrored version of the SPL Checkout Data](top 500 “Greatest” novels](https://www.responsible-datasets-in-context.com/posts/top-500-novels/top-500-novels.html) recording checkouts from 2005 up till Feb 2025.
Dataset
Download Full Data
Download Table Data (including filtered options)
What’s in the data?
The dataset contains extensive metadata information on the top 500 novels list, borrowed from our previous RDIC post.
Basic info on novels:
- TOP_500_RANK: Numeric rank of text in OCLC’s original Top 500 List.
- TOP_500_Title: Title of text, as recorded in OCLC’s original Top 500 List.
- AUTHOR: Author of text, as recorded in OCLC’s original Top 500 List.
- PUB_YEAR: Year of first publication of text, according to Wikipedia.
- ORIG_LANG: Original language of text, according to Wikipedia.
- GENRE: Genre of text, as recorded in OCLC’s original Top 500 List (filtered by the ‘Choose Genre’ dropdown).
Library holdings info:
- OCLC_HOLDINGS: Total physical library holdings listed in WorldCat for an individual work (OWI), according to Classify.
- OCLC_EHOLDINGS: Total digital library holdings listed in WorldCat for an individual work (OWI), according to OCLC.
- OCLC_TOTAL_EDITIONS: Total editions of an individual work–physical and digital–listed in WorldCat according to OCLC.
- OCLC_HOLDINGS_RANK: Numeric rank of text based on total holdings recorded in WorldCat.
- OCLC_EDITIONS_RANK: Numeric rank of text based on total number of editions recorded in WorldCat.
Online popularity info:
- GR_AVG_RATING: Average star rating for a text on Goodreads.
- GR_NUM_RATINGS: Total number of ratings for a text on Goodreads.
- GR_NUM_REVIEWS: Total number of reviews for a text on Goodreads.
- GR_AVG_RATING_RANK: Numeric rank of text based on average Goodreads rating.
- GR_NUM_RATINGS_RANK: Numeric rank of text based on overall number of ratings on Goodreads.
Unique Identifiers and URLS:
- OCLC_OWI: Work ID on OCLC. A work ID represents a cluster based on “author and title information from bibliographic and authority records.” A title can be represented by multiple clusters, and therefore multiple OWIs. More information about OCLC work clustering can be found here.
- AUTHOR_VIAF: Author VIAF ID.
- GR_URL: URL for text on Goodreads.
- WIKI_URL: URL for text on Wikipedia.
- PG_ENG_URL: URL for English-language text on Project Gutenberg.
- PG_ORIG_URL: URL for original-language text (where applicable) on Project Gutenberg.
- FULL_TEXT: Full text of the novel, if it is in the public domain.
After merging with the Seattle Public Library Checkout Data, we also provide the following columns sourced from their dataset:
- UsageClass: Denotes if the item is “physical” or “digital.”
- CheckoutType: Denotes the vendor tool used to check out the item.
- MaterialType: Describes the type of item checked out (examples: book, song, movie, music, magazine).
- CheckoutYear: The 4-digit year of checkout for this record.
- CheckoutMonth: The month of checkout for this record.
- Checkouts: A count of the number of times the title was checked out within the “Checkout Month.”
- ISBN: A comma-separated list of ISBNs associated with the item record for the checkout. (Text)
- Title: The full title and subtitle of an individual item. (Text)
- Creator: The author or entity responsible for authoring the item according to the SPL.
- Subjects: The subject of the item as it appears in the catalog.
- Publisher: The publisher of the title.
- PublicationYear: The year from the catalog record in which the item was published, printed, or copyrighted.
Note that we provide two titles and two different author fields, one is sourced from the Top 500 Novels List, and the other from the SPL Checkout Data. Reconciling book data is problem, and different versions and editions of the same text can have slightly different title and author variants.
Code
# Note on installation: https://statsandr.com/blog/an-efficient-way-to-install-and-load-r-packages/
# install.packages("rmarkdown")
# too big
# library(tidyverse)
library(dplyr)
library(tidyr)
library(stringr)
library(forcats)
library(ggplot2)
library(purrr)
# Load National Park Visitation data
gov_dataset <- read.csv("https://raw.githubusercontent.com/melaniewalsh/responsible-datasets-in-context/refs/heads/main/datasets/gubernatorial-bios/gubernatorial_bios_final.csv", stringsAsFactors = FALSE)
## Create long form version of the data where each year has the set of governors who governed in that year. We will use this transformed data for a lot of our visualizations
gov_long <- gov_dataset %>%
# Split multiple ranges into separate rows
separate_rows(years_in_office, sep = "\\s(?=\\d{4}\\s-\\s\\d{4}|\\d{4}\\s-)") %>%
# Extract start and end years
mutate(
start = as.numeric(str_extract(years_in_office, "^\\d{4}")),
end = as.numeric(str_extract(years_in_office, "\\d{4}$"))
) %>%
# Expand each range into all years served
rowwise() %>%
mutate(year = list(seq(start, end))) %>%
unnest(year) %>%
ungroup() %>%
filter(year >= 1775, year <= 2025) %>%
distinct()
## Let's get average college attendance rate by elected governors for each year
yearly_college <- gov_long %>%
group_by(year) %>%
summarise(
total = n(),
college = sum(college_attendance == 1, na.rm = TRUE),
pct_college = 100 * college / total,
.groups = "drop"
)
# 2. Bin years into 25-year chunks
binned_college <- yearly_college %>%
mutate(bin = cut(year, breaks = seq(1775, 2025, by = 25), include.lowest = TRUE, right = FALSE)) %>%
group_by(bin) %>%
summarise(
first_year = min(year), # midpoint anchor
pct_college = mean(pct_college, na.rm = TRUE),
.groups = "drop"
)
# 3. Plot
ggplot() +
geom_point(
data = yearly_college,
aes(x = year, y = pct_college),
color = "gray50", alpha = 0.4
) +
geom_point(
data = binned_college,
aes(x = first_year + 12.5, y = pct_college), # put point in middle of 25-year bin
color = "#6A0DAD", size = 2
) +
geom_line(
data = binned_college,
aes(x = first_year + 12.5, y = pct_college),
color = "#6A0DAD"
) +
theme_bw() +
xlab("") +
ggtitle("Have elected governors become more college educated?") +
ylab("Percent College") +
theme(plot.title = element_text(size = 8)) +
scale_x_continuous(breaks = seq(1775, 2025, by = 25)) # Add x-axis ticks every 25 years
Where Did The Data Come From? Who Collected It?
For more details on the Top 500 novel list, refer to the initial post. The SPL dataset was organized by David Christensen, Data Analysis Lead at the Seattle Public Library. The Data from 2005 to 2016 in this dataset is from the digital artwork, “Making Visible the Invisible,” by studios of George Legrady.
Why Was The Data Collected? How Is The Data Used?
The data collected here is part of an ongoing project to reveal the dynamics of literary popularity through the SPL checkout data. Studying literary popularity is difficult; book sales data is proprietary and inaccessible to individuals outside of the publishing industry. While a powerful resource, the SPL checkout data, like all book data, struggles with persistent book identifiers. The same underlying work can often have different editions, author name variants, and metadata. It is difficult to cluster different versions of the same underlying work together at scale, but we have experimented previously with using a combination of semi-automated methods and manual oversight to cluster smaller sized corpuses of interest (Gupta et al. 2025). The Top 500 Novels Dataset is compelling because it presents a set of works that have persisted in popularity and relevance well past publication date. These are exceptionally popular titles that have avoided the fate of the majority of books, which often fall out of circulation soon after publishing Sorensen (2007). Studying novels like these is useful to understand how popularity and reception is working for novels that are cultural touchstones.
Code
## Let's get average military rate by elected governors for each year
yearly_military <- gov_long %>%
group_by(year) %>%
summarise(
total = n(),
military = sum(military_service == 1, na.rm = TRUE),
pct_military = 100 * military / total,
.groups = "drop"
)
# 2. Bin years into 25-year chunks
binned_military <- yearly_military %>%
mutate(bin = cut(year, breaks = seq(1775, 2025, by = 25), include.lowest = TRUE, right = FALSE)) %>%
group_by(bin) %>%
summarise(
first_year = min(year),
pct_military = mean(pct_military, na.rm = TRUE),
.groups = "drop"
)
# 3. Plot
ggplot() +
geom_point(
data = yearly_military,
aes(x = year, y = pct_military),
color = "gray50", alpha = 0.4
) +
geom_point(
data = binned_military,
aes(x = first_year + 12.5, y = pct_military), # center in bin
color = "#6A0DAD", size = 2
) +
geom_line(
data = binned_military,
aes(x = first_year + 12.5, y = pct_military),
color = "#6A0DAD"
) +
theme_bw() +
xlab("") +
ggtitle("Have elected governors served in the military more or less?") +
ylab("Percent Military Service") +
theme(plot.title = element_text(size = 8)) +
scale_x_continuous(breaks = seq(1775, 2025, by = 25))
How Was The Data Collected?
To reconcile the SPL Checkout Data with the top 500 novel data, we employ a multi-step algorithm to capture all the records in the library data that may match one of the novels in our source list.
We first, manipualte the title and author fields in the SPL data to normalize against common variants. For example, many titles have “(unabridged)” appended to them, and many authors have a last name, first name formatting (e.g Collins, Suzanne). Once we’ve simplified the SPL Checkout Data, to simplified versions of title and creator, we group by those fields to reduce our dataframe down to about 800,000 unique titles.
We then run a two-stage algorithm to pair novels from the Top 500 list to checkout records. We iterate through all 500 of our novels and compare first the last name of the author to the creator field in the SPL, checking if the creator field contains the last name of the author somewhere in its text. We then use the Python RapidFuzz library to run a fuzzy matching of thte top 500 title against the title field in the SPL data. We use the Partial Ratio algorithm which identifies the optimal alignment of the shorter string in the longer string. We use a threshold of 85.
For an example of how this algorithm is working, consider the novel Catching Fire by Suzanne Collins. A matching SPL record has the title as Catching Fire: (movie tie in edition), the Partial Ratio algorithm returns a perfect 100 match between this title and the true title, Catching Fire. Our algorithm is designed to have high recall, rather than precision. In other words, we would rather identify more matches that could be wrong, than miss matches that could be right. Our preference for recall over precision is because there are TONS of books in the dataset, once we have a list of matching records, we can manually look at matches to ensure accuracy. It is much more difficult to identify matches in the first place from over 800,000 unique titles!
Importantly, the recall of our algorithm will not be perfect, and we will miss some editions that should be clustered in. For example, alternate language editions that lack the English title will often fail the Fuzzy Match. For example, Stendhal’s The Red and the Black in French is titled Le Rouge et le Noir, which will not pass our fuzzy matching threshold (luckily we’re only missing 11 checkouts from the French edition).
Our algorithm returns matches for every single novel on our list! But because of our high recall approach, we need to filter out mistmatches. For example, Khaled Hosseini’s A Thousands Splendid Suns got matched onto the record, And the Mountains Echoed by the Bestselling Author of the Kite Runner and a Thousand Splendid Suns. These types of cases require manual oversight to filter out.
Our approach has a couple of systematic failures besides weird edge cases. Book series where only the first entry is in the list (Artemis Fowl, Diary of a Wimpy Kid, The Maze Runner etc.) often return all the matching entries from the series, because each future title references the original title. These were excluded manually.
An interesting case we often see when matching library data to these popular novels is dealing with checked out titles that contain more than one text. Think the entire Lord of The Rings Trilogy in one large volume. We make the decision to map checkouts for the entire Lord of the Rings Trilogy to each of it’s entries in the Top 500 List, meaning that one checkout is being counted three separate times in this new dataset.
Code
head(gov_dataset[c("state_territory", "governor", "party", "first_year", "school", "birth_state_territory")])| state_territory | governor | party | first_year | school | birth_state_territory |
|---|---|---|---|---|---|
| Alabama | Kay Ivey | Republican | 2017 | University of Auburn | Alabama |
| Alaska | Mike Dunleavy | Republican | 2018 | Misericordia University; University of Alaska Fairbanks | Pennsylvania |
| American Samoa | Pula’ali’i Nikolao Pula | Republican | 2025 | Menlo College; Brigham Young University; George Mason University | American Samoa |
| Arizona | Katie Hobbs | Democratic | 2023 | Northern Arizona University, Arizona State University | Arizona |
| Arkansas | Sarah Huckabee Sanders | Republican | 2023 | Ouachita Baptist University | Arkansas |
| California | Gavin Newsom | Democratic | 2019 | Santa Clara University | California |
Uncertainty in the Data
We do not claim our matching to be perfect. Besides alternate language editions, it is likely that other strange variants have been missed by our method. Additionally, the SPL Checkout Data is not a perfect proxy for book popularity. Library checkout data is affected by regional dynamics, library programming, and the nature of the library system Gupta et al. (2025).
Library checkouts have risen over time, partly due to increased participation in the library system and the rise of digital books. As a result, library checkout data does not represent a stable configuration of patrons engaging in consistent checkout behavior. Users should not treat the longitudinal checkout data as homogeneously representing the same relationship to readership across the time series.
What can we do with the data?
Thanks to the Top 500 Dataset Creators, we have tons of metadata variables we can play with to identify trends in our data. In our previous piece (Gupta et al. 2025), we present evidence that authorial death leads to a temporary boost in library checkouts for canonical authors.
We verify that relationship as follows.
Another affordance of the Top 500 Dataset list is its genre table. Although only about half of the entries are filled out, the genre metadata allows us to compare books within the same genre to see if there are any relationships between their receptions.
We can conduct the same correlation test as we did with authors to inspect which genres are the most internally correlated. ##
Sci-FI and Horror are the most internally correlated genres, while Political, Bildung, and Mystery are less internally correlated.
(Berglund and Steiner 2021) presents evidence from Scandinavian fiction that an author’s new releases boosts interest in their previous titles. Books by the same author effect each other’s success. The top 500 list has 89 different authors that have multiple releases. One question we might ask is whether works by the same author tend to have similar dynamics of interest.
We can test this by looking at the Pearson correlation coefficients between books by the same author. We grabbed the mean correlations between each author’s books, and got a value of 0.42. In other words, on average, 42% of a book’s variance in the checkout data can be explained by a different book by the same author.
It’s hard to interpret that number without context. If we look at just the correlations on average between every book in the top 500 corpus, we only observe a coefficient of 0.16. So books by the same author are almost 3x more correlated than what we would expect on average.
Let’s look at a few specific examples. A lot of books by the same author belong to book series. For example, all of J.K. Rowling’s Harry Potter books are in the top 500 corpus, so we can generate a correlation heatmap for her books.
The heatmap is really striking, the middle four books of the series are highly correlated with each other, while the first book and the final two seem to have their own dynamics.
Let’s look at some timeseries to figure out what’s going on with the edge books!
First, the Prisoner of Azkaban vs. the Chamber of Secrets
These have a correlation coefficient of 0.97 and that lines up with what we’re seeing in the timeseries.
How about if we compare the Prisoner of Azkaban to the Sorceror’s Stone?
They actually track really well together for most of the timeframe, but we see a huge spike for the Sorceror’s Stone in 2020, which is going to really confound the correlation coefficient.
How about the Deathly Hallows and the Half-Blood Prince. These have a more obvious explanation. Both these books actually are released in the SPL timeframe, and have large numbers of checkouts following their release at the start of their timeseries. This follows what we know about how books function commercially, where a large share of their sales is immediately around their release (Sorensen 2007). Outside of their release date behavior, we see that these books still tend to follow the same dynamics. In fact, if we trim out the first few years of our timeframe and only look at our data after 2007, the correlation coefficients go straight back up!
The high correlation between the books in the series suggest that these books are likely complementary goods, the appeal for one of these books increases with the appeal for the other. Of course this makes sense, readers who finish The Prisoner of Azkaban are likely to move straight onto the Goblet of Fire in quick succession, and the value of any one of these books to a consumer is intrinsically tied to the value of the other books in the series, and the series as a whole.
The opposite of complementary good in economics is a substitutue good. Can we think of book, or groups of books as substitutes for each other? Are books within the same genre competing for the same readership? Or does the success of one book in the genre boost other books as well?
The answer as you might expect, is it depends.
Looking at correlations is a fun way to discover hidden relationships between seemingly different books. With only 500 novels, we can compare the timeseries of the books to each other, to see what novels are correlated. Besides the obvious pairs at the top of the list, we see some more interesting comparisons. The Giver and The Book Thief are highly correlated, and both seemed to have peaked in the late 2
Besides the obvious

Code
## Let's get average % female governors every year
yearly_female <- gov_long %>%
group_by(year) %>%
summarise(
total = n(),
female = sum(gender == "female", na.rm = TRUE),
pct_female = 100 * female / total,
.groups = "drop"
)
# 2. Bin years into 25-year chunks
binned_female <- yearly_female %>%
mutate(bin = cut(year, breaks = seq(1775, 2025, by = 25), include.lowest = TRUE, right = FALSE)) %>%
group_by(bin) %>%
summarise(
first_year = min(year),
pct_female = mean(pct_female, na.rm = TRUE),
.groups = "drop"
)
# 3. Plot
ggplot() +
geom_point(
data = yearly_female,
aes(x = year, y = pct_female),
color = "gray50", alpha = 0.4
) +
geom_point(
data = binned_female,
aes(x = first_year + 12.5, y = pct_female), # center in bin
color = "#6A0DAD", size = 2
) +
geom_line(
data = binned_female,
aes(x = first_year + 12.5, y = pct_female),
color = "#6A0DAD"
) +
theme_bw() +
xlab("") +
ggtitle("Have elected governors become more gender diverse?") +
ylab("Percent Female Governors") +
theme(plot.title = element_text(size = 8)) +
scale_x_continuous(breaks = seq(1775, 2025, by = 25))
The percentage of Ivy League-educated governors has slightly decreased over time, with approximately 25 percent of governors in each period being educated at an Ivy League institution. In previous work on political backgrounds, such as (Dal Bó, Dal Bó, and Snyder 2009) an Ivy League education is commonly used as a marker of elite status. This definition omits other highly selective institutions—such as Stanford, the University of Chicago, or Georgetown—which also serve as pipelines to elite political careers.
Code
## Let's look at Ivy League rate
yearly_ivy <- gov_long %>%
group_by(year) %>%
summarise(
total = n(),
ivy = sum(ivy_attendance == 1, na.rm = TRUE),
pct_ivy = 100 * ivy / total,
.groups = "drop"
)
# 2. Bin years into 25-year chunks
binned_ivy <- yearly_ivy %>%
mutate(bin = cut(year, breaks = seq(1775, 2025, by = 25), include.lowest = TRUE, right = FALSE)) %>%
group_by(bin) %>%
summarise(
first_year = min(year),
pct_ivy = mean(pct_ivy, na.rm = TRUE),
.groups = "drop"
)
# 3. Plot
ggplot() +
geom_point(
data = yearly_ivy,
aes(x = year, y = pct_ivy),
color = "gray50", alpha = 0.4
) +
geom_point(
data = binned_ivy,
aes(x = first_year + 12.5, y = pct_ivy), # center in bin
color = "#6A0DAD", size = 2
) +
geom_line(
data = binned_ivy,
aes(x = first_year + 12.5, y = pct_ivy),
color = "#6A0DAD"
) +
theme_bw() +
xlab("") +
ggtitle("Have elected governors become more Ivy-educated?") +
ylab("Percent with Ivy Attendance") +
theme(plot.title = element_text(size = 8)) +
scale_x_continuous(breaks = seq(1775, 2025, by = 25))
Have gubernatorial offices become more professionalized over the course of American history? We track the share of governors with law degrees as a simple proxy for how “professionalized” the office is. Law has long been a standard path into politics–lawyers gain legal expertise and entry into elite social networks. The share rises through the mid-19th century and then plateaus. Since roughly 2000, it declines modestly, consistent with more governors coming from business and other non-law fields.
Code
## Let's look at rate of governors with law degrees
yearly_lawyer <- gov_long %>%
group_by(year) %>%
summarise(
total = n(),
lawyer = sum(lawyer == 1, na.rm = TRUE),
pct_lawyer = 100 * lawyer / total,
.groups = "drop"
)
# 2. Bin years into 25-year chunks
binned_lawyer <- yearly_lawyer %>%
mutate(bin = cut(year, breaks = seq(1775, 2025, by = 25), include.lowest = TRUE, right = FALSE)) %>%
group_by(bin) %>%
summarise(
first_year = min(year),
pct_lawyer = mean(pct_lawyer, na.rm = TRUE),
.groups = "drop"
)
# 3. Plot
ggplot() +
geom_point(
data = yearly_lawyer,
aes(x = year, y = pct_lawyer),
color = "gray50", alpha = 0.4
) +
geom_point(
data = binned_lawyer,
aes(x = first_year + 12.5, y = pct_lawyer), # center in bin
color = "#6A0DAD", size = 2
) +
geom_line(
data = binned_lawyer,
aes(x = first_year + 12.5, y = pct_lawyer),
color = "#6A0DAD"
) +
theme_bw() +
xlab("") +
ggtitle("Have elected governors become more likely to be lawyers?") +
ylab("Percent Lawyers") +
theme(plot.title = element_text(size = 8)) +
scale_x_continuous(breaks = seq(1775, 2025, by = 25))
Which parties have produced the largest number of distinct governors? We count each person once (regardless of how many terms served), tallying their listed party affiliation. On this measure, Democrats make up the largest share, followed by Republicans, with historical parties (Whig, Democratic-Republican, Federalist, etc.) comprising smaller slices.
Code
gov_dataset_deduplicated <- gov_dataset %>%
distinct(governor, .keep_all = TRUE)
party_counts <- gov_dataset_deduplicated %>%
transmute(party = coalesce(party, "")) %>%
mutate(
party = str_replace_all(party, "\\bRepublician\\b", "Republican"),
party = str_replace_all(party, "\\bDemocrat\\b", "Democratic"),
party = str_replace_all(party, "\\bJacksonian Democrat\\b|\\bJackson Democrat\\b", "Democratic"),
party = str_replace_all(party, "\\bAnti-Jacksonian\\b", "National Republican"),
party = str_replace_all(party, "\\bJeffersonian(-| )?Republican\\b", "Democratic-Republican"),
party = str_replace_all(party, "\\bIndependent-?Republican\\b", "Republican"),
party = str_replace_all(party, "\\s*\\(.*?\\)", ""),
party = str_replace_all(party, "\\s+", " ")
) %>%
separate_rows(party, sep = "\\s*(;|,|/|\\band\\b|&|;)\\s*") %>%
mutate(party = str_squish(party)) %>%
filter(party != "") %>%
count(party, sort = TRUE)
party_counts_final <- party_counts %>%
slice_max(n, n = 10) %>%
mutate(party = fct_reorder(party, n))
ggplot(party_counts_final, aes(x = n, y = party)) +
geom_col(fill = "#6A0DAD") +
theme_bw() +
theme(plot.title = element_text(size = 8)) +
labs(
title = "Most common party affiliations among U.S. governors",
x = "Count", y = NULL
) +
theme_minimal(base_size = 12)
We also want to know if governors were born in the same state that they eventually governed. We plot the share of governors who were born in their state. Surprisingly, this proportion has actually increased over time.
Code
## Let's look at rate of governors with law degrees
yearly_born_in_state <- gov_long %>%
group_by(year) %>%
summarise(
total = n(),
in_state = sum(born_in_state_territory == 1, na.rm = TRUE),
pct_in_state = 100 * in_state / total,
.groups = "drop"
)
# 2. Bin years into 25-year chunks
binned_born_in_state <- yearly_born_in_state %>%
mutate(bin = cut(year, breaks = seq(1775, 2025, by = 25), include.lowest = TRUE, right = FALSE)) %>%
group_by(bin) %>%
summarise(
first_year = min(year),
pct_in_state = mean(pct_in_state, na.rm = TRUE),
.groups = "drop"
)
# 3. Plot
ggplot() +
geom_point(
data = yearly_born_in_state,
aes(x = year, y = pct_in_state),
color = "gray50", alpha = 0.4
) +
geom_point(
data = binned_born_in_state,
aes(x = first_year + 12.5, y = pct_in_state), # center in bin
color = "#6A0DAD", size = 2
) +
geom_line(
data = binned_born_in_state,
aes(x = first_year + 12.5, y = pct_in_state),
color = "#6A0DAD"
) +
theme_bw() +
xlab("") +
ggtitle("Percent of Governors Born in the State/Territory They Governed") +
ylab("Percent Born In-State") +
theme(plot.title = element_text(size = 8)) +
scale_x_continuous(breaks = seq(1775, 2025, by = 25))
Governors, like many politicians, have been getting older over time. In the past 25 years, the average age at entry has been about 55. This pattern likely reflects not only changes in the typical political career but also broader increases in life expectancy.
Code
yearly_age <- gov_long %>%
group_by(year) %>%
summarise(
avg_age = mean(age_at_start, na.rm = TRUE),
.groups = "drop"
)
# 2. Bin years into 25-year chunks and compute mean within each bin
binned_age <- yearly_age %>%
mutate(bin = cut(year, breaks = seq(1775, 2025, by = 25), include.lowest = TRUE, right = FALSE)) %>%
group_by(bin) %>%
summarise(
first_year = min(year),
avg_age = mean(avg_age, na.rm = TRUE),
.groups = "drop"
)
# 3. Plot
ggplot() +
geom_point(
data = yearly_age,
aes(x = year, y = avg_age),
color = "gray50", alpha = 0.4
) +
geom_point(
data = binned_age,
aes(x = first_year + 12.5, y = avg_age), # center in bin
color = "#6A0DAD", size = 2
) +
geom_line(
data = binned_age,
aes(x = first_year + 12.5, y = avg_age),
color = "#6A0DAD"
) +
theme_bw() +
xlab("") +
ggtitle("Average Age of Governors at Start of Office") +
ylab("Average Age") +
theme(plot.title = element_text(size = 8)) +
scale_x_continuous(breaks = seq(1775, 2025, by = 25))
Conclusion
The gubernatorial biography dataset gives us perspective on the range of backgrounds for important politicians in American history. Generating features for tabular data out of web-scraped text is challenging and imperfect. But, efforts like this help us notice demographic trends among US governors, and we’ve pointed out a couple interesting results around gender, professionalization, and governor age. We think future work could refine our methods and extend them to other offices, to help complete a more full picture of the history of American government.
References
Berglund, Karl, and Ann Steiner. 2021. “Is Backlist the New Frontlist?: Large-Scale Data Analysis of Bestseller Book Consumption in Streaming Services.” Logos 32 (1): 7–24. https://doi.org/10.1163/18784712-03104006.
Cohen, Margaret. 2018. The Sentimental Education of the Novel. Princeton: Princeton University Press. https://muse-jhu-edu.offcampus.lib.washington.edu/pub/267/monograph/book/61065.
Dal Bó, Ernesto, Pedro Dal Bó, and Jason Snyder. 2009. “Political Dynasties.” The Review of Economic Studies 76 (1): 115–42.
Gupta, Neel, David Christensen, and Melanie Walsh. 2025. “Seattle Public Library’s Open Checkout Data: What Can It Tell Us about Readers and Book Popularity More Broadly?” Journal of Open Humanities Data 11 (August): 46. https://doi.org/10.5334/johd.332.
Gupta, Neel, Daniella Maor, Karalee Harris, Emily Backstrom, Hongyuan Dong, and Melanie Walsh. 2025. “The Canon in Circulation: Tracking the Reception of
textit{norton Anthology} Authors in Library Checkout Data.” Edited by Taylor Arnold Margherita Fantoli and Ruben Ros. Anthology of Computers and the Humanities 3: 1510–22. https://doi.org/10.63744/P6qPH135jhY2.
textit{norton Anthology} Authors in Library Checkout Data.” Edited by Taylor Arnold Margherita Fantoli and Ruben Ros. Anthology of Computers and the Humanities 3: 1510–22. https://doi.org/10.63744/P6qPH135jhY2.
Sorensen, Alan T. 2007. “Bestseller Lists and Product Variety.” The Journal of Industrial Economics 55 (4): 715–38. https://doi.org/10.1111/j.1467-6451.2007.00327.x.
Explore the Data
Download Full Data
Download Table Data (including filtered options)
Programming Exercises
The Historical Gubernatorial Dataset is useful as a dataset for presenting interesting examples and prompting future independent exploration. Information about the governors in the dataset is well-available online, and finding historical outliers can be a fun way to engage with the history of the American Government. The exercise we present here introduces basic data exploration skills to look at common instances in a dataframe column, and then prompts the student to explore further on their own.
| Date | Title | Categories |
|---|---|---|
| Aug 1, 2024 | Pandas Value Counts with Gubernatorial Data (Exercise) | pandas, exercise |
| Aug 1, 2024 | Pandas Value Counts with Gubernatorial Data (Solution) | pandas, exercise, solution |
No matching items