library(Lahman)
library(tidyverse)
library(rvest)
library(magick)
library(glue)
library(here)
library(parallel)
Face of a Superstar: Part 1
Introduction
Hello folks! I’m in the process of building my blog and have been going through some older projects and converting them to quarto documents in order to create content for this blog. The following project was started roughly 5 years ago on a whim. I’ve re-written nearly the entire original script because lots of things change over 5 years, namely I’ve become a bit better with R.
This post will be the first of at least a two part series around predicting MLB Hall of Fame inductees based on their image. Let’s go ahead and dive in!
Motivation
Why in the world did I do this? Growing up a baseball player, I heard all the time “He looks like a superstar”. So, naturally, I decided to take that simple overused phrase literally and model it out. This exercise is meant to be “fun” and not to be taken seriously; there are numerous ways it can go wrong, which we will discuss in Part 2. For now, let’s start off like most Data Science projects and collect some data!
Setup - Required Packages
First off, we will need to install the following packages:
- Lahman - a source for pretty much all your baseball needs
- tidyverse - essential for almost any data science project
- rvest - used to scrape data
The code to do so is below (unless already installed)
Data Collection
Now that the required packages are loaded, let’s understand how we plan to use them, namely the Lahman package and the associated data.
Lahman Data
The Lahman package is a collection of data sets derived from the Lahman Baseball Database. This database contains a vast amount of statistics and player data. We plan to use this data/package to 1) Gather player IDs to pull images from baseball-reference.com (Baseball-Reference 2023) and 2) Identify the players that have been inducted into the Hall of Fame.
Lahman - People
Lahman has a dataset called People
that can be accessed with the following code.
data("People")
::glimpse(People) dplyr
Rows: 21,010
Columns: 26
$ playerID <chr> "aardsda01", "aaronha01", "aaronto01", "aasedo01", "abada…
$ birthYear <int> 1981, 1934, 1939, 1954, 1972, 1985, 1850, 1877, 1869, 186…
$ birthMonth <int> 12, 2, 8, 9, 8, 12, 11, 4, 11, 10, 9, 3, 10, 2, 8, 9, 6, …
$ birthDay <int> 27, 5, 5, 8, 25, 17, 4, 15, 11, 14, 20, 16, 22, 16, 17, 1…
$ birthCity <chr> "Denver", "Mobile", "Mobile", "Orange", "Palm Beach", "La…
$ birthCountry <chr> "USA", "USA", "USA", "USA", "USA", "D.R.", "USA", "USA", …
$ birthState <chr> "CO", "AL", "AL", "CA", "FL", "La Romana", "PA", "PA", "V…
$ deathYear <int> NA, 2021, 1984, NA, NA, NA, 1905, 1957, 1962, 1926, NA, 1…
$ deathMonth <int> NA, 1, 8, NA, NA, NA, 5, 1, 6, 4, NA, 2, 6, NA, NA, NA, N…
$ deathDay <int> NA, 22, 16, NA, NA, NA, 17, 6, 11, 27, NA, 13, 11, NA, NA…
$ deathCountry <chr> NA, "USA", "USA", NA, NA, NA, "USA", "USA", "USA", "USA",…
$ deathState <chr> NA, "GA", "GA", NA, NA, NA, "NJ", "FL", "VT", "CA", NA, "…
$ deathCity <chr> NA, "Atlanta", "Atlanta", NA, NA, NA, "Pemberton", "Fort …
$ nameFirst <chr> "David", "Hank", "Tommie", "Don", "Andy", "Fernando", "Jo…
$ nameLast <chr> "Aardsma", "Aaron", "Aaron", "Aase", "Abad", "Abad", "Aba…
$ nameGiven <chr> "David Allan", "Henry Louis", "Tommie Lee", "Donald Willi…
$ weight <int> 215, 180, 190, 190, 184, 235, 192, 170, 175, 169, 220, 19…
$ height <int> 75, 72, 75, 75, 73, 74, 72, 71, 71, 68, 74, 71, 70, 78, 7…
$ bats <fct> R, R, R, R, L, L, R, R, R, L, R, R, R, R, R, L, R, L, L, …
$ throws <fct> R, R, R, R, L, L, R, R, R, L, R, R, R, R, L, L, R, L, R, …
$ debut <chr> "2004-04-06", "1954-04-13", "1962-04-10", "1977-07-26", "…
$ bbrefID <chr> "aardsda01", "aaronha01", "aaronto01", "aasedo01", "abada…
$ finalGame <chr> "2015-08-23", "1976-10-03", "1971-09-26", "1990-10-03", "…
$ retroID <chr> "aardd001", "aaroh101", "aarot101", "aased001", "abada001…
$ deathDate <date> NA, 2021-01-22, 1984-08-16, NA, NA, NA, 1905-05-17, 1957…
$ birthDate <date> 1981-12-27, 1934-02-05, 1939-08-05, 1954-09-08, 1972-08-…
As you can see from our data output above, we have a wide selection of information for each baseball player. The most important columns for us will be the playerID
and bbrefID
. With these two columns, we will be able to join to the HallOfFame
tibble and pull the MLB Hall of Fame eligible players from baseball-reference.com (Baseball-Reference 2023).
Lahman - HallOfFame
Speaking of the HallOfFame
tibble, below is the code to load the data and view it.
data("HallOfFame")
::glimpse(HallOfFame) dplyr
Rows: 6,382
Columns: 9
$ playerID <chr> "aaronha01", "abbotji01", "abreubo01", "abreubo01", "abreu…
$ yearID <int> 1982, 2005, 2020, 2021, 2022, 2023, 2024, 1937, 1938, 1939…
$ votedBy <chr> "BBWAA", "BBWAA", "BBWAA", "BBWAA", "BBWAA", "BBWAA", "BBW…
$ ballots <dbl> 415, 516, 397, 401, 394, 389, 385, 201, 262, 274, 233, 247…
$ needed <dbl> 312, 387, 298, 301, 296, 292, 289, 151, 197, 206, 175, 186…
$ votes <dbl> 406, 13, 22, 35, 34, 60, 57, 8, 11, 11, 11, 7, 6, 22, 4, 5…
$ inducted <fct> Y, N, N, N, N, N, N, N, N, N, N, N, N, N, N, N, N, N, N, N…
$ category <fct> Player, Player, Player, Player, Player, Player, Player, Pl…
$ needed_note <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "Players i…
In this table, we are concerned only with the playerID
(to link to the People
tibble), inducted
and category
. Let’s move on to some cleanup of these two tibbles.
Lahman Data Cleaning
The Lahman data conveniently gave us a key, playerID
, that easily links our two tibbles together. Let’s go ahead and add the information that we want from the People
tibble to the HallOfFame
tibble. In addition, we will go ahead and select only the columns previously listed as important from the HallOfFame
tibble.
First, let’s split the HallOfFame
tibble into Y/N category tibbles.
# Collect the players that are in the HOF
<- HallOfFame %>%
hof_y_people # select columns that are important
select(playerID, inducted, category) %>%
# filter to Player only
filter(category %in% "Player", inducted %in% "Y") %>%
# category is no longer needed, remove it
select(-category) %>%
distinct()
# Collect the players that are not in the HOF
<- HallOfFame %>%
hof_n_people # select columns that are important
select(playerID, inducted, category) %>%
# filter to Player only
filter(category %in% "Player", inducted %in% "N", !playerID %in% hof_y_people$playerID) %>%
# category is no longer needed, remove it
select(-category) %>%
distinct()
Our two tibbles separate the eligible players that made the Hall of Fame versus eligible players that did not make the Hall of Fame.
We then join the people data to get the bbrefID
to have a more complete data set in the code below.
# Combine both tibbles and join the required people data
<- bind_rows(hof_y_people, hof_n_people) %>%
hof_people left_join(
%>% select(playerID, bbrefID)
People by = "playerID"
,
)
glimpse(hof_people)
Rows: 1,396
Columns: 3
$ playerID <chr> "aaronha01", "alexape01", "alomaro01", "ansonca01", "aparilu0…
$ inducted <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y…
$ bbrefID <chr> "aaronha01", "alexape01", "alomaro01", "ansonca01", "aparilu0…
Now we have a single tibble with the Hall of Fame eligible MLB players as well as their baseball reference ID. We are now able to move on and get the image data for our players.
Baseball-Reference Images
Each MLB player has a baseball-reference page associated with them. For example, here is a link to Robin Yount’s page(Baseball-Reference 2023). When viewing Robin Yount’s baseball reference page, you’ll see a lot of information on his statistics, accolades, etc. But most importantly, you will see his player image. That image is the data that we want to pull.
Sample Data
Let’s take a look at the Robin Yount example a little more. To determine how to get the image you will need to inspect the baseball-reference source code and find how the image(s) are tagged. In doing so, you can download the image data. Below is an example using Robin Yount.
<- "https://www.baseball-reference.com/players/y/yountro01.shtml"
url <- session(url)
webpage <- webpage %>% html_nodes("img")
link_titles <- link_titles[2] %>% html_attr("src")
img_url_first img_url_first
[1] "https://www.baseball-reference.com/req/202408150/images/headshots/a/aadc0345_sabr.jpg"
::image_read(img_url_first) magick
Notice how we set this up as img_url_first
. If you hover over the image on baseball-reference, you may see multiple images pop up. In practice, this would be cause for concern. We would need to determine which image, if any, should be used for our modeling approach. In the context of this problem do we go with a picture that looks younger? Or maybe a picture of the player on their first team versus their last team? For the purposes of this exercise, we will choose only the picture that appears first. We’ll save the in-depth approaches for a rainy day.
Just to provide tangible evidence of multiple images, below is the second Robin Yount image.
<- webpage %>% html_nodes("img")
link_titles <- link_titles[3] %>% html_attr("src")
img_url_second img_url_second
[1] "https://www.baseball-reference.com/req/202408150/images/headshots/a/aadc0345_davis.jpg"
::image_read(img_url_second) magick
WARNING: While this image appears to be an older image, baseball-reference does note that images are not necessarily in chronological order. If you take this a step further, do not assume that the first image is younger than the second or third.
Scraping Hall of Fame Eligible Players
With an example created, we need to pull all Hall of Fame eligible players. To do this, let’s break down the URL that we used to pull the data. The URL is listed below:
- https://www.baseball-reference.com/players/y/yountro01.shtml
- https://www.baseball-reference.com/players/y/yountro01.shtml
Let’s break this into different parts:
root
= https://www.baseball-reference.com/playersfirst letter last name
= /ybbrefID
= /yountro01extension
= .shtml
So we need to always use the same root
and extension
while adjusting the first letter last name
and bbrefID
. In order to do this, we will create a simple function. Our function will take a single input, the bbrefID
and output an image into a directory titled Hall_of_Fame_Eligible
. The images will be stored as eligibility
_playerID
. In addition, I have added logic to avoid scraping data that has already been scraped. More on that in just a moment.
<- function(baseball_reference_id){
scrape_bbref_images
<- hof_people %>%
inducted filter(bbrefID %in% baseball_reference_id) %>%
pull(inducted) %>%
as.character()
<- substr(baseball_reference_id,1,1)
first_letter_first_name
<- glue::glue(here::here(), "/data/MLB_Hall_Of_Fame_Project/Hall_of_Fame_Eligible/{inducted}_{baseball_reference_id}.jpg")
file_name
<- glue::glue("https://www.baseball-reference.com/players/{first_letter_first_name}/{baseball_reference_id}.shtml")
url
try(if(!file.exists(file_name)){
print(baseball_reference_id)
Sys.sleep(5)
<- session(url)
webpage <- webpage %>% html_nodes("img")
link_titles <- link_titles[2] %>% html_attr("src")
img_url_first download.file(img_url_first, file_name,mode = "wb")
else{
} print(glue::glue("The file for {baseball_reference_id} already exists."))
}
)
}
Now with our function, we will loop through all of our eligible players and download their first image. It is important to remember that we are scraping data from a website. One drawback of doing so is the potential to have your IP blocked. It is important to take things slow otherwise you may run into unexpected issues while gathering this data. Our function above uses Sys.sleep(5)
to attempt to slow down the looping process but even then it is not enough. Additionally, there are safeguards in place to prevent pulling data that already exists. Run at your own risk and adjust based on how the website reacts to you.
lapply(hof_people$bbrefID, scrape_bbref_images)
Now we should have data on all eligible Hall of Fame players. A sample structure of the directory can be found below. In my next post, we will discuss how to model this data to determine who truly does “look like a superstar”.
levelName
1 data
2 °--MLB_Hall_Of_Fame_Project
3 °--Hall_of_Fame_Eligible
4 ¦--N_baergca01.jpg
5 ¦--N_darkal01.jpg
6 ¦--N_delgaca01.jpg
7 ¦--N_doyleja01.jpg
8 °--N_heltoto01.jpg
4 ¦--N_baergca01.jpg
5 ¦--N_darkal01.jpg
6 ¦--N_delgaca01.jpg
7 ¦--N_doyleja01.jpg
8 °--N_heltoto01.jpg