randpy.ai Banner - Explore R and Python Integration

Face of a Superstar: Part 1

Gathering Data to predict Hall of Fame Inductees based on Images
Author

Jeffrey Sumner

Published

May 13, 2023

Introduction

Hello folks! I’m in the process of building my blog and have been going through some older projects and converting them to quarto documents in order to create content for this blog. The following project was started roughly 5 years ago on a whim. I’ve re-written nearly the entire original script because lots of things change over 5 years, namely I’ve become a bit better with R.

This post will be the first of at least a two part series around predicting MLB Hall of Fame inductees based on their image. Let’s go ahead and dive in!

Motivation

Why in the world did I do this? Growing up a baseball player, I heard all the time “He looks like a superstar”. So, naturally, I decided to take that simple overused phrase literally and model it out. This exercise is meant to be “fun” and not to be taken seriously; there are numerous ways it can go wrong, which we will discuss in Part 2. For now, let’s start off like most Data Science projects and collect some data!

Setup - Required Packages

First off, we will need to install the following packages:

  1. Lahman - a source for pretty much all your baseball needs
  2. tidyverse - essential for almost any data science project
  3. rvest - used to scrape data

The code to do so is below (unless already installed)

library(Lahman)
library(tidyverse)
library(rvest)
library(magick)
library(glue)
library(here)
library(parallel)

Data Collection

Now that the required packages are loaded, let’s understand how we plan to use them, namely the Lahman package and the associated data.

Lahman Data

The Lahman package is a collection of data sets derived from the Lahman Baseball Database. This database contains a vast amount of statistics and player data. We plan to use this data/package to 1) Gather player IDs to pull images from baseball-reference.com (Baseball-Reference 2023) and 2) Identify the players that have been inducted into the Hall of Fame.

Lahman - People

Lahman has a dataset called People that can be accessed with the following code.

data("People")
dplyr::glimpse(People)
Rows: 21,010
Columns: 26
$ playerID     <chr> "aardsda01", "aaronha01", "aaronto01", "aasedo01", "abada…
$ birthYear    <int> 1981, 1934, 1939, 1954, 1972, 1985, 1850, 1877, 1869, 186…
$ birthMonth   <int> 12, 2, 8, 9, 8, 12, 11, 4, 11, 10, 9, 3, 10, 2, 8, 9, 6, …
$ birthDay     <int> 27, 5, 5, 8, 25, 17, 4, 15, 11, 14, 20, 16, 22, 16, 17, 1…
$ birthCity    <chr> "Denver", "Mobile", "Mobile", "Orange", "Palm Beach", "La…
$ birthCountry <chr> "USA", "USA", "USA", "USA", "USA", "D.R.", "USA", "USA", …
$ birthState   <chr> "CO", "AL", "AL", "CA", "FL", "La Romana", "PA", "PA", "V…
$ deathYear    <int> NA, 2021, 1984, NA, NA, NA, 1905, 1957, 1962, 1926, NA, 1…
$ deathMonth   <int> NA, 1, 8, NA, NA, NA, 5, 1, 6, 4, NA, 2, 6, NA, NA, NA, N…
$ deathDay     <int> NA, 22, 16, NA, NA, NA, 17, 6, 11, 27, NA, 13, 11, NA, NA…
$ deathCountry <chr> NA, "USA", "USA", NA, NA, NA, "USA", "USA", "USA", "USA",…
$ deathState   <chr> NA, "GA", "GA", NA, NA, NA, "NJ", "FL", "VT", "CA", NA, "…
$ deathCity    <chr> NA, "Atlanta", "Atlanta", NA, NA, NA, "Pemberton", "Fort …
$ nameFirst    <chr> "David", "Hank", "Tommie", "Don", "Andy", "Fernando", "Jo…
$ nameLast     <chr> "Aardsma", "Aaron", "Aaron", "Aase", "Abad", "Abad", "Aba…
$ nameGiven    <chr> "David Allan", "Henry Louis", "Tommie Lee", "Donald Willi…
$ weight       <int> 215, 180, 190, 190, 184, 235, 192, 170, 175, 169, 220, 19…
$ height       <int> 75, 72, 75, 75, 73, 74, 72, 71, 71, 68, 74, 71, 70, 78, 7…
$ bats         <fct> R, R, R, R, L, L, R, R, R, L, R, R, R, R, R, L, R, L, L, …
$ throws       <fct> R, R, R, R, L, L, R, R, R, L, R, R, R, R, L, L, R, L, R, …
$ debut        <chr> "2004-04-06", "1954-04-13", "1962-04-10", "1977-07-26", "…
$ bbrefID      <chr> "aardsda01", "aaronha01", "aaronto01", "aasedo01", "abada…
$ finalGame    <chr> "2015-08-23", "1976-10-03", "1971-09-26", "1990-10-03", "…
$ retroID      <chr> "aardd001", "aaroh101", "aarot101", "aased001", "abada001…
$ deathDate    <date> NA, 2021-01-22, 1984-08-16, NA, NA, NA, 1905-05-17, 1957…
$ birthDate    <date> 1981-12-27, 1934-02-05, 1939-08-05, 1954-09-08, 1972-08-…

As you can see from our data output above, we have a wide selection of information for each baseball player. The most important columns for us will be the playerID and bbrefID. With these two columns, we will be able to join to the HallOfFame tibble and pull the MLB Hall of Fame eligible players from baseball-reference.com (Baseball-Reference 2023).

Lahman - HallOfFame

Speaking of the HallOfFame tibble, below is the code to load the data and view it.

data("HallOfFame")
dplyr::glimpse(HallOfFame)
Rows: 6,382
Columns: 9
$ playerID    <chr> "aaronha01", "abbotji01", "abreubo01", "abreubo01", "abreu…
$ yearID      <int> 1982, 2005, 2020, 2021, 2022, 2023, 2024, 1937, 1938, 1939…
$ votedBy     <chr> "BBWAA", "BBWAA", "BBWAA", "BBWAA", "BBWAA", "BBWAA", "BBW…
$ ballots     <dbl> 415, 516, 397, 401, 394, 389, 385, 201, 262, 274, 233, 247…
$ needed      <dbl> 312, 387, 298, 301, 296, 292, 289, 151, 197, 206, 175, 186…
$ votes       <dbl> 406, 13, 22, 35, 34, 60, 57, 8, 11, 11, 11, 7, 6, 22, 4, 5…
$ inducted    <fct> Y, N, N, N, N, N, N, N, N, N, N, N, N, N, N, N, N, N, N, N…
$ category    <fct> Player, Player, Player, Player, Player, Player, Player, Pl…
$ needed_note <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "Players i…

In this table, we are concerned only with the playerID (to link to the People tibble), inducted and category. Let’s move on to some cleanup of these two tibbles.

Lahman Data Cleaning

The Lahman data conveniently gave us a key, playerID, that easily links our two tibbles together. Let’s go ahead and add the information that we want from the People tibble to the HallOfFame tibble. In addition, we will go ahead and select only the columns previously listed as important from the HallOfFame tibble.

First, let’s split the HallOfFame tibble into Y/N category tibbles.

# Collect the players that are in the HOF
hof_y_people <- HallOfFame %>%
  # select columns that are important
  select(playerID, inducted, category) %>%
  # filter to Player only
  filter(category %in% "Player", inducted %in% "Y") %>%
  # category is no longer needed, remove it
  select(-category) %>%
  distinct() 

# Collect the players that are not in the HOF
hof_n_people <- HallOfFame %>%
  # select columns that are important
  select(playerID, inducted, category) %>%
  # filter to Player only
  filter(category %in% "Player", inducted %in% "N", !playerID %in% hof_y_people$playerID) %>%
  # category is no longer needed, remove it
  select(-category) %>%
  distinct() 

Our two tibbles separate the eligible players that made the Hall of Fame versus eligible players that did not make the Hall of Fame.

We then join the people data to get the bbrefID to have a more complete data set in the code below.

# Combine both tibbles and join the required people data
hof_people <- bind_rows(hof_y_people, hof_n_people) %>%
  left_join(
    People %>% select(playerID, bbrefID)
    , by = "playerID"
  )

glimpse(hof_people)
Rows: 1,396
Columns: 3
$ playerID <chr> "aaronha01", "alexape01", "alomaro01", "ansonca01", "aparilu0…
$ inducted <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y…
$ bbrefID  <chr> "aaronha01", "alexape01", "alomaro01", "ansonca01", "aparilu0…

Now we have a single tibble with the Hall of Fame eligible MLB players as well as their baseball reference ID. We are now able to move on and get the image data for our players.

Baseball-Reference Images

Each MLB player has a baseball-reference page associated with them. For example, here is a link to Robin Yount’s page(Baseball-Reference 2023). When viewing Robin Yount’s baseball reference page, you’ll see a lot of information on his statistics, accolades, etc. But most importantly, you will see his player image. That image is the data that we want to pull.

Sample Data

Let’s take a look at the Robin Yount example a little more. To determine how to get the image you will need to inspect the baseball-reference source code and find how the image(s) are tagged. In doing so, you can download the image data. Below is an example using Robin Yount.

url <- "https://www.baseball-reference.com/players/y/yountro01.shtml"
webpage <- session(url)
link_titles <- webpage %>% html_nodes("img")
img_url_first <- link_titles[2] %>% html_attr("src")
img_url_first
[1] "https://www.baseball-reference.com/req/202408150/images/headshots/a/aadc0345_sabr.jpg"
magick::image_read(img_url_first)

Robin Yount Sample Image #1

Notice how we set this up as img_url_first. If you hover over the image on baseball-reference, you may see multiple images pop up. In practice, this would be cause for concern. We would need to determine which image, if any, should be used for our modeling approach. In the context of this problem do we go with a picture that looks younger? Or maybe a picture of the player on their first team versus their last team? For the purposes of this exercise, we will choose only the picture that appears first. We’ll save the in-depth approaches for a rainy day.

Just to provide tangible evidence of multiple images, below is the second Robin Yount image.

link_titles <- webpage %>% html_nodes("img")
img_url_second <- link_titles[3] %>% html_attr("src")
img_url_second
[1] "https://www.baseball-reference.com/req/202408150/images/headshots/a/aadc0345_davis.jpg"
magick::image_read(img_url_second)

Robin Yount Sample Image #2

WARNING: While this image appears to be an older image, baseball-reference does note that images are not necessarily in chronological order. If you take this a step further, do not assume that the first image is younger than the second or third.

Scraping Hall of Fame Eligible Players

With an example created, we need to pull all Hall of Fame eligible players. To do this, let’s break down the URL that we used to pull the data. The URL is listed below:

  • https://www.baseball-reference.com/players/y/yountro01.shtml
  • https://www.baseball-reference.com/players/y/yountro01.shtml

Let’s break this into different parts:

  1. root = https://www.baseball-reference.com/players
  2. first letter last name = /y
  3. bbrefID = /yountro01
  4. extension = .shtml

So we need to always use the same root and extension while adjusting the first letter last name and bbrefID. In order to do this, we will create a simple function. Our function will take a single input, the bbrefID and output an image into a directory titled Hall_of_Fame_Eligible. The images will be stored as eligibility_playerID. In addition, I have added logic to avoid scraping data that has already been scraped. More on that in just a moment.

scrape_bbref_images <- function(baseball_reference_id){
  
  inducted <- hof_people %>%
    filter(bbrefID %in% baseball_reference_id) %>%
    pull(inducted) %>%
    as.character()
  
  first_letter_first_name <- substr(baseball_reference_id,1,1)
  
  file_name <- glue::glue(here::here(), "/data/MLB_Hall_Of_Fame_Project/Hall_of_Fame_Eligible/{inducted}_{baseball_reference_id}.jpg")
  
  url <- glue::glue("https://www.baseball-reference.com/players/{first_letter_first_name}/{baseball_reference_id}.shtml")
  
  try(if(!file.exists(file_name)){
    print(baseball_reference_id)
    Sys.sleep(5)
    webpage <- session(url)
    link_titles <- webpage %>% html_nodes("img")
    img_url_first <- link_titles[2] %>% html_attr("src")
    download.file(img_url_first, file_name,mode = "wb")
  } else{
    print(glue::glue("The file for {baseball_reference_id} already exists."))
  }
  )
  
}

Now with our function, we will loop through all of our eligible players and download their first image. It is important to remember that we are scraping data from a website. One drawback of doing so is the potential to have your IP blocked. It is important to take things slow otherwise you may run into unexpected issues while gathering this data. Our function above uses Sys.sleep(5) to attempt to slow down the looping process but even then it is not enough. Additionally, there are safeguards in place to prevent pulling data that already exists. Run at your own risk and adjust based on how the website reacts to you.

lapply(hof_people$bbrefID, scrape_bbref_images)

Now we should have data on all eligible Hall of Fame players. A sample structure of the directory can be found below. In my next post, we will discuss how to model this data to determine who truly does “look like a superstar”.

                      levelName
1 data
2  °--MLB_Hall_Of_Fame_Project
3      °--Hall_of_Fame_Eligible
4          ¦--N_baergca01.jpg
5          ¦--N_darkal01.jpg
6          ¦--N_delgaca01.jpg
7          ¦--N_doyleja01.jpg
8          °--N_heltoto01.jpg  
4 ¦--N_baergca01.jpg 5 ¦--N_darkal01.jpg 6 ¦--N_delgaca01.jpg 7 ¦--N_doyleja01.jpg 8 °--N_heltoto01.jpg

References

Baseball-Reference. 2023. Baseball-Reference.com.” https://www.baseball-reference.com/.