Final project - create your own R Package
Final project - create your own R Package
Step 1: Setting Up the Package
I started by setting up my R package using usethis and devtools, which are great tools for package development in R. Here's a quick rundown of what I did:
I used the usethis::create_package() function to initialize the basic structure of the package.
I created a new R script for each function:
clean_combine_data.R: For data cleaning.
plot_40yd_by_position.R: To plot the 40-yard dash times by player position.
summarize_40yd_by_position.R: To create a summary table showing average, minimum, and maximum times by position.
Issaiah Jennings
In this blog post, I’ll walk you through how I created an R package, combinePredictR, to analyze the 2019 NFL Combine data. This package includes functions for cleaning, visualizing, and summarizing key player statistics, such as the 40-yard dash time. Along the way, I’ll explain the steps I took, the challenges I encountered, and how I addressed them.
In this blog post, I’ll walk you through how I created an R package, combinePredictR, to analyze the 2019 NFL Combine data. This package includes functions for cleaning, visualizing, and summarizing key player statistics, such as the 40-yard dash time. Along the way, I’ll explain the steps I took, the challenges I encountered, and how I addressed them.
Step 1: Setting Up the Package
I started by setting up my R package using usethis and devtools, which are great tools for package development in R. Here's a quick rundown of what I did:
I used the usethis::create_package() function to initialize the basic structure of the package.
I created a new R script for each function:
clean_combine_data.R: For data cleaning.
plot_40yd_by_position.R: To plot the 40-yard dash times by player position.
summarize_40yd_by_position.R: To create a summary table showing average, minimum, and maximum times by position.
Step 2: Cleaning the Data
The first task was cleaning the raw NFL Combine data. I loaded the dataset from a CSV file and wrote the clean_combine_data() function to clean it. The function removes rows with missing values for key columns like 40-yard dash times and vertical jump heights. I also filtered out the 9.99 values, which were placeholders for missing data.
Here’s the cleaning code: clean_combine_data <- function(data) { requireNamespace("dplyr", quietly = TRUE) data_cleaned <- data %>% dplyr::filter( !is.na(`40 Yard`), !is.na(`Vert Leap (in)`), `40 Yard` != 9.99 # Filter out placeholder times ) return(data_cleaned) }
By doing this, I ensured that I was working with accurate, real data to analyze and visualize.
The first task was cleaning the raw NFL Combine data. I loaded the dataset from a CSV file and wrote the clean_combine_data() function to clean it. The function removes rows with missing values for key columns like 40-yard dash times and vertical jump heights. I also filtered out the 9.99 values, which were placeholders for missing data.
Here’s the cleaning code: clean_combine_data <- function(data) { requireNamespace("dplyr", quietly = TRUE) data_cleaned <- data %>% dplyr::filter( !is.na(`40 Yard`), !is.na(`Vert Leap (in)`), `40 Yard` != 9.99 # Filter out placeholder times ) return(data_cleaned) }
By doing this, I ensured that I was working with accurate, real data to analyze and visualize.
Step 3: Visualizing the Data
Next, I created a boxplot to visualize the distribution of 40-yard dash times across player positions. This was done using ggplot2, one of my favorite R visualization packages.
The function plot_40yd_by_position() creates a boxplot for each position (e.g., QB, WR, OT), showing the spread of dash times. Here’s the code for this function: plot_40yd_by_position <- function(data) { requireNamespace("ggplot2", quietly = TRUE) requireNamespace("dplyr", quietly = TRUE) data_filtered <- dplyr::filter(data, `40 Yard` < 7) # Filter out extreme values data_filtered$POS <- factor(data_filtered$POS, levels = c("QB", "RB", "WR", "TE", "CB", "S", "LB", "DE", "DT", "OT", "OG", "C", "K", "P")) ggplot2::ggplot(data_filtered, ggplot2::aes(x = POS, y = `40 Yard`, fill = POS)) + ggplot2::geom_boxplot() + ggplot2::labs( title = "40-Yard Dash Times by Position", x = "Position", y = "40-Yard Dash Time (seconds)" ) + ggplot2::theme_minimal() + ggplot2::theme(legend.position = "none") # Remove legend }
This function creates a colorful, informative plot that shows how players from different positions compare in their 40-yard dash times.
Next, I created a boxplot to visualize the distribution of 40-yard dash times across player positions. This was done using ggplot2, one of my favorite R visualization packages.
The function plot_40yd_by_position() creates a boxplot for each position (e.g., QB, WR, OT), showing the spread of dash times. Here’s the code for this function: plot_40yd_by_position <- function(data) { requireNamespace("ggplot2", quietly = TRUE) requireNamespace("dplyr", quietly = TRUE) data_filtered <- dplyr::filter(data, `40 Yard` < 7) # Filter out extreme values data_filtered$POS <- factor(data_filtered$POS, levels = c("QB", "RB", "WR", "TE", "CB", "S", "LB", "DE", "DT", "OT", "OG", "C", "K", "P")) ggplot2::ggplot(data_filtered, ggplot2::aes(x = POS, y = `40 Yard`, fill = POS)) + ggplot2::geom_boxplot() + ggplot2::labs( title = "40-Yard Dash Times by Position", x = "Position", y = "40-Yard Dash Time (seconds)" ) + ggplot2::theme_minimal() + ggplot2::theme(legend.position = "none") # Remove legend }
This function creates a colorful, informative plot that shows how players from different positions compare in their 40-yard dash times.
Step 4: Summarizing the Data
To complement the plot, I wrote the summarize_40yd_by_position() function, which calculates the average, minimum, and maximum 40-yard dash times for each position. It also counts the number of players in each position. Here’s the code: summarize_40yd_by_position <- function(data) { requireNamespace("dplyr", quietly = TRUE) summary_table <- data %>% dplyr::group_by(POS) %>% dplyr::summarise( Avg_40Yd = round(mean(`40 Yard`, na.rm = TRUE), 2), Min_40Yd = round(min(`40 Yard`, na.rm = TRUE), 2), Max_40Yd = round(max(`40 Yard`, na.rm = TRUE), 2), Count = dplyr::n() ) %>% dplyr::arrange(Avg_40Yd) return(summary_table) }
This function provides a clean summary table, giving a quick overview of the 40-yard dash statistics by position. It’s a great way to identify patterns and outliers.
To complement the plot, I wrote the summarize_40yd_by_position() function, which calculates the average, minimum, and maximum 40-yard dash times for each position. It also counts the number of players in each position. Here’s the code: summarize_40yd_by_position <- function(data) { requireNamespace("dplyr", quietly = TRUE) summary_table <- data %>% dplyr::group_by(POS) %>% dplyr::summarise( Avg_40Yd = round(mean(`40 Yard`, na.rm = TRUE), 2), Min_40Yd = round(min(`40 Yard`, na.rm = TRUE), 2), Max_40Yd = round(max(`40 Yard`, na.rm = TRUE), 2), Count = dplyr::n() ) %>% dplyr::arrange(Avg_40Yd) return(summary_table) }
This function provides a clean summary table, giving a quick overview of the 40-yard dash statistics by position. It’s a great way to identify patterns and outliers.
Step 5: Wrapping Up
Once the functions were ready, I documented the package using Roxygen2 comments. I used the devtools::document() function to generate help files in the man/ folder.
After that, I added the MIT License to the package using the usethis::use_mit_license() function, and finally, I pushed the package to GitHub using usethis::use_github().
Once the functions were ready, I documented the package using Roxygen2 comments. I used the devtools::document() function to generate help files in the man/ folder.
After that, I added the MIT License to the package using the usethis::use_mit_license() function, and finally, I pushed the package to GitHub using usethis::use_github().
Conclusion
In this project, I’ve built an R package to analyze NFL Combine data. I cleaned the data, created visualizations, and summarized key statistics. I also took care to document everything and license the code for open-source use.
By following this process, I gained a deeper understanding of R package development, data cleaning, visualization, and GitHub integration. This project has been a great learning experience, and I hope it’s helpful to others working with sports data or building their own R packages!
In this project, I’ve built an R package to analyze NFL Combine data. I cleaned the data, created visualizations, and summarized key statistics. I also took care to document everything and license the code for open-source use.
By following this process, I gained a deeper understanding of R package development, data cleaning, visualization, and GitHub integration. This project has been a great learning experience, and I hope it’s helpful to others working with sports data or building their own R packages!
What's Next?
Now that the package is complete, I plan to extend it by adding:
More data visualizations (e.g., comparing other player statistics).
Advanced analysis techniques (e.g., clustering players by performance).
https://github.com/Ijennin/combinePredictR/tree/master
Now that the package is complete, I plan to extend it by adding:
More data visualizations (e.g., comparing other player statistics).
Advanced analysis techniques (e.g., clustering players by performance).
https://github.com/Ijennin/combinePredictR/tree/master
Comments
Post a Comment