Final project - create your own R Package

Final project - create your own R Package

Issaiah Jennings 

In this blog post, I’ll walk you through how I created an R package, combinePredictR, to analyze the 2019 NFL Combine data. This package includes functions for cleaning, visualizing, and summarizing key player statistics, such as the 40-yard dash time. Along the way, I’ll explain the steps I took, the challenges I encountered, and how I addressed them. 

Step 1: Setting Up the Package

I started by setting up my R package using usethis and devtools, which are great tools for package development in R. Here's a quick rundown of what I did:


I used the usethis::create_package() function to initialize the basic structure of the package.


I created a new R script for each function:


clean_combine_data.R: For data cleaning.


plot_40yd_by_position.R: To plot the 40-yard dash times by player position.


summarize_40yd_by_position.R: To create a summary table showing average, minimum, and maximum times by position.

Step 2: Cleaning the Data

The first task was cleaning the raw NFL Combine data. I loaded the dataset from a CSV file and wrote the clean_combine_data() function to clean it. The function removes rows with missing values for key columns like 40-yard dash times and vertical jump heights. I also filtered out the 9.99 values, which were placeholders for missing data.

Here’s the cleaning code: clean_combine_data <- function(data) { requireNamespace("dplyr", quietly = TRUE) data_cleaned <- data %>% dplyr::filter( !is.na(`40 Yard`), !is.na(`Vert Leap (in)`), `40 Yard` != 9.99 # Filter out placeholder times ) return(data_cleaned) }


By doing this, I ensured that I was working with accurate, real data to analyze and visualize.

Step 3: Visualizing the Data

Next, I created a boxplot to visualize the distribution of 40-yard dash times across player positions. This was done using ggplot2, one of my favorite R visualization packages.

The function plot_40yd_by_position() creates a boxplot for each position (e.g., QB, WR, OT), showing the spread of dash times. Here’s the code for this function: plot_40yd_by_position <- function(data) { requireNamespace("ggplot2", quietly = TRUE) requireNamespace("dplyr", quietly = TRUE) data_filtered <- dplyr::filter(data, `40 Yard` < 7) # Filter out extreme values data_filtered$POS <- factor(data_filtered$POS, levels = c("QB", "RB", "WR", "TE", "CB", "S", "LB", "DE", "DT", "OT", "OG", "C", "K", "P")) ggplot2::ggplot(data_filtered, ggplot2::aes(x = POS, y = `40 Yard`, fill = POS)) + ggplot2::geom_boxplot() + ggplot2::labs( title = "40-Yard Dash Times by Position", x = "Position", y = "40-Yard Dash Time (seconds)" ) + ggplot2::theme_minimal() + ggplot2::theme(legend.position = "none") # Remove legend }


This function creates a colorful, informative plot that shows how players from different positions compare in their 40-yard dash times.

Step 4: Summarizing the Data

To complement the plot, I wrote the summarize_40yd_by_position() function, which calculates the average, minimum, and maximum 40-yard dash times for each position. It also counts the number of players in each position. Here’s the code: summarize_40yd_by_position <- function(data) { requireNamespace("dplyr", quietly = TRUE) summary_table <- data %>% dplyr::group_by(POS) %>% dplyr::summarise( Avg_40Yd = round(mean(`40 Yard`, na.rm = TRUE), 2), Min_40Yd = round(min(`40 Yard`, na.rm = TRUE), 2), Max_40Yd = round(max(`40 Yard`, na.rm = TRUE), 2), Count = dplyr::n() ) %>% dplyr::arrange(Avg_40Yd) return(summary_table) }


This function provides a clean summary table, giving a quick overview of the 40-yard dash statistics by position. It’s a great way to identify patterns and outliers.

Step 5: Wrapping Up

Once the functions were ready, I documented the package using Roxygen2 comments. I used the devtools::document() function to generate help files in the man/ folder.

After that, I added the MIT License to the package using the usethis::use_mit_license() function, and finally, I pushed the package to GitHub using usethis::use_github().

Conclusion

In this project, I’ve built an R package to analyze NFL Combine data. I cleaned the data, created visualizations, and summarized key statistics. I also took care to document everything and license the code for open-source use.

By following this process, I gained a deeper understanding of R package development, data cleaning, visualization, and GitHub integration. This project has been a great learning experience, and I hope it’s helpful to others working with sports data or building their own R packages!

What's Next?

Now that the package is complete, I plan to extend it by adding:
More data visualizations (e.g., comparing other player statistics).
Advanced analysis techniques (e.g., clustering players by performance). 


https://github.com/Ijennin/combinePredictR/tree/master

Comments

Popular posts from this blog

Module # 1 assignment(R)

Module # 4 Programming structure in R