## Introduction

Matt Chapman is my favorite player, and in the spirit of improving my skills, he will be my next victim in a long string of machine learning endeavors. Since I have yet to put together an ensemble model that I’m satisfied with, my goal is to build one that accurately predicts all of Matt Chapman’s swings in the 2019 season. So without further ado, let’s get into it.

## Data Organization

I pulled together Statcast data from Chapman’s 2019 season, to make it a bit of a challenge, I didn’t want to use anything too highly correlated with swing and misses and landed on these seven features for the models: count (state), plate_x, plate_y, release speed, spin rate, pitch type, and pitcher. One-hot encoding was used on all the categorical variables – count, state, pitch type, and pitcher to ensure that they could be used. Although seven total features, the one-hot encoded features expanded the number of variables that were used to a total of 268. Fortunately for us, swings aren’t a concerning minority class, Chappy takes roughly two thirds of his pitches and the remaining third being swings; to ensure the testing and training datasets are balanced, we’ll split them (70/30) using stratified random sampling in a 10 fold cross validation set. From an original dataset with 2864 observations and 268 features, there is no shortage of data in the training set.

## EDA

It doesn’t come as too much of a surprise that there’s not much meat to pull off of the statistical bone regarding velocity/spin rate and our target variable, but it’s still important to check. What’s of more interest is exploring our target variable of swings (and no swings) by count, pitch type, and pitcher. In what counts did Matt Chapman swing the most in 2019? Are there any pitches that Chapman swung at the most? Being a patient hitter who draws lots of walks, Chappy is the least likely to swing first pitch or in 3-0 counts – in fact he didn’t swing once at a 3-0 pitch in 2019.

Additionally, in two strike counts he’s the most likely to swing, particularly full counts. It’s hard to tell if this finding comes from Chapman’s willingness to protect the plate in hopes of drawing a walk, or a symptom of a hitter who also happens to strikeout at a high rate. Pitch types shows less variability, however, we can see that Chapman’s favorite pitches to swing at are four-seam fastballs, cutters, and changeups. Chapman’s swing rate against Marco Gonzales happened to be upwards of 60%, the highest for any pitcher he faced in 2019. What were Gonzales’ most used pitches in 2019? Thats right – the four-seam fastball, cutter, and changeup. This information might seem trivial at best and useless at worst, but any information gathered might assist in building the best model possible.

## Random Forest Model

The data used for this model was normalized and all predictors with only a single value were dropped (pitchers who only faced Chapman once and threw a single pitch). To save time, this process was repeated for each of the models.

This model utilized 200 trees and included two hyperparameters: mtry (number of features randomly sampled at each split), and min_n (minimum number of data points for a node to be split further). Mtry was tuned from a range of 10-30 features, and min_n was tuned from a range of 2-8 features. Ideally, I would’ve liked to expand these ranged a bit, particularly the mtry considering how many features we have in our data, however, I bumped up against the computational limits of my computer – a common theme in this project likely due to the large number of predictors. Results did not change much after 200 trees, I decided to refrain from tuning it any further and instead used the computational power to prioritize the other two hyperparameters. The final model was chosen through optimizing AUC (area under the ROC curve).

Results from the final model:

Trees | Mtry | Min_n | AUC | Accuracy | Precision | Sensitivity | Specificity |
---|---|---|---|---|---|---|---|

200 | 30 | 5 | 0.857 | 78.48% | 0.763 | 0.550 | 0.909 |

The model did a great job not only classifying pitches that Chapman didn’t swing at, but also avoiding false positive predictions (specificity = 0.9). Given that there are more non-swings than swings, ideally we would rather see a high sensitivity or precision than a high specificity. Unfortunately, the model struggles at identifying positive classes, only predicting 55% its total positive class predictions correctly.

## Logistic Regression Model

For the logistic regression model, I only included one hyperparameter: penalty (the total amount of regularization included on the model), using a range of 10^-10-10^1 sequentially divided into 30 parts. Additionally, natural splines (with 5 degrees of freedom) were added to velocity, spin rate, plate coordinates, all pitch types, as well as six pitchers – Marco Gonzalez, Wade Miley, Jose Leclerc, Adrian Sampson, Mike Leake, and Jose Berrios. I would’ve liked to add splines to all the pitchers but admittedly I couldn’t figure out how to do this with Tidymodels, so instead I added all the pitchers that Chapman had the highest swing rates against. I also chose to add a lasso regression component to the model (mix = 1), where regression coefficients can be reduced to zero (completely eliminated) rather than asymptotically decreased like in ridge regression. This time the final model will be chosen to maximize the precision rather than the AUC.

Results from the final model:

Penalty | Mixture | AUC | Accuracy | Precision | Sensitivity | Specificity |
---|---|---|---|---|---|---|

6.723 × 10^-4 | 1 | 0.802 | 75.59% | 0.809 | 0.840 | 0.574 |

It turns out that we didn’t necessarily need to prioritize precision, if we had chosen the final model based on AUC it would have yielded similar results. Nonetheless, this model, unlike the random forest, is quite good at identifying swings and making accurate predictions. This comes of course at the expense of specificity something that our random forest model can compensate for during the ensemble process.

**Boosted Trees Model**

Next up is a boosted trees model. I decided to go back to decision trees considering the promising results obtained using random forest. Three hyperparameters were trained on our cross validated sets: mtry, learn rate (how fast/slow the model learns depending on a weighting factor), and trees. The broad goal at this stage is to increase any and all the metrics we can, but more specifically increase the rate at which the model predicts Chapman’s swings correctly (sensitivity).

Trees | Mtry | Learn Rate | AUC | Accuracy | Precision | Sensitivity | Specificity |
---|---|---|---|---|---|---|---|

31 | 100 | 0.3 | 0.843 | 77.70% | 0.824 | 0.854 | 0.610 |

The boosted trees might be the best model we’ve produced so far on the resampled data. It seems to combine the best of both worlds from our random forest and the logistic regression models, making accurate predictions about Chapman’s swings at the expensive of a little higher of a false positive rate.

## Ensemble Model

Now to put them all together. Tidymodels allows for fitting of a regularized model on the predictions of each individual model on the test data. The stacking coefficients are weights given to each model’s predictions in order to maximize the desired metrics. The penalty of 0.0001 indicates the amount of regularization just like the lasso regression in the logistic regression model. The predictions of the boosted trees model was given the most weight, followed by the random forest and logistic regression models.

Results from the ensemble model (on test data):

Accuracy | Precision | Sensitivity | Specificity |
---|---|---|---|

77.6% | 0.801 | 0.874 | 0.591 |

The model performed well on the test data, accurately predicting 87% of Chapman’s total swings, the highest sensitivity so far even among the models fitted on the training data. Unfortunately this comes at the cost of a higher than ideal fall positive rate but a tradeoff had to appear somewhere. It excelled

at recognizing Chapman’s tendency to take the first pitch, but this led the model to over predict first pitch non-swings which produced a large number of false negatives. Similarly, our analysis earlier revealed that Chapman is quite likely to swing in two strike counts particularly 0-2 counts, again we see an overproduction of positive classes in this count as well as elsewhere.

## Takeaways/Notes

This project was more about creating a quality ensemble model than it was about predicting something meaningful at large. There are things I wanted to improve like the splines (which I’m still determined to figure out fully), as well as accounting for the first pitch false negatives, but for my first official ensemble project I think this was a good first try.

Here are some things I learned from doing this project:

- More experience with Tidymodels and classification algorithms
- Improved model visualization skills
- More experience effectively applying and tuning hyperparameters
- Increased understanding of the tradeoff between specificity and sensitivity and how to create models that compensate for each other when put together
- Although I didn’t know about this until too late in the process, I could have used the step_dummy() function to one-hot encode the data in the model building process instead of the data organization process.
- Instead of focusing on metrics like precision, recall, sensitivity, specificity, etc. to evaluate the models – I could have used PPV (Positive Predictive Values) and NPV (Negative Predictive) as additions or possibly substitutes to convey similar information in a more concise way.

*As always the code for this project is located on my GitHub. I would greatly appreciate any feedback, I’m always in the process of learning and getting better!*