Our group will consist of Luke Albright and James Manzer for the data science project and presentation. We are exploring historical Major League Baseball statistics to determine what variables correlate with the number of wins a team has during a season, as well as their predictive power for determining the number of wins a MLB team would have.
The dataset we are interested in studying involves Major League Baseball statistics. Both of us have a history of playing baseball and are now fans of the Chicago Cubs and Texas Rangers. James is even a member of the club baseball team here at Tulane. Given our passion for baseball and its vast records of statistics, we feel compelled to take a deep dive into this space.
We have identified the Lahman Database as our data source for this project. The Lahman Database is a comprehensive collection of baseball statistics that covers Major League Baseball from the 19th century to the present. It is widely used by researchers, analysts and enthusiasts to analyze various teams and players throughout history. The database is freely available and can be downloaded in various formats. We will be using the CSV version of this data to import into our notebook.
The data we downloaded is divided into three main sections, core, upstream and contrib, however we will be focusing on data from the core and contrib folders, which includes player and team batting, fielding, and pitching statistics as well as salary data. More specifically, we will be looking at aggregated salary, pitching, and hitting data by team and by year, some of which will need to be preprocessed. We believe this trove of information will enable us to identify variables that correlate with the number of wins a franchise may have in a given year, as well as predict the future number of wins a franchise may have given certain historical information.
We will be looking at historical data from 2004-2016 to use for our model. The salary data we are using is only available to 2016, resulting in that constraint. We are using 2004 as the beginning of data due to the shift in the league at that time to more modern standards. The steroid era of the MLB lasted throughout the 1990's until league wide PED testing was implemented in 2003, resulting in inflated statistics for a number of players and teams. Because of this, we our data will begin in 2004.
This first block of code installs the necessary libraries for use throughout this project.
# Initialize necesarry libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import zipfile
import seaborn as sns
import statsmodels.formula.api as smf
from sklearn import linear_model
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn import metrics
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import scale
from sklearn.metrics import mean_absolute_percentage_error
from sklearn.linear_model import Ridge
from numpy import mean
from numpy import std
from numpy import absolute
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
from sklearn.linear_model import ElasticNet
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import ShuffleSplit
from sklearn.utils import shuffle
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_validate
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import make_scorer, mean_squared_error
import matplotlib.pyplot as plt
For ease of use, we created a public Github repository with the raw zip file from the Lahman website. We now will navigate to the proper Github directory from which we will pull our data, and extract the zip file.
# Extract zip file from github
%cd /content
!git clone https://github.com/lalbright22/DataScience2023.git
%cd /content/DataScience2023/
zip_file_path = '/content/DataScience2023/baseballdatabank-2023.1.zip'
extract_dir = '/content/DataScience2023/'
# Create a ZipFile object
with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
    # Extract all the contents into the target directory
    zip_ref.extractall(extract_dir)
/content fatal: destination path 'DataScience2023' already exists and is not an empty directory. /content/DataScience2023
Now that the file is unzipped, we will navigte to the contrib sub-folder to read the Salaries CSV file into a pandas dataframe.
%cd baseballdatabank-2023.1/contrib
# Upload salaries dataframe
salaries = pd.read_csv("Salaries.csv")
salaries.head()
/content/DataScience2023/baseballdatabank-2023.1/contrib
| yearID | teamID | lgID | playerID | salary | |
|---|---|---|---|---|---|
| 0 | 1985 | ATL | NL | barkele01 | 870000 | 
| 1 | 1985 | ATL | NL | bedrost01 | 550000 | 
| 2 | 1985 | ATL | NL | benedbr01 | 545000 | 
| 3 | 1985 | ATL | NL | campri01 | 633333 | 
| 4 | 1985 | ATL | NL | ceronri01 | 625000 | 
Once thats done, we will navigate to the core sub-folder that includes the Teams CSV file, reading the file into a pandas dataframe.
%cd ../core
# Downloads teams dataframe
teams = pd.read_csv("Teams.csv")
teams
/content/DataScience2023/baseballdatabank-2023.1/core
| yearID | lgID | teamID | franchID | divID | Rank | G | Ghome | W | L | ... | DP | FP | name | park | attendance | BPF | PPF | teamIDBR | teamIDlahman45 | teamIDretro | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1871 | NaN | BS1 | BNA | NaN | 3 | 31 | NaN | 20 | 10 | ... | 24 | 0.834 | Boston Red Stockings | South End Grounds I | NaN | 103 | 98 | BOS | BS1 | BS1 | 
| 1 | 1871 | NaN | CH1 | CNA | NaN | 2 | 28 | NaN | 19 | 9 | ... | 16 | 0.829 | Chicago White Stockings | Union Base-Ball Grounds | NaN | 104 | 102 | CHI | CH1 | CH1 | 
| 2 | 1871 | NaN | CL1 | CFC | NaN | 8 | 29 | NaN | 10 | 19 | ... | 15 | 0.818 | Cleveland Forest Citys | National Association Grounds | NaN | 96 | 100 | CLE | CL1 | CL1 | 
| 3 | 1871 | NaN | FW1 | KEK | NaN | 7 | 19 | NaN | 7 | 12 | ... | 8 | 0.803 | Fort Wayne Kekiongas | Hamilton Field | NaN | 101 | 107 | KEK | FW1 | FW1 | 
| 4 | 1871 | NaN | NY2 | NNA | NaN | 5 | 33 | NaN | 16 | 17 | ... | 14 | 0.840 | New York Mutuals | Union Grounds (Brooklyn) | NaN | 90 | 88 | NYU | NY2 | NY2 | 
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | 
| 3010 | 2022 | NL | SLN | STL | C | 1 | 162 | 81.0 | 93 | 69 | ... | 181 | 0.989 | St. Louis Cardinals | Busch Stadium III | 3320551.0 | 94 | 94 | STL | SLN | SLN | 
| 3011 | 2022 | AL | TBA | TBD | E | 1 | 162 | 81.0 | 86 | 76 | ... | 110 | 0.985 | Tampa Bay Rays | Tropicana Field | 1128127.0 | 95 | 93 | TBR | TBA | TBA | 
| 3012 | 2022 | AL | TEX | TEX | W | 4 | 162 | 81.0 | 68 | 94 | ... | 143 | 0.984 | Texas Rangers | Globe Life Field | 2011361.0 | 100 | 101 | TEX | TEX | TEX | 
| 3013 | 2022 | AL | TOR | TOR | E | 2 | 162 | 81.0 | 92 | 70 | ... | 120 | 0.986 | Toronto Blue Jays | Rogers Centre | 2653830.0 | 100 | 100 | TOR | TOR | TOR | 
| 3014 | 2022 | NL | WAS | WSN | E | 5 | 162 | 81.0 | 55 | 107 | ... | 126 | 0.982 | Washington Nationals | Nationals Park | 2026401.0 | 94 | 96 | WSN | MON | WAS | 
3015 rows × 48 columns
We will now begin preprocessing our data, beginning with the salaries dataframe, aggregating the salary data for each player by team and year.
# Sum salaries by team and year (league stays the same in all cases) in order to merge with teams dataset
team_salaries = salaries.groupby(["teamID", "yearID"])["salary"].sum().reset_index() #Groups and sums the player salaries by team.
team_salaries
| teamID | yearID | salary | |
|---|---|---|---|
| 0 | ANA | 1997 | 31135472 | 
| 1 | ANA | 1998 | 41281000 | 
| 2 | ANA | 1999 | 55388166 | 
| 3 | ANA | 2000 | 51464167 | 
| 4 | ANA | 2001 | 47535167 | 
| ... | ... | ... | ... | 
| 913 | WAS | 2012 | 80855143 | 
| 914 | WAS | 2013 | 113703270 | 
| 915 | WAS | 2014 | 131983680 | 
| 916 | WAS | 2015 | 155587472 | 
| 917 | WAS | 2016 | 141652646 | 
918 rows × 3 columns
Now we will calculate a few new variables from the teams dataframe, including run differential (differential), batting average (AVG), on base percentage (OBP), singles (1B) and slugging (SLG).
# Sum salaries by team and year (league stays the same in all cases) in order to merge with teams dataset
teams["differential"] = teams['R'] - teams['RA']
# Calculate batting average by the formula below
teams["AVG"] = (teams['H']) / (teams['AB'])
# Calculate on base percentage by the formula below
teams["OBP"] = (teams['H'] + teams['BB'] + teams['HBP']) / (teams['AB'] + teams['BB'] + teams['HBP'] + teams['SF'])
# Calculate singles by the formula below
teams["1B"] = teams['H'] - teams['2B'] - teams['3B'] - teams['HR']
# Calculate slugging percentage by the formula below
teams["SLG"] = (teams['1B'] + 2*teams['2B'] + 3*teams['3B'] + 4*teams['HR'])/teams['AB']
teams
| yearID | lgID | teamID | franchID | divID | Rank | G | Ghome | W | L | ... | BPF | PPF | teamIDBR | teamIDlahman45 | teamIDretro | differential | AVG | OBP | 1B | SLG | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1871 | NaN | BS1 | BNA | NaN | 3 | 31 | NaN | 20 | 10 | ... | 103 | 98 | BOS | BS1 | BS1 | 98 | 0.310496 | NaN | 316 | 0.422012 | 
| 1 | 1871 | NaN | CH1 | CNA | NaN | 2 | 28 | NaN | 19 | 9 | ... | 104 | 102 | CHI | CH1 | CH1 | 61 | 0.270067 | NaN | 240 | 0.373746 | 
| 2 | 1871 | NaN | CL1 | CFC | NaN | 8 | 29 | NaN | 10 | 19 | ... | 96 | 100 | CLE | CL1 | CL1 | -92 | 0.276560 | NaN | 246 | 0.391231 | 
| 3 | 1871 | NaN | FW1 | KEK | NaN | 7 | 19 | NaN | 7 | 12 | ... | 101 | 107 | KEK | FW1 | FW1 | -106 | 0.238606 | NaN | 149 | 0.293566 | 
| 4 | 1871 | NaN | NY2 | NNA | NaN | 5 | 33 | NaN | 16 | 17 | ... | 90 | 88 | NYU | NY2 | NY2 | -11 | 0.287037 | NaN | 338 | 0.349715 | 
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | 
| 3010 | 2022 | NL | SLN | STL | C | 1 | 162 | 81.0 | 93 | 69 | ... | 94 | 94 | STL | SLN | SLN | 135 | 0.252183 | 0.325268 | 878 | 0.420124 | 
| 3011 | 2022 | AL | TBA | TBD | E | 1 | 162 | 81.0 | 86 | 76 | ... | 95 | 93 | TBR | TBA | TBA | 52 | 0.239098 | 0.308500 | 842 | 0.377125 | 
| 3012 | 2022 | AL | TEX | TEX | W | 4 | 162 | 81.0 | 68 | 94 | ... | 100 | 101 | TEX | TEX | TEX | -36 | 0.238773 | 0.300881 | 866 | 0.395400 | 
| 3013 | 2022 | AL | TOR | TOR | E | 2 | 162 | 81.0 | 92 | 70 | ... | 100 | 100 | TOR | TOR | TOR | 96 | 0.263546 | 0.328667 | 945 | 0.431143 | 
| 3014 | 2022 | NL | WAS | WSN | E | 5 | 162 | 81.0 | 55 | 107 | ... | 94 | 96 | WSN | MON | WAS | -252 | 0.248620 | 0.310229 | 943 | 0.377438 | 
3015 rows × 53 columns
Here we will drop a few unecessary columns in our dataframe that will not be used in our model.
# Dropping unecessary columns
teams.drop(columns=["teamIDBR", "teamIDlahman45", "teamIDretro", "IPouts"], inplace=True)
Here we will merge the two dataframes on team and year, resulting in our final baseball dataframe that will be used in our model.
# This results in a dataframe with statistics for each team and year, including total salary, wins, attendance, batting and pitching park factors, and more.
baseball = pd.merge(team_salaries, teams, on=["teamID", "yearID"])
Lastly, we will filter the yearID column of the dataframe to only include data from the years 2004-2016. We know that our data preprocessing is correct because there are 30 teams and 30 teams * 13 years = 390 rows, as stated below
baseball = baseball[(baseball['yearID'] >= 2004) & (baseball['yearID'] <= 2016)]
baseball
| teamID | yearID | salary | lgID | franchID | divID | Rank | G | Ghome | W | ... | name | park | attendance | BPF | PPF | differential | AVG | OBP | 1B | SLG | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 7 | ANA | 2004 | 100534667 | AL | ANA | W | 1 | 162 | 81.0 | 92 | ... | Anaheim Angels | Angels Stadium of Anaheim | 3375677.0 | 97 | 97 | 102 | 0.282467 | 0.340760 | 1132 | 0.429075 | 
| 14 | ARI | 2004 | 69780750 | NL | ARI | W | 5 | 162 | 81.0 | 51 | ... | Arizona Diamondbacks | Bank One Ballpark | 2519560.0 | 105 | 107 | -284 | 0.252706 | 0.309889 | 933 | 0.392677 | 
| 15 | ARI | 2005 | 62329166 | NL | ARI | W | 2 | 162 | 81.0 | 77 | ... | Arizona Diamondbacks | Bank One Ballpark | 2059424.0 | 103 | 105 | -160 | 0.255676 | 0.332481 | 910 | 0.421081 | 
| 16 | ARI | 2006 | 59684226 | NL | ARI | W | 4 | 162 | 81.0 | 76 | ... | Arizona Diamondbacks | Chase Field | 2091685.0 | 105 | 105 | -15 | 0.266785 | 0.331313 | 977 | 0.423915 | 
| 17 | ARI | 2007 | 52067546 | NL | ARI | W | 1 | 162 | 81.0 | 90 | ... | Arizona Diamondbacks | Chase Field | 2325249.0 | 107 | 107 | -20 | 0.250093 | 0.320761 | 853 | 0.412931 | 
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | 
| 913 | WAS | 2012 | 80855143 | NL | WSN | E | 1 | 162 | 81.0 | 98 | ... | Washington Nationals | Nationals Park | 2370794.0 | 99 | 101 | 137 | 0.261443 | 0.322152 | 948 | 0.427605 | 
| 914 | WAS | 2013 | 113703270 | NL | WSN | E | 2 | 162 | 81.0 | 86 | ... | Washington Nationals | Nationals Park | 2652422.0 | 102 | 101 | 30 | 0.251104 | 0.312594 | 918 | 0.397535 | 
| 915 | WAS | 2014 | 131983680 | NL | WSN | E | 1 | 162 | 81.0 | 96 | ... | Washington Nationals | Nationals Park | 2579389.0 | 104 | 102 | 131 | 0.253158 | 0.320988 | 959 | 0.392999 | 
| 916 | WAS | 2015 | 155587472 | NL | WSN | E | 2 | 162 | 81.0 | 83 | ... | Washington Nationals | Nationals Park | 2619843.0 | 102 | 99 | 68 | 0.251105 | 0.321016 | 908 | 0.402542 | 
| 917 | WAS | 2016 | 141652646 | NL | WSN | E | 1 | 162 | 81.0 | 95 | ... | Washington Nationals | Nationals Park | 2481938.0 | 100 | 98 | 151 | 0.255556 | 0.325532 | 903 | 0.425865 | 
390 rows × 50 columns
baseball.columns
Index(['teamID', 'yearID', 'salary', 'lgID', 'franchID', 'divID', 'Rank', 'G',
       'Ghome', 'W', 'L', 'DivWin', 'WCWin', 'LgWin', 'WSWin', 'R', 'AB', 'H',
       '2B', '3B', 'HR', 'BB', 'SO', 'SB', 'CS', 'HBP', 'SF', 'RA', 'ER',
       'ERA', 'CG', 'SHO', 'SV', 'HA', 'HRA', 'BBA', 'SOA', 'E', 'DP', 'FP',
       'name', 'park', 'attendance', 'BPF', 'PPF', 'differential', 'AVG',
       'OBP', '1B', 'SLG'],
      dtype='object')
Now that the data is preprocessed, we will visualize the data and perform exploratory data analysis to identify trends, identify correlations and begin to form a hypothesis as to what variables correlate with wins the most.
baseball_nl = baseball[baseball['lgID'] == 'NL']
baseball_al = baseball[baseball['lgID'] == 'AL']
# Plot for National League
plt.figure(figsize=(12, 8)) #Adjust the figure size
sns.scatterplot(data=baseball_nl, x='yearID', y='W', hue='teamID', palette='plasma', edgecolor='w', s=100)
plt.title('W by AL Teams Each Year')
plt.xlabel('Year')
plt.ylabel('W')
plt.legend(title='teamID', bbox_to_anchor=(1.05, 1), loc='upper left')  # Adjust legend position
# Add a horizontal line at the mean # of wins
plt.axhline(81, color='green', linestyle='--', label='Threshold (81)')
plt.show()
# Plot for American League
plt.figure(figsize=(12, 8)) #Adjust the figure size
sns.scatterplot(data=baseball_al, x='yearID', y='W', hue='teamID', palette='plasma', edgecolor='w', s=100)
plt.title('W by AL Teams Each Year')
plt.xlabel('Year')
plt.ylabel('W')
plt.legend(title='teamID', bbox_to_anchor=(1.05, 1), loc='upper left')  # Adjust legend position
# Add a horizontal line at the mean # of wins
plt.axhline(81, color='green', linestyle='--', label='Threshold (81)')
plt.show()
Shown above is a scatterplot showing the number of wins, our dependent variable, each team had per year. The teams are split up by the NL and AL. This give us an idea of the distribution of the # of wins a team has in a given year. As the data shows, it appears to have a relatively large standard deviation. Lets take a closer look.
wins_std_by_league_and_year = baseball.groupby(['lgID', 'yearID'])['W'].std()
print("Standard Deviation of Wins by League and Year:")
print(wins_std_by_league_and_year)
Standard Deviation of Wins by League and Year:
lgID  yearID
AL    2004      13.473621
      2005      13.070393
      2006      12.000000
      2007      10.885266
      2008      11.338207
      2009      12.162137
      2010      11.640211
      2011      11.432795
      2012      10.967484
      2013      13.704361
      2014       9.927355
      2015       7.225616
      2016      10.511672
NL    2004      14.051690
      2005       8.830817
      2006       8.035079
      2007       7.957177
      2008      11.002841
      2009      11.094856
      2010      10.776518
      2011      11.763645
      2012      12.992305
      2013      11.093542
      2014       9.523405
      2015      13.097801
      2016      11.089678
Name: W, dtype: float64
The standard deviation of wins by league per year from the years 2004 is usually around 11-12. There appears to be no conistent trend to the variation.
Heatmaps are a helpful way of visualizing correlation between a number of variables. Below are 4 heatmaps, all of which include wins as the first row, containing all of the variables in our dataframe and their correlations.
sns.heatmap(baseball[['W', 'BPF', 'R', 'AB', 'H', '2B', '3B', 'HR', 'BB', 'SO']].corr(), annot = True, cmap = 'coolwarm', fmt='.2f')
<Axes: >
From the heatmap above, it appears that runs, homeruns, and walks, have positive correlations with wins, and strikeouts and oddly triples have a negative correlation with wins. These all make sense apart from triples, which could be explained by the fact that triples don't score the batter, and aren't as beneficial as homeruns.
sns.heatmap(baseball[['W', 'SB', 'CS', 'HBP', 'SF', 'RA', 'ER', 'CG', 'SHO']].corr(), annot = True, cmap = 'coolwarm', fmt='.2f')
<Axes: >
From the heatmap above, shutouts and sacrifice flies are the strongest positevely correlated variables to wins, which both make sense since shutouts ensure a win in a given game and sacrifice flies score runs. Runs against and earned runs both have strong negative correlations with wins, which makes sense since a high number of runs against, earned or not, don't bode well for a teams' winning prospects.
sns.heatmap(baseball[['W', 'SV', 'HA', 'HRA', 'BBA', 'SOA', 'E', 'DP', 'FP', 'attendance']].corr(), annot = True, cmap = 'coolwarm', fmt='.2f')
<Axes: >
From the heatmap above, saves, attendance, fielding percentage and strike outs against are all positively correlated with wins. All of these make sense, and while attendance isn't an in-game statistic, it make sense that winning teams draw larger crowds. Hits against, walks against, error and home runs against are negatively correlated with wins, which all make sense as well.
sns.heatmap(baseball[['W', 'PPF', 'differential', 'AVG', 'OBP', '1B', 'SLG', 'salary']].corr(), annot = True, cmap = 'coolwarm', fmt='.2f')
<Axes: >
In our last heatmap apove, teams with a positive win differential in a given year are highly positively correlated with wins, as teams that score more runs than they let up tend to win more often. On base percentage, slugging percentage, salary and batting average are all also positvely correlated.
Next, we are going to take a look at some regression plots, with wins and variables we viewed as important from our heatmaps visualizations. For all of the graphs, we will use sns's regplot function, and plot wins on the y axis with our selected variables on the x axis. We will also run a linear regression output for the just the wins and selected variables, giving us an idea of how statistically significant these selected variables are with respect to wins in a vacuum. Because the data is not standardized yet, the coefficients aren't useful.
diff_v_win = sns.regplot(x='differential', y='W', data=baseball, scatter_kws={'s':10})
diff_v_win.set(xlabel='Run Differential', ylabel='Wins')
[Text(0.5, 0, 'Run Differential'), Text(0, 0.5, 'Wins')]
mod = smf.ols(formula='W ~ differential', data=baseball)
results = mod.fit()
print(results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      W   R-squared:                       0.858
Model:                            OLS   Adj. R-squared:                  0.858
Method:                 Least Squares   F-statistic:                     2343.
Date:                Fri, 15 Dec 2023   Prob (F-statistic):          1.68e-166
Time:                        05:39:20   Log-Likelihood:                -1105.0
No. Observations:                 390   AIC:                             2214.
Df Residuals:                     388   BIC:                             2222.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept       80.9795      0.209    387.757      0.000      80.569      81.390
differential     0.1008      0.002     48.405      0.000       0.097       0.105
==============================================================================
Omnibus:                        6.362   Durbin-Watson:                   1.819
Prob(Omnibus):                  0.042   Jarque-Bera (JB):                6.148
Skew:                           0.288   Prob(JB):                       0.0462
Kurtosis:                       3.216   Cond. No.                         100.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
on_base = sns.regplot(x='OBP', y='W', data=baseball, scatter_kws={'s':10})
on_base.set(xlabel='On Base Percentage', ylabel='Wins')
[Text(0.5, 0, 'On Base Percentage'), Text(0, 0.5, 'Wins')]
mod = smf.ols(formula='W ~ OBP', data=baseball)
results = mod.fit()
print(results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      W   R-squared:                       0.214
Model:                            OLS   Adj. R-squared:                  0.212
Method:                 Least Squares   F-statistic:                     105.9
Date:                Fri, 15 Dec 2023   Prob (F-statistic):           4.00e-22
Time:                        05:39:20   Log-Likelihood:                -1438.4
No. Observations:                 390   AIC:                             2881.
Df Residuals:                     388   BIC:                             2889.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    -37.8732     11.559     -3.277      0.001     -60.599     -15.147
OBP          364.7135     35.438     10.292      0.000     295.039     434.388
==============================================================================
Omnibus:                        9.409   Durbin-Watson:                   1.224
Prob(Omnibus):                  0.009   Jarque-Bera (JB):                6.149
Skew:                          -0.156   Prob(JB):                       0.0462
Kurtosis:                       2.470   Cond. No.                         79.8
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
sns.regplot(x='salary', y='W', data=baseball, scatter_kws={'s':20})
<Axes: xlabel='salary', ylabel='W'>
This is a regression plot showing the relationship between total team salary and wins for a given team on a given year. There is slight positive trend and little correlation between the two variables.
mod = smf.ols(formula='W ~ salary', data=baseball)
results = mod.fit()
print(results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      W   R-squared:                       0.136
Model:                            OLS   Adj. R-squared:                  0.134
Method:                 Least Squares   F-statistic:                     60.95
Date:                Fri, 15 Dec 2023   Prob (F-statistic):           5.51e-14
Time:                        05:39:21   Log-Likelihood:                -1457.0
No. Observations:                 390   AIC:                             2918.
Df Residuals:                     388   BIC:                             2926.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     71.8865      1.274     56.447      0.000      69.383      74.390
salary      9.759e-08   1.25e-08      7.807      0.000     7.3e-08    1.22e-07
==============================================================================
Omnibus:                       17.415   Durbin-Watson:                   1.332
Prob(Omnibus):                  0.000   Jarque-Bera (JB):                8.623
Skew:                          -0.142   Prob(JB):                       0.0134
Kurtosis:                       2.329   Cond. No.                     2.52e+08
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.52e+08. This might indicate that there are
strong multicollinearity or other numerical problems.
sns.regplot(x='BPF', y='W', data=baseball, scatter_kws={'s':20})
<Axes: xlabel='BPF', ylabel='W'>
This is a regression plot showing the relationship between the batting park factor (the tendency for batters to hit above average in a given ballpark) and wins for a given team on a given year. There no trend or correlation between the two variables.
mod = smf.ols(formula='W ~ BPF', data=baseball)
results = mod.fit()
print(results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      W   R-squared:                       0.009
Model:                            OLS   Adj. R-squared:                  0.006
Method:                 Least Squares   F-statistic:                     3.365
Date:                Fri, 15 Dec 2023   Prob (F-statistic):             0.0674
Time:                        05:39:21   Log-Likelihood:                -1483.8
No. Observations:                 390   AIC:                             2972.
Df Residuals:                     388   BIC:                             2980.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     60.9305     10.944      5.568      0.000      39.414      82.447
BPF            0.2002      0.109      1.834      0.067      -0.014       0.415
==============================================================================
Omnibus:                       20.655   Durbin-Watson:                   1.156
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               11.233
Skew:                          -0.238   Prob(JB):                      0.00364
Kurtosis:                       2.319   Cond. No.                     1.99e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.99e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
ppf_v_win = sns.regplot(x='PPF', y='W', data=baseball, scatter_kws={'s':20})
ppf_v_win.set(xlabel='Pitchig Park Factor', ylabel='Wins')
[Text(0.5, 0, 'Pitchig Park Factor'), Text(0, 0.5, 'Wins')]
This is a regression plot showing the relationship between pitching park factor (the tendency for hitters to hit below average in a given ballpark) and wins for a given team on a given year. There is slight negative trend and little correlation between the two variables.
mod = smf.ols(formula='W ~ PPF', data=baseball)
results = mod.fit()
print(results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      W   R-squared:                       0.002
Model:                            OLS   Adj. R-squared:                 -0.001
Method:                 Least Squares   F-statistic:                    0.6023
Date:                Fri, 15 Dec 2023   Prob (F-statistic):              0.438
Time:                        05:39:22   Log-Likelihood:                -1485.2
No. Observations:                 390   AIC:                             2974.
Df Residuals:                     388   BIC:                             2982.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     89.6310     11.161      8.031      0.000      67.687     111.575
PPF           -0.0864      0.111     -0.776      0.438      -0.305       0.132
==============================================================================
Omnibus:                       23.562   Durbin-Watson:                   1.150
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               11.214
Skew:                          -0.203   Prob(JB):                      0.00367
Kurtosis:                       2.275   Cond. No.                     2.02e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.02e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
sns.regplot(x='attendance', y='W', data=baseball, scatter_kws={'s':20})
<Axes: xlabel='attendance', ylabel='W'>
This is a regression plot showing the relationship between attendance and wins for a given team on a given year. There is positive trend and some correlation between the two variables.
mod = smf.ols(formula='W ~ attendance', data=baseball)
results = mod.fit()
print(results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      W   R-squared:                       0.232
Model:                            OLS   Adj. R-squared:                  0.230
Method:                 Least Squares   F-statistic:                     117.5
Date:                Fri, 15 Dec 2023   Prob (F-statistic):           4.31e-24
Time:                        05:39:22   Log-Likelihood:                -1433.9
No. Observations:                 390   AIC:                             2872.
Df Residuals:                     388   BIC:                             2880.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     61.3386      1.876     32.698      0.000      57.650      65.027
attendance  7.885e-06   7.27e-07     10.839      0.000    6.45e-06    9.31e-06
==============================================================================
Omnibus:                        3.926   Durbin-Watson:                   1.319
Prob(Omnibus):                  0.140   Jarque-Bera (JB):                3.400
Skew:                          -0.145   Prob(JB):                        0.183
Kurtosis:                       2.646   Cond. No.                     9.97e+06
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 9.97e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
sns.regplot(x='FP', y='W', data=baseball, scatter_kws={'s':20})
<Axes: xlabel='FP', ylabel='W'>
This is a regression plot showing the relationship between fielding percentage and wins for a given team on a given year. There is positive trend and little correlation between the two variables.
mod = smf.ols(formula='W ~ FP', data=baseball)
results = mod.fit()
print(results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      W   R-squared:                       0.156
Model:                            OLS   Adj. R-squared:                  0.154
Method:                 Least Squares   F-statistic:                     71.68
Date:                Fri, 15 Dec 2023   Prob (F-statistic):           5.26e-16
Time:                        05:39:23   Log-Likelihood:                -1452.4
No. Observations:                 390   AIC:                             2909.
Df Residuals:                     388   BIC:                             2917.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept  -1633.7785    202.532     -8.067      0.000   -2031.976   -1235.581
FP          1743.1308    205.883      8.467      0.000    1338.346    2147.916
==============================================================================
Omnibus:                       14.490   Durbin-Watson:                   1.191
Prob(Omnibus):                  0.001   Jarque-Bera (JB):                7.067
Skew:                          -0.069   Prob(JB):                       0.0292
Kurtosis:                       2.355   Cond. No.                         796.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
From the linear regression outputs, many of the results clue us to semi-strong positive and negative correlation between wins and the selected variables. Also of note, many of the ouputs alert us of strong multicollinearity, which makes sense. From our heatmaps, we see that many of the indepednent variables have high correlations, which results in multicollinearity. For our model, it may be best to try alternative models to linear regression.
For the modeling section of our tutorial, we will be focusing on ridge and lasso regressions to evaluate out data. As shown from the exploratory data analysis, many of the independent variables in our dataset have multicollinearity, which presents a problem for interpreting the individual coefficients of the correlated variables. Because we want to determine which variables are the most important in predicting wins, we will eliminate linear regression from our model.
We will use ridge regression because it adds a penalty term to the linear regression function, controlling the size of the coefficients. The penalty term is the sum of the squares of the coefficients multiplied by a regularization parameter. As we have a large dataset, overfitting is a concern, so a slimmer model is more optimal.
Similarly, we will use lasso regression as an alternative to ridge regression in order to create a slimmer model. It is similar to ridge, but instead of adding the sum of squared coefficients, it adds the sum of the absolute values of the coefficients. Lasso also have a tendency to produce sparse models by setting some coefficients exactly to zero, effectively performing feature selection.
Below is our entire model. The process of our model will be explained in comments, and we will conduct an analysis on the output following.
# Select relevant features
features = ['R', 'AB', 'H', '2B', '3B', 'HR', 'BB', 'SO', 'SB', 'CS', 'HBP', 'SF', 'RA', 'ER', 'CG', 'SHO', 'SV', 'HA', 'HRA', 'BBA', 'SOA', 'E', 'DP', 'FP', 'attendance', 'BPF', 'PPF', 'differential', 'AVG', 'OBP', '1B', 'SLG', 'salary']
# Split the data into features (X) and the target variable (y)
X = baseball[features]
y = baseball['W']
# Shuffle X and y to avoid data being grouped by year
X_shuffled, y_shuffled = shuffle(X, y, random_state=42)
# Split the shuffled data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_shuffled, y_shuffled, test_size=0.2, random_state=42)
# Standardizing the features, a good all-purpose scaling method
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Lasso Regression with cross-validation
lasso_model = Lasso(alpha=1.0)
lasso_cv_results = cross_validate(lasso_model, X_train_scaled, y_train, cv=5, scoring={'mse': make_scorer(mean_squared_error), 'mae': make_scorer(mean_absolute_error), 'mape': 'neg_mean_absolute_percentage_error'}, return_train_score=True)
# Ridge Regression with cross-validation
ridge_model = Ridge(alpha=1.0)
ridge_cv_results = cross_validate(ridge_model, X_train_scaled, y_train, cv=5, scoring={'mse': make_scorer(mean_squared_error), 'mae': make_scorer(mean_absolute_error), 'mape': 'neg_mean_absolute_percentage_error'}, return_train_score=True)
# Outputting cross-validation results
print("Lasso Regression Cross-Validation Results:")
print("Train MSE:", lasso_cv_results['train_mse'])
print("Test MSE:", lasso_cv_results['test_mse'])
print("Train MAE:", lasso_cv_results['train_mae'])
print("Test MAE:", lasso_cv_results['test_mae'])
print("Train MAPE:", -lasso_cv_results['train_mape'].mean())
print("Test MAPE:", -lasso_cv_results['test_mape'].mean())
print("\nRidge Regression Cross-Validation Results:")
print("Train MSE:", ridge_cv_results['train_mse'])
print("Test MSE:", ridge_cv_results['test_mse'])
print("Train MAE:", ridge_cv_results['train_mae'])
print("Test MAE:", ridge_cv_results['test_mae'])
print("Train MAPE:", -ridge_cv_results['train_mape'].mean())
print("Test MAPE:", -ridge_cv_results['test_mape'].mean())
# Fit final models on the entire training set
lasso_model.fit(X_train_scaled, y_train)
ridge_model.fit(X_train_scaled, y_train)
# Making predictions on the test set
y_pred_lasso = lasso_model.predict(X_test_scaled)
y_pred_ridge = ridge_model.predict(X_test_scaled)
# Evaluating performance on the test set
mse_lasso = mean_squared_error(y_test, y_pred_lasso)
mse_ridge = mean_squared_error(y_test, y_pred_ridge)
mae_lasso = mean_absolute_error(y_test, y_pred_lasso)
mae_ridge = mean_absolute_error(y_test, y_pred_ridge)
mape_lasso = mean_absolute_error(y_test, y_pred_lasso) / y_test.mean() * 100
mape_ridge = mean_absolute_error(y_test, y_pred_ridge) / y_test.mean() * 100
# Outputting additional metrics
print("\nLasso Regression Metrics on Test Set:")
print("Test MSE:", mse_lasso)
print("Test MAE:", mae_lasso)
print("Test MAPE:", mape_lasso)
print("\nRidge Regression Metrics on Test Set:")
print("Test MSE:", mse_ridge)
print("Test MAE:", mae_ridge)
print("Test MAPE:", mape_ridge)
# Calculating coefficients
linear_coefficients = pd.Series(lasso_model.coef_, index=features, name='Lasso Coefficients')
ridge_coefficients = pd.Series(ridge_model.coef_, index=features, name='Ridge Coefficients')
# Outputting the coefficients
print("\nLasso Regression Coefficients:")
print(linear_coefficients)
print("\nRidge Regression Coefficients:")
print(ridge_coefficients)
# Plotting the coefficients
plt.figure(figsize=(12, 6))
linear_coefficients.plot(kind='bar', color='blue', alpha=0.7, label='Lasso Regression')
ridge_coefficients.plot(kind='bar', color='orange', alpha=0.7, label='Ridge Regression')
plt.title('Lasso vs. Ridge Regression Coefficients')
plt.xlabel('Features')
plt.ylabel('Coefficients')
plt.legend()
plt.show()
Lasso Regression Cross-Validation Results: Train MSE: [11.20666699 11.67411654 11.04421093 11.68885514 10.69664513] Test MSE: [11.39210868 9.02284629 11.96660232 11.99680211 13.39725325] Train MAE: [2.67132948 2.77871331 2.67748195 2.7707325 2.66335372] Test MAE: [2.85383271 2.39444269 2.85633994 2.76982123 2.88783674] Train MAPE: 0.03444431452354736 Test MAPE: 0.03493001884604956 Ridge Regression Cross-Validation Results: Train MSE: [7.74448574 8.35934331 7.3459862 7.93124754 7.45480217] Test MSE: [ 9.84239331 7.10438773 11.46098082 8.8855374 11.69348803] Train MAE: [2.206153 2.27575906 2.09535439 2.2363997 2.16573858] Test MAE: [2.43717243 2.09853272 2.76548542 2.31515811 2.69512577] Train MAPE: 0.02773242061706115 Test MAPE: 0.031060612906844226 Lasso Regression Metrics on Test Set: Test MSE: 13.25365719237693 Test MAE: 2.933882513923934 Test MAPE: 3.6301211308068986 Ridge Regression Metrics on Test Set: Test MSE: 10.203421192219395 Test MAE: 2.56765965016351 Test MAPE: 3.1769900493774395 Lasso Regression Coefficients: R 0.000000 AB 0.000000 H 0.000000 2B 0.000000 3B -0.000000 HR 0.000000 BB 0.000000 SO -0.000000 SB 0.000000 CS -0.000000 HBP 0.000000 SF 0.000000 RA -0.000000 ER -0.000000 CG 0.000000 SHO 0.000000 SV 2.121406 HA -0.000000 HRA -0.000000 BBA -0.000000 SOA 0.000000 E -0.000000 DP -0.000000 FP 0.000000 attendance 0.058465 BPF -0.000000 PPF -0.000000 differential 8.178797 AVG 0.000000 OBP 0.000000 1B 0.000000 SLG 0.000000 salary 0.000000 Name: Lasso Coefficients, dtype: float64 Ridge Regression Coefficients: R 3.403701 AB -0.366015 H -0.056132 2B -0.171711 3B -0.474057 HR 0.078274 BB 0.389861 SO -0.183055 SB 0.343619 CS -0.123511 HBP 0.174036 SF -0.157728 RA -0.769355 ER -1.867679 CG 0.390758 SHO 0.655094 SV 2.912473 HA 0.200086 HRA -0.363844 BBA -0.490196 SOA 0.633574 E 0.325810 DP -0.017997 FP 0.701568 attendance 0.614488 BPF -0.305166 PPF -0.003891 differential 3.277630 AVG 0.598614 OBP 0.100328 1B 0.027911 SLG 0.617461 salary -0.193715 Name: Ridge Coefficients, dtype: float64
A large block of code and an equally large output! Lets dive in and see what our model has discovered.
To recap, we used sklearn's ridge and lasso regression function to create two models that look to correlate a long list of independent variables taken from the Lahman baseball database, and wins.
We scaled the data using standardization, which scales the data to have a mean of 0 and a standard deviation of 1. We then shuffled the data to prevent recency bias, and split our data into our training and testing sets. We used cross validation to get a more accurate picture of our model's over or underfitting tendency, and got the following results:
print("Lasso Regression Cross-Validation Results:")
print("Train MSE:", lasso_cv_results['train_mse'])
print("Test MSE:", lasso_cv_results['test_mse'])
print("Train MAE:", lasso_cv_results['train_mae'])
print("Test MAE:", lasso_cv_results['test_mae'])
print("Train MAPE:", -lasso_cv_results['train_mape'].mean())
print("Test MAPE:", -lasso_cv_results['test_mape'].mean())
print("\nRidge Regression Cross-Validation Results:")
print("Train MSE:", ridge_cv_results['train_mse'])
print("Test MSE:", ridge_cv_results['test_mse'])
print("Train MAE:", ridge_cv_results['train_mae'])
print("Test MAE:", ridge_cv_results['test_mae'])
print("Train MAPE:", -ridge_cv_results['train_mape'].mean())
print("Test MAPE:", -ridge_cv_results['test_mape'].mean())
Lasso Regression Cross-Validation Results: Train MSE: [11.20666699 11.67411654 11.04421093 11.68885514 10.69664513] Test MSE: [11.39210868 9.02284629 11.96660232 11.99680211 13.39725325] Train MAE: [2.67132948 2.77871331 2.67748195 2.7707325 2.66335372] Test MAE: [2.85383271 2.39444269 2.85633994 2.76982123 2.88783674] Train MAPE: 0.03444431452354736 Test MAPE: 0.03493001884604956 Ridge Regression Cross-Validation Results: Train MSE: [7.74448574 8.35934331 7.3459862 7.93124754 7.45480217] Test MSE: [ 9.84239331 7.10438773 11.46098082 8.8855374 11.69348803] Train MAE: [2.206153 2.27575906 2.09535439 2.2363997 2.16573858] Test MAE: [2.43717243 2.09853272 2.76548542 2.31515811 2.69512577] Train MAPE: 0.02773242061706115 Test MAPE: 0.031060612906844226
As we can see above, it appears that ridge is more accurate than lasso in predicting the # of wins a MLB team has. Both are very accurate however, with testing MAPE of just above 3%. It appears that the models are within a few games of predicting the number of wins a team may have, given data about its entire season.
Moving on, we will take a look at the coefficients of each model. The lasso model, which is the slimmest, appears to only take a few variables into account, namely saves (2.121406), attendance (0.058465), and run differential (8.178797). Run differential clearly is the most important coefficient in terms of its correlation with wins a given team may have. The ridge regression model has coefficients for every variable, but its most significant are run differential (3.277630), runs (3.403701), earned runs against (-1.867679), and saves (2.912473). This data also falls in line with our lasso model, as runs and earned runs against are components of run differential, and saves is important as a save equals a win.
The graph above displays the coefficients for each variable for both models, overlayed to compare the magnitude of each. It highlights lasso's tendency to create slimmer models, and provides the same conclusions as the individual variables coefficients printed above.
Overall, our model does a good job at predicting the number of wins a baseball team would have in a given year given a number of statistics. We found the most important variables to teams success, some of which are self explanatory, such as run differential, and others more curious, such as park attendance.
These models perform well on historical data, however it would be hard to predict next year's number of wins given past years data. This is a avenue we can pursue in future analysis of baseball data, and surely surprising conclucsions will arise. However, lots of the data that does make an impact on such predicitions, such as prospects, momentum, trades, etc, are very hard to accumulate, and hence are the reason why baseball is one of the most intriguing sports to analyze.
Thank you for reading this tutorial, attached below please find the link to the Lahman website.