https://lalbright22.github.io/

Final Tutorial - Keys to Success in Major League Baseball¶

Introduction¶

Our group will consist of Luke Albright and James Manzer for the data science project and presentation. We are exploring historical Major League Baseball statistics to determine what variables correlate with the number of wins a team has during a season, as well as their predictive power for determining the number of wins a MLB team would have.

About the Data:¶

The dataset we are interested in studying involves Major League Baseball statistics. Both of us have a history of playing baseball and are now fans of the Chicago Cubs and Texas Rangers. James is even a member of the club baseball team here at Tulane. Given our passion for baseball and its vast records of statistics, we feel compelled to take a deep dive into this space.

We have identified the Lahman Database as our data source for this project. The Lahman Database is a comprehensive collection of baseball statistics that covers Major League Baseball from the 19th century to the present. It is widely used by researchers, analysts and enthusiasts to analyze various teams and players throughout history. The database is freely available and can be downloaded in various formats. We will be using the CSV version of this data to import into our notebook.

The data we downloaded is divided into three main sections, core, upstream and contrib, however we will be focusing on data from the core and contrib folders, which includes player and team batting, fielding, and pitching statistics as well as salary data. More specifically, we will be looking at aggregated salary, pitching, and hitting data by team and by year, some of which will need to be preprocessed. We believe this trove of information will enable us to identify variables that correlate with the number of wins a franchise may have in a given year, as well as predict the future number of wins a franchise may have given certain historical information.

We will be looking at historical data from 2004-2016 to use for our model. The salary data we are using is only available to 2016, resulting in that constraint. We are using 2004 as the beginning of data due to the shift in the league at that time to more modern standards. The steroid era of the MLB lasted throughout the 1990's until league wide PED testing was implemented in 2003, resulting in inflated statistics for a number of players and teams. Because of this, we our data will begin in 2004.

ETL¶

This first block of code installs the necessary libraries for use throughout this project.

In [267]:
# Initialize necesarry libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import zipfile
import seaborn as sns
import statsmodels.formula.api as smf
from sklearn import linear_model
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn import metrics
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import scale
from sklearn.metrics import mean_absolute_percentage_error
from sklearn.linear_model import Ridge
from numpy import mean
from numpy import std
from numpy import absolute
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
from sklearn.linear_model import ElasticNet
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import ShuffleSplit
from sklearn.utils import shuffle
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_validate
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import make_scorer, mean_squared_error
import matplotlib.pyplot as plt

For ease of use, we created a public Github repository with the raw zip file from the Lahman website. We now will navigate to the proper Github directory from which we will pull our data, and extract the zip file.

In [268]:
# Extract zip file from github

%cd /content
!git clone https://github.com/lalbright22/DataScience2023.git
%cd /content/DataScience2023/

zip_file_path = '/content/DataScience2023/baseballdatabank-2023.1.zip'
extract_dir = '/content/DataScience2023/'

# Create a ZipFile object
with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
    # Extract all the contents into the target directory
    zip_ref.extractall(extract_dir)
/content
fatal: destination path 'DataScience2023' already exists and is not an empty directory.
/content/DataScience2023

Now that the file is unzipped, we will navigte to the contrib sub-folder to read the Salaries CSV file into a pandas dataframe.

In [269]:
%cd baseballdatabank-2023.1/contrib

# Upload salaries dataframe
salaries = pd.read_csv("Salaries.csv")
salaries.head()
/content/DataScience2023/baseballdatabank-2023.1/contrib
Out[269]:
yearID teamID lgID playerID salary
0 1985 ATL NL barkele01 870000
1 1985 ATL NL bedrost01 550000
2 1985 ATL NL benedbr01 545000
3 1985 ATL NL campri01 633333
4 1985 ATL NL ceronri01 625000

Once thats done, we will navigate to the core sub-folder that includes the Teams CSV file, reading the file into a pandas dataframe.

In [270]:
%cd ../core

# Downloads teams dataframe
teams = pd.read_csv("Teams.csv")
teams
/content/DataScience2023/baseballdatabank-2023.1/core
Out[270]:
yearID lgID teamID franchID divID Rank G Ghome W L ... DP FP name park attendance BPF PPF teamIDBR teamIDlahman45 teamIDretro
0 1871 NaN BS1 BNA NaN 3 31 NaN 20 10 ... 24 0.834 Boston Red Stockings South End Grounds I NaN 103 98 BOS BS1 BS1
1 1871 NaN CH1 CNA NaN 2 28 NaN 19 9 ... 16 0.829 Chicago White Stockings Union Base-Ball Grounds NaN 104 102 CHI CH1 CH1
2 1871 NaN CL1 CFC NaN 8 29 NaN 10 19 ... 15 0.818 Cleveland Forest Citys National Association Grounds NaN 96 100 CLE CL1 CL1
3 1871 NaN FW1 KEK NaN 7 19 NaN 7 12 ... 8 0.803 Fort Wayne Kekiongas Hamilton Field NaN 101 107 KEK FW1 FW1
4 1871 NaN NY2 NNA NaN 5 33 NaN 16 17 ... 14 0.840 New York Mutuals Union Grounds (Brooklyn) NaN 90 88 NYU NY2 NY2
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3010 2022 NL SLN STL C 1 162 81.0 93 69 ... 181 0.989 St. Louis Cardinals Busch Stadium III 3320551.0 94 94 STL SLN SLN
3011 2022 AL TBA TBD E 1 162 81.0 86 76 ... 110 0.985 Tampa Bay Rays Tropicana Field 1128127.0 95 93 TBR TBA TBA
3012 2022 AL TEX TEX W 4 162 81.0 68 94 ... 143 0.984 Texas Rangers Globe Life Field 2011361.0 100 101 TEX TEX TEX
3013 2022 AL TOR TOR E 2 162 81.0 92 70 ... 120 0.986 Toronto Blue Jays Rogers Centre 2653830.0 100 100 TOR TOR TOR
3014 2022 NL WAS WSN E 5 162 81.0 55 107 ... 126 0.982 Washington Nationals Nationals Park 2026401.0 94 96 WSN MON WAS

3015 rows × 48 columns

Data Preprocessing¶

We will now begin preprocessing our data, beginning with the salaries dataframe, aggregating the salary data for each player by team and year.

In [271]:
# Sum salaries by team and year (league stays the same in all cases) in order to merge with teams dataset
team_salaries = salaries.groupby(["teamID", "yearID"])["salary"].sum().reset_index() #Groups and sums the player salaries by team.
team_salaries
Out[271]:
teamID yearID salary
0 ANA 1997 31135472
1 ANA 1998 41281000
2 ANA 1999 55388166
3 ANA 2000 51464167
4 ANA 2001 47535167
... ... ... ...
913 WAS 2012 80855143
914 WAS 2013 113703270
915 WAS 2014 131983680
916 WAS 2015 155587472
917 WAS 2016 141652646

918 rows × 3 columns

Now we will calculate a few new variables from the teams dataframe, including run differential (differential), batting average (AVG), on base percentage (OBP), singles (1B) and slugging (SLG).

In [272]:
# Sum salaries by team and year (league stays the same in all cases) in order to merge with teams dataset
teams["differential"] = teams['R'] - teams['RA']
In [273]:
# Calculate batting average by the formula below
teams["AVG"] = (teams['H']) / (teams['AB'])
In [274]:
# Calculate on base percentage by the formula below
teams["OBP"] = (teams['H'] + teams['BB'] + teams['HBP']) / (teams['AB'] + teams['BB'] + teams['HBP'] + teams['SF'])
In [275]:
# Calculate singles by the formula below
teams["1B"] = teams['H'] - teams['2B'] - teams['3B'] - teams['HR']
In [276]:
# Calculate slugging percentage by the formula below
teams["SLG"] = (teams['1B'] + 2*teams['2B'] + 3*teams['3B'] + 4*teams['HR'])/teams['AB']
teams
Out[276]:
yearID lgID teamID franchID divID Rank G Ghome W L ... BPF PPF teamIDBR teamIDlahman45 teamIDretro differential AVG OBP 1B SLG
0 1871 NaN BS1 BNA NaN 3 31 NaN 20 10 ... 103 98 BOS BS1 BS1 98 0.310496 NaN 316 0.422012
1 1871 NaN CH1 CNA NaN 2 28 NaN 19 9 ... 104 102 CHI CH1 CH1 61 0.270067 NaN 240 0.373746
2 1871 NaN CL1 CFC NaN 8 29 NaN 10 19 ... 96 100 CLE CL1 CL1 -92 0.276560 NaN 246 0.391231
3 1871 NaN FW1 KEK NaN 7 19 NaN 7 12 ... 101 107 KEK FW1 FW1 -106 0.238606 NaN 149 0.293566
4 1871 NaN NY2 NNA NaN 5 33 NaN 16 17 ... 90 88 NYU NY2 NY2 -11 0.287037 NaN 338 0.349715
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3010 2022 NL SLN STL C 1 162 81.0 93 69 ... 94 94 STL SLN SLN 135 0.252183 0.325268 878 0.420124
3011 2022 AL TBA TBD E 1 162 81.0 86 76 ... 95 93 TBR TBA TBA 52 0.239098 0.308500 842 0.377125
3012 2022 AL TEX TEX W 4 162 81.0 68 94 ... 100 101 TEX TEX TEX -36 0.238773 0.300881 866 0.395400
3013 2022 AL TOR TOR E 2 162 81.0 92 70 ... 100 100 TOR TOR TOR 96 0.263546 0.328667 945 0.431143
3014 2022 NL WAS WSN E 5 162 81.0 55 107 ... 94 96 WSN MON WAS -252 0.248620 0.310229 943 0.377438

3015 rows × 53 columns

Here we will drop a few unecessary columns in our dataframe that will not be used in our model.

In [277]:
# Dropping unecessary columns
teams.drop(columns=["teamIDBR", "teamIDlahman45", "teamIDretro", "IPouts"], inplace=True)

Here we will merge the two dataframes on team and year, resulting in our final baseball dataframe that will be used in our model.

In [278]:
# This results in a dataframe with statistics for each team and year, including total salary, wins, attendance, batting and pitching park factors, and more.
baseball = pd.merge(team_salaries, teams, on=["teamID", "yearID"])

Lastly, we will filter the yearID column of the dataframe to only include data from the years 2004-2016. We know that our data preprocessing is correct because there are 30 teams and 30 teams * 13 years = 390 rows, as stated below

In [279]:
baseball = baseball[(baseball['yearID'] >= 2004) & (baseball['yearID'] <= 2016)]
In [280]:
baseball
Out[280]:
teamID yearID salary lgID franchID divID Rank G Ghome W ... name park attendance BPF PPF differential AVG OBP 1B SLG
7 ANA 2004 100534667 AL ANA W 1 162 81.0 92 ... Anaheim Angels Angels Stadium of Anaheim 3375677.0 97 97 102 0.282467 0.340760 1132 0.429075
14 ARI 2004 69780750 NL ARI W 5 162 81.0 51 ... Arizona Diamondbacks Bank One Ballpark 2519560.0 105 107 -284 0.252706 0.309889 933 0.392677
15 ARI 2005 62329166 NL ARI W 2 162 81.0 77 ... Arizona Diamondbacks Bank One Ballpark 2059424.0 103 105 -160 0.255676 0.332481 910 0.421081
16 ARI 2006 59684226 NL ARI W 4 162 81.0 76 ... Arizona Diamondbacks Chase Field 2091685.0 105 105 -15 0.266785 0.331313 977 0.423915
17 ARI 2007 52067546 NL ARI W 1 162 81.0 90 ... Arizona Diamondbacks Chase Field 2325249.0 107 107 -20 0.250093 0.320761 853 0.412931
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
913 WAS 2012 80855143 NL WSN E 1 162 81.0 98 ... Washington Nationals Nationals Park 2370794.0 99 101 137 0.261443 0.322152 948 0.427605
914 WAS 2013 113703270 NL WSN E 2 162 81.0 86 ... Washington Nationals Nationals Park 2652422.0 102 101 30 0.251104 0.312594 918 0.397535
915 WAS 2014 131983680 NL WSN E 1 162 81.0 96 ... Washington Nationals Nationals Park 2579389.0 104 102 131 0.253158 0.320988 959 0.392999
916 WAS 2015 155587472 NL WSN E 2 162 81.0 83 ... Washington Nationals Nationals Park 2619843.0 102 99 68 0.251105 0.321016 908 0.402542
917 WAS 2016 141652646 NL WSN E 1 162 81.0 95 ... Washington Nationals Nationals Park 2481938.0 100 98 151 0.255556 0.325532 903 0.425865

390 rows × 50 columns

In [281]:
baseball.columns
Out[281]:
Index(['teamID', 'yearID', 'salary', 'lgID', 'franchID', 'divID', 'Rank', 'G',
       'Ghome', 'W', 'L', 'DivWin', 'WCWin', 'LgWin', 'WSWin', 'R', 'AB', 'H',
       '2B', '3B', 'HR', 'BB', 'SO', 'SB', 'CS', 'HBP', 'SF', 'RA', 'ER',
       'ERA', 'CG', 'SHO', 'SV', 'HA', 'HRA', 'BBA', 'SOA', 'E', 'DP', 'FP',
       'name', 'park', 'attendance', 'BPF', 'PPF', 'differential', 'AVG',
       'OBP', '1B', 'SLG'],
      dtype='object')

Here is a helpful glossary containing the definitions for all of the variables in the merged dataframe.¶

  • yearID = Year
  • lgID = League (National or American)
  • teamID = Team
  • franchID = Franchise
  • divID = Division
  • Rank = Ranking
  • G = Games
  • Ghome = Home Games
  • W = Wins
  • L = Loses
  • DivWin = Divisional Wins
  • WCWin = Wild Card Wins
  • LGWin = League Wins
  • WSWin = World Series Wins
  • R = Runs
  • AB = At Bats
  • H = Hits
  • 2B = Doubles
  • 3B = Triples
  • HR = Home Runs
  • BB = Walks
  • SO = Strike Outs
  • SB = Stolen Bases
  • CS = Caught Stealing
  • HBP = Hit by Pitch
  • SF = Sacrifice Fly
  • RA = Runs Against
  • ER = Earned Runs
  • ERA = Earned Runs Average
  • CG = Complete Game
  • SHO = Shutout
  • SV = Save
  • IPouts = Outs Pitched
  • HA = Hits Allowed
  • HRA = Home Runs Allowed
  • BBA = Walks Allowed
  • SOA = Strike Outs Allowed
  • E = Errors
  • DP = Double Play
  • FP = Fielding Percentage
  • name = Team Name
  • park = Ballpark
  • attendance = Attendance
  • BPF = Batting Park Factor
  • PPF = Pitching Park Factor
  • differential = Run Differential
  • AVG = Batting Average
  • OBP = On Base Percentage
  • 1B = Singles
  • SLG = Slugging Percentage
  • salary = Total Team Salary

EDA¶

Now that the data is preprocessed, we will visualize the data and perform exploratory data analysis to identify trends, identify correlations and begin to form a hypothesis as to what variables correlate with wins the most.

In [282]:
baseball_nl = baseball[baseball['lgID'] == 'NL']
baseball_al = baseball[baseball['lgID'] == 'AL']

# Plot for National League
plt.figure(figsize=(12, 8)) #Adjust the figure size
sns.scatterplot(data=baseball_nl, x='yearID', y='W', hue='teamID', palette='plasma', edgecolor='w', s=100)
plt.title('W by AL Teams Each Year')
plt.xlabel('Year')
plt.ylabel('W')
plt.legend(title='teamID', bbox_to_anchor=(1.05, 1), loc='upper left')  # Adjust legend position
# Add a horizontal line at the mean # of wins
plt.axhline(81, color='green', linestyle='--', label='Threshold (81)')
plt.show()

# Plot for American League
plt.figure(figsize=(12, 8)) #Adjust the figure size
sns.scatterplot(data=baseball_al, x='yearID', y='W', hue='teamID', palette='plasma', edgecolor='w', s=100)
plt.title('W by AL Teams Each Year')
plt.xlabel('Year')
plt.ylabel('W')
plt.legend(title='teamID', bbox_to_anchor=(1.05, 1), loc='upper left')  # Adjust legend position
# Add a horizontal line at the mean # of wins
plt.axhline(81, color='green', linestyle='--', label='Threshold (81)')
plt.show()

Shown above is a scatterplot showing the number of wins, our dependent variable, each team had per year. The teams are split up by the NL and AL. This give us an idea of the distribution of the # of wins a team has in a given year. As the data shows, it appears to have a relatively large standard deviation. Lets take a closer look.

In [283]:
wins_std_by_league_and_year = baseball.groupby(['lgID', 'yearID'])['W'].std()

print("Standard Deviation of Wins by League and Year:")
print(wins_std_by_league_and_year)
Standard Deviation of Wins by League and Year:
lgID  yearID
AL    2004      13.473621
      2005      13.070393
      2006      12.000000
      2007      10.885266
      2008      11.338207
      2009      12.162137
      2010      11.640211
      2011      11.432795
      2012      10.967484
      2013      13.704361
      2014       9.927355
      2015       7.225616
      2016      10.511672
NL    2004      14.051690
      2005       8.830817
      2006       8.035079
      2007       7.957177
      2008      11.002841
      2009      11.094856
      2010      10.776518
      2011      11.763645
      2012      12.992305
      2013      11.093542
      2014       9.523405
      2015      13.097801
      2016      11.089678
Name: W, dtype: float64

The standard deviation of wins by league per year from the years 2004 is usually around 11-12. There appears to be no conistent trend to the variation.

Heatmaps are a helpful way of visualizing correlation between a number of variables. Below are 4 heatmaps, all of which include wins as the first row, containing all of the variables in our dataframe and their correlations.

In [284]:
sns.heatmap(baseball[['W', 'BPF', 'R', 'AB', 'H', '2B', '3B', 'HR', 'BB', 'SO']].corr(), annot = True, cmap = 'coolwarm', fmt='.2f')
Out[284]:
<Axes: >

From the heatmap above, it appears that runs, homeruns, and walks, have positive correlations with wins, and strikeouts and oddly triples have a negative correlation with wins. These all make sense apart from triples, which could be explained by the fact that triples don't score the batter, and aren't as beneficial as homeruns.

In [285]:
sns.heatmap(baseball[['W', 'SB', 'CS', 'HBP', 'SF', 'RA', 'ER', 'CG', 'SHO']].corr(), annot = True, cmap = 'coolwarm', fmt='.2f')
Out[285]:
<Axes: >

From the heatmap above, shutouts and sacrifice flies are the strongest positevely correlated variables to wins, which both make sense since shutouts ensure a win in a given game and sacrifice flies score runs. Runs against and earned runs both have strong negative correlations with wins, which makes sense since a high number of runs against, earned or not, don't bode well for a teams' winning prospects.

In [286]:
sns.heatmap(baseball[['W', 'SV', 'HA', 'HRA', 'BBA', 'SOA', 'E', 'DP', 'FP', 'attendance']].corr(), annot = True, cmap = 'coolwarm', fmt='.2f')
Out[286]:
<Axes: >

From the heatmap above, saves, attendance, fielding percentage and strike outs against are all positively correlated with wins. All of these make sense, and while attendance isn't an in-game statistic, it make sense that winning teams draw larger crowds. Hits against, walks against, error and home runs against are negatively correlated with wins, which all make sense as well.

In [287]:
sns.heatmap(baseball[['W', 'PPF', 'differential', 'AVG', 'OBP', '1B', 'SLG', 'salary']].corr(), annot = True, cmap = 'coolwarm', fmt='.2f')
Out[287]:
<Axes: >

In our last heatmap apove, teams with a positive win differential in a given year are highly positively correlated with wins, as teams that score more runs than they let up tend to win more often. On base percentage, slugging percentage, salary and batting average are all also positvely correlated.

Next, we are going to take a look at some regression plots, with wins and variables we viewed as important from our heatmaps visualizations. For all of the graphs, we will use sns's regplot function, and plot wins on the y axis with our selected variables on the x axis. We will also run a linear regression output for the just the wins and selected variables, giving us an idea of how statistically significant these selected variables are with respect to wins in a vacuum. Because the data is not standardized yet, the coefficients aren't useful.

In [288]:
diff_v_win = sns.regplot(x='differential', y='W', data=baseball, scatter_kws={'s':10})
diff_v_win.set(xlabel='Run Differential', ylabel='Wins')
Out[288]:
[Text(0.5, 0, 'Run Differential'), Text(0, 0.5, 'Wins')]
In [289]:
mod = smf.ols(formula='W ~ differential', data=baseball)
results = mod.fit()
print(results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      W   R-squared:                       0.858
Model:                            OLS   Adj. R-squared:                  0.858
Method:                 Least Squares   F-statistic:                     2343.
Date:                Fri, 15 Dec 2023   Prob (F-statistic):          1.68e-166
Time:                        05:39:20   Log-Likelihood:                -1105.0
No. Observations:                 390   AIC:                             2214.
Df Residuals:                     388   BIC:                             2222.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept       80.9795      0.209    387.757      0.000      80.569      81.390
differential     0.1008      0.002     48.405      0.000       0.097       0.105
==============================================================================
Omnibus:                        6.362   Durbin-Watson:                   1.819
Prob(Omnibus):                  0.042   Jarque-Bera (JB):                6.148
Skew:                           0.288   Prob(JB):                       0.0462
Kurtosis:                       3.216   Cond. No.                         100.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [290]:
on_base = sns.regplot(x='OBP', y='W', data=baseball, scatter_kws={'s':10})
on_base.set(xlabel='On Base Percentage', ylabel='Wins')
Out[290]:
[Text(0.5, 0, 'On Base Percentage'), Text(0, 0.5, 'Wins')]
In [291]:
mod = smf.ols(formula='W ~ OBP', data=baseball)
results = mod.fit()
print(results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      W   R-squared:                       0.214
Model:                            OLS   Adj. R-squared:                  0.212
Method:                 Least Squares   F-statistic:                     105.9
Date:                Fri, 15 Dec 2023   Prob (F-statistic):           4.00e-22
Time:                        05:39:20   Log-Likelihood:                -1438.4
No. Observations:                 390   AIC:                             2881.
Df Residuals:                     388   BIC:                             2889.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    -37.8732     11.559     -3.277      0.001     -60.599     -15.147
OBP          364.7135     35.438     10.292      0.000     295.039     434.388
==============================================================================
Omnibus:                        9.409   Durbin-Watson:                   1.224
Prob(Omnibus):                  0.009   Jarque-Bera (JB):                6.149
Skew:                          -0.156   Prob(JB):                       0.0462
Kurtosis:                       2.470   Cond. No.                         79.8
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [292]:
sns.regplot(x='salary', y='W', data=baseball, scatter_kws={'s':20})
Out[292]:
<Axes: xlabel='salary', ylabel='W'>

This is a regression plot showing the relationship between total team salary and wins for a given team on a given year. There is slight positive trend and little correlation between the two variables.

In [293]:
mod = smf.ols(formula='W ~ salary', data=baseball)
results = mod.fit()
print(results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      W   R-squared:                       0.136
Model:                            OLS   Adj. R-squared:                  0.134
Method:                 Least Squares   F-statistic:                     60.95
Date:                Fri, 15 Dec 2023   Prob (F-statistic):           5.51e-14
Time:                        05:39:21   Log-Likelihood:                -1457.0
No. Observations:                 390   AIC:                             2918.
Df Residuals:                     388   BIC:                             2926.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     71.8865      1.274     56.447      0.000      69.383      74.390
salary      9.759e-08   1.25e-08      7.807      0.000     7.3e-08    1.22e-07
==============================================================================
Omnibus:                       17.415   Durbin-Watson:                   1.332
Prob(Omnibus):                  0.000   Jarque-Bera (JB):                8.623
Skew:                          -0.142   Prob(JB):                       0.0134
Kurtosis:                       2.329   Cond. No.                     2.52e+08
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.52e+08. This might indicate that there are
strong multicollinearity or other numerical problems.
In [294]:
sns.regplot(x='BPF', y='W', data=baseball, scatter_kws={'s':20})
Out[294]:
<Axes: xlabel='BPF', ylabel='W'>

This is a regression plot showing the relationship between the batting park factor (the tendency for batters to hit above average in a given ballpark) and wins for a given team on a given year. There no trend or correlation between the two variables.

In [295]:
mod = smf.ols(formula='W ~ BPF', data=baseball)
results = mod.fit()
print(results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      W   R-squared:                       0.009
Model:                            OLS   Adj. R-squared:                  0.006
Method:                 Least Squares   F-statistic:                     3.365
Date:                Fri, 15 Dec 2023   Prob (F-statistic):             0.0674
Time:                        05:39:21   Log-Likelihood:                -1483.8
No. Observations:                 390   AIC:                             2972.
Df Residuals:                     388   BIC:                             2980.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     60.9305     10.944      5.568      0.000      39.414      82.447
BPF            0.2002      0.109      1.834      0.067      -0.014       0.415
==============================================================================
Omnibus:                       20.655   Durbin-Watson:                   1.156
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               11.233
Skew:                          -0.238   Prob(JB):                      0.00364
Kurtosis:                       2.319   Cond. No.                     1.99e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.99e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
In [296]:
ppf_v_win = sns.regplot(x='PPF', y='W', data=baseball, scatter_kws={'s':20})
ppf_v_win.set(xlabel='Pitchig Park Factor', ylabel='Wins')
Out[296]:
[Text(0.5, 0, 'Pitchig Park Factor'), Text(0, 0.5, 'Wins')]

This is a regression plot showing the relationship between pitching park factor (the tendency for hitters to hit below average in a given ballpark) and wins for a given team on a given year. There is slight negative trend and little correlation between the two variables.

In [297]:
mod = smf.ols(formula='W ~ PPF', data=baseball)
results = mod.fit()
print(results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      W   R-squared:                       0.002
Model:                            OLS   Adj. R-squared:                 -0.001
Method:                 Least Squares   F-statistic:                    0.6023
Date:                Fri, 15 Dec 2023   Prob (F-statistic):              0.438
Time:                        05:39:22   Log-Likelihood:                -1485.2
No. Observations:                 390   AIC:                             2974.
Df Residuals:                     388   BIC:                             2982.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     89.6310     11.161      8.031      0.000      67.687     111.575
PPF           -0.0864      0.111     -0.776      0.438      -0.305       0.132
==============================================================================
Omnibus:                       23.562   Durbin-Watson:                   1.150
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               11.214
Skew:                          -0.203   Prob(JB):                      0.00367
Kurtosis:                       2.275   Cond. No.                     2.02e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.02e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
In [298]:
sns.regplot(x='attendance', y='W', data=baseball, scatter_kws={'s':20})
Out[298]:
<Axes: xlabel='attendance', ylabel='W'>

This is a regression plot showing the relationship between attendance and wins for a given team on a given year. There is positive trend and some correlation between the two variables.

In [299]:
mod = smf.ols(formula='W ~ attendance', data=baseball)
results = mod.fit()
print(results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      W   R-squared:                       0.232
Model:                            OLS   Adj. R-squared:                  0.230
Method:                 Least Squares   F-statistic:                     117.5
Date:                Fri, 15 Dec 2023   Prob (F-statistic):           4.31e-24
Time:                        05:39:22   Log-Likelihood:                -1433.9
No. Observations:                 390   AIC:                             2872.
Df Residuals:                     388   BIC:                             2880.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     61.3386      1.876     32.698      0.000      57.650      65.027
attendance  7.885e-06   7.27e-07     10.839      0.000    6.45e-06    9.31e-06
==============================================================================
Omnibus:                        3.926   Durbin-Watson:                   1.319
Prob(Omnibus):                  0.140   Jarque-Bera (JB):                3.400
Skew:                          -0.145   Prob(JB):                        0.183
Kurtosis:                       2.646   Cond. No.                     9.97e+06
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 9.97e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
In [300]:
sns.regplot(x='FP', y='W', data=baseball, scatter_kws={'s':20})
Out[300]:
<Axes: xlabel='FP', ylabel='W'>

This is a regression plot showing the relationship between fielding percentage and wins for a given team on a given year. There is positive trend and little correlation between the two variables.

In [301]:
mod = smf.ols(formula='W ~ FP', data=baseball)
results = mod.fit()
print(results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      W   R-squared:                       0.156
Model:                            OLS   Adj. R-squared:                  0.154
Method:                 Least Squares   F-statistic:                     71.68
Date:                Fri, 15 Dec 2023   Prob (F-statistic):           5.26e-16
Time:                        05:39:23   Log-Likelihood:                -1452.4
No. Observations:                 390   AIC:                             2909.
Df Residuals:                     388   BIC:                             2917.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept  -1633.7785    202.532     -8.067      0.000   -2031.976   -1235.581
FP          1743.1308    205.883      8.467      0.000    1338.346    2147.916
==============================================================================
Omnibus:                       14.490   Durbin-Watson:                   1.191
Prob(Omnibus):                  0.001   Jarque-Bera (JB):                7.067
Skew:                          -0.069   Prob(JB):                       0.0292
Kurtosis:                       2.355   Cond. No.                         796.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

From the linear regression outputs, many of the results clue us to semi-strong positive and negative correlation between wins and the selected variables. Also of note, many of the ouputs alert us of strong multicollinearity, which makes sense. From our heatmaps, we see that many of the indepednent variables have high correlations, which results in multicollinearity. For our model, it may be best to try alternative models to linear regression.

Model¶

For the modeling section of our tutorial, we will be focusing on ridge and lasso regressions to evaluate out data. As shown from the exploratory data analysis, many of the independent variables in our dataset have multicollinearity, which presents a problem for interpreting the individual coefficients of the correlated variables. Because we want to determine which variables are the most important in predicting wins, we will eliminate linear regression from our model.

We will use ridge regression because it adds a penalty term to the linear regression function, controlling the size of the coefficients. The penalty term is the sum of the squares of the coefficients multiplied by a regularization parameter. As we have a large dataset, overfitting is a concern, so a slimmer model is more optimal.

Similarly, we will use lasso regression as an alternative to ridge regression in order to create a slimmer model. It is similar to ridge, but instead of adding the sum of squared coefficients, it adds the sum of the absolute values of the coefficients. Lasso also have a tendency to produce sparse models by setting some coefficients exactly to zero, effectively performing feature selection.

Below is our entire model. The process of our model will be explained in comments, and we will conduct an analysis on the output following.

In [302]:
# Select relevant features
features = ['R', 'AB', 'H', '2B', '3B', 'HR', 'BB', 'SO', 'SB', 'CS', 'HBP', 'SF', 'RA', 'ER', 'CG', 'SHO', 'SV', 'HA', 'HRA', 'BBA', 'SOA', 'E', 'DP', 'FP', 'attendance', 'BPF', 'PPF', 'differential', 'AVG', 'OBP', '1B', 'SLG', 'salary']

# Split the data into features (X) and the target variable (y)
X = baseball[features]
y = baseball['W']

# Shuffle X and y to avoid data being grouped by year
X_shuffled, y_shuffled = shuffle(X, y, random_state=42)

# Split the shuffled data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_shuffled, y_shuffled, test_size=0.2, random_state=42)

# Standardizing the features, a good all-purpose scaling method
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Lasso Regression with cross-validation
lasso_model = Lasso(alpha=1.0)
lasso_cv_results = cross_validate(lasso_model, X_train_scaled, y_train, cv=5, scoring={'mse': make_scorer(mean_squared_error), 'mae': make_scorer(mean_absolute_error), 'mape': 'neg_mean_absolute_percentage_error'}, return_train_score=True)

# Ridge Regression with cross-validation
ridge_model = Ridge(alpha=1.0)
ridge_cv_results = cross_validate(ridge_model, X_train_scaled, y_train, cv=5, scoring={'mse': make_scorer(mean_squared_error), 'mae': make_scorer(mean_absolute_error), 'mape': 'neg_mean_absolute_percentage_error'}, return_train_score=True)

# Outputting cross-validation results
print("Lasso Regression Cross-Validation Results:")
print("Train MSE:", lasso_cv_results['train_mse'])
print("Test MSE:", lasso_cv_results['test_mse'])
print("Train MAE:", lasso_cv_results['train_mae'])
print("Test MAE:", lasso_cv_results['test_mae'])
print("Train MAPE:", -lasso_cv_results['train_mape'].mean())
print("Test MAPE:", -lasso_cv_results['test_mape'].mean())

print("\nRidge Regression Cross-Validation Results:")
print("Train MSE:", ridge_cv_results['train_mse'])
print("Test MSE:", ridge_cv_results['test_mse'])
print("Train MAE:", ridge_cv_results['train_mae'])
print("Test MAE:", ridge_cv_results['test_mae'])
print("Train MAPE:", -ridge_cv_results['train_mape'].mean())
print("Test MAPE:", -ridge_cv_results['test_mape'].mean())

# Fit final models on the entire training set
lasso_model.fit(X_train_scaled, y_train)
ridge_model.fit(X_train_scaled, y_train)

# Making predictions on the test set
y_pred_lasso = lasso_model.predict(X_test_scaled)
y_pred_ridge = ridge_model.predict(X_test_scaled)

# Evaluating performance on the test set
mse_lasso = mean_squared_error(y_test, y_pred_lasso)
mse_ridge = mean_squared_error(y_test, y_pred_ridge)
mae_lasso = mean_absolute_error(y_test, y_pred_lasso)
mae_ridge = mean_absolute_error(y_test, y_pred_ridge)
mape_lasso = mean_absolute_error(y_test, y_pred_lasso) / y_test.mean() * 100
mape_ridge = mean_absolute_error(y_test, y_pred_ridge) / y_test.mean() * 100

# Outputting additional metrics
print("\nLasso Regression Metrics on Test Set:")
print("Test MSE:", mse_lasso)
print("Test MAE:", mae_lasso)
print("Test MAPE:", mape_lasso)

print("\nRidge Regression Metrics on Test Set:")
print("Test MSE:", mse_ridge)
print("Test MAE:", mae_ridge)
print("Test MAPE:", mape_ridge)

# Calculating coefficients
linear_coefficients = pd.Series(lasso_model.coef_, index=features, name='Lasso Coefficients')
ridge_coefficients = pd.Series(ridge_model.coef_, index=features, name='Ridge Coefficients')

# Outputting the coefficients
print("\nLasso Regression Coefficients:")
print(linear_coefficients)

print("\nRidge Regression Coefficients:")
print(ridge_coefficients)

# Plotting the coefficients
plt.figure(figsize=(12, 6))
linear_coefficients.plot(kind='bar', color='blue', alpha=0.7, label='Lasso Regression')
ridge_coefficients.plot(kind='bar', color='orange', alpha=0.7, label='Ridge Regression')
plt.title('Lasso vs. Ridge Regression Coefficients')
plt.xlabel('Features')
plt.ylabel('Coefficients')
plt.legend()
plt.show()
Lasso Regression Cross-Validation Results:
Train MSE: [11.20666699 11.67411654 11.04421093 11.68885514 10.69664513]
Test MSE: [11.39210868  9.02284629 11.96660232 11.99680211 13.39725325]
Train MAE: [2.67132948 2.77871331 2.67748195 2.7707325  2.66335372]
Test MAE: [2.85383271 2.39444269 2.85633994 2.76982123 2.88783674]
Train MAPE: 0.03444431452354736
Test MAPE: 0.03493001884604956

Ridge Regression Cross-Validation Results:
Train MSE: [7.74448574 8.35934331 7.3459862  7.93124754 7.45480217]
Test MSE: [ 9.84239331  7.10438773 11.46098082  8.8855374  11.69348803]
Train MAE: [2.206153   2.27575906 2.09535439 2.2363997  2.16573858]
Test MAE: [2.43717243 2.09853272 2.76548542 2.31515811 2.69512577]
Train MAPE: 0.02773242061706115
Test MAPE: 0.031060612906844226

Lasso Regression Metrics on Test Set:
Test MSE: 13.25365719237693
Test MAE: 2.933882513923934
Test MAPE: 3.6301211308068986

Ridge Regression Metrics on Test Set:
Test MSE: 10.203421192219395
Test MAE: 2.56765965016351
Test MAPE: 3.1769900493774395

Lasso Regression Coefficients:
R               0.000000
AB              0.000000
H               0.000000
2B              0.000000
3B             -0.000000
HR              0.000000
BB              0.000000
SO             -0.000000
SB              0.000000
CS             -0.000000
HBP             0.000000
SF              0.000000
RA             -0.000000
ER             -0.000000
CG              0.000000
SHO             0.000000
SV              2.121406
HA             -0.000000
HRA            -0.000000
BBA            -0.000000
SOA             0.000000
E              -0.000000
DP             -0.000000
FP              0.000000
attendance      0.058465
BPF            -0.000000
PPF            -0.000000
differential    8.178797
AVG             0.000000
OBP             0.000000
1B              0.000000
SLG             0.000000
salary          0.000000
Name: Lasso Coefficients, dtype: float64

Ridge Regression Coefficients:
R               3.403701
AB             -0.366015
H              -0.056132
2B             -0.171711
3B             -0.474057
HR              0.078274
BB              0.389861
SO             -0.183055
SB              0.343619
CS             -0.123511
HBP             0.174036
SF             -0.157728
RA             -0.769355
ER             -1.867679
CG              0.390758
SHO             0.655094
SV              2.912473
HA              0.200086
HRA            -0.363844
BBA            -0.490196
SOA             0.633574
E               0.325810
DP             -0.017997
FP              0.701568
attendance      0.614488
BPF            -0.305166
PPF            -0.003891
differential    3.277630
AVG             0.598614
OBP             0.100328
1B              0.027911
SLG             0.617461
salary         -0.193715
Name: Ridge Coefficients, dtype: float64

A large block of code and an equally large output! Lets dive in and see what our model has discovered.

To recap, we used sklearn's ridge and lasso regression function to create two models that look to correlate a long list of independent variables taken from the Lahman baseball database, and wins.

We scaled the data using standardization, which scales the data to have a mean of 0 and a standard deviation of 1. We then shuffled the data to prevent recency bias, and split our data into our training and testing sets. We used cross validation to get a more accurate picture of our model's over or underfitting tendency, and got the following results:

In [303]:
print("Lasso Regression Cross-Validation Results:")
print("Train MSE:", lasso_cv_results['train_mse'])
print("Test MSE:", lasso_cv_results['test_mse'])
print("Train MAE:", lasso_cv_results['train_mae'])
print("Test MAE:", lasso_cv_results['test_mae'])
print("Train MAPE:", -lasso_cv_results['train_mape'].mean())
print("Test MAPE:", -lasso_cv_results['test_mape'].mean())

print("\nRidge Regression Cross-Validation Results:")
print("Train MSE:", ridge_cv_results['train_mse'])
print("Test MSE:", ridge_cv_results['test_mse'])
print("Train MAE:", ridge_cv_results['train_mae'])
print("Test MAE:", ridge_cv_results['test_mae'])
print("Train MAPE:", -ridge_cv_results['train_mape'].mean())
print("Test MAPE:", -ridge_cv_results['test_mape'].mean())
Lasso Regression Cross-Validation Results:
Train MSE: [11.20666699 11.67411654 11.04421093 11.68885514 10.69664513]
Test MSE: [11.39210868  9.02284629 11.96660232 11.99680211 13.39725325]
Train MAE: [2.67132948 2.77871331 2.67748195 2.7707325  2.66335372]
Test MAE: [2.85383271 2.39444269 2.85633994 2.76982123 2.88783674]
Train MAPE: 0.03444431452354736
Test MAPE: 0.03493001884604956

Ridge Regression Cross-Validation Results:
Train MSE: [7.74448574 8.35934331 7.3459862  7.93124754 7.45480217]
Test MSE: [ 9.84239331  7.10438773 11.46098082  8.8855374  11.69348803]
Train MAE: [2.206153   2.27575906 2.09535439 2.2363997  2.16573858]
Test MAE: [2.43717243 2.09853272 2.76548542 2.31515811 2.69512577]
Train MAPE: 0.02773242061706115
Test MAPE: 0.031060612906844226

As we can see above, it appears that ridge is more accurate than lasso in predicting the # of wins a MLB team has. Both are very accurate however, with testing MAPE of just above 3%. It appears that the models are within a few games of predicting the number of wins a team may have, given data about its entire season.

Moving on, we will take a look at the coefficients of each model. The lasso model, which is the slimmest, appears to only take a few variables into account, namely saves (2.121406), attendance (0.058465), and run differential (8.178797). Run differential clearly is the most important coefficient in terms of its correlation with wins a given team may have. The ridge regression model has coefficients for every variable, but its most significant are run differential (3.277630), runs (3.403701), earned runs against (-1.867679), and saves (2.912473). This data also falls in line with our lasso model, as runs and earned runs against are components of run differential, and saves is important as a save equals a win.

The graph above displays the coefficients for each variable for both models, overlayed to compare the magnitude of each. It highlights lasso's tendency to create slimmer models, and provides the same conclusions as the individual variables coefficients printed above.

Conclusion¶

Overall, our model does a good job at predicting the number of wins a baseball team would have in a given year given a number of statistics. We found the most important variables to teams success, some of which are self explanatory, such as run differential, and others more curious, such as park attendance.

These models perform well on historical data, however it would be hard to predict next year's number of wins given past years data. This is a avenue we can pursue in future analysis of baseball data, and surely surprising conclucsions will arise. However, lots of the data that does make an impact on such predicitions, such as prospects, momentum, trades, etc, are very hard to accumulate, and hence are the reason why baseball is one of the most intriguing sports to analyze.

Thank you for reading this tutorial, attached below please find the link to the Lahman website.

Important Links¶

Lahman Database