2024-11-28
statsmodels
is a package for implementing traditional frequentist statistics models
scikit-learn
is a workhorse package for general statistical modeling (inclusive of ML)
matplotlib
and seaborn
are packages for visualizing data results
matplotlib
statsmodels
statsmodels
the following way:R
’s lm
)import numpy as np
rng = np.random.default_rng(seed=12345)
def dnorm(mean, variance, rng, size=1):
if isinstance(size, int):
size = size,
return mean + np.sqrt(variance) * rng.standard_normal(*size)
N = 100
X = np.c_[dnorm(0, 0.4, rng, size=N),
dnorm(0, 0.6, rng, size=N),
dnorm(0, 0.2, rng, size=N)]
eps = dnorm(0, 0.1, rng, size=N)
beta = [0.1, 0.3, 0.5]
y = np.dot(X, beta) + eps
What are the dimensions of X
and y
?
Note that this is a linear, homoskedastic model.
array([[-0.90050602, -0.18942958, -1.0278702 ],
[ 0.79925205, -1.54598388, -0.32739708],
[-0.55065483, -0.12025429, 0.32935899],
[-0.16391555, 0.82403985, 0.20827485],
[-0.04765129, -0.21314698, -0.04824364]])
As in our numpy
lecture, we also want to fit an intercept – statsmodels
makes this easy
array([[ 1. , -0.90050602, -0.18942958, -1.0278702 ],
[ 1. , 0.79925205, -1.54598388, -0.32739708],
[ 1. , -0.55065483, -0.12025429, 0.32935899],
[ 1. , -0.16391555, 0.82403985, 0.20827485],
[ 1. , -0.04765129, -0.21314698, -0.04824364]])
sm.OLS
We begin by creating a sm.OLS
object using our data (this is the array-based method):
summary()
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.470
Model: OLS Adj. R-squared: 0.453
Method: Least Squares F-statistic: 28.36
Date: Wed, 20 Nov 2024 Prob (F-statistic): 3.23e-13
Time: 09:48:23 Log-Likelihood: -25.390
No. Observations: 100 AIC: 58.78
Df Residuals: 96 BIC: 69.20
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const -0.0208 0.032 -0.653 0.516 -0.084 0.042
x1 0.0658 0.054 1.220 0.226 -0.041 0.173
x2 0.2690 0.043 6.312 0.000 0.184 0.354
x3 0.4494 0.068 6.567 0.000 0.314 0.585
==============================================================================
Omnibus: 0.429 Durbin-Watson: 1.878
Prob(Omnibus): 0.807 Jarque-Bera (JB): 0.296
Skew: 0.133 Prob(JB): 0.863
Kurtosis: 2.995 Cond. No. 2.16
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Let’s rerun with heteroskedasticity-robust standard errors:
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.470
Model: OLS Adj. R-squared: 0.453
Method: Least Squares F-statistic: 25.90
Date: Wed, 20 Nov 2024 Prob (F-statistic): 2.32e-12
Time: 09:48:23 Log-Likelihood: -25.390
No. Observations: 100 AIC: 58.78
Df Residuals: 96 BIC: 69.20
Df Model: 3
Covariance Type: HC1
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const -0.0208 0.032 -0.657 0.511 -0.083 0.041
x1 0.0658 0.060 1.097 0.273 -0.052 0.183
x2 0.2690 0.046 5.795 0.000 0.178 0.360
x3 0.4494 0.072 6.231 0.000 0.308 0.591
==============================================================================
Omnibus: 0.429 Durbin-Watson: 1.878
Prob(Omnibus): 0.807 Jarque-Bera (JB): 0.296
Skew: 0.133 Prob(JB): 0.863
Kurtosis: 2.995 Cond. No. 2.16
==============================================================================
Notes:
[1] Standard Errors are heteroscedasticity robust (HC1)
pd.DataFrame
as Input to statsmodels
col0 col1 col2 y
0 -0.900506 -0.189430 -1.027870 -0.599527
1 0.799252 -1.545984 -0.327397 -0.588454
2 -0.550655 -0.120254 0.329359 0.185634
3 -0.163916 0.824040 0.208275 -0.007477
4 -0.047651 -0.213147 -0.048244 -0.015374
This can now be used much like R
’s lm
– we use a formula to access the API:
Intercept -0.020799
col0 0.065813
col1 0.268970
col2 0.449419
dtype: float64
Note the intercept is fit by default
col0 0.069201
col1 0.341687
col2 0.444801
col0:col1 -0.121129
np.exp(col1) -0.032190
dtype: float64
If you’re doing something in your formula that patsy
doesn’t recognize by default, you can pass it within I()
in the formula, which tells patsy
to let python handle it.
We can also use the results object to predict on out-of-sample data
0 -0.631536
1 -0.475749
2 0.030740
3 0.305839
4 -0.124827
dtype: float64
patsy
In fact, patsy
is quite useful for creating design matrices from formulae (again, very similar to R
’s model.matrix
):
array([[ 1. , -0.90050602, -1.0278702 , 1.05651715],
[ 1. , 0.79925205, -0.32739708, 0.10718885],
[ 1. , -0.55065483, 0.32935899, 0.10847735],
[ 1. , -0.16391555, 0.20827485, 0.04337841],
[ 1. , -0.04765129, -0.04824364, 0.00232745]])
statsmodels
Simulate 1000 observations from
\[ y = 2 + 3\times\sin(x) + \varepsilon \]
where \(x\sim\mathcal{N}(0,1)\) and \(\varepsilon\sim\mathcal{N}(0,2)\). Then fit two statsmodels objects reporting the coefficients from a correctly specified regression (using \(\sin(x)\)) and from an incorrectly specified one (just using \(x\)).
statsmodels
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.678
Model: OLS Adj. R-squared: 0.678
Method: Least Squares F-statistic: 2100.
Date: Wed, 20 Nov 2024 Prob (F-statistic): 9.48e-248
Time: 09:48:23 Log-Likelihood: -1762.2
No. Observations: 1000 AIC: 3528.
Df Residuals: 998 BIC: 3538.
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 2.0459 0.045 45.830 0.000 1.958 2.133
np.sin(x) 3.0660 0.067 45.830 0.000 2.935 3.197
==============================================================================
Omnibus: 0.087 Durbin-Watson: 2.064
Prob(Omnibus): 0.958 Jarque-Bera (JB): 0.055
Skew: 0.017 Prob(JB): 0.973
Kurtosis: 3.012 Cond. No. 1.50
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.579
Model: OLS Adj. R-squared: 0.578
Method: Least Squares F-statistic: 1371.
Date: Wed, 20 Nov 2024 Prob (F-statistic): 1.37e-189
Time: 09:48:23 Log-Likelihood: -1896.3
No. Observations: 1000 AIC: 3797.
Df Residuals: 998 BIC: 3806.
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 2.0309 0.051 39.791 0.000 1.931 2.131
x 1.8869 0.051 37.033 0.000 1.787 1.987
==============================================================================
Omnibus: 14.047 Durbin-Watson: 2.019
Prob(Omnibus): 0.001 Jarque-Bera (JB): 23.471
Skew: 0.020 Prob(JB): 8.01e-06
Kurtosis: 3.750 Cond. No. 1.03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
scikit-learn
scikit-learn
?scikit-learn
API style which is very accessibleData is from a Kaggle competition about passenger survival rates on the Titanic
PassengerId Survived Pclass \
0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3
Name Sex Age SibSp \
0 Braund, Mr. Owen Harris male 22.0 1
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1
2 Heikkinen, Miss. Laina female 26.0 0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1
4 Allen, Mr. William Henry male 35.0 0
Parch Ticket Fare Cabin Embarked
0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S
With a few exceptions, you can’t feed ML algorithms missing data, so we need see the prevalence of missing data in predictors:
PassengerId 0.000000
Survived 0.000000
Pclass 0.000000
Name 0.000000
Sex 0.000000
Age 0.198653
SibSp 0.000000
Parch 0.000000
Ticket 0.000000
Fare 0.000000
Cabin 0.771044
Embarked 0.002245
dtype: float64
PassengerId 0.000000
Survived 0.000000
Pclass 0.000000
Name 0.000000
Sex 0.000000
Age 0.000000
SibSp 0.000000
Parch 0.000000
Ticket 0.000000
Fare 0.000000
Cabin 0.771044
Embarked 0.002245
dtype: float64
Sex
IsFemale | Sex | |
---|---|---|
0 | 0 | male |
1 | 1 | female |
2 | 1 | female |
3 | 1 | female |
4 | 0 | male |
5 | 0 | male |
We have to pass np.array
s to sklearn
, so we pick features and convert them to numpy
array([[ 3., 0., 22.],
[ 1., 1., 38.],
[ 3., 1., 26.],
[ 1., 1., 35.],
[ 3., 0., 35.]])
We will fit a logistic regression – this means that we have to import the model from the sklearn
module linear_model
and create an instance of the model type
We can then fit this model using the data and the fit()
method, which is consistent across all of sklearn
’s models:
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LogisticRegression()
Recall that the probability of survival in this case is being modeled as
\[ \mathbb{P}(\text{Survival}_i) = \frac{1}{1 + \exp(-(\beta^{\top}x_i))} = \frac{\exp(\beta^{\top}x_i)}{1 + \exp(\beta^{\top}x_i)} \]
Are the coefficients interpretable?
Finally, we can make predictions using the fitted model:
array([0, 0, 0, 0, 1, 0, 1, 0, 1, 0])
We can also assess the fit using the score
method (here we have no test data outcomes, so we reuse the training data)
0.7878787878787878
This is the percentage of observations for which the model makes the correct prediction.
0.7878787878787878
Since we’re fitting a logistic regression model, we can also retrieve the coefficients:
array([[-1.14157583, 2.51975386, -0.03272378]])
How should we interpret these coefficients (on ticket class, female, and age, respectively)?
sklearn
API More Fullypandas
scikit-learn
has an API for doing thisWe have two options – using pandas
(pd.get_dummies
) or using sklearn
(preprocessing.OneHotEncoder
). I’ll use the former here, but the latter is also fine
array([[3, 22.0, True],
[1, 38.0, False],
[3, 26.0, False],
[1, 35.0, False],
[3, 35.0, True]], dtype=object)
Pipeline
Let’s say that we were comfortable saying that we wanted to use our median imputation on any features that we pass to our model – then we can use a Pipeline
Pipeline(steps=[('impute_median', SimpleImputer(strategy='median')), ('logit', LogisticRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
Pipeline(steps=[('impute_median', SimpleImputer(strategy='median')), ('logit', LogisticRegression())])
SimpleImputer(strategy='median')
LogisticRegression()
To access the output, we now have to specify which step in the pipeline we want to access
array([[-1.14120327, -0.03271306, -2.51954002]])
scikit-learn
sklearn.linear_model.LassoCV
)array([1., 2., 3., 4., 5., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
beta
is “sparse”LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LinearRegression()
Let’s generate unseen data:
-0.40907911433195765
Implementation of five-fold CV for a parameter value of \(\alpha=2\)
array([0.05229446, 0.110592 , 0.1352647 , 0.06777857, 0.17660835])
What does each value in this array mean?
We want to look for the best parameter over a grid of possibilities. Let’s say we restrict ourselves to \(\alpha \in \{1,...,10\}\):
GridSearchCV(estimator=Lasso(alpha=2.0), param_grid={'alpha': array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])})In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
GridSearchCV(estimator=Lasso(alpha=2.0), param_grid={'alpha': array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])})
Lasso(alpha=1)
Lasso(alpha=1)
mean_fit_time std_fit_time mean_score_time std_score_time param_alpha \
0 0.000799 8.803250e-04 0.000202 0.000032 1
1 0.000295 9.296001e-06 0.000169 0.000001 2
2 0.000305 3.362246e-05 0.000219 0.000068 3
3 0.000388 1.317610e-04 0.000196 0.000037 4
4 0.000282 4.699689e-06 0.000164 0.000003 5
5 0.000269 2.802407e-06 0.000162 0.000002 6
6 0.000479 3.773684e-04 0.000207 0.000077 7
7 0.000279 9.279844e-06 0.000168 0.000006 8
8 0.000268 1.741598e-06 0.000161 0.000002 9
9 0.000265 7.599534e-07 0.000161 0.000002 10
params split0_test_score split1_test_score split2_test_score \
0 {'alpha': 1} 0.055458 0.219426 0.062059
1 {'alpha': 2} 0.052294 0.110592 0.135265
2 {'alpha': 3} 0.030939 0.059064 0.075362
3 {'alpha': 4} -0.002265 0.034352 0.051328
4 {'alpha': 5} -0.031806 0.003537 0.021333
5 {'alpha': 6} -0.047611 -0.006580 -0.001169
6 {'alpha': 7} -0.047611 -0.006580 -0.001169
7 {'alpha': 8} -0.047611 -0.006580 -0.001169
8 {'alpha': 9} -0.047611 -0.006580 -0.001169
9 {'alpha': 10} -0.047611 -0.006580 -0.001169
split3_test_score split4_test_score mean_test_score std_test_score \
0 0.084018 0.204363 0.125065 0.071683
1 0.067779 0.176608 0.108508 0.045115
2 -0.041256 0.074601 0.039742 0.043579
3 -0.135167 -0.028565 -0.016063 0.065751
4 -0.134277 -0.075044 -0.043251 0.056192
5 -0.134277 -0.075044 -0.052936 0.048913
6 -0.134277 -0.075044 -0.052936 0.048913
7 -0.134277 -0.075044 -0.052936 0.048913
8 -0.134277 -0.075044 -0.052936 0.048913
9 -0.134277 -0.075044 -0.052936 0.048913
rank_test_score
0 1
1 2
2 3
3 4
4 5
5 6
6 6
7 6
8 6
9 6
Lasso(alpha=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
Lasso(alpha=1)
array([ 0.18018531, 0.19411377, 2.31230729, 2.60385006, 3.9903116 ,
0. , -0. , -0. , -0. , -0.55286155,
-0. , 0. , 0.89657092, 0. , -0. ,
0. , 0. , -0. , -0. , 0. ,
0. , 0. , 0. , -0. , -0.27817011,
-0. , -0. , -0. , -0. , -0. ,
-0. , 0. , -0. , 0. , -0. ,
-0. , -0. , 0.41305422, -0. , 0. ,
-0. , 0. , 0. , 0. , 0. ,
0. , 0. , -0. , -0. , 0.45830332,
-0. , 0. , -0. , -0. , 0. ,
-0. , -0.19840069, 0. , -0. , 0. ,
-0. , 0. , 0. , 0. , -0. ,
0. , -0. , -0. , 0. , -0. ,
-0. , -0. , 0. , 0.19424094, 0. ,
-0.99068853, 0. , -0. , -0.20650089, -0. ,
-0. , 0. , 0. , -0. , 0. ,
0. , -0. , -0. , 0.08357251, -0. ,
-0. , 0.0501038 , 0. , -0. , -0. ,
0. , -0. , 0. , 0. , 0. ])
LassoCV
Look up the documentation for sklearn.linear_model.LassoCV
and fit a cross-validated model directly with the API (feel free to just use the package defaults). What is the optimal value of \(\alpha\) and what is the score on our unseen data?
LassoCV
1.2160551211595514
0.23526655630512117
matplotlib
We will always import matplotlib
like so:
These are very easy! Let’s generate some data and pass it directly to matplotlib
matplotlib
API BroadlyCreating a figure:
<Figure size 960x480 with 0 Axes>
Plenty more in the documentation
Using our OLS model and LASSO model from before, plot actual vs. predictions on unseen data from both models.
preds_lasso = grid.best_estimator_.predict(X_unseen)
preds_ols = ols.predict(X_unseen)
fig, axes = plt.subplots(1,2)
axes[0].scatter(y_unseen, preds_ols)
axes[1].scatter(y_unseen, preds_lasso)
axes[0].set_title('OLS')
axes[1].set_title('LASSO')
for i in range(2):
axes[i].set_xlabel('Actual')
axes[i].set_ylabel('Predicted')
To illustrate how to annotate, we’ll create a plot of the closing S&P 500 price over the course of the financial crisis, with important dates called out. First, let’s get the data
SPX
Date
1990-02-01 328.79
1990-02-02 330.92
1990-02-05 331.85
1990-02-06 329.66
1990-02-07 333.75
This is enough to create a basic plot of the S&P 500 closing price
We need to know what happened and when:
list
seaborn
matplotlib
API can be Inconvenientmatplotlib
, but it’s not ideal
seaborn
works directly with pandas
to leverage DataFrame
and Series
objectsSeries
have a plot
Methodplot
Has Formatting OptionsDataFrame
is made up of Series
… A B C D
0 1.106728 -0.866214 0.235206 -1.853799
10 2.948470 -0.725233 0.686533 -1.808754
20 5.147404 -2.057531 -0.643525 -2.597403
30 4.132320 -1.565383 -0.225219 -4.784608
40 2.902674 -2.789219 0.611306 -6.792956
50 3.214318 -2.902681 1.248224 -5.860315
60 4.613268 -2.930443 0.319327 -5.002226
70 5.384007 -2.878670 1.494778 -4.990756
80 4.741819 -1.049899 0.849258 -5.290131
90 5.164763 -2.133666 1.681957 -4.424057
plot
on a DataFrame
DataFrame
Bar ChartDataFrame
Stacked Bar ChartUnnamed: 0.2 | Unnamed: 0.1 | Unnamed: 0 | match_id | shotX | shotY | quarter | time_remaining | player | team | made | shot_type | distance | score | opp | status | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 202203210BRK | 14.8 | 28.9 | 1st quarter | 11:34.0 | Donovan Mitchell | UTA | False | 3-pointer | 27 | 0-0 | 'UTA' | tied |
1 | 1 | 1 | 1 | 202203210BRK | 24.8 | 7.3 | 1st quarter | 11:06.0 | Donovan Mitchell | UTA | False | 2-pointer | 4 | 0-0 | 'UTA' | tied |
2 | 2 | 2 | 2 | 202203210BRK | 20.0 | 13.7 | 1st quarter | 10:26.0 | Mike Conley | UTA | True | 2-pointer | 11 | 2-2 | 'UTA' | tied |
shot_type made
2-pointer True 526
False 410
3-pointer False 400
True 217
Name: count, dtype: int64
DataFrame
made | False | True |
---|---|---|
shot_type | ||
2-pointer | 410 | 526 |
3-pointer | 400 | 217 |
unstack
made | False | True |
---|---|---|
shot_type | ||
2-pointer | 410 | 526 |
3-pointer | 400 | 217 |
unstack
Let’s create a heatmap of expected points by location on the floor:
points | |||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
shotX | -0.3 | -0.1 | 0.1 | 0.2 | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 | 0.8 | ... | 47.3 | 47.4 | 47.5 | 47.6 | 47.7 | 47.8 | 47.9 | 48.0 | 48.1 | 48.2 |
shotY | |||||||||||||||||||||
-0.3 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
0.3 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
0.5 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 rows × 383 columns
Note that pd.Series
objects have a plot.hist()
method
Note that pd.Series
objects have a plot.density()
method
\(\beta_{LASSO}\) solves the problem
\[ \min_{\hat{\beta}}\; \underbrace{\sum_{i=1}^n\left(y-\mathbb{X}\hat{\beta}\right)^2}_{\text{Sum of Squared Errors}} + \underbrace{\alpha|\hat{\beta}|}_{\text{Regularization Penalty}} \]