2024-05-01
Convention: classes get SnakeCase names
We have a (kind of useless) class!
__init__()
This code runs whenever the object is created, and always takes at least one argument (self
). Here, we’ll give our regression a name so that we can keep track of different specifications. This argument will become an attribute of the class within __init__()
:
Emulating sklearn
, let’s add a method for fitting a linear regression, and add the X
and y
data as attributes. We do this like so:
fit
array([[-0.02338684, 0.01298824, -0.00793553, ..., -0.02349427,
-0.01361545, -0.00545721],
[-0.08753511, -0.01996627, -0.05883339, ..., -0.08773467,
-0.06938418, -0.05422977],
[ 0.07353455, -0.01142486, 0.03744571, ..., 0.07378548,
0.05071202, 0.03165723],
[-0.4028155 , 0.23644813, -0.13127072, ..., -0.40470355,
-0.23109093, -0.08771625],
[ 0.30115639, -0.14433381, 0.11192219, ..., 0.30247214,
0.18148493, 0.08156994]])
What’s the issue?
((1000, 5), (1000, 1000), (5, 1000))
Write a method for our class to ensure that X
and y
make sense dimensionally before computing beta
class LinearRegression:
def __init__(self, name):
self.name = name
def _check_dims(self):
"""Check input dimensions"""
return self.X.shape[0] == self.y.shape[0] and self.y.shape[1] == 1
def fit(self, X: np.array, y: np.array):
"""Fit a linear regression"""
self.X = X
self.y = y
if not self._check_dims():
raise ValueError('Dimensions are not correct')
self.beta = np.linalg.inv(self.X.T @ self.X) @ self.X.T @ self.y
my_obj = LinearRegression('panel_reg')
my_obj.fit(X, y)
ValueError: Dimensions are not correct
_check_dims
?_check_dims
array([[ 3.00064876],
[ 4.95711242],
[-1.98260416],
[ 6.01959136],
[ 1.50678939]])
class LinearRegression:
def __init__(self, name):
self.name = name
def _check_dims(self):
"""Check input dimensions"""
return self.X.shape[0] == self.y.shape[0] and self.y.shape[1] == 1
def fit(self, X: np.array, y: np.array) -> np.array:
"""Fit a linear regression"""
self.X = X
self.y = y
if not self._check_dims():
raise ValueError('Dimensions are not correct')
self.beta = np.linalg.inv(self.X.T @ self.X) @ self.X.T @ self.y
self.resid = (self.y-self.X @ self.beta).flatten()
self.cov = np.linalg.inv(self.X.T@self.X) * np.inner(self.resid, self.resid)
summary
Methodclass LinearRegression:
def __init__(self, name):
self.name = name
def _check_dims(self):
"""Check input dimensions"""
return self.X.shape[0] == self.y.shape[0] and self.y.shape[1] == 1
def fit(self, X: np.array, y: np.array):
"""Fit a linear regression"""
self.X = X
self.y = y
if not self._check_dims():
raise ValueError('Dimensions are not correct')
self.beta = np.linalg.inv(self.X.T @ self.X) @ self.X.T @ self.y
self.resid = (self.y-self.X @ self.beta).flatten()
self.cov = np.linalg.inv(self.X.T@self.X) * np.inner(self.resid, self.resid)
def summary(self) -> pd.DataFrame:
"""Produce a regression table"""
ses = np.sqrt(np.diag(self.cov))
data = {
'coef': self.beta.flatten(),
'se': ses,
't-stat': self.beta.flatten() / ses
}
return pd.DataFrame(data)
summary
Methodcoef | se | t-stat | |
---|---|---|---|
0 | 3.000649 | 0.971147 | 3.089799 |
1 | 4.957112 | 0.962300 | 5.151319 |
2 | -1.982604 | 0.964765 | -2.055013 |
3 | 6.019591 | 0.973967 | 6.180491 |
4 | 1.506789 | 0.968623 | 1.555599 |
Please add a plotting method to our class that shows actual data against fitted data. Include in the title the \(R^2\) value, calculated as
\[ R^2 = 1 - \frac{\sum_{i=1}^n(y_i-\hat{y}_i)^2}{\sum_{i=1}^n(y_i - \overline{y})^2} \]
where \(\overline{y}\) is the sample mean of the \(y_i\) values.
from matplotlib import pyplot as plt
class LinearRegression:
def __init__(self, name):
self.name = name
def _check_dims(self):
"""Check input dimensions"""
return self.X.shape[0] == self.y.shape[0] and self.y.shape[1] == 1
def fit(self, X: np.array, y: np.array):
"""Fit a linear regression"""
self.X = X
self.y = y
if not self._check_dims():
raise ValueError('Dimensions are not correct')
self.beta = np.linalg.inv(self.X.T @ self.X) @ self.X.T @ self.y
self.resid = (self.y-self.X@self.beta).flatten()
self.cov = np.linalg.inv(self.X.T@self.X) * np.inner(self.resid, self.resid)
def summary(self) -> pd.DataFrame:
"""Produce a regression table"""
ses = np.sqrt(np.diag(self.cov))
data = {
'coef': self.beta.flatten(),
'se': ses,
't-stat': self.beta.flatten() / ses
}
return pd.DataFrame(data)
def plot(self):
y_hat = self.X @ self.beta
r2 = 1 - np.sum((self.y - y_hat)**2) / np.sum((self.y-np.mean(self.y))**2)
fig,ax = plt.subplots()
ax.scatter(self.y, self.X @ self.beta)
ax.axline((0,0), slope=1, color='red')
ax.set_xlabel('Actual')
ax.set_ylabel('Fitted Value')
ax.set_title(f'R2: {round(r2, 2)}')
fig.show()
linear_regression.py
Let’s call this a completed module for our package pyedpyper
which will implement statistical methods:
pyedpyper/linear_regression.py
import pandas as pd
import numpy as np
class LinearRegression:
def __init__(self, name):
self.name = name
def _check_dims(self):
"""Check input dimensions"""
return self.X.shape[0] == self.y.shape[0] and self.y.shape[1] == 1
def fit(self, X: np.array, y: np.array):
"""Fit a linear regression"""
self.X = X
self.y = y
if not self._check_dims():
raise ValueError('Dimensions are not correct')
self.beta = np.linalg.inv(self.X.T @ self.X) @ self.X.T @ self.y
self.resid = (self.y-self.X@self.beta).flatten()
self.cov = np.linalg.inv(self.X.T@self.X) * np.inner(self.resid, self.resid)
def summary(self) -> pd.DataFrame:
"""Produce a regression table"""
ses = np.sqrt(np.diag(self.cov))
data = {
'coef': self.beta.flatten(),
'se': ses,
't-stat': self.beta.flatten() / ses
}
return pd.DataFrame(data)
\[ \min_{\hat{\beta}}\; \sum_{i=1}^n(y-\mathbb{X}\hat{\beta})^2 + \alpha \hat{\beta}^2 \]
Ridge is nice because it has a closed-form solution:
\[ \hat{\beta} = \left(\mathbb{X}^{\top}\mathbb{X} + \alpha \mathbb{I}\right)^{-1}\mathbb{X}^{\top}y \]
ridge_regression.py
We can adapt our existing code pretty easily.
class RidgeRegression:
def __init__(self, name):
self.name = name
def _check_dims(self):
"""Check input dimensions"""
return self.X.shape[0] == self.y.shape[0] and self.y.shape[1] == 1\
and isinstance(self.alpha, (float,int))
def fit(self, X: np.array, y: np.array, alpha: float):
"""Fit a linear regression"""
self.X = X
self.y = y
self.alpha = alpha
if not self._check_dims():
raise ValueError('Dimensions are not correct')
denom = np.linalg.inv(self.X.T @ self.X + self.alpha * np.eye(self.X.shape[1]))
num = self.X.T @ self.y
self.beta = denom @ num
def summary(self) -> pd.DataFrame:
"""Produce a regression table"""
data = {
'coef': self.beta.flatten()
}
return pd.DataFrame(data)
coef | |
---|---|
0 | 2.869855 |
1 | 4.736071 |
2 | -1.898266 |
3 | 5.737582 |
4 | 1.450930 |
ridge_regression.py
ridge_regression.py
import pandas as pd
import numpy as np
class RidgeRegression:
def __init__(self, name):
self.name = name
def _check_dims(self):
"""Check input dimensions"""
return self.X.shape[0] == self.y.shape[0] and self.y.shape[1] == 1\
and isinstance(self.alpha, (float,int))
def fit(self, X: np.array, y: np.array, alpha: float):
"""Fit a linear regression"""
self.X = X
self.y = y
self.alpha = alpha
if not self._check_dims():
raise ValueError('Dimensions are not correct')
denom = np.linalg.inv(self.X.T @ self.X + self.alpha * np.eye(self.X.shape[1]))
num = self.X.T @ self.y
self.beta = denom @ num
def summary(self) -> pd.DataFrame:
"""Produce a regression table"""
data = {
'coef': self.beta.flatten()
}
return pd.DataFrame(data)
__init__.py
file__init__.py
pyedpyper/
__init__.py
models/
__init__.py
linear_regression.py
ridge_regression.py
Import:
linear_regression.py
becomes linear_regression
)RidgeRegression
we just copy-pasted a ton of code from LinearRegression
utils.py
A common module that contains functions that are used by different pieces of code, generally somewhere it’s easily accessible by other modules.
We could put this function in utils.py
kwargs
We might not know which keyword (kw
) arguments (args
) will be passed by a user, or we’d like to leave all options available – we can do this by having **kwargs
be a function argument (essentially an unpacked dictionary) that we use in the function.
<class 'dict'>
name: this, value: is
name: a, value: test
pyedpyper/
__init__.py
utils.py
models/
__init__.py
linear_regression.py
ridge_regression.py
utils.py
linear_regression.py
linear_regression.py
import pandas as pd
import numpy as np
from .. import utils as ut
class LinearRegression:
def __init__(self, name):
self.name = name
def _check_dims(self):
"""Check input dimensions"""
return self.X.shape[0] == self.y.shape[0] and self.y.shape[1] == 1
def fit(self, X: np.array, y: np.array):
"""Fit a linear regression"""
self.X = X
self.y = y
if not self._check_dims():
raise ValueError('Dimensions are not correct')
self.beta = np.linalg.inv(self.X.T @ self.X) @ self.X.T @ self.y
self.resid = (self.y-self.X@self.beta).flatten()
self.cov = np.linalg.inv(self.X.T@self.X) * np.inner(self.resid, self.resid)
def summary(self) -> pd.DataFrame:
"""Produce a regression table"""
ses = np.sqrt(np.diag(self.cov))
return ut.summary(
beta = self.beta, se = ses, t_stat= self.beta.flatten() / ses
)
ridge_regression.py
ridge_regression.py
import pandas as pd
import numpy as np
from .. import utils as ut
class RidgeRegression:
def __init__(self, name):
self.name = name
def _check_dims(self):
"""Check input dimensions"""
return self.X.shape[0] == self.y.shape[0] and self.y.shape[1] == 1\
and isinstance(self.alpha, (float,int))
def fit(self, X: np.array, y: np.array, alpha: float):
"""Fit a linear regression"""
self.X = X
self.y = y
self.alpha = alpha
if not self._check_dims():
raise ValueError('Dimensions are not correct')
denom = np.linalg.inv(self.X.T @ self.X + self.alpha * np.eye(self.X.shape[1]))
num = self.X.T @ self.y
self.beta = denom @ num
def summary(self) -> pd.DataFrame:
"""Produce a regression table"""
return ut.summary(beta = self.beta)
setup.py
setup.py
pip
how to install the package from GitHubsetup.py
pip install git+<your_github_https_url_here.git>
In this case:
pip install git+https://github.com/lukas-hager/pyedpyper.git
NB: You may need to use pip3
instead of pip
depending on your system configuration
pytest
pytest
makes this easy to do.Our directory looks like this
pyedpyper/
__init__.py
models/
__init__.py
linear_regression.py
ridge_regression.py
Our new directory will look like this
pyedpyper/
__init__.py
models/
__init__.py
linear_regression.py
ridge_regression.py
tests/
Recall our linear and ridge regression modules. Think of what kind of tests we could run to make sure that we get correct results. In this case, think about “ground truth” results – what are some datasets we could pass our code where we “know” the results?
statsmodels
pytest
will look for tests within files that are prefixed by “test”assert
that will error out if the asserted condition is not metSuppose we had a function like this
No errors
pytest
pytest <filename if you want to test a specific file>
atol
: absolute tolerancertol
: relative toleranceabs(a - b) <= (atol + rtol * abs(b))
import pytest
def test_error_lr():
X = rng.randn(1000, 5)
y1 = rng.randn(1001, 1)
y2 = rng.randn(1000, 2)
lr = LinearRegression()
with pytest.raises(ValueError):
lr.fit(X,y1)
with pytest.raises(ValueError):
lr.fit(X,y2)
def test_error_rr():
X = rng.randn(1000, 5)
y1 = rng.randn(1001, 1)
y2 = rng.randn(1000, 2)
rr = RidgeRegression()
with pytest.raises(ValueError):
rr.fit(X,y1)
with pytest.raises(ValueError):
rr.fit(X,y2)
tests
folderpytest
on our moduletest_linear_regression.py
import pytest
import numpy as np
from pyedpyper.models.linear_regression import LinearRegression
from pyedpyper.models.ridge_regression import RidgeRegression
rng = np.random.default_rng()
def test_fit_no_error():
X = rng.random((1000, 5))
beta = np.arange(1.,6.).reshape(-1,1)
y = X @ beta
lr = LinearRegression('ols')
lr.fit(X,y)
assert np.allclose(lr.beta,beta,rtol=.001,atol=0)
def test_comparison():
X = rng.random((1000, 5))
beta = np.arange(1.,6.).reshape(-1,1)
y = X @ beta
lr = LinearRegression('ols')
lr.fit(X,y)
rr = RidgeRegression('ridge')
rr.fit(X,y,0)
assert np.allclose(lr.beta,rr.beta,rtol=0,atol=1e-6)
def test_error_lr():
X = rng.random((1000, 5))
y1 = rng.random((1001, 1))
y2 = rng.random((1000, 2))
lr = LinearRegression('ols')
with pytest.raises(ValueError):
lr.fit(X,y1)
with pytest.raises(ValueError):
lr.fit(X,y2)
import pytest
import numpy as np
from pyedpyper.models.ridge_regression import RidgeRegression
rng = np.random.default_rng()
def test_error_rr():
X = rng.random((1000, 5))
y1 = rng.random((1001, 1))
y2 = rng.random((1000, 2))
rr = RidgeRegression('ridge')
with pytest.raises(ValueError):
rr.fit(X,y1,1)
with pytest.raises(ValueError):
rr.fit(X,y2,1)
Within the package folder we can run pytest
to get the test results:
=============================================================================================================== test session starts ================================================================================================================
platform darwin -- Python 3.8.5, pytest-8.1.1, pluggy-1.4.0
rootdir: /Users/hlukas/git/pyedpyper
plugins: csv-3.0.0
collected 4 items
tests/test_linear_regression.py ... [ 75%]
tests/test_ridge_regression.py . [100%]
DataFrame
Look at the documentation for pd.testing.assert_frame_equal and use it to write a unit test for our utils.summary
method.
DataFrame
utils
Test FileAgain, we can run this with pytest
:
=============================================================================================================== test session starts ================================================================================================================
platform darwin -- Python 3.8.5, pytest-8.1.1, pluggy-1.4.0
rootdir: /Users/hlukas/git/pyedpyper
plugins: csv-3.0.0
collected 5 items
tests/test_linear_regression.py ... [ 60%]
tests/test_ridge_regression.py . [ 80%]
tests/test_utils.py . [100%]
pytest-cov
pytest
has the capability to provide this to us directly if we install a second packagepip install pytest-cov
pytest-cov
pytest --cov
============================================================================= test session starts ==============================================================================
platform darwin -- Python 3.8.5, pytest-8.1.1, pluggy-1.4.0
rootdir: /Users/hlukas/git/pyedpyper
plugins: cov-5.0.0, csv-3.0.0
collected 5 items
tests/test_linear_regression.py ... [ 60%]
tests/test_ridge_regression.py . [ 80%]
tests/test_utils.py . [100%]
---------- coverage: platform darwin, python 3.8.5-final-0 -----------
Name Stmts Miss Cover
-----------------------------------------------------------
pyedpyper/models/__init__.py 0 0 100%
pyedpyper/models/linear_regression.py 19 2 89%
pyedpyper/models/ridge_regression.py 19 1 95%
pyedpyper/utils.py 3 0 100%
tests/__init__.py 0 0 100%
tests/test_linear_regression.py 30 0 100%
tests/test_ridge_regression.py 13 0 100%
tests/test_utils.py 7 0 100%
-----------------------------------------------------------
TOTAL 91 3 97%
pip install pre-commit
We’ll add a file called .pre-commit-config.yaml
to our package that tells pre-commit
what we want it to look for (this is the baseline file that can be found here):
.pre-commit-config.yaml
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v2.3.0
hooks:
- id: check-yaml
- id: end-of-file-fixer
- id: trailing-whitespace
- repo: https://github.com/psf/black
rev: 22.10.0
hooks:
- id: black
For example, this will (stylistically) remove trailing whitespace and add a blank line at the end of files.
We can run
pre-commit install
to set up our new hooks
Check Yaml...............................................................Passed
Fix End of Files.........................................................Failed
- hook id: end-of-file-fixer
- exit code: 1
- files were modified by this hook
Fixing .pre-commit-config.yaml
Fixing tests/test_utils.py
Fixing tests/test_linear_regression.py
Fixing tests/test_ridge_regression.py
Trim Trailing Whitespace.................................................Passed
black....................................................................Failed
- hook id: black
- files were modified by this hook
reformatted tests/test_utils.py
reformatted tests/test_ridge_regression.py
reformatted tests/test_linear_regression.py
All done! ✨ 🍰 ✨
3 files reformatted, 5 files left unchanged.
Remember that we can lint our code to make it conform stylistically – this is something that we can add to our pre-commit hook
.pre-commit-config.yaml
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v2.3.0
hooks:
- id: check-yaml
- id: end-of-file-fixer
- id: trailing-whitespace
- repo: https://github.com/psf/black
rev: 22.10.0
hooks:
- id: black
- repo: https://github.com/pylint-dev/pylint
rev: v3.1.0
hooks:
- id: pylint
We get output like this:
************* Module tests.test_ridge_regression
tests/test_ridge_regression.py:1:0: C0114: Missing module docstring (missing-module-docstring)
tests/test_ridge_regression.py:9:0: C0116: Missing function or method docstring (missing-function-docstring)
tests/test_ridge_regression.py:10:4: C0103: Variable name "X" doesn't conform to snake_case naming style (invalid-name)
environment.yml
A file that lists out the dependencies for our virtual environment:
environment.yml
name: pyedpyper
dependencies:
- numpy
- pandas
- pre-commit
- pylint
conda env create -f environment.yml
conda activate pyedpyper
pyedpyper
The repository with all of these files can be found here