Writing Modules and Testing

Lukas Hager

2024-11-28

Learning Objectives

Understand how to write classes, modules, and packages in Python
Able to create a comprehensive package of modules
Be familiar with the use of unit testing and how to do so in Python

Class

What is a Class?

Broadly: an object constructor (sometimes people think of it as a blueprint)
The object has attributes and methods
Attributes:
- Data associated with the class
Methods:
- Functions associated with the class

Why?

Bundles functionality and attributes
DRY
Object-Oriented Programming
- Low-level languages are based on functions and logic
- OOP is based on creating objects that have data and functions

Example

Cars
- Two cars are different, but will share attributes (both have wheels, steering wheels, brakes, etc.)
- The object would be a car, and we can fill in the specific attributes later
Models
- Two regression models are different, but share attributes (both produce coefficients, standard errors, fitted values)
- The object would be the model, and we can fill in the specific attributes later (the data)

Creating a Class

Convention: classes get SnakeCase names

class LinearRegression:
    def __init__(self):
        return None

We have a (kind of useless) class!

my_obj = LinearRegression()

`init()`

This code runs whenever the object is created, and always takes at least one argument (self). Here, we’ll give our regression a name so that we can keep track of different specifications. This argument will become an attribute of the class within __init__():

class LinearRegression:
    def __init__(self, name = 'ols'):
        self.name = name

We can access this attribute once the object is created:

my_obj = LinearRegression('panel_reg')
my_obj.name

'panel_reg'

Adding a Method

Emulating sklearn, let’s add a method for fitting a linear regression, and add the X and y data as attributes. We do this like so:

class LinearRegression:
    def __init__(self, name):
        self.name = name
    def fit(self, X: np.array, y: np.array):
        """Fit a linear regression"""
        self.X = X
        self.y = y
        self.beta = np.linalg.inv(self.X.T @ self.X) @ self.X.T @ self.y

Trying out `fit`

rng = np.random.default_rng(seed=481)
X = rng.standard_normal(size=(1000,5))
epsilon = rng.standard_normal(size=(1000,1))
y = X @ np.array([3., 5., -2, 6., 1.5]) + epsilon

my_obj = LinearRegression('panel_reg')
my_obj.fit(X, y)
my_obj.beta

array([[-0.02338684,  0.01298824, -0.00793553, ..., -0.02349427,
        -0.01361545, -0.00545721],
       [-0.08753511, -0.01996627, -0.05883339, ..., -0.08773467,
        -0.06938418, -0.05422977],
       [ 0.07353455, -0.01142486,  0.03744571, ...,  0.07378548,
         0.05071202,  0.03165723],
       [-0.4028155 ,  0.23644813, -0.13127072, ..., -0.40470355,
        -0.23109093, -0.08771625],
       [ 0.30115639, -0.14433381,  0.11192219, ...,  0.30247214,
         0.18148493,  0.08156994]])

What’s the issue?

Dimensions

my_obj.X.shape, my_obj.y.shape, my_obj.beta.shape

((1000, 5), (1000, 1000), (5, 1000))

Exercise: Input Sanitization

Write a method for our class to ensure that X and y make sense dimensionally before computing beta

Solution: Input Sanitization

class LinearRegression:
    def __init__(self, name):
        self.name = name
    def _check_dims(self):
        """Check input dimensions"""
        return self.X.shape[0] == self.y.shape[0] and self.y.shape[1] == 1
    def fit(self, X: np.array, y: np.array):
        """Fit a linear regression"""
        self.X = X
        self.y = y
        if not self._check_dims():
            raise ValueError('Dimensions are not correct')
        self.beta = np.linalg.inv(self.X.T @ self.X) @ self.X.T @ self.y

my_obj = LinearRegression('panel_reg')
my_obj.fit(X, y)

ValueError: Dimensions are not correct

Why `_check_dims`?

We’d consider this method “private”
- Users should not need to call _check_dims
As such, we put an underscore at the beginning to let them know that they can ignore it

Trying Method with Correct Dimensions

rng = np.random.default_rng(seed=481)
X = rng.standard_normal(size=(1000,5))
epsilon = rng.standard_normal(size=(1000,1))
y = X @ np.array([3., 5., -2, 6., 1.5]).reshape(-1,1) + epsilon

my_obj = LinearRegression('panel_reg')
my_obj.fit(X, y)
my_obj.beta

array([[ 3.00064876],
       [ 4.95711242],
       [-1.98260416],
       [ 6.01959136],
       [ 1.50678939]])

Adding SEs to Fit

class LinearRegression:
    def __init__(self, name):
        self.name = name
    def _check_dims(self):
        """Check input dimensions"""
        return self.X.shape[0] == self.y.shape[0] and self.y.shape[1] == 1
    def fit(self, X: np.array, y: np.array) -> np.array:
        """Fit a linear regression"""
        self.X = X
        self.y = y
        if not self._check_dims():
            raise ValueError('Dimensions are not correct')
        self.beta = np.linalg.inv(self.X.T @ self.X) @ self.X.T @ self.y
        self.resid = (self.y-self.X @ self.beta).flatten()
        self.cov = np.linalg.inv(self.X.T@self.X) * np.inner(self.resid, self.resid)

Adding a `summary` Method

class LinearRegression:
    def __init__(self, name):
        self.name = name
    def _check_dims(self):
        """Check input dimensions"""
        return self.X.shape[0] == self.y.shape[0] and self.y.shape[1] == 1
    def fit(self, X: np.array, y: np.array):
        """Fit a linear regression"""
        self.X = X
        self.y = y
        if not self._check_dims():
            raise ValueError('Dimensions are not correct')
        self.beta = np.linalg.inv(self.X.T @ self.X) @ self.X.T @ self.y
        self.resid = (self.y-self.X @ self.beta).flatten()
        self.cov = np.linalg.inv(self.X.T@self.X) * np.inner(self.resid, self.resid)
    def summary(self) -> pd.DataFrame:
        """Produce a regression table"""
        ses = np.sqrt(np.diag(self.cov))
        data = {
            'coef': self.beta.flatten(),
            'se': ses,
            't-stat': self.beta.flatten() / ses
        }
        return pd.DataFrame(data)

Trying our `summary` Method

my_obj = LinearRegression('panel_reg')
my_obj.fit(X, y)
my_obj.summary()

	coef	se	t-stat
0	3.000649	0.971147	3.089799
1	4.957112	0.962300	5.151319
2	-1.982604	0.964765	-2.055013
3	6.019591	0.973967	6.180491
4	1.506789	0.968623	1.555599

Exercise: Plot

Please add a plotting method to our class that shows actual data against fitted data. Include in the title the \(R^2\) value, calculated as

\[ R^2 = 1 - \frac{\sum_{i=1}^n(y_i-\hat{y}_i)^2}{\sum_{i=1}^n(y_i - \overline{y})^2} \]

where \(\overline{y}\) is the sample mean of the \(y_i\) values.

Solutions: Plot (Class)

from matplotlib import pyplot as plt

class LinearRegression:
    def __init__(self, name):
        self.name = name
    def _check_dims(self):
        """Check input dimensions"""
        return self.X.shape[0] == self.y.shape[0] and self.y.shape[1] == 1
    def fit(self, X: np.array, y: np.array):
        """Fit a linear regression"""
        self.X = X
        self.y = y
        if not self._check_dims():
            raise ValueError('Dimensions are not correct')
        self.beta = np.linalg.inv(self.X.T @ self.X) @ self.X.T @ self.y
        self.resid = (self.y-self.X@self.beta).flatten()
        self.cov = np.linalg.inv(self.X.T@self.X) * np.inner(self.resid, self.resid)
    def summary(self) -> pd.DataFrame:
        """Produce a regression table"""
        ses = np.sqrt(np.diag(self.cov))
        data = {
            'coef': self.beta.flatten(),
            'se': ses,
            't-stat': self.beta.flatten() / ses
        }
        return pd.DataFrame(data)
    def plot(self):
        y_hat = self.X @ self.beta
        r2 = 1 - np.sum((self.y - y_hat)**2) / np.sum((self.y-np.mean(self.y))**2)
        fig,ax = plt.subplots()
        ax.scatter(self.y, self.X @ self.beta)
        ax.axline((0,0), slope=1, color='red')
        ax.set_xlabel('Actual')
        ax.set_ylabel('Fitted Value')
        ax.set_title(f'R2: {round(r2, 2)}')
        fig.show()

Solutions: Plot (Use)

my_reg = LinearRegression('ols')
my_reg.fit(X,y)
my_reg.plot()

Packages

`linear_regression.py`

Let’s call this a completed module for our package pyedpyper which will implement statistical methods:

pyedpyper/linear_regression.py

import pandas as pd
import numpy as np

class LinearRegression:
    def __init__(self, name):
        self.name = name
    def _check_dims(self):
        """Check input dimensions"""
        return self.X.shape[0] == self.y.shape[0] and self.y.shape[1] == 1
    def fit(self, X: np.array, y: np.array):
        """Fit a linear regression"""
        self.X = X
        self.y = y
        if not self._check_dims():
            raise ValueError('Dimensions are not correct')
        self.beta = np.linalg.inv(self.X.T @ self.X) @ self.X.T @ self.y
        self.resid = (self.y-self.X@self.beta).flatten()
        self.cov = np.linalg.inv(self.X.T@self.X) * np.inner(self.resid, self.resid)
    def summary(self) -> pd.DataFrame:
        """Produce a regression table"""
        ses = np.sqrt(np.diag(self.cov))
        data = {
            'coef': self.beta.flatten(),
            'se': ses,
            't-stat': self.beta.flatten() / ses
        }
        return pd.DataFrame(data)

Ridge Regression

Recall LASSO, where we regularized a regression for prediction
Ridge does the same thing, but with a different regularization penalty
- LASSO penalizes the sum of the absolute values of the coefficients
- Ridge penalizes the sum of the squared coefficients

\[ \min_{\hat{\beta}}\; \sum_{i=1}^n(y-\mathbb{X}\hat{\beta})^2 + \alpha \hat{\beta}^2 \]

Ridge Solution

Ridge is nice because it has a closed-form solution:

\[ \hat{\beta} = \left(\mathbb{X}^{\top}\mathbb{X} + \alpha \mathbb{I}\right)^{-1}\mathbb{X}^{\top}y \]

Adding `ridge_regression.py`

We can adapt our existing code pretty easily.

class RidgeRegression:
    def __init__(self, name):
        self.name = name
    def _check_dims(self):
        """Check input dimensions"""
        return self.X.shape[0] == self.y.shape[0] and self.y.shape[1] == 1\
             and isinstance(self.alpha, (float,int))
    def fit(self, X: np.array, y: np.array, alpha: float):
        """Fit a linear regression"""
        self.X = X
        self.y = y
        self.alpha = alpha
        if not self._check_dims():
            raise ValueError('Dimensions are not correct')
        denom = np.linalg.inv(self.X.T @ self.X + self.alpha * np.eye(self.X.shape[1]))
        num = self.X.T @ self.y
        self.beta = denom @ num
    def summary(self) -> pd.DataFrame:
        """Produce a regression table"""
        data = {
            'coef': self.beta.flatten()
        }
        return pd.DataFrame(data)

Trying Our Class

ridge_obj = RidgeRegression('ridge')
ridge_obj.fit(X,y,50)
ridge_obj.summary()

	coef
0	2.869855
1	4.736071
2	-1.898266
3	5.737582
4	1.450930

`ridge_regression.py`

ridge_regression.py

import pandas as pd
import numpy as np

class RidgeRegression:
    def __init__(self, name):
        self.name = name
    def _check_dims(self):
        """Check input dimensions"""
        return self.X.shape[0] == self.y.shape[0] and self.y.shape[1] == 1\
             and isinstance(self.alpha, (float,int))
    def fit(self, X: np.array, y: np.array, alpha: float):
        """Fit a linear regression"""
        self.X = X
        self.y = y
        self.alpha = alpha
        if not self._check_dims():
            raise ValueError('Dimensions are not correct')
        denom = np.linalg.inv(self.X.T @ self.X + self.alpha * np.eye(self.X.shape[1]))
        num = self.X.T @ self.y
        self.beta = denom @ num
    def summary(self) -> pd.DataFrame:
        """Produce a regression table"""
        data = {
            'coef': self.beta.flatten()
        }
        return pd.DataFrame(data)

Creating a Package

It would be nice if we could run something like

from pyedpyper import linear_regression, ridge_regression

To do this, we need to arrange the files properly in a folder and add an __init__.py file

`init.py`

This file marks that a directory is a package, so python should look for modules within the folder
It can be empty – really just tells python that it should look within a folder for code

File Structure

pyedpyper/
- __init__.py
- models/
  - __init__.py
  - linear_regression.py
  - ridge_regression.py

Import:

from pyedpyper.models import linear_regression, ridge_regression

Notes

The package needs to be added to the PATH global variable – can do this like so:

import sys
sys.path.append('/my/filepath/to/package')

We import the file names without the extension (linear_regression.py becomes linear_regression)
We can then reference the classes we’ve created within the file:

my_obj = linear_regression.LinearRegression('ols')

Problem

To create RidgeRegression we just copy-pasted a ton of code from LinearRegression
We want to avoid repeating ourselves when possible
How can we do this?

`utils.py`

A common module that contains functions that are used by different pieces of code, generally somewhere it’s easily accessible by other modules.

Example

We could put this function in utils.py

def summary(**kwargs):
    """Produce a regression table"""
    return pd.DataFrame(kwargs)

Aside: `kwargs`

We might not know which keyword (kw) arguments (args) will be passed by a user, or we’d like to leave all options available – we can do this by having **kwargs be a function argument (essentially an unpacked dictionary) that we use in the function.

def my_fun(**kwargs):
    print(type(kwargs))
    for name in kwargs.keys():
        print(f'name: {name}, value: {kwargs[name]}')

my_fun(this = 'is', a = 'test')

<class 'dict'>
name: this, value: is
name: a, value: test

New File Structure

pyedpyper/
- __init__.py
- utils.py
- models/
  - __init__.py
  - linear_regression.py
  - ridge_regression.py

`utils.py`

utils.py

import pandas as pd

def summary(**kwargs):
    """Produce a regression table"""
    return pd.DataFrame(kwargs)

New `linear_regression.py`

linear_regression.py

import pandas as pd
import numpy as np

from .. import utils as ut

class LinearRegression:
    def __init__(self, name):
        self.name = name
    def _check_dims(self):
        """Check input dimensions"""
        return self.X.shape[0] == self.y.shape[0] and self.y.shape[1] == 1
    def fit(self, X: np.array, y: np.array):
        """Fit a linear regression"""
        self.X = X
        self.y = y
        if not self._check_dims():
            raise ValueError('Dimensions are not correct')
        self.beta = np.linalg.inv(self.X.T @ self.X) @ self.X.T @ self.y
        self.resid = (self.y-self.X@self.beta).flatten()
        self.cov = np.linalg.inv(self.X.T@self.X) * np.inner(self.resid, self.resid)
    def summary(self) -> pd.DataFrame:
        """Produce a regression table"""
        ses = np.sqrt(np.diag(self.cov))
        return ut.summary(
            beta = self.beta, se = ses, t_stat= self.beta.flatten() / ses
        )

New `ridge_regression.py`

ridge_regression.py

import pandas as pd
import numpy as np

from .. import utils as ut

class RidgeRegression:
    def __init__(self, name):
        self.name = name
    def _check_dims(self):
        """Check input dimensions"""
        return self.X.shape[0] == self.y.shape[0] and self.y.shape[1] == 1\
             and isinstance(self.alpha, (float,int))
    def fit(self, X: np.array, y: np.array, alpha: float):
        """Fit a linear regression"""
        self.X = X
        self.y = y
        self.alpha = alpha
        if not self._check_dims():
            raise ValueError('Dimensions are not correct')
        denom = np.linalg.inv(self.X.T @ self.X + self.alpha * np.eye(self.X.shape[1]))
        num = self.X.T @ self.y
        self.beta = denom @ num
    def summary(self) -> pd.DataFrame:
        """Produce a regression table"""
        return ut.summary(beta = self.beta)

Installation

This is a little bit tedious – we were forced to manually add our package to the path.
It would be better if we could just install the package where the rest of our packages are installed.
We can do this pretty easily using setup.py

`setup.py`

Tells pip how to install the package from GitHub
Lists the package dependencies
Allows others to use your package

Example of `setup.py`

setup.py

from setuptools import setup, find_packages

setup(
    name='pyedpyper',
    version=0.01,

    url='https://github.com/lukas-hager/pyedpyper',
    author='Lukas Hager',
    author_email='lghhager@uw.edu',

    install_requires = [
        'numpy',
        'pandas',
        'matplotlib'
    ],

    packages=find_packages(),
)

GitHub Site

The code for this package can be found here
To install via pip, we need to copy the HTTPS link (click on the green “Code” button for the link)
Install via

pip install git+<your_github_https_url_here.git>

In this case:

pip install git+https://github.com/lukas-hager/pyedpyper.git

NB: You may need to use pip3 instead of pip depending on your system configuration

Testing

Why Test?

One reason:
- If you have a lot of code, it can be difficult to make edits and ensure things still run properly
- You want to feel confident when you push the edit to a subpart of the code that the rest of the code still works as anticipated
Another reason:
- You have a live model in production that you want to make sure hasn’t “drifted”
- That is, the data-generating process that the model was trained off of has not changed

Common Workflow

Develop code and tests simultaneously
- For example, every time you write a method, write a test for that method
Whenever you add to the codebase or change things, ensure that the code still passes the tests
If you don’t pass, you’re alerted to some issue that you can fix before putting your work into the real world

Alternative to Unit Testing

Pushing to Production

Common Testing Framework – `pytest`

We should add testing to our package – pytest makes this easy to do.
We can define a set of tests and run them to make sure our package still works

Folder Structure Reminder

Our directory looks like this

pyedpyper/
- __init__.py
- models/
  - __init__.py
  - linear_regression.py
  - ridge_regression.py

Adding Tests

Our new directory will look like this

pyedpyper/
- __init__.py
- models/
  - __init__.py
  - linear_regression.py
  - ridge_regression.py
- tests/

Exercise: Test Ideas

Recall our linear and ridge regression modules. Think of what kind of tests we could run to make sure that we get correct results. In this case, think about “ground truth” results – what are some datasets we could pass our code where we “know” the results?

Solutions: Test Ideas

If we have wholly deterministic data, we should be able to recover the regression coefficient exactly
If we pass data in that has the wrong dimensions, we should raise a specific error (the one that we wrote)
Our coefficients should match the coefficients that are generated by a package like statsmodels
If we specify a regularization parameter of 0, we should get the same coefficients from OLS and Ridge

Test Coverage

Formally defined as the percentage of code in our package that’s unit tested
We want to get as close to 100% as we can – that indicates that there aren’t chunks of code that have no checks
Ideally, we’d write a check for all of our methods in our classes

Writing Tests

pytest will look for tests within files that are prefixed by “test”
The tests are functions that contain statements like assert that will error out if the asserted condition is not met

Example of a Test

Suppose we had a function like this

def addition(a: float, b: float) -> float:
    """Add a and b"""
    return a + b

We could have a test for it:

def test_addition():
    assert addition(3, 5) == 8

Under the hood, pytest will run the test:

test_addition()

No errors

How to Run `pytest`

pytest <filename if you want to test a specific file>

Testing Deterministic Model

rng = np.random.default_rng()

def test_fit_no_error():
    X = rng.randn(1000, 5)
    beta = np.arange(1.,6.).reshape(-1,1)
    y = X @ beta
    lr = LinearRegression('ols')
    lr.fit(X,y)
    assert np.allclose(lr.beta,beta,rtol=0,atol=1e-6)

atol: absolute tolerance
rtol: relative tolerance
condition: abs(a - b) <= (atol + rtol * abs(b))

Testing Comparison

def test_comparison():
    X = rng.randn(1000, 5)
    beta = np.arange(1.,6.).reshape(-1,1)
    y = X @ beta
    lr = LinearRegression('ols')
    lr.fit(X,y)
    rr = RidgeRegression('ridge')
    rr.fit(X,y,0)
    assert np.allclose(lr.beta,rr.beta,rtol=0,atol=1e-6)

Testing Error

import pytest

def test_error_lr():
    X = rng.randn(1000, 5)
    y1 = rng.randn(1001, 1)
    y2 = rng.randn(1000, 2)
    lr = LinearRegression()
    with pytest.raises(ValueError):
        lr.fit(X,y1)
    with pytest.raises(ValueError):
        lr.fit(X,y2)

def test_error_rr():
    X = rng.randn(1000, 5)
    y1 = rng.randn(1001, 1)
    y2 = rng.randn(1000, 2)
    rr = RidgeRegression()
    with pytest.raises(ValueError):
        rr.fit(X,y1)
    with pytest.raises(ValueError):
        rr.fit(X,y2)

Creating Test Files

We can store the tests for each module in their own file within our tests folder
We then have the ability to run pytest on our module

Regression Tests

test_linear_regression.py

import pytest
import numpy as np

from pyedpyper.models.linear_regression import LinearRegression
from pyedpyper.models.ridge_regression import RidgeRegression

rng = np.random.default_rng()

def test_fit_no_error():
    X = rng.random((1000, 5))
    beta = np.arange(1.,6.).reshape(-1,1)
    y = X @ beta
    lr = LinearRegression('ols')
    lr.fit(X,y)
    assert np.allclose(lr.beta,beta,rtol=.001,atol=0)

def test_comparison():
    X = rng.random((1000, 5))
    beta = np.arange(1.,6.).reshape(-1,1)
    y = X @ beta
    lr = LinearRegression('ols')
    lr.fit(X,y)
    rr = RidgeRegression('ridge')
    rr.fit(X,y,0)
    assert np.allclose(lr.beta,rr.beta,rtol=0,atol=1e-6)

def test_error_lr():
    X = rng.random((1000, 5))
    y1 = rng.random((1001, 1))
    y2 = rng.random((1000, 2))
    lr = LinearRegression('ols')
    with pytest.raises(ValueError):
        lr.fit(X,y1)
    with pytest.raises(ValueError):
        lr.fit(X,y2)

Ridge Tests

import pytest
import numpy as np

from pyedpyper.models.ridge_regression import RidgeRegression

rng = np.random.default_rng()

def test_error_rr():
    X = rng.random((1000, 5))
    y1 = rng.random((1001, 1))
    y2 = rng.random((1000, 2))
    rr = RidgeRegression('ridge')
    with pytest.raises(ValueError):
        rr.fit(X,y1,1)
    with pytest.raises(ValueError):
        rr.fit(X,y2,1)

Testing

Within the package folder we can run pytest to get the test results:

=============================================================================================================== test session starts ================================================================================================================
platform darwin -- Python 3.8.5, pytest-8.1.1, pluggy-1.4.0
rootdir: /Users/hlukas/git/pyedpyper
plugins: csv-3.0.0
collected 4 items                                                                                                                                                                                                                                  

tests/test_linear_regression.py ...                                                                                                                                                                                                          [ 75%]
tests/test_ridge_regression.py .                                                                                                                                                                                                             [100%]

Exercise: Testing a `DataFrame`

Look at the documentation for pd.testing.assert_frame_equal and use it to write a unit test for our utils.summary method.

Solutions: Testing a `DataFrame`

def test_summary():
    df1 = ut.summary(a=np.arange(6), b=np.arange(6,12))
    df2 = pd.DataFrame({'a': np.arange(6), 'b': np.arange(6,12)})
    pd.testing.assert_frame_equal(df1, df2)

`utils` Test File

test_utils.py

import pandas as pd
import numpy as np

from pyedpyper import utils as ut

def test_summary():
    df1 = ut.summary(a=np.arange(6), b=np.arange(6,12))
    df2 = pd.DataFrame({'a': np.arange(6), 'b': np.arange(6,12)})
    pd.testing.assert_frame_equal(df1, df2)

Testing

Again, we can run this with pytest:

=============================================================================================================== test session starts ================================================================================================================
platform darwin -- Python 3.8.5, pytest-8.1.1, pluggy-1.4.0
rootdir: /Users/hlukas/git/pyedpyper
plugins: csv-3.0.0
collected 5 items                                                                                                                                                                                                                                  

tests/test_linear_regression.py ...                                                                                                                                                                                                          [ 60%]
tests/test_ridge_regression.py .                                                                                                                                                                                                             [ 80%]
tests/test_utils.py .                                                                                                                                                                                                                        [100%]

Code Coverage

A useful metric is how much of our code is unit tested
For example, passing all tests is much more relevant when those tests are testing every method in our codebase compared to when they’re just testing a single function

Installation of `pytest-cov`

pytest has the capability to provide this to us directly if we install a second package

pip install pytest-cov

Usage of `pytest-cov`

pytest --cov

============================================================================= test session starts ==============================================================================
platform darwin -- Python 3.8.5, pytest-8.1.1, pluggy-1.4.0
rootdir: /Users/hlukas/git/pyedpyper
plugins: cov-5.0.0, csv-3.0.0
collected 5 items                                                                                                                                                              

tests/test_linear_regression.py ...                                                                                                                                      [ 60%]
tests/test_ridge_regression.py .                                                                                                                                         [ 80%]
tests/test_utils.py .                                                                                                                                                    [100%]

---------- coverage: platform darwin, python 3.8.5-final-0 -----------
Name                                    Stmts   Miss  Cover
-----------------------------------------------------------
pyedpyper/models/__init__.py                0      0   100%
pyedpyper/models/linear_regression.py      19      2    89%
pyedpyper/models/ridge_regression.py       19      1    95%
pyedpyper/utils.py                          3      0   100%
tests/__init__.py                           0      0   100%
tests/test_linear_regression.py            30      0   100%
tests/test_ridge_regression.py             13      0   100%
tests/test_utils.py                         7      0   100%
-----------------------------------------------------------
TOTAL                                      91      3    97%

Pre-Commit Hooks

Why?

We want high-quality commits
- For example, we want the code to confirm to a stylistic standard
- We want certain types of files to be committed together
Humans are forgetful and it’s hard to go through a lot of code to make sure it’s all consistent
We’ll add something to our package that will check our commits when we submit them

Installation

pip install pre-commit

YAML

We’ll add a file called .pre-commit-config.yaml to our package that tells pre-commit what we want it to look for (this is the baseline file that can be found here):

.pre-commit-config.yaml

repos:
-   repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v2.3.0
    hooks:
    -   id: check-yaml
    -   id: end-of-file-fixer
    -   id: trailing-whitespace
-   repo: https://github.com/psf/black
    rev: 22.10.0
    hooks:
    -   id: black

For example, this will (stylistically) remove trailing whitespace and add a blank line at the end of files.

YAML: Aside

YAML stands for “Yet Another Markup Language”

Installing

We can run

pre-commit install

to set up our new hooks

Commit Output

Check Yaml...............................................................Passed
Fix End of Files.........................................................Failed
- hook id: end-of-file-fixer
- exit code: 1
- files were modified by this hook

Fixing .pre-commit-config.yaml
Fixing tests/test_utils.py
Fixing tests/test_linear_regression.py
Fixing tests/test_ridge_regression.py

Trim Trailing Whitespace.................................................Passed
black....................................................................Failed
- hook id: black
- files were modified by this hook

reformatted tests/test_utils.py
reformatted tests/test_ridge_regression.py
reformatted tests/test_linear_regression.py

All done! ✨ 🍰 ✨
3 files reformatted, 5 files left unchanged.

Linting

Remember that we can lint our code to make it conform stylistically – this is something that we can add to our pre-commit hook

.pre-commit-config.yaml

repos:
-   repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v2.3.0
    hooks:
    -   id: check-yaml
    -   id: end-of-file-fixer
    -   id: trailing-whitespace
-   repo: https://github.com/psf/black
    rev: 22.10.0
    hooks:
    -   id: black
-   repo: https://github.com/pylint-dev/pylint
    rev: v3.1.0
    hooks:
    -   id: pylint

Running with Linting

We get output like this:

************* Module tests.test_ridge_regression
tests/test_ridge_regression.py:1:0: C0114: Missing module docstring (missing-module-docstring)
tests/test_ridge_regression.py:9:0: C0116: Missing function or method docstring (missing-function-docstring)
tests/test_ridge_regression.py:10:4: C0103: Variable name "X" doesn't conform to snake_case naming style (invalid-name)

Virtual Environments

Why?

We want to limit the number of issues we run into due to differences in package verisons
We want to make sure that the correct packages are installed while developing

What?

A version of python with specific versions of packages installed
Anyone can create this version of python
Ensures that things are standardized
I generally run them through Anaconda

`environment.yml`

A file that lists out the dependencies for our virtual environment:

environment.yml

name: pyedpyper
dependencies:
  - numpy
  - pandas
  - pre-commit
  - pylint

Can be created via conda env create -f environment.yml
Can be activated via conda activate pyedpyper

`pyedpyper`

The repository with all of these files can be found here

Writing Modules and Testing

Learning Objectives

Class

What is a Class?

Why?

Example

Creating a Class

__init__()

Adding a Method

Trying out fit

Dimensions

Exercise: Input Sanitization

Solution: Input Sanitization

Why _check_dims?

Trying Method with Correct Dimensions

Adding SEs to Fit

Adding a summary Method

Trying our summary Method

Exercise: Plot

Solutions: Plot (Class)

Solutions: Plot (Use)

Packages

linear_regression.py

Ridge Regression

Ridge Solution

Adding ridge_regression.py

Trying Our Class

ridge_regression.py

Creating a Package

__init__.py

File Structure

Notes

Problem

utils.py

Example

Aside: kwargs

New File Structure

utils.py

New linear_regression.py

New ridge_regression.py

Installation

setup.py

Example of setup.py

GitHub Site

Testing

Why Test?

Common Workflow

Alternative to Unit Testing

Common Testing Framework – pytest

Folder Structure Reminder

Adding Tests

Exercise: Test Ideas

Solutions: Test Ideas

Test Coverage

Writing Tests

Example of a Test

How to Run pytest

Testing Deterministic Model

Testing Comparison

Testing Error

Creating Test Files

Regression Tests

Ridge Tests

Testing

Exercise: Testing a DataFrame

Solutions: Testing a DataFrame

utils Test File

Testing

Code Coverage

Installation of pytest-cov

Usage of pytest-cov

Pre-Commit Hooks

Why?

Installation

YAML

YAML: Aside

Installing

Commit Output

Linting

Running with Linting

`init()`

Trying out `fit`

Why `_check_dims`?

Adding a `summary` Method

Trying our `summary` Method

`linear_regression.py`

Adding `ridge_regression.py`

`ridge_regression.py`

`init.py`

`utils.py`

Aside: `kwargs`

`utils.py`

New `linear_regression.py`

New `ridge_regression.py`

`setup.py`

Example of `setup.py`

Common Testing Framework – `pytest`

How to Run `pytest`

Exercise: Testing a `DataFrame`

Solutions: Testing a `DataFrame`

`utils` Test File

Installation of `pytest-cov`

Usage of `pytest-cov`

`environment.yml`

`pyedpyper`