2024-04-01
numpy
: Workhorse library for vectorized computingpandas
: Data cleaning and manipulationscipy
: Statistical computing methodsnumpy
numpy
numpy
provides many advantages
numpy
to do an operation properly, and more importantly, in a way that’s numerically stablenumpy
gives us the ndarray
, which makes matrix and array operations much easier53.6 ms ± 1.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
83.1 ms ± 2.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
numpy
)634 µs ± 20.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
numpy
solution is roughly 100 times faster than list solutionnp
as a prefix?”
Recall that for two \(n\times 1\) vectors \(x_1\) and \(x_2\), we can compute the inner product as
\[x_1^{\top}x_2 = \sum_{i=1}^nx_{1i}x_{2i}\]
10.1 µs ± 250 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
np.inner
1.13 µs ± 24 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Can you rewrite our factorial function using numpy
? Is it faster? Hint: look up np.prod
.
3.43 µs ± 30.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
If a matrix \(A\) has dimension \(n\times k\) and matrix \(B\) has dimension \(k\times \ell\), then they can be multiplied and the resulting matrix \(AB\) has dimension \(n\times \ell\) and elements given by
\[ \begin{bmatrix} a_{11} & ... & a_{1k} \\ a_{21} & ... & a_{2k} \\ \vdots & \ddots & \vdots \\ a_{n1} & ... & a_{nk} \end{bmatrix} \begin{bmatrix} b_{11} & ... & b_{1\ell} \\ b_{21} & ... & b_{2\ell} \\ \vdots & \ddots & \vdots \\ b_{k1} & ... & b_{k\ell} \end{bmatrix}= \begin{bmatrix} \sum_{i=1}^ka_{1i}b_{i1} & ... & \sum_{i=1}^ka_{1i}b_{i\ell} \\ \sum_{i=1}^ka_{2i}b_{i1} & ... & \sum_{i=1}^ka_{2i}b_{i\ell} \\ \vdots & \ddots & \vdots \\ \sum_{i=1}^ka_{ni}b_{i1} & ... & \sum_{i=1}^ka_{ni}b_{i\ell} \\ \end{bmatrix} \]
numpy
Matrix A:
[[0 1]
[2 3]
[4 5]
[6 7]
[8 9]]
Matrix B:
[[10 11 12 13 14]
[15 16 17 18 19]]
Product:
[[ 15 16 17 18 19]
[ 65 70 75 80 85]
[115 124 133 142 151]
[165 178 191 204 217]
[215 232 249 266 283]]
numpy
works with arrays – think of these as generalized matrices
numpy
often initializes vectors with dimension 1 (just a length)(10,)
-1
ArgumentWe can always leave one dimension as -1 to tell numpy
to just make that dimension whatever is necessary to complete the array. For example, if we wanted an array of dimension \(5 \times 2 \times 1\), we could do
(5, 2, 1)
numpy
Define a matrix of random numbers:
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19]])
To access the first row:
To access the first column:
To access the first three rows and first two columns:
numpy
Multiplication of a scalar to an array works as expected:
array([[ 0., 5., 10., 15., 20.],
[25., 30., 35., 40., 45.],
[50., 55., 60., 65., 70.],
[75., 80., 85., 90., 95.]])
What about multiplying an array by an array?
array([[ 0, 1, 4, 9, 16],
[ 25, 36, 49, 64, 81],
[100, 121, 144, 169, 196],
[225, 256, 289, 324, 361]])
Let’s create a new array
array([[4, 1],
[0, 6],
[6, 2],
[6, 9],
[6, 7]])
If we want to subset to only columns where the sum of the numbers is more than 10, we can do this directly (more on np.sum
soon). First, we get a boolean array
array([False, False, False, True, True])
Compute the absolute value of an arbitrary array (do not use any built-in absolute value functions).
array([3, 2, 1, 0, 1, 2])
numpy
has built-in vectorized mathematical functions (called universal functions, or ufuncs
):
array([[1.00000000e+00, 2.71828183e+00, 7.38905610e+00, 2.00855369e+01,
5.45981500e+01],
[1.48413159e+02, 4.03428793e+02, 1.09663316e+03, 2.98095799e+03,
8.10308393e+03],
[2.20264658e+04, 5.98741417e+04, 1.62754791e+05, 4.42413392e+05,
1.20260428e+06],
[3.26901737e+06, 8.88611052e+06, 2.41549528e+07, 6.56599691e+07,
1.78482301e+08]])
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19]])
numpy
has a faster language (C/C++/Fortran) run these loopsnumpy
operation using these functions, it will be much faster than a loop.Sometimes we have ufuncs
that accept two arguments:
[4 0 6 6 6] [1 6 2 9 7]
vec shape: (1, 5), mat shape: (3, 5)
We know that we can do matrix multiplication, but what happens if we try to naively divide?
(3, 5)
Let’s look at these arrays to try to diagnose what’s happening:
[[0 1 2 3 4]]
[[1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1.]]
[[ inf 1. 0.5 0.33333333 0.25 ]
[ inf 1. 0.5 0.33333333 0.25 ]
[ inf 1. 0.5 0.33333333 0.25 ]]
numpy
is willing to “fill in” the values of the vector to turn it into the same size as the matrix and then do operations element-wise if it thinks we know what we want.
numpy
will only broadcast if
So this operation fails because the number of dimensions do not match:
ValueError: operands could not be broadcast together with shapes (3,5) (3,)
And this operation fails because the arrays differ across multiple dimensions:
ValueError: operands could not be broadcast together with shapes (5,2,1) (4,1,1)
Lesson: if you’re broadcasting, make it clear to numpy
what behavior you want!
Broadcasting makes calculating the variance/covariance matrix of vector of random variables easy
array([[ 0.9774961 , 0.00404101, -0.0754416 ],
[ 0.00404101, 1.85719771, 0.07401078],
[-0.0754416 , 0.07401078, 2.89976322]])
We now know enough numpy
to efficiently “standardize” elements of a matrix (or more accurately, to rewrite sklearn.preprocessing.StandardScaler
):
\[ z_{ij} = \frac{x_{ij}-\mu_j}{\sigma_j} \]
This is useful if your features have different scales for certain models (i.e. LASSO). Please write a function standardize
to do this (that is, calculate the mean and standard deviation of each column, and return the array with standardized values).
array([[-0.17149859, -1.31876095],
[-1.88648444, 0.32969024],
[ 0.68599434, -0.98907071],
[ 0.68599434, 1.31876095],
[ 0.68599434, 0.65938047]])
Note here that we are broadcasting.
numpy
allows for easy simulation draws. For example, we can draw 100,000 times from a standard normal distribution:
0.0029345100785475027 0.9966634341413745
As noted by McKinney, this procedure is significantly (at least one order of magnitude) faster than python’s default random
module – use numpy.random
!
numpy
:-0.0016440274114192328 1.00077298022399
We’re skeptical that numpy
is actually giving us random numbers, so we want to test it by doing the following procedure 1000 times:
Armed with our 1000 sample means, we’re going to calculate the mean and standard deviation of the sample means. What should they be? Do we get close? Please set a seed for reproducibility.
Via the Central Limit Theorem, we know that the sample mean \(\overline{x}\) of \(n\) i.i.d. random variables with mean \(\mu\) is distributed
\[ \overline{x} \sim \mathcal{N}\left(\mu, \frac{\sigma^2}{n}\right) \]
-2.977436149096266e-05 0.031892612678989775
CPU times: user 23.9 ms, sys: 691 µs, total: 24.5 ms
Wall time: 25.3 ms
-2.9774361490963187e-05 0.030131789839109725
CPU times: user 13.7 ms, sys: 298 µs, total: 14 ms
Wall time: 14.1 ms
numpy
For RegressionsSuppose we have data that looks like this:
Salary | Experience | SAT Score | College GPA | Master’s Degree |
---|---|---|---|---|
60,000 | 2 | 1200 | 3.2 | 0 |
90,000 | 1 | 1400 | 3.8 | 1 |
30,000 | 19 | 900 | 2.5 | 0 |
200,000 | 10 | 1400 | 3.8 | 1 |
150,000 | 5 | 1000 | 3.0 | 0 |
30,000 | 4 | 1100 | 3.9 | 0 |
We’d like to understand what drives salary. What should we do?
Given feature (explanatory variables) matrix \(\textbf{X}\) and outcome variable \(y\), define residuals from the model for a choice of coefficient vector \(\beta\) as
\[ \underbrace{e}_{n\times 1} = \underbrace{y}_{n\times 1} - \underbrace{\textbf{X}}_{n\times k}\underbrace{\beta}_{k\times 1} \]
Then the coefficients from an OLS (ordinary least squares) regression are the solution to the problem
\[ \min_{\beta}\; e^{\top}e \]
If \(\mathbf{X}^{\top}\mathbf{X}\) is invertible, then the solution to the problem has a closed-form solution given by
\[ \hat{\beta} = (\underbrace{\mathbf{X}^{\top}\mathbf{X}}_{k\times k})^{-1}\underbrace{\mathbf{X}^{\top}}_{k\times n}\underbrace{y}_{n\times 1} \]
First, we need to load the data1
(6, 4) (6, 1)
array([[ 720.48960622],
[ 114.35428578],
[-18598.07654699],
[ 51613.9979523 ]])
What’s missing?
array([[1.0e+00, 2.0e+00, 1.2e+03, 3.2e+00, 0.0e+00],
[1.0e+00, 1.0e+00, 1.4e+03, 3.8e+00, 1.0e+00],
[1.0e+00, 1.9e+01, 9.0e+02, 2.5e+00, 0.0e+00],
[1.0e+00, 1.0e+01, 1.4e+03, 3.8e+00, 1.0e+00],
[1.0e+00, 5.0e+00, 1.0e+03, 3.0e+00, 0.0e+00],
[1.0e+00, 4.0e+00, 1.1e+03, 3.9e+00, 0.0e+00]])
axis
We do need the axis
argument to put the intercept vector in the right place:
ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 1 and the array at index 1 has size 4
np.r_
and np.c_
If you know that you’re adding rows or columns to a matrix, numpy
has two handy functions to do just that:
array([[1.0e+00, 2.0e+00, 1.2e+03, 3.2e+00, 0.0e+00],
[1.0e+00, 1.0e+00, 1.4e+03, 3.8e+00, 1.0e+00],
[1.0e+00, 1.9e+01, 9.0e+02, 2.5e+00, 0.0e+00],
[1.0e+00, 1.0e+01, 1.4e+03, 3.8e+00, 1.0e+00],
[1.0e+00, 5.0e+00, 1.0e+03, 3.0e+00, 0.0e+00],
[1.0e+00, 4.0e+00, 1.1e+03, 3.9e+00, 0.0e+00]])
Note the brackets (this is technically an indexing routine)
np.hstack
and np.vstack
Alternatively, we can perform the operation with np.hstack
:
array([[1.0e+00, 2.0e+00, 1.2e+03, 3.2e+00, 0.0e+00],
[1.0e+00, 1.0e+00, 1.4e+03, 3.8e+00, 1.0e+00],
[1.0e+00, 1.9e+01, 9.0e+02, 2.5e+00, 0.0e+00],
[1.0e+00, 1.0e+01, 1.4e+03, 3.8e+00, 1.0e+00],
[1.0e+00, 5.0e+00, 1.0e+03, 3.0e+00, 0.0e+00],
[1.0e+00, 4.0e+00, 1.1e+03, 3.9e+00, 0.0e+00]])
Note that we’re passing a tuple
here
array([[ 1.75082781e+05],
[-1.22516556e+03],
[-1.49834437e+01],
[-2.62417219e+04],
[ 9.73509934e+04]])
We can check with sklearn
(future lectures)
array([[ 1.75082781e+05, -1.22516556e+03, -1.49834437e+01,
-2.62417219e+04, 9.73509934e+04]])
numpy
\[\mathbb{P}(\ell|p) = {k \choose \ell} p^{\ell}(1-p)^{k-\ell}\]
This means that for a given parameter \(p\) we can write the probability of observing our data given \(p\) as
\[ \mathbb{P}(x_1,...x_n|p) = \prod_{i=1}^n\mathbb{P}(x_i|p) \]
It’s often more convenient (and numerically stable) to work with the logarithm of the likelihood function:
\[ \mathcal{L}_n(x_1,...x_n|p) = \log\left(\mathbb{P}(x_1,...x_n|p)\right) = \sum_{i=1}^n\log\left(\mathbb{P}(x_i|p)\right) \]
This is why it’s important for our observations to be independent!
Given a likelihood function, we can define MLE estimator for our example as the value of \(p\) that maximizes the probability of observing our data. Formally, it is the solution to the problem
\[ \max_{\hat{p}}\; \mathcal{L}_n(x_1,...x_n|\hat{p}) \]
How can we solve this problem?
\[ \begin{split} \frac{\partial \mathcal{L}_n(x_1,...x_n|p)}{\partial p} &= \sum_{i=1}^n\frac{\partial \log\left(\mathbb{P}(x_i|p)\right)}{\partial p} \\ &= \sum_{i=1}^n \frac{\partial}{\partial p} \left\{\log {k\choose x_i} + x_i\log(p) + (k-x_i)\log(1-p)\right\} \\ &= \sum_{i=1}^n \frac{x_i}{p} - \frac{k-x_i}{1-p} \\ &= \frac{\sum_{i=1}^nx_i}{p} - \frac{nk-\sum_{i=1}^nx_i}{1-p} \end{split} \]
Set to zero to find the maximum:
\[ \frac{\sum_{i=1}^nx_i}{\hat{p}_{MLE}} = \frac{nk-\sum_{i=1}^nx_i}{1-\hat{p}_{MLE}} \implies \hat{p}_{MLE} = \frac{1}{n}\sum_{i=1}^n\frac{x_i}{k} \]
numpy
and scipy
For MLELet’s solve this problem using the closed-form solution and a computational solution.
0.349745
def neg_ll(theta: float, data: np.array, k: int) -> float:
"""
Compute the negative (why?) log-likelihood for a
binomial distribution given data and a specific parameter (theta).
The factorial term is omitted (why?)
"""
p_successes = data * np.log(theta)
p_failures = (k-data) * np.log(1-theta)
return -np.sum(p_successes + p_failures)
print(f'Log-Likelihood when we\'re far away: {-neg_ll(.8, x_i, k)}')
print(f'Log-Likelihood when we\'re close: {-neg_ll(.3, x_i, k)}')
Log-Likelihood when we're far away: -1124588.3911042244
Log-Likelihood when we're close: -653013.134119855
final_simplex: (array([[0.34974365],
[0.34974976]]), array([647288.64111237, 647288.6411581 ]))
fun: 647288.6411123656
message: 'Optimization terminated successfully.'
nfev: 32
nit: 16
status: 0
success: True
x: array([0.34974365])
What happens if we let scipy
use its default solver (BFGS, I believe)?
fun: 671916.227352009
hess_inv: <1x1 LbfgsInvHessProduct with dtype=float64>
jac: array([-531973.30590399])
message: b'CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH'
nfev: 6
nit: 1
njev: 3
status: 0
success: True
x: array([0.25])
Be careful!
Recall Bayes Rule:
\[ \mathbb{P}(A|B) = \frac{\mathbb{P}(A \cap B)}{\mathbb{P}(B)} = \frac{\mathbb{P}(B|A)\mathbb{P}(A)}{\mathbb{P}(B)} \]
Rewriting in a more useful format for our application:
\[ \begin{split} \underbrace{\mathbb{P}(\theta|x_1,...,x_n)}_{\text{Posterior}} &= \frac{\mathbb{P}(x_1,...,x_n|\theta)\mathbb{P}(\theta)}{\mathbb{P}(x_1,...,x_n)} \\ &\propto \underbrace{\mathbb{P}(x_1,...,x_n|\theta)}_{\text{Likelihood}}\underbrace{\mathbb{P}(\theta)}_{\text{Prior}} \end{split} \]
Suppose I have two “candidate” draws for my sample from the posterior, \(\theta_1\) and \(\theta_2\). By Bayes Rule, I know the ratio of their posterior probabilities can be calculated as
\[ \frac{\mathbb{P}(\theta_1|x_1,...,x_n)}{\mathbb{P}(\theta_2|x_1,...,x_n)} = \frac{\mathbb{P}(x_1,...,x_n|\theta_1)\mathbb{P}(\theta_1)}{\mathbb{P}(x_1,...,x_n|\theta_2)\mathbb{P}(\theta_2)} \equiv \alpha \]
where \(\alpha\) is the “acceptance rate”. Then to pick which parameter to include in our sample, just need to simulate a random variable that will pick \(\theta_1\) with probability \(\min\{\alpha, 1\}\) and \(\theta_2\) otherwise.
\[ \frac{\mathbb{P}(\theta_1|x_1,...,x_n)}{\mathbb{P}(\theta_2|x_1,...,x_n)} \equiv \alpha \]
How should we generate \(\theta_1\) and \(\theta_2\) if we don’t know what the distribution looks like?
Formally, a Markov Chain is a sequence with “memorylessness” property \[ \mathbb{P}(X_{n}=x_{n}\mid X_{n-1}=x_{n-1},\dots ,X_{0}=x_{0})=\mathbb{P}(X_{n}=x_{n}\mid X_{n-1}=x_{n-1}) \]
That is, only the most recent value in the sequence impacts the next value in the sequence
For our purposes, \(\theta_i = \theta_{i-1} + \varepsilon_i\) where \(\mathbb{E}[\varepsilon_i]=0\)
\[ x_n = \sum_{i=0}^{n-1}x_i + \varepsilon_n = x_0 + \sum_{i=1}^n\varepsilon_i \]
Generate 1000 draws from a Markov Chain that starts at zero with your choice of noise term. Set a seed for reproducibility.
CPU times: user 638 µs, sys: 108 µs, total: 746 µs
Wall time: 734 µs
CPU times: user 155 µs, sys: 3 µs, total: 158 µs
Wall time: 161 µs
Put it all together into this (simplified version) of the Metropolis-Hastings algorithm:
def met_hast(n_iterations: int) -> list:
"""
Run Metropolis-Hastings algorithm for our binomial example.
"""
post_draws = []
theta_old = .5
for i in range(n_iterations):
theta_new = theta_old + np.random.normal(loc = 0, scale = .05)
ll_old = -neg_ll(theta_old, x_i, k)
ll_new = -neg_ll(theta_new, x_i, k)
acceptance_rate = np.exp(ll_new-ll_old)
if np.random.uniform() < acceptance_rate:
post_draws.append(theta_new)
theta_old = theta_new
else:
post_draws.append(theta_old)
return post_draws
Mean of posterior draws: 0.3498
What’s happening in the early iterations?
Mean of posterior draws: 0.3497