`pandas`

Lukas Hager

2024-11-28

Overview

`pandas`

pandas is an open-source library that makes working with tabular data easy
- numpy: good at arrays/matrix operations (e.g. all float values)
- pandas: good at heterogeneous data (e.g. some float, some character, etc.)
- Primary pythonic tool for cleaning data for analysis

Learning Objectives

Understand how to load, slice, filter, join, reshape, and clean data using pandas
Feel comfortable with the basics of regular expressions

`pandas` Data Structures

Importing `pandas`¹

We’ll always use the pd prefix for pandas. We’ll also import numpy with its standard prefix²

import pandas as pd
import numpy as np

`Series`

pandas has a similar object to np.array – Series:

obj = pd.Series([4, 7, -5, 3])
obj

0    4
1    7
2   -5
3    3
dtype: int64

Note that the object has an index – these are like dict keys in base python:

obj2 = pd.Series([4, 7, -5, 3], index=["d", "b", "a", "c"])
obj2

d    4
b    7
a   -5
c    3
dtype: int64

Indexing `Series`

The index allows us more descriptive access to values:

obj2['a']

-5

obj2[['a','c']]

a   -5
c    3
dtype: int64

obj2[3]

obj2['d'] = 6
obj2[['a', 'd']]

a   -5
d    6
dtype: int64

Boolean Indexing

We can still use boolean indexing without removing the index link:

obj2[obj2>2]

d    6
b    7
c    3
dtype: int64

Note that the index is still there, and both array and index are accessible with Series methods:

obj2[obj2>2].array

<NumpyExtensionArray>
[6, 7, 3]
Length: 3, dtype: int64

obj2[obj2>2].index

Index(['d', 'b', 'c'], dtype='object')

Creating `Series` from `dict`

Since we’re arguing that Series is very similar to a dict, it’s logical that we can create one from a dict:

sdata = {"Ohio": 35000, "Texas": 71000, "Oregon": 16000, "Utah": 5000}
obj3 = pd.Series(sdata)
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

And we can also do the reverse operation:

obj3.to_dict()

{'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}

Providing Index to `Series`

We can pass an index directly in the creation of a Series object:

states = ["California", "Ohio", "Oregon", "Texas"]
obj4 = pd.Series(sdata, index=states)
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

Why do we have a NaN (not a number)? Where is Utah’s entry?

`NaN`

Use pd.isna function to identify these values:

pd.isna(obj4)

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

We often want to look at arrays that have null values removed:

obj4[~pd.isna(obj4)]

Ohio      35000.0
Oregon    16000.0
Texas     71000.0
dtype: float64

They are ignored by default in pandas aggregation methods (the opposite of R)…

obj4.sum()

122000.0

…but this may hide bugs in our code. To be safe:

obj4.sum(skipna=False)

nan

`DataFrame`

The power of pandas is in the DataFrame object. There are many ways to construct a DataFrame but a common one is passing a dict:

data = {"state": ["Ohio", "Ohio", "Ohio", "Nevada", "Nevada", "Nevada"],
        "year": [2000, 2001, 2002, 2001, 2002, 2003],
        "pop": [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)
frame

	state	year	pop
0	Ohio	2000	1.5
1	Ohio	2001	1.7
2	Ohio	2002	3.6
3	Nevada	2001	2.4
4	Nevada	2002	2.9
5	Nevada	2003	3.2

`head` and `tail`

frame.head(2)

	state	year	pop
0	Ohio	2000	1.5
1	Ohio	2001	1.7

frame.tail(2)

	state	year	pop
4	Nevada	2002	2.9
5	Nevada	2003	3.2

Exercise: `numpy` Array as `DataFrame`

Look up pd.DataFrame’s documentation and create a \(1000 \times 26\) DataFrame where each column has a letter of the alphabet as its header, and its rows are 1000 Poisson random variables.

Solutions: `numpy` Array as `DataFrame`

rng = np.random.default_rng(seed=481)
random_array = rng.poisson(size = (1000,26))
headers = list('abcdefghijklmnopqrstuvwxyz')
pois = pd.DataFrame(data = random_array, columns = headers)
pois.head(3)

	a	b	c	d	e	f	g	h	i	...	q	r	s	t	u	v	w	x	y	z
0	2	1	0	0	1	0	0	1	1	...	1	0	4	1	1	1	0	2	1	0
1	1	0	1	2	0	0	1	1	0	...	4	2	3	1	0	0	1	0	0	1
2	1	0	2	0	1	2	0	3	0	...	0	2	3	0	0	2	1	1	0	2

3 rows × 26 columns

Selecting a Column

frame.state

0      Ohio
1      Ohio
2      Ohio
3    Nevada
4    Nevada
5    Nevada
Name: state, dtype: object

frame['state']

0      Ohio
1      Ohio
2      Ohio
3    Nevada
4    Nevada
5    Nevada
Name: state, dtype: object

The latter is safer – the former will not work with variable names like “my var”

Selecting Columns

frame[['state', 'year']]

	state	year
0	Ohio	2000
1	Ohio	2001
2	Ohio	2002
3	Nevada	2001
4	Nevada	2002
5	Nevada	2003

Note that this is a DataFrame, while the previous objects were Series

Selecting Column as `DataFrame`

frame[['state']]

	state
0	Ohio
1	Ohio
2	Ohio
3	Nevada
4	Nevada
5	Nevada

Creating Variables

frame['debt'] = 1.5
frame['debt2'] = np.arange(len(frame))
frame

	state	year	pop	debt	debt2
0	Ohio	2000	1.5	1.5	0
1	Ohio	2001	1.7	1.5	1
2	Ohio	2002	3.6	1.5	2
3	Nevada	2001	2.4	1.5	3
4	Nevada	2002	2.9	1.5	4
5	Nevada	2003	3.2	1.5	5

Reindexing

obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=["d", "b", "a", "c"])
obj

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

Suppose we want to set a new index on this DataFrame. What happens?

obj.reindex(["a", "b", "c", "d", "e"])

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

Why is a NaN induced?

`reindex` for Economic Data

One useful application of reindexing for Economics is when we work with time-series data
We may want to interpolate incomplete time-series data (fill it forward)

og_df = pd.DataFrame(
    {
        'price': np.random.uniform(size=3), 
        'quantity': np.random.uniform(size=3)
    },
    index=[1,3,5]
)
og_df

	price	quantity
1	0.054966	0.839750
3	0.606854	0.865037
5	0.140897	0.488995

Default `reindex` Behavior

What should we do if we want one observation per day?

og_df.reindex(np.arange(6))

	price	quantity
0	NaN	NaN
1	0.054966	0.839750
2	NaN	NaN
3	0.606854	0.865037
4	NaN	NaN
5	0.140897	0.488995

`ffill`

How should we interpolate day 2’s data?

filled_df = og_df.reindex(np.arange(6), method='ffill')
filled_df

	price	quantity
0	NaN	NaN
1	0.054966	0.839750
2	0.054966	0.839750
3	0.606854	0.865037
4	0.606854	0.865037
5	0.140897	0.488995

`pandas` Indexing (`numpy` Notation)

We can filter data via the same notation as in numpy:

filled_df[~pd.isna(filled_df['price'])]

	price	quantity
1	0.054966	0.839750
2	0.054966	0.839750
3	0.606854	0.865037
4	0.606854	0.865037
5	0.140897	0.488995

`pandas` Indexing (`.loc`)

The preferred method for subsetting data in pandas is using the .loc method¹:

filled_df.loc[~pd.isna(filled_df['price'])]

	price	quantity
1	0.054966	0.839750
2	0.054966	0.839750
3	0.606854	0.865037
4	0.606854	0.865037
5	0.140897	0.488995

Indexing using Index Values

Note that we previously were using Boolean indexing – we can of course also index using the index itself

filled_df.loc[[1,3,5]]

	price	quantity
1	0.054966	0.839750
3	0.606854	0.865037
5	0.140897	0.488995

Select on Both Axes with `.loc`

filled_df.loc[~pd.isna(filled_df['price']), 'quantity']

1    0.839750
2    0.839750
3    0.865037
4    0.865037
5    0.488995
Name: quantity, dtype: float64

Value Assignment with `.loc`

filled_df.loc[pd.isna(filled_df['price'])] = 0.
filled_df

	price	quantity
0	0.000000	0.000000
1	0.054966	0.839750
2	0.054966	0.839750
3	0.606854	0.865037
4	0.606854	0.865037
5	0.140897	0.488995

`ufuncs` with `pandas`

frame = pd.DataFrame(
    np.random.standard_normal((4, 3)),
    columns=list("bde"),
    index=["Utah", "Ohio", "Texas", "Oregon"]
)
frame

	b	d	e
Utah	-1.075511	1.574516	-0.988367
Ohio	1.321872	0.379486	-0.361248
Texas	1.529598	0.491796	0.353201
Oregon	-1.462351	-0.952698	-0.553078

`np.abs`

np.abs(frame)

	b	d	e
Utah	1.075511	1.574516	0.988367
Ohio	1.321872	0.379486	0.361248
Texas	1.529598	0.491796	0.353201
Oregon	1.462351	0.952698	0.553078

`apply`

def f1(x: pd.Series) -> float:
    """
    Return the difference between the largest and smallest 
    value of a `Series`.
    """
    return x.max() - x.min()

frame.apply(f1)

b    2.991949
d    2.527214
e    1.341568
dtype: float64

`apply` along rows

frame.apply(f1, axis = 'columns')

Utah      2.650027
Ohio      1.683120
Texas     1.176397
Oregon    0.909273
dtype: float64

Note on `apply`

Apply will inherently be inefficient relative to a vectorized function
For example – don’t use apply to take the logarithm of a column in your DataFrame. Use np.log
- See the .map method for how you could do this if you wanted to

Exercise: Distributions

Use and your \(1000 \times 26\) DataFrame and return the sample mean and variance for each column. Use apply and numpy for one and only numpy for the other. Is there a time difference?

Solutions: Distributions

%%time 
mean_pois = pois.apply(np.mean, axis = 0)
print(mean_pois.shape)

(26,)
CPU times: user 782 μs, sys: 71 μs, total: 853 μs
Wall time: 790 μs

%%time 
var_pois = np.var(pois, axis = 0)
print(var_pois.shape)

(26,)
CPU times: user 560 μs, sys: 293 μs, total: 853 μs
Wall time: 975 μs

Sorting

We often want to sort data by some column or set of columns
- Useful in time-series analysis
- Also useful when we need to compute lagged variables
We can do this a few ways in pandas

Sorting by Index

obj = pd.Series(np.arange(4), index=["d", "a", "b", "c"])
obj

d    0
a    1
b    2
c    3
dtype: int64

obj.sort_index()

a    1
b    2
c    3
d    0
dtype: int64

Note that this is sorting lexicographically

obj.sort_index(ascending=False)

d    0
c    3
b    2
a    1
dtype: int64

Sorting by Variables

We also want to sort data by specific variables, that are not the index

obj.sort_values()

d    0
a    1
b    2
c    3
dtype: int64

NaN values are sorted to the end by default

obj = pd.Series([3., 2., np.nan, 5., np.nan], index=["d", "a", "b", "c", "z"])
obj.sort_values()

a    2.0
d    3.0
c    5.0
b    NaN
z    NaN
dtype: float64

Summary Statistics

As we’ve seen, we can call aggregating functions on DataFrame objects to get some summary stats

df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
                   [np.nan, np.nan], [0.75, -1.3]],
                   index=["a", "b", "c", "d"],
                   columns=["one", "two"])

df

	one	two
a	1.40	NaN
b	7.10	-4.5
c	NaN	NaN
d	0.75	-1.3

Summary Stats

df.sum()

one    9.25
two   -5.80
dtype: float64

df.mean()

one    3.083333
two   -2.900000
dtype: float64

What are these methods doing with NaN values?

A fuller list of summary statistics methods for pandas is available here.

`DataFrame.sum` versus `np.sum`

Note that np.sum will have different default behavior depending on the object that it’s called on:

np.sum(random_array)

np.sum(pois).head()

a    1001
b     950
c    1029
d    1003
e    1019
dtype: int64

`describe`

We can also use the built-in describe method:

df.describe()

	one	two
count	3.000000	2.000000
mean	3.083333	-2.900000
std	3.493685	2.262742
min	0.750000	-4.500000
25%	1.075000	-3.700000
50%	1.400000	-2.900000
75%	4.250000	-2.100000
max	7.100000	-1.300000

`describe` with Non-numeric Data

obj = pd.Series(["a", "a", "b", "c"] * 4)
obj.describe()

count     16
unique     3
top        a
freq       8
dtype: object

`unique`

We often want to know the unique (or distinct) values in a Series (equivalently, column in our DataFrame)

obj = pd.Series(["c", "a", "d", "a", "a", "b", "b", "c", "c"])
obj.unique()

array(['c', 'a', 'd', 'b'], dtype=object)

We also will want to know how often distinct values arise in a Series

obj.value_counts()

c    3
a    3
b    2
d    1
Name: count, dtype: int64

`isin`

We also can check which elements of a Series object match certain values:

obj[obj.isin(["b", "c"])]

0    c
5    b
6    b
7    c
8    c
dtype: object

Loading Data

One of the nicest features of pandas is that it has methods for reading an assortment of stored data into a DataFrame
We’ll focus on some of the most common ones here, but all the loading methods are relatively similar

`pd.read_csv`

Reads data from a comma-separated value file (CSV). The data looks like this

from io import StringIO # you will not need StringIO if reading from a file
raw_csv_file = StringIO("""
a, b, c
1, 2, europe
3, 5, asia
-2, 6, north america
""")
pd.read_csv(raw_csv_file)

	a	b	c
0	1	2	europe
1	3	5	asia
2	-2	6	north america

Useful `pd.read_csv` Arguments

sep: what the delineator in the file is (“,” and “|” are common)
na_values: list of strings to coerce to NaN (e.g. “NULL”, “N/A”, “NA”)
names: pass your own column names instead of inferring from the data
skiprows: tell pandas to not read in certain rows of the file
index_col: tell pandas which column to use as the index

`pd.read_excel`

Reads in data from a specific Excel sheet (remember you can have multiple sheets in a workbook)

url = 'https://www.blm.gov/sites/blm.gov/files/Table2_Number_of_Acres_Leased.xlsx'
pd.read_excel(url, usecols = ['FY 2014', 'FY 2015']).head()

	FY 2014	FY 2015
0	80540.0	78570.00
1	1813246.0	1813246.00
2	39560.0	19995.65
3	465407.0	460904.00
4	223206.0	208056.00

Useful `pd.read_excel` Arguments

sheet_name: if a workbook has multiple sheets, access the one that you want
names: pass your own column names instead of inferring from the data
thousands/decimal: tell pandas what the separator for thousands or decimals is if you want to parse a text field to numeric
index_col: tell pandas which column to use as the index

`pd.read_html`

If a table on a webpage is sufficiently simple¹, we can use pandas to read it directly
Note that this method returns a list of DataFrame objects
Often need to specify the match argument to get the table you want

econ_courses = pd.read_html('https://econ.washington.edu/courses', match = 'ECON Courses')
econ_courses[0].head(3)

	Course	Course Title (click for details)	SLN	Instructor	Meeting Time
0	ECON 200 A	Introduction to Microeconomics	13840	Melissa Knox	TTh 8:30am - 9:50am
1	ECON 200 AA	Introduction to Microeconomics	13841	NaN	F 8:30am - 9:50am
2	ECON 200 AB	Introduction to Microeconomics	13842	NaN	F 9:00am - 10:20am

Useful `pd.read_html` Arguments

All the previous ones from pd.read_csv and pd.read_excel
attrs: a dictionary of attributes to identify a table in the webpage’s HTML
- We’ll discuss this a little more in webscraping, but you should be aware of it if you’re using this method on a page where there are a lot of tables

Data Cleaning in `pandas`

Duplicate Data

Depending on context, we may want to remove duplicate observations from a dataset
Recall that we have the .unique() method to get the unique values from a Series
If we want to filter out duplicates, we have some direct options for DataFrames

data = pd.DataFrame({"k1": ["one", "two"] * 3 + ["two"],
                     "k2": [1, 1, 2, 3, 3, 4, 4]})
data

	k1	k2
0	one	1
1	two	1
2	one	2
3	two	3
4	one	3
5	two	4
6	two	4

data.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

data.drop_duplicates()

	k1	k2
0	one	1
1	two	1
2	one	2
3	two	3
4	one	3
5	two	4

Discretizing

If we want to bin data, it can be surprisingly tricky to code – luckily, pandas has a built-in method for us:

ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
bins = [18, 25, 35, 60, 100]
age_categories = pd.cut(ages,bins)
age_categories

[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64, right]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

String Manipulation

A huge benefit of the package
Working with text data is crucial in Economic research
Let’s begin with an example

Class List

Let’s return to our Econ department data

econ_courses[0]['Course']

0      ECON 200 A
1     ECON 200 AA
2     ECON 200 AB
3     ECON 200 AC
4     ECON 200 AD
         ...     
71     ECON 591 A
72     ECON 596 A
73     ECON 600 A
74     ECON 601 A
75     ECON 800 A
Name: Course, Length: 76, dtype: object

How could we classify courses as 200-level, 300-level, 400-level, and graduate level?

First option: `str.slice`

slice_df = econ_courses[0][['Course']]
slice_df['course_no'] = slice_df['Course'].str.slice(5,8)
slice_df.head(4)

	Course	course_no
0	ECON 200 A	200
1	ECON 200 AA	200
2	ECON 200 AB	200
3	ECON 200 AC	200

How can we check?

Check: `str.slice`

pd.value_counts(slice_df['course_no'])

course_no
200    16
201    14
300     7
301     4
483     2
600     1
596     1
591     1
585     1
584     1
485     1
539     1
518     1
503     1
501     1
499     1
497     1
491     1
488     1
601     1
581     1
482     1
472     1
471     1
451     1
448     1
443     1
432     1
426     1
424     1
422     1
421     1
411     1
406     1
399     1
382     1
216     1
800     1
Name: count, dtype: int64

When would this approach fail?

Classification: `np.where`

slice_df['classification'] = np.where(
    slice_df['course_no'].astype('int') >= 500,
    'grad',
    np.where(
        slice_df['course_no'].astype('int') >= 400,
        '400 level',
        np.where(
            slice_df['course_no'].astype('int') >= 300,
            '300 level',
            '200 level'
        )
    )
)
slice_df.iloc[[30,50]]

	Course	course_no	classification
30	ECON 216 A	216	200 level
50	ECON 432 A	432	400 level

Ideal Approach

It would be cool if we could just tell pandas to grab the 3-digit substring within any larger string
Then we could apply this to any course catalog (where the prefix isn’t just “ECON”)
This is done using regular expressions

Regular Expressions (regex)

Strings that are used to match patterns instead of explicit combinations of characters in text
In our example, we want to match numbers and no characters
- In regex, a general digit (0-9) can be represented using “\d”
- In regex, we can specify the exact number of matches we want via “{3}”
So a regex that would work in this application looks like this: \d{3} (match three digits)

Regular Expressions in `pandas`

These are extremely useful in working with string data in pandas

regex_df = econ_courses[0][['Course']]
regex_df['course_no'] = regex_df['Course'].str.extract('(\d{3})')
regex_df.head(4)

	Course	course_no
0	ECON 200 A	200
1	ECON 200 AA	200
2	ECON 200 AB	200
3	ECON 200 AC	200

Generally Useful Regular Expressions

Match any one upper-case letter: [A-Z]
Match any one lower-case letter: [a-z]
Match any one letter: [A-Za-z]
Match any one character: .
Match multiple of a character in a row (greedy, zero or more): *
Match multiple of a character in a row (non-greedy, one or more): +
Match a space: \s or a literal space
Match any text between “a” and “b”: (?<=a).+(?=b)
Match “a” or “b”: |

Example: Match Phone Numbers

messy_phone_nos = pd.Series(['410-662-6967', '292-330-2492', '012-34-5678'])

messy_phone_nos.str.match('\d+-\d+-\d+')

0    True
1    True
2    True
dtype: bool

messy_phone_nos.str.match('\d{3}-\d{3}-\d{4}')

0     True
1     True
2    False
dtype: bool

Example: Match Washington Cities

wa_cities = pd.Series(
    ['Seattle, WA', 'Bellingham, Washington', 'Portland, Oregon']
)

wa_cities.str.match('[A-Za-z]+, (WA|Washington)')

0     True
1     True
2    False
dtype: bool

wa_cities.str.match('[A-Za-z]+, W[Aa]+?')

0     True
1     True
2    False
dtype: bool

Exercise: Matching

Write a regular expression that will match each element in its entirety in these arrays:

“285a”, “2a86”, “44b”, “abc”
“1 + 1 = 2”, “10 + 2 = 12”, “3+11=14”
Match any numpy method, but no pandas methods (assume np and pd prefixes)

A useful engine for practice: regex101

Hint: check out what ? does.

Matching Solutions

pd.Series(["285a", "2a86", "44b", "abc"]).str.extract('([a-z\d]+)')[0]

0    285a
1    2a86
2     44b
3     abc
Name: 0, dtype: object

pd.Series(["1 + 1 = 2", "10 + 2 = 12", "3+11=14"])\
    .str.extract('(\d+\s?\+\s?\d+\s?\=\s?\d+)')[0]

0      1 + 1 = 2
1    10 + 2 = 12
2        3+11=14
Name: 0, dtype: object

pd.Series(['np.inner', 'pd.read_html']).str.extract('(np\.[A-Za-z\_]+)')[0]

0    np.inner
1         NaN
Name: 0, dtype: object

Wrangling Data in `pandas`

Database-Style Joining

When we want to link data in two DataFrames by keys, we will join or merge them

df1 = pd.DataFrame({"key": ["b", "b", "a", "c", "a", "a", "b"],
                    "data1": pd.Series(range(7), dtype="Int64")})
df2 = pd.DataFrame({"key": ["a", "b", "d"],
                    "data2": pd.Series(range(3), dtype="Int64")})
df1,df2

(  key  data1
 0   b      0
 1   b      1
 2   a      2
 3   c      3
 4   a      4
 5   a      5
 6   b      6,
   key  data2
 0   a      0
 1   b      1
 2   d      2)

`pd.merge`

pd.merge(df1, df2)

	key	data1	data2
0	b	0	1
1	b	1	1
2	a	2	0
3	a	4	0
4	a	5	0
5	b	6	1

How does it know how to join? Do we have all of our original observations?

Specifying Merge Keys

A best practice is to explicitly state the columns you want to join on

pd.merge(df1, df2, on="key")

	key	data1	data2
0	b	0	1
1	b	1	1
2	a	2	0
3	a	4	0
4	a	5	0
5	b	6	1

Different Names in Merge

df1.rename(columns={'key': 'key1'}, inplace=True)
df2.rename(columns={'key': 'key2'}, inplace=True)
pd.merge(df1, df2, left_on='key1', right_on='key2')

	key1	data1	key2	data2
0	b	0	b	1
1	b	1	b	1
2	a	2	a	0
3	a	4	a	0
4	a	5	a	0
5	b	6	b	1

Outer Join

pd.merge(df1, df2, left_on='key1', right_on='key2', how='outer')

	key1	data1	key2	data2
0	a	2	a	0
1	a	4	a	0
2	a	5	a	0
3	b	0	b	1
4	b	1	b	1
5	b	6	b	1
6	c	3	NaN	<NA>
7	NaN	<NA>	d	2

Left Join

pd.merge(df1, df2, left_on='key1', right_on='key2', how='left')

	key1	data1	key2	data2
0	b	0	b	1
1	b	1	b	1
2	a	2	a	0
3	c	3	NaN	<NA>
4	a	4	a	0
5	a	5	a	0
6	b	6	b	1

Right Join

pd.merge(df1, df2, left_on='key1', right_on='key2', how='right')

	key1	data1	key2	data2
0	a	2	a	0
1	a	4	a	0
2	a	5	a	0
3	b	0	b	1
4	b	1	b	1
5	b	6	b	1
6	NaN	<NA>	d	2

Non-distinct Join Keys

df1 = pd.DataFrame({"key": ["b", "b", "a", "c", "a", "b"],
                    "data1": pd.Series(range(6), dtype="Int64")})
df2 = pd.DataFrame({"key": ["a", "b", "a", "b", "d"],
                    "data2": pd.Series(range(5), dtype="Int64")})
df1,df2

(  key  data1
 0   b      0
 1   b      1
 2   a      2
 3   c      3
 4   a      4
 5   b      5,
   key  data2
 0   a      0
 1   b      1
 2   a      2
 3   b      3
 4   d      4)

Cartesian Product

print(pd.merge(df1, df2, on='key', how='left'))

   key  data1  data2
0    b      0      1
1    b      0      3
2    b      1      1
3    b      1      3
4    a      2      0
5    a      2      2
6    c      3   <NA>
7    a      4      0
8    a      4      2
9    b      5      1
10   b      5      3

Inner Joins Don’t Help

print(pd.merge(df1, df2, on='key', how='inner'))

  key  data1  data2
0   b      0      1
1   b      0      3
2   b      1      1
3   b      1      3
4   a      2      0
5   a      2      2
6   a      4      0
7   a      4      2
8   b      5      1
9   b      5      3

`join`

To merge on the index of a DataFrame, we can use join¹

left2 = pd.DataFrame([[1., 2.], [3., 4.], [5., 6.]],
                     index=["a", "c", "e"],
                     columns=["Ohio", "Nevada"]).astype("Int64")
right2 = pd.DataFrame([[7., 8.], [9., 10.], [11., 12.], [13, 14]],
                      index=["b", "c", "d", "e"],
                      columns=["Missouri", "Alabama"]).astype("Int64")
left2,right2

(   Ohio  Nevada
 a     1       2
 c     3       4
 e     5       6,
    Missouri  Alabama
 b         7        8
 c         9       10
 d        11       12
 e        13       14)

Using `join`

print(left2.join(right2, how = 'inner'))

   Ohio  Nevada  Missouri  Alabama
c     3       4         9       10
e     5       6        13       14

print(left2.join(right2, how = 'outer'))

   Ohio  Nevada  Missouri  Alabama
a     1       2      <NA>     <NA>
b  <NA>    <NA>         7        8
c     3       4         9       10
d  <NA>    <NA>        11       12
e     5       6        13       14

Exercise: Joins/Merges

Create a DataFrame that has every combination of the integers from 1-10 and their names as strings (i.e. “one”, “two”, “three”, etc) – you should have 100 combinations.
Then create a new column which contains the computation of the integer exponentiated to the number of letters in the name for all 100 computations.
Finally, filter to rows where that value is divisible by 27.

Solutions: Joins

df1 = pd.DataFrame({'ints': np.arange(1,11), 'key': 1}).set_index('key')
char_list = ['one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'ten']
df2 = pd.DataFrame({'chars': char_list, 'key': 1}).set_index('key')
cj_data = df1.join(df2, how = 'left')
cj_data.shape

(100, 2)

cj_data['n_chars'] = cj_data['chars'].str.len()
cj_data['exp'] = cj_data['ints'] ** cj_data['n_chars']
print(cj_data.loc[cj_data['exp'] % 27 == 0].shape)
print(cj_data.loc[cj_data['exp'] % 27 == 0].head(5))

(30, 4)
     ints  chars  n_chars  exp
key                           
1       3    one        3   27
1       3    two        3   27
1       3  three        5  243
1       3   four        4   81
1       3   five        4   81

Stacking Data

A common occurrence is that data is stored in individual files, and we want to combine all the files into one dataset
- E.g. a folder full of CSV files
Remember DRY: we don’t want to copy-paste 15 versions of pd.read_csv(...) and then combine them

`pd.concat`

s1 = pd.Series([0, 1], index=["a", "b"], dtype="Int64")
s2 = pd.Series([2, 3, 4], index=["c", "d", "e"], dtype="Int64")
s3 = pd.Series([5, 6], index=["f", "g"], dtype="Int64")
pd.concat([s1, s2, s3])

a    0
b    1
c    2
d    3
e    4
f    5
g    6
dtype: Int64

Note that we passed a list to pd.concat

Example: Economics Courses in 2023

Suppose we wanted to create a DataFrame of all the Economics courses UW offered in 2023
The URLs are in the format

https://econ.washington.edu/courses/<year>/<quarter>/all

We already know how to download the tables with pd.read_html

Powerful Combination: List Comprehension and `pd.concat`

def download_econ_courses(year:int, quarter: str) -> pd.DataFrame:
    """
    Download Economics Course Offerings for `year` and `quarter`
    """
    url = f'https://econ.washington.edu/courses/{year}/{quarter}/all'
    df = pd.read_html(url, match = 'ECON Courses')[0]
    df['quarter'] = quarter
    return df

quarters = ['winter', 'spring', 'summer', 'autumn']

data_2023 = pd.concat([download_econ_courses(2023, x) for x in quarters])
print(data_2023.head(2))

        Course Course Title (click for details)    SLN    Instructor  \
0   ECON 200 A   Introduction to Microeconomics  13928  Melissa Knox   
1  ECON 200 AA   Introduction to Microeconomics  13929   Ken Inosaki   

          Meeting Time quarter  
0  TTh 8:30am - 9:50am  winter  
1    F 8:30am - 9:50am  winter

Pivoting Data

We’ll often want to change the fundamental form of the data from “wide” to “long” or the opposite
For example, instead of having “year” as a column, we may want each year to be its own column
This is always kind of a pain
- pandas makes it relatively painless with pivot and melt

Econ Course Data

print(data_2023.tail(5))

        Course Course Title (click for details)    SLN         Instructor  \
69  ECON 594 A                  Economic Growth  14241  Stephen Turnovsky   
70  ECON 600 A    Independent Study or Research  14242                NaN   
71  ECON 601 A                       Internship  14243                NaN   
72  ECON 602 A  Teaching Introductory Economics  14244      Yael Midnight   
73  ECON 800 A            Doctoral Dissertation  14245                NaN   

            Meeting Time quarter  
69  MW 10:30am - 11:50am  autumn  
70        to be arranged  autumn  
71        to be arranged  autumn  
72    W 9:30am - 11:20am  autumn  
73        to be arranged  autumn

What if we wanted to know courses taught by a professor over the year (400 level)?

Data to Pivot

profs = ['Melissa Knox', 'Yael Jacobs', 'Fahad Khalil']
data_subset = data_2023.loc[
    data_2023['Instructor'].isin(profs) & 
    (data_2023['Course'].str.get(5) == '4')
]
print(data_subset)

        Course            Course Title (click for details)    SLN  \
42  ECON 400 A                     Advanced Microeconomics  13971   
56  ECON 485 A  Game Theory with Applications to Economics  13986   
54  ECON 448 A                  Population and Development  13683   

      Instructor          Meeting Time quarter  
42  Melissa Knox   TTh 1:30pm - 3:20pm  winter  
56  Fahad Khalil  MW 10:30am - 12:20pm  winter  
54  Melissa Knox  TTh 8:30am - 10:20am  spring

Pivoting

pivoted = data_subset.pivot(
    index='Instructor', # what should the index be?
    columns='quarter', # what column should create the columns?
    values='Course' # what values should fill in the new columns?
)
pivoted

quarter	spring	winter
Instructor
Fahad Khalil	NaN	ECON 485 A
Melissa Knox	ECON 448 A	ECON 400 A

Full Data Pivot

Will this work for the full dataset?

data_2023.pivot(
    index='Instructor',
    columns='quarter',
    values='Course'
)

ValueError: Index contains duplicate entries, cannot reshape

How can we see this?

dup_entries = data_2023.loc[
    data_2023[['Instructor', 'quarter']].duplicated(keep=False) &
    ~pd.isna(data_2023['Instructor']),
    ['Instructor', 'quarter', 'Course']
]

Duplicates

print(dup_entries.sort_values(['Instructor', 'quarter']))

        Instructor quarter      Course
57   Alan Griffith  autumn  ECON 496 A
67   Alan Griffith  autumn  ECON 587 A
44   Alan Griffith  winter  ECON 410 A
68   Alan Griffith  winter  ECON 590 A
64   Brian Greaney  spring  ECON 509 A
..             ...     ...         ...
74      Yanqin Fan  spring  ECON 599 A
47    Yu-chin Chen  autumn  ECON 426 A
60    Yu-chin Chen  autumn  ECON 502 A
67  Yuya Takahashi  spring  ECON 534 A
73  Yuya Takahashi  spring  ECON 596 B

[71 rows x 3 columns]

Resetting the Index

To unpivot (melt), we need the key variable to not be the index:

print(pivoted.reset_index())

quarter    Instructor      spring      winter
0        Fahad Khalil         NaN  ECON 485 A
1        Melissa Knox  ECON 448 A  ECON 400 A

Melting

Then we can melt:

melted = pd.melt(
    pivoted.reset_index(),
    id_vars = 'Instructor'
)
print(melted)

     Instructor quarter       value
0  Fahad Khalil  spring         NaN
1  Melissa Knox  spring  ECON 448 A
2  Fahad Khalil  winter  ECON 485 A
3  Melissa Knox  winter  ECON 400 A

Aggregation

How Aggregation Works in `pandas`

`groupby`

The first step is telling pandas how to split (which groups to use):

grouped_data = data_2023.groupby('Course')
grouped_data

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x15daf2cf0>

This is our dataset, but with instructions about how to group

Aggregating

We can just call some methods directly on the grouped data (this will be fastest)

print(grouped_data.count().head(10))

             Course Title (click for details)  SLN  Instructor  Meeting Time  \
Course                                                                         
ECON 200 A                                  4    4           4             4   
ECON 200 AA                                 3    3           3             3   
ECON 200 AB                                 3    3           3             3   
ECON 200 AC                                 3    3           3             3   
ECON 200 AD                                 3    3           3             3   
ECON 200 AE                                 3    3           3             3   
ECON 200 AF                                 3    3           3             3   
ECON 200 AG                                 3    3           3             3   
ECON 200 AH                                 3    3           3             3   
ECON 200 AI                                 3    3           3             3   

             quarter  
Course                
ECON 200 A         4  
ECON 200 AA        3  
ECON 200 AB        3  
ECON 200 AC        3  
ECON 200 AD        3  
ECON 200 AE        3  
ECON 200 AF        3  
ECON 200 AG        3  
ECON 200 AH        3  
ECON 200 AI        3

Aggregating On Specific Columns

print(grouped_data['Instructor'].count().head(5))

Course
ECON 200 A     4
ECON 200 AA    3
ECON 200 AB    3
ECON 200 AC    3
ECON 200 AD    3
Name: Instructor, dtype: int64

print(grouped_data[['Instructor']].count().head(5))

             Instructor
Course                 
ECON 200 A            4
ECON 200 AA           3
ECON 200 AB           3
ECON 200 AC           3
ECON 200 AD           3

Load Apple Data

import json
f = open('resources/aapl_json.json')
aapl_dict = json.load(f)
times = pd.to_datetime(aapl_dict['TimeInfo']['Ticks'], unit='ms')
times.name = 'time'
aapl_hfts = pd.DataFrame(
    data = aapl_dict['Series'][0]['DataPoints'], 
    index = times
)
print(aapl_hfts.head(5))

                           0
time                        
2024-04-05 13:30:00  169.540
2024-04-05 13:31:00  169.505
2024-04-05 13:32:00  169.300
2024-04-05 13:33:00  169.132
2024-04-05 13:34:00  169.030

Clean Data

aapl_hfts.reset_index(inplace=True)
aapl_hfts.rename(columns={0: 'price'}, inplace=True)
aapl_hfts['date'] = aapl_hfts['time'].dt.date
print(aapl_hfts.head(5))

                 time    price        date
0 2024-04-05 13:30:00  169.540  2024-04-05
1 2024-04-05 13:31:00  169.505  2024-04-05
2 2024-04-05 13:32:00  169.300  2024-04-05
3 2024-04-05 13:33:00  169.132  2024-04-05
4 2024-04-05 13:34:00  169.030  2024-04-05

Aggregate

print(aapl_hfts.groupby('date')['price'].ohlc())

              open    high     low   close
date                                      
2024-04-05  169.54  170.38  169.03  169.47

If we don’t want the grouped columns as an index, we can pass that to groupby:

agg_data = aapl_hfts.groupby('date', as_index=False)[['price']].mean()
print(agg_data)
print(agg_data.columns)

         date       price
0  2024-04-05  169.681324
Index(['date', 'price'], dtype='object')

Renaming Aggregate Columns

If we’re using agg, we often don’t like the names it spits out – it’s easy to change them, we just need to pass a tuple:

new_agg = aapl_hfts.groupby('date', as_index=False)[['price']].agg(
    [('average_price', 'mean'),
     ('sd', np.std)]
)
new_agg

	date	price
		average_price	sd
0	2024-04-05	169.681324	0.281085

Aside: Hierarchical Columns

Note that the columns here are now hierarchical

new_agg.columns

MultiIndex([( 'date',              ''),
            ('price', 'average_price'),
            ('price',            'sd')],
           )

If we want to get columns from a level, we can do so:

new_agg.columns.get_level_values(1)

Index(['', 'average_price', 'sd'], dtype='object')

We can also just flatten the index:

new_agg.columns.to_flat_index()

Index([('date', ''), ('price', 'average_price'), ('price', 'sd')], dtype='object')

Or remove a level entirely:

new_agg.columns.droplevel(0)

Index(['', 'average_price', 'sd'], dtype='object')

Aggregate with Multiple Functions

print(aapl_hfts.groupby('date')['price'].agg(['ohlc', 'mean', 'min', 'max']))

              ohlc                                mean     min     max
              open    high     low   close       price   price   price
date                                                                  
2024-04-05  169.54  170.38  169.03  169.47  169.681324  169.03  170.38

This gives you the flavor and use cases that I’ve most encountered, but if you want to get really in-depth (pandas can do some crazy stuff), check out McKinney’s chapter.

	a	b	c	d	e	f	g	h	i	...	q	r	s	t	u	v	w	x	y	z
0	2	1	0	0	1	0	0	1	1	...	1	0	4	1	1	1	0	2	1	0
1	1	0	1	2	0	0	1	1	0	...	4	2	3	1	0	0	1	0	0	1
2	1	0	2	0	1	2	0	3	0	...	0	2	3	0	0	2	1	1	0	2

	a	b	c	d	e	f	g	h	i	...	q	r	s	t	u	v	w	x	y	z
0	2	1	0	0	1	0	0	1	1	...	1	0	4	1	1	1	0	2	1	0
1	1	0	1	2	0	0	1	1	0	...	4	2	3	1	0	0	1	0	0	1
2	1	0	2	0	1	2	0	3	0	...	0	2	3	0	0	2	1	1	0	2

pandas

Overview

pandas

Learning Objectives

pandas Data Structures

Importing pandas1

Series

Indexing Series

Boolean Indexing

Creating Series from dict

Providing Index to Series

NaN

DataFrame

head and tail

Exercise: numpy Array as DataFrame

Solutions: numpy Array as DataFrame

Selecting a Column

Selecting Columns

Selecting Column as DataFrame

Creating Variables

Reindexing

reindex for Economic Data

Default reindex Behavior

ffill

pandas Indexing (numpy Notation)

pandas Indexing (.loc)

Indexing using Index Values

Select on Both Axes with .loc

Value Assignment with .loc

ufuncs with pandas

np.abs

apply

apply along rows

Note on apply

Exercise: Distributions

Solutions: Distributions

Sorting

Sorting by Index

Sorting by Variables

Summary Statistics

Summary Stats

DataFrame.sum versus np.sum

describe

describe with Non-numeric Data

unique

isin

Loading Data

Loading Data

pd.read_csv

Useful pd.read_csv Arguments

pd.read_excel

Useful pd.read_excel Arguments

pd.read_html

Useful pd.read_html Arguments

Data Cleaning in pandas

Duplicate Data

Dealing with Duplicate Data

Discretizing

String Manipulation

Class List

First option: str.slice

Check: str.slice

Classification: np.where

Ideal Approach

Regular Expressions (regex)

Regular Expressions in pandas

Generally Useful Regular Expressions

Example: Match Phone Numbers

Example: Match Washington Cities

Exercise: Matching

Matching Solutions

Wrangling Data in pandas

Database-Style Joining

pd.merge

Specifying Merge Keys

Different Names in Merge

Outer Join

Left Join

Right Join

Non-distinct Join Keys

`pandas`

`pandas`

`pandas` Data Structures

Importing `pandas`¹

`Series`

Indexing `Series`

Creating `Series` from `dict`

Providing Index to `Series`

`NaN`

`DataFrame`

`head` and `tail`

Exercise: `numpy` Array as `DataFrame`

Solutions: `numpy` Array as `DataFrame`

Selecting Column as `DataFrame`

`reindex` for Economic Data

Default `reindex` Behavior

`ffill`

`pandas` Indexing (`numpy` Notation)

`pandas` Indexing (`.loc`)

Select on Both Axes with `.loc`

Value Assignment with `.loc`

`ufuncs` with `pandas`

`np.abs`

`apply`

`apply` along rows

Note on `apply`

`DataFrame.sum` versus `np.sum`

`describe`

`describe` with Non-numeric Data

`unique`

`isin`

`pd.read_csv`

Useful `pd.read_csv` Arguments

`pd.read_excel`

Useful `pd.read_excel` Arguments

`pd.read_html`

Useful `pd.read_html` Arguments

Data Cleaning in `pandas`

First option: `str.slice`

Check: `str.slice`

Classification: `np.where`

Regular Expressions in `pandas`

Wrangling Data in `pandas`

`pd.merge`

`join`

Using `join`

`pd.concat`

Powerful Combination: List Comprehension and `pd.concat`

How Aggregation Works in `pandas`

`groupby`

	a	b	c	d	e	f	g	h	i	...	q	r	s	t	u	v	w	x	y	z
0	2	1	0	0	1	0	0	1	1	...	1	0	4	1	1	1	0	2	1	0
1	1	0	1	2	0	0	1	1	0	...	4	2	3	1	0	0	1	0	0	1
2	1	0	2	0	1	2	0	3	0	...	0	2	3	0	0	2	1	1	0	2