pandas
2024-11-28
pandas
pandas
is an open-source library that makes working with tabular data easy
numpy
: good at arrays/matrix operations (e.g. all float values)pandas
: good at heterogeneous data (e.g. some float, some character, etc.)pandas
pandas
Data Structurespandas
1We’ll always use the pd
prefix for pandas
. We’ll also import numpy
with its standard prefix2
Series
pandas
has a similar object to np.array
– Series
:
0 4
1 7
2 -5
3 3
dtype: int64
Series
The index allows us more descriptive access to values:
-5
We can still use boolean indexing without removing the index link:
d 6
b 7
c 3
dtype: int64
Note that the index is still there, and both array and index are accessible with Series
methods:
<NumpyExtensionArray>
[6, 7, 3]
Length: 3, dtype: int64
Series
from dict
Since we’re arguing that Series
is very similar to a dict
, it’s logical that we can create one from a dict
:
Ohio 35000
Texas 71000
Oregon 16000
Utah 5000
dtype: int64
Series
We can pass an index directly in the creation of a Series
object:
California NaN
Ohio 35000.0
Oregon 16000.0
Texas 71000.0
dtype: float64
Why do we have a NaN
(not a number)? Where is Utah’s entry?
NaN
Use pd.isna
function to identify these values:
California True
Ohio False
Oregon False
Texas False
dtype: bool
We often want to look at arrays that have null values removed:
Ohio 35000.0
Oregon 16000.0
Texas 71000.0
dtype: float64
DataFrame
The power of pandas
is in the DataFrame
object. There are many ways to construct a DataFrame
but a common one is passing a dict
:
state | year | pop | |
---|---|---|---|
0 | Ohio | 2000 | 1.5 |
1 | Ohio | 2001 | 1.7 |
2 | Ohio | 2002 | 3.6 |
3 | Nevada | 2001 | 2.4 |
4 | Nevada | 2002 | 2.9 |
5 | Nevada | 2003 | 3.2 |
head
and tail
state | year | pop | |
---|---|---|---|
0 | Ohio | 2000 | 1.5 |
1 | Ohio | 2001 | 1.7 |
numpy
Array as DataFrame
Look up pd.DataFrame
’s documentation and create a \(1000 \times 26\) DataFrame
where each column has a letter of the alphabet as its header, and its rows are 1000 Poisson random variables.
numpy
Array as DataFrame
a | b | c | d | e | f | g | h | i | j | ... | q | r | s | t | u | v | w | x | y | z | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | ... | 1 | 0 | 4 | 1 | 1 | 1 | 0 | 2 | 1 | 0 |
1 | 1 | 0 | 1 | 2 | 0 | 0 | 1 | 1 | 0 | 0 | ... | 4 | 2 | 3 | 1 | 0 | 0 | 1 | 0 | 0 | 1 |
2 | 1 | 0 | 2 | 0 | 1 | 2 | 0 | 3 | 0 | 0 | ... | 0 | 2 | 3 | 0 | 0 | 2 | 1 | 1 | 0 | 2 |
3 rows × 26 columns
0 Ohio
1 Ohio
2 Ohio
3 Nevada
4 Nevada
5 Nevada
Name: state, dtype: object
The latter is safer – the former will not work with variable names like “my var”
state | year | |
---|---|---|
0 | Ohio | 2000 |
1 | Ohio | 2001 |
2 | Ohio | 2002 |
3 | Nevada | 2001 |
4 | Nevada | 2002 |
5 | Nevada | 2003 |
Note that this is a DataFrame
, while the previous objects were Series
DataFrame
state | |
---|---|
0 | Ohio |
1 | Ohio |
2 | Ohio |
3 | Nevada |
4 | Nevada |
5 | Nevada |
state | year | pop | debt | debt2 | |
---|---|---|---|---|---|
0 | Ohio | 2000 | 1.5 | 1.5 | 0 |
1 | Ohio | 2001 | 1.7 | 1.5 | 1 |
2 | Ohio | 2002 | 3.6 | 1.5 | 2 |
3 | Nevada | 2001 | 2.4 | 1.5 | 3 |
4 | Nevada | 2002 | 2.9 | 1.5 | 4 |
5 | Nevada | 2003 | 3.2 | 1.5 | 5 |
d 4.5
b 7.2
a -5.3
c 3.6
dtype: float64
Suppose we want to set a new index on this DataFrame
. What happens?
a -5.3
b 7.2
c 3.6
d 4.5
e NaN
dtype: float64
Why is a NaN
induced?
reindex
for Economic Dataprice | quantity | |
---|---|---|
1 | 0.054966 | 0.839750 |
3 | 0.606854 | 0.865037 |
5 | 0.140897 | 0.488995 |
reindex
BehaviorWhat should we do if we want one observation per day?
ffill
How should we interpolate day 2’s data?
pandas
Indexing (numpy
Notation)We can filter data via the same notation as in numpy
:
price | quantity | |
---|---|---|
1 | 0.054966 | 0.839750 |
2 | 0.054966 | 0.839750 |
3 | 0.606854 | 0.865037 |
4 | 0.606854 | 0.865037 |
5 | 0.140897 | 0.488995 |
pandas
Indexing (.loc
)The preferred method for subsetting data in pandas is using the .loc
method1:
price | quantity | |
---|---|---|
1 | 0.054966 | 0.839750 |
2 | 0.054966 | 0.839750 |
3 | 0.606854 | 0.865037 |
4 | 0.606854 | 0.865037 |
5 | 0.140897 | 0.488995 |
Note that we previously were using Boolean indexing – we can of course also index using the index itself
price | quantity | |
---|---|---|
1 | 0.054966 | 0.839750 |
3 | 0.606854 | 0.865037 |
5 | 0.140897 | 0.488995 |
.loc
1 0.839750
2 0.839750
3 0.865037
4 0.865037
5 0.488995
Name: quantity, dtype: float64
.loc
price | quantity | |
---|---|---|
0 | 0.000000 | 0.000000 |
1 | 0.054966 | 0.839750 |
2 | 0.054966 | 0.839750 |
3 | 0.606854 | 0.865037 |
4 | 0.606854 | 0.865037 |
5 | 0.140897 | 0.488995 |
ufuncs
with pandas
b | d | e | |
---|---|---|---|
Utah | -1.075511 | 1.574516 | -0.988367 |
Ohio | 1.321872 | 0.379486 | -0.361248 |
Texas | 1.529598 | 0.491796 | 0.353201 |
Oregon | -1.462351 | -0.952698 | -0.553078 |
np.abs
b | d | e | |
---|---|---|---|
Utah | 1.075511 | 1.574516 | 0.988367 |
Ohio | 1.321872 | 0.379486 | 0.361248 |
Texas | 1.529598 | 0.491796 | 0.353201 |
Oregon | 1.462351 | 0.952698 | 0.553078 |
apply
b 2.991949
d 2.527214
e 1.341568
dtype: float64
apply
along rowsUtah 2.650027
Ohio 1.683120
Texas 1.176397
Oregon 0.909273
dtype: float64
apply
apply
to take the logarithm of a column in your DataFrame
. Use np.log
.map
method for how you could do this if you wanted toUse and your \(1000 \times 26\) DataFrame
and return the sample mean and variance for each column. Use apply
and numpy
for one and only numpy
for the other. Is there a time difference?
(26,)
CPU times: user 782 μs, sys: 71 μs, total: 853 μs
Wall time: 790 μs
pandas
d 0
a 1
b 2
c 3
dtype: int64
We also want to sort data by specific variables, that are not the index
d 0
a 1
b 2
c 3
dtype: int64
As we’ve seen, we can call aggregating functions on DataFrame
objects to get some summary stats
one | two | |
---|---|---|
a | 1.40 | NaN |
b | 7.10 | -4.5 |
c | NaN | NaN |
d | 0.75 | -1.3 |
one 9.25
two -5.80
dtype: float64
What are these methods doing with NaN
values?
A fuller list of summary statistics methods for pandas
is available here.
DataFrame.sum
versus np.sum
Note that np.sum
will have different default behavior depending on the object that it’s called on:
25866
describe
We can also use the built-in describe
method:
one | two | |
---|---|---|
count | 3.000000 | 2.000000 |
mean | 3.083333 | -2.900000 |
std | 3.493685 | 2.262742 |
min | 0.750000 | -4.500000 |
25% | 1.075000 | -3.700000 |
50% | 1.400000 | -2.900000 |
75% | 4.250000 | -2.100000 |
max | 7.100000 | -1.300000 |
describe
with Non-numeric Datacount 16
unique 3
top a
freq 8
dtype: object
unique
We often want to know the unique (or distinct) values in a Series
(equivalently, column in our DataFrame
)
array(['c', 'a', 'd', 'b'], dtype=object)
isin
We also can check which elements of a Series
object match certain values:
0 c
5 b
6 b
7 c
8 c
dtype: object
pandas
is that it has methods for reading an assortment of stored data into a DataFrame
pd.read_csv
Reads data from a comma-separated value file (CSV). The data looks like this
a | b | c | |
---|---|---|---|
0 | 1 | 2 | europe |
1 | 3 | 5 | asia |
2 | -2 | 6 | north america |
pd.read_csv
Argumentssep
: what the delineator in the file is (“,” and “|” are common)na_values
: list of strings to coerce to NaN
(e.g. “NULL”, “N/A”, “NA”)names
: pass your own column names instead of inferring from the dataskiprows
: tell pandas
to not read in certain rows of the fileindex_col
: tell pandas
which column to use as the indexpd.read_excel
Reads in data from a specific Excel sheet (remember you can have multiple sheets in a workbook)
FY 2014 | FY 2015 | |
---|---|---|
0 | 80540.0 | 78570.00 |
1 | 1813246.0 | 1813246.00 |
2 | 39560.0 | 19995.65 |
3 | 465407.0 | 460904.00 |
4 | 223206.0 | 208056.00 |
pd.read_excel
Argumentssheet_name
: if a workbook has multiple sheets, access the one that you wantnames
: pass your own column names instead of inferring from the datathousands
/decimal
: tell pandas
what the separator for thousands or decimals is if you want to parse a text field to numericindex_col
: tell pandas
which column to use as the indexpd.read_html
pandas
to read it directlyDataFrame
objectsmatch
argument to get the table you wantCourse | Course Title (click for details) | SLN | Instructor | Meeting Time | |
---|---|---|---|---|---|
0 | ECON 200 A | Introduction to Microeconomics | 13840 | Melissa Knox | TTh 8:30am - 9:50am |
1 | ECON 200 AA | Introduction to Microeconomics | 13841 | NaN | F 8:30am - 9:50am |
2 | ECON 200 AB | Introduction to Microeconomics | 13842 | NaN | F 9:00am - 10:20am |
pd.read_html
Argumentspd.read_csv
and pd.read_excel
attrs
: a dictionary of attributes to identify a table in the webpage’s HTML
pandas
.unique()
method to get the unique values from a Series
DataFrames
k1 | k2 | |
---|---|---|
0 | one | 1 |
1 | two | 1 |
2 | one | 2 |
3 | two | 3 |
4 | one | 3 |
5 | two | 4 |
6 | two | 4 |
If we want to bin data, it can be surprisingly tricky to code – luckily, pandas
has a built-in method for us:
[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64, right]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]
Let’s return to our Econ department data
0 ECON 200 A
1 ECON 200 AA
2 ECON 200 AB
3 ECON 200 AC
4 ECON 200 AD
...
71 ECON 591 A
72 ECON 596 A
73 ECON 600 A
74 ECON 601 A
75 ECON 800 A
Name: Course, Length: 76, dtype: object
How could we classify courses as 200-level, 300-level, 400-level, and graduate level?
str.slice
Course | course_no | |
---|---|---|
0 | ECON 200 A | 200 |
1 | ECON 200 AA | 200 |
2 | ECON 200 AB | 200 |
3 | ECON 200 AC | 200 |
How can we check?
str.slice
course_no
200 16
201 14
300 7
301 4
483 2
600 1
596 1
591 1
585 1
584 1
485 1
539 1
518 1
503 1
501 1
499 1
497 1
491 1
488 1
601 1
581 1
482 1
472 1
471 1
451 1
448 1
443 1
432 1
426 1
424 1
422 1
421 1
411 1
406 1
399 1
382 1
216 1
800 1
Name: count, dtype: int64
When would this approach fail?
np.where
Course | course_no | classification | |
---|---|---|---|
30 | ECON 216 A | 216 | 200 level |
50 | ECON 432 A | 432 | 400 level |
pandas
to grab the 3-digit substring within any larger string\d{3}
(match three digits)pandas
These are extremely useful in working with string data in pandas
Course | course_no | |
---|---|---|
0 | ECON 200 A | 200 |
1 | ECON 200 AA | 200 |
2 | ECON 200 AB | 200 |
3 | ECON 200 AC | 200 |
[A-Z]
[a-z]
[A-Za-z]
.
*
+
\s
or a literal space(?<=a).+(?=b)
|
Write a regular expression that will match each element in its entirety in these arrays:
np
and pd
prefixes)A useful engine for practice: regex101
Hint: check out what ?
does.
0 285a
1 2a86
2 44b
3 abc
Name: 0, dtype: object
0 1 + 1 = 2
1 10 + 2 = 12
2 3+11=14
Name: 0, dtype: object
pandas
When we want to link data in two DataFrames
by keys, we will join
or merge
them
( key data1
0 b 0
1 b 1
2 a 2
3 c 3
4 a 4
5 a 5
6 b 6,
key data2
0 a 0
1 b 1
2 d 2)
pd.merge
key | data1 | data2 | |
---|---|---|---|
0 | b | 0 | 1 |
1 | b | 1 | 1 |
2 | a | 2 | 0 |
3 | a | 4 | 0 |
4 | a | 5 | 0 |
5 | b | 6 | 1 |
How does it know how to join? Do we have all of our original observations?
A best practice is to explicitly state the columns you want to join on
key | data1 | data2 | |
---|---|---|---|
0 | b | 0 | 1 |
1 | b | 1 | 1 |
2 | a | 2 | 0 |
3 | a | 4 | 0 |
4 | a | 5 | 0 |
5 | b | 6 | 1 |
key1 | data1 | key2 | data2 | |
---|---|---|---|---|
0 | b | 0 | b | 1 |
1 | b | 1 | b | 1 |
2 | a | 2 | a | 0 |
3 | a | 4 | a | 0 |
4 | a | 5 | a | 0 |
5 | b | 6 | b | 1 |
key1 | data1 | key2 | data2 | |
---|---|---|---|---|
0 | a | 2 | a | 0 |
1 | a | 4 | a | 0 |
2 | a | 5 | a | 0 |
3 | b | 0 | b | 1 |
4 | b | 1 | b | 1 |
5 | b | 6 | b | 1 |
6 | c | 3 | NaN | <NA> |
7 | NaN | <NA> | d | 2 |
key1 | data1 | key2 | data2 | |
---|---|---|---|---|
0 | b | 0 | b | 1 |
1 | b | 1 | b | 1 |
2 | a | 2 | a | 0 |
3 | c | 3 | NaN | <NA> |
4 | a | 4 | a | 0 |
5 | a | 5 | a | 0 |
6 | b | 6 | b | 1 |
key1 | data1 | key2 | data2 | |
---|---|---|---|---|
0 | a | 2 | a | 0 |
1 | a | 4 | a | 0 |
2 | a | 5 | a | 0 |
3 | b | 0 | b | 1 |
4 | b | 1 | b | 1 |
5 | b | 6 | b | 1 |
6 | NaN | <NA> | d | 2 |
( key data1
0 b 0
1 b 1
2 a 2
3 c 3
4 a 4
5 b 5,
key data2
0 a 0
1 b 1
2 a 2
3 b 3
4 d 4)
key data1 data2
0 b 0 1
1 b 0 3
2 b 1 1
3 b 1 3
4 a 2 0
5 a 2 2
6 c 3 <NA>
7 a 4 0
8 a 4 2
9 b 5 1
10 b 5 3
key data1 data2
0 b 0 1
1 b 0 3
2 b 1 1
3 b 1 3
4 a 2 0
5 a 2 2
6 a 4 0
7 a 4 2
8 b 5 1
9 b 5 3
join
To merge on the index of a DataFrame
, we can use join
1
( Ohio Nevada
a 1 2
c 3 4
e 5 6,
Missouri Alabama
b 7 8
c 9 10
d 11 12
e 13 14)
join
Ohio Nevada Missouri Alabama
c 3 4 9 10
e 5 6 13 14
DataFrame
that has every combination of the integers from 1-10 and their names as strings (i.e. “one”, “two”, “three”, etc) – you should have 100 combinations.(100, 2)
(30, 4)
ints chars n_chars exp
key
1 3 one 3 27
1 3 two 3 27
1 3 three 5 243
1 3 four 4 81
1 3 five 4 81
pd.read_csv(...)
and then combine thempd.concat
a 0
b 1
c 2
d 3
e 4
f 5
g 6
dtype: Int64
Note that we passed a list to pd.concat
DataFrame
of all the Economics courses UW offered in 2023pd.read_html
pd.concat
def download_econ_courses(year:int, quarter: str) -> pd.DataFrame:
"""
Download Economics Course Offerings for `year` and `quarter`
"""
url = f'https://econ.washington.edu/courses/{year}/{quarter}/all'
df = pd.read_html(url, match = 'ECON Courses')[0]
df['quarter'] = quarter
return df
quarters = ['winter', 'spring', 'summer', 'autumn']
data_2023 = pd.concat([download_econ_courses(2023, x) for x in quarters])
print(data_2023.head(2))
Course Course Title (click for details) SLN Instructor \
0 ECON 200 A Introduction to Microeconomics 13928 Melissa Knox
1 ECON 200 AA Introduction to Microeconomics 13929 Ken Inosaki
Meeting Time quarter
0 TTh 8:30am - 9:50am winter
1 F 8:30am - 9:50am winter
Course Course Title (click for details) SLN Instructor \
69 ECON 594 A Economic Growth 14241 Stephen Turnovsky
70 ECON 600 A Independent Study or Research 14242 NaN
71 ECON 601 A Internship 14243 NaN
72 ECON 602 A Teaching Introductory Economics 14244 Yael Midnight
73 ECON 800 A Doctoral Dissertation 14245 NaN
Meeting Time quarter
69 MW 10:30am - 11:50am autumn
70 to be arranged autumn
71 to be arranged autumn
72 W 9:30am - 11:20am autumn
73 to be arranged autumn
What if we wanted to know courses taught by a professor over the year (400 level)?
Course Course Title (click for details) SLN \
42 ECON 400 A Advanced Microeconomics 13971
56 ECON 485 A Game Theory with Applications to Economics 13986
54 ECON 448 A Population and Development 13683
Instructor Meeting Time quarter
42 Melissa Knox TTh 1:30pm - 3:20pm winter
56 Fahad Khalil MW 10:30am - 12:20pm winter
54 Melissa Knox TTh 8:30am - 10:20am spring
quarter | spring | winter |
---|---|---|
Instructor | ||
Fahad Khalil | NaN | ECON 485 A |
Melissa Knox | ECON 448 A | ECON 400 A |
Will this work for the full dataset?
ValueError: Index contains duplicate entries, cannot reshape
Instructor quarter Course
57 Alan Griffith autumn ECON 496 A
67 Alan Griffith autumn ECON 587 A
44 Alan Griffith winter ECON 410 A
68 Alan Griffith winter ECON 590 A
64 Brian Greaney spring ECON 509 A
.. ... ... ...
74 Yanqin Fan spring ECON 599 A
47 Yu-chin Chen autumn ECON 426 A
60 Yu-chin Chen autumn ECON 502 A
67 Yuya Takahashi spring ECON 534 A
73 Yuya Takahashi spring ECON 596 B
[71 rows x 3 columns]
To unpivot (melt), we need the key variable to not be the index:
quarter Instructor spring winter
0 Fahad Khalil NaN ECON 485 A
1 Melissa Knox ECON 448 A ECON 400 A
Then we can melt:
Instructor quarter value
0 Fahad Khalil spring NaN
1 Melissa Knox spring ECON 448 A
2 Fahad Khalil winter ECON 485 A
3 Melissa Knox winter ECON 400 A
pandas
groupby
The first step is telling pandas
how to split (which groups to use):
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x15daf2cf0>
This is our dataset, but with instructions about how to group
We can just call some methods directly on the grouped data (this will be fastest)
Course Title (click for details) SLN Instructor Meeting Time \
Course
ECON 200 A 4 4 4 4
ECON 200 AA 3 3 3 3
ECON 200 AB 3 3 3 3
ECON 200 AC 3 3 3 3
ECON 200 AD 3 3 3 3
ECON 200 AE 3 3 3 3
ECON 200 AF 3 3 3 3
ECON 200 AG 3 3 3 3
ECON 200 AH 3 3 3 3
ECON 200 AI 3 3 3 3
quarter
Course
ECON 200 A 4
ECON 200 AA 3
ECON 200 AB 3
ECON 200 AC 3
ECON 200 AD 3
ECON 200 AE 3
ECON 200 AF 3
ECON 200 AG 3
ECON 200 AH 3
ECON 200 AI 3
Course
ECON 200 A 4
ECON 200 AA 3
ECON 200 AB 3
ECON 200 AC 3
ECON 200 AD 3
Name: Instructor, dtype: int64
0
time
2024-04-05 13:30:00 169.540
2024-04-05 13:31:00 169.505
2024-04-05 13:32:00 169.300
2024-04-05 13:33:00 169.132
2024-04-05 13:34:00 169.030
time price date
0 2024-04-05 13:30:00 169.540 2024-04-05
1 2024-04-05 13:31:00 169.505 2024-04-05
2 2024-04-05 13:32:00 169.300 2024-04-05
3 2024-04-05 13:33:00 169.132 2024-04-05
4 2024-04-05 13:34:00 169.030 2024-04-05
open high low close
date
2024-04-05 169.54 170.38 169.03 169.47
If we’re using agg
, we often don’t like the names it spits out – it’s easy to change them, we just need to pass a tuple:
date | price | ||
---|---|---|---|
average_price | sd | ||
0 | 2024-04-05 | 169.681324 | 0.281085 |
Note that the columns here are now hierarchical
MultiIndex([( 'date', ''),
('price', 'average_price'),
('price', 'sd')],
)
If we want to get columns from a level, we can do so:
Index(['', 'average_price', 'sd'], dtype='object')
We can also just flatten the index:
Index([('date', ''), ('price', 'average_price'), ('price', 'sd')], dtype='object')
ohlc mean min max
open high low close price price price
date
2024-04-05 169.54 170.38 169.03 169.47 169.681324 169.03 170.38
This gives you the flavor and use cases that I’ve most encountered, but if you want to get really in-depth (pandas
can do some crazy stuff), check out McKinney’s chapter.