def github() -> str:
"""
Some docstrings.
"""
return "https://github.com/<user>/<repo>/blob/main/<filename.py>"
Problem Set 3
This problem set must be submitted on Canvas by 11:59 PM PST on April 17, 2024.
Please point all of your code at the files’ URLs on the course website (that is, do not download them and reference those local files)
Exercise 0
Please write a function that takes no arguments and returns a link to your solutions on GitHub.
Use the following shell:
Exercise 1
Please write a function called import_yearly_data
that takes as its argument a list of years and returns a concatenated DataFrame
of the Direct Emitters
tab of each of those year’s EPA excel sheet. Please add a variable year
that references the year from which the data is pulled. Please use the fourth row as the column names and do not import the first three rows. Do not use any columns as an index.
Use the following shell:
import pandas as pd
def import_yearly_data(years: list) -> pd.DataFrame:
"""
Some docstrings.
"""
return None
Exercise 2
Please write a function called import_parent_companies
that takes as its argument a list of years and returns a concatenated DataFrame
of the corresponding tabs in the parent companies excel sheet1. Please add a variable year
that references the year from which the data is pulled. Finally, please remove any row that is entirely null values. Do not use any columns as an index.
Use the following shell:
import pandas as pd
def import_parent_companies(years: list) -> pd.DataFrame:
"""
Some docstrings.
"""
return None
Exercise 3
Please write a function called n_null
that takes as its arguments a DataFrame
and a column name and returns an integer corresponding to the number of null values in that column. Use the following shell:
def n_null(df: pd.DataFrame, col: str) -> int:
"""
Some docstrings
"""
return None
Not to be turned in, but to be considered: does n_null
suggest that using FRS Id
or Facility Id
is a better choice for joining our data?
Exercise 4
Please write a function called clean_data
that takes as its arguments a concatenated DataFrame
of emissions sheets (the output of Exercise 1) and a concatenated DataFrame
of parent companies (the output of Exercise 2) and outputs a DataFrame
produced using the following steps:
- Left join the parent companies data onto the EPA data using as join keys the
year
andFacility ID
variables in the emissions data and their equivalents in the parent companies data. - Subset the data to the following variables:
Facility Id
year
State
Industry Type (sectors)
Total reported direct emissions
PARENT CO. STATE
PARENT CO. PERCENT OWNERSHIP
- Make all the column names lower-case.
Use the following shell:
def clean_data(emissions_data: pd.DataFrame, parent_data: pd.DataFrame) -> pd.DataFrame:
"""
Some docstrings.
"""
return None
Exercise 5
Please write a function called aggregate_emissions
that takes as input a DataFrame
with the schema of the output of Exercise 4 and a list of variables and produces the minimum, median, mean, and maximum values for the following variables (note the case change) aggregated at the level of the variables supplied in the argument:
total reported direct emissions
parent co. percent ownership
That is, if group_vars = ['state']
, please aggregate the two variables at the state level.
Please return the the data sorted by highest to lowest mean total reported direct emissions. Finally, please use as_index=True
(the default) in your groupby
calls to keep the grouping variables as the index.
Use the following shell:
def aggregate_emissions(df: pd.DataFrame, group_vars: list) -> pd.DataFrame:
"""
Some docstrings.
"""
return None
Optional Bonus
Email me (Lukas) an interesting fact about this data that you obtain from these or other explorations. I will be the sole arbiter of whether or not this fact is interesting enough to merit a bonus point on the assignment.
Footnotes
Note that the file format is
xlsb
– you may need to download an additional library.↩︎