Problem Set 3

Author

Lukas Hager

Published

April 22, 2024

This problem set must be submitted on Canvas by 11:59 PM PST on April 17, 2024.

Warning

Please point all of your code at the files’ URLs on the course website (that is, do not download them and reference those local files)

Exercise 0

Please write a function that takes no arguments and returns a link to your solutions on GitHub.

Use the following shell:

def github() -> str:
    """
    Some docstrings.
    """

    return "https://github.com/<user>/<repo>/blob/main/<filename.py>"

Exercise 1

Please write a function called import_yearly_data that takes as its argument a list of years and returns a concatenated DataFrame of the Direct Emitters tab of each of those year’s EPA excel sheet. Please add a variable year that references the year from which the data is pulled. Please use the fourth row as the column names and do not import the first three rows. Do not use any columns as an index.

Use the following shell:

import pandas as pd

def import_yearly_data(years: list) -> pd.DataFrame:
    """
    Some docstrings.
    """

    return None

Exercise 2

Please write a function called import_parent_companies that takes as its argument a list of years and returns a concatenated DataFrame of the corresponding tabs in the parent companies excel sheet1. Please add a variable year that references the year from which the data is pulled. Finally, please remove any row that is entirely null values. Do not use any columns as an index.

Use the following shell:

import pandas as pd

def import_parent_companies(years: list) -> pd.DataFrame:
    """
    Some docstrings.
    """

    return None

Exercise 3

Please write a function called n_null that takes as its arguments a DataFrame and a column name and returns an integer corresponding to the number of null values in that column. Use the following shell:

def n_null(df: pd.DataFrame, col: str) -> int:
    """
    Some docstrings
    """

    return None
Note

Not to be turned in, but to be considered: does n_null suggest that using FRS Id or Facility Id is a better choice for joining our data?

Exercise 4

Please write a function called clean_data that takes as its arguments a concatenated DataFrame of emissions sheets (the output of Exercise 1) and a concatenated DataFrame of parent companies (the output of Exercise 2) and outputs a DataFrame produced using the following steps:

  1. Left join the parent companies data onto the EPA data using as join keys the year and Facility ID variables in the emissions data and their equivalents in the parent companies data.
  2. Subset the data to the following variables:
    • Facility Id
    • year
    • State
    • Industry Type (sectors)
    • Total reported direct emissions
    • PARENT CO. STATE
    • PARENT CO. PERCENT OWNERSHIP
  3. Make all the column names lower-case.

Use the following shell:

def clean_data(emissions_data: pd.DataFrame, parent_data: pd.DataFrame) -> pd.DataFrame:
    """
    Some docstrings.
    """

    return None

Exercise 5

Please write a function called aggregate_emissions that takes as input a DataFrame with the schema of the output of Exercise 4 and a list of variables and produces the minimum, median, mean, and maximum values for the following variables (note the case change) aggregated at the level of the variables supplied in the argument:

  • total reported direct emissions
  • parent co. percent ownership

That is, if group_vars = ['state'], please aggregate the two variables at the state level.

Please return the the data sorted by highest to lowest mean total reported direct emissions. Finally, please use as_index=True (the default) in your groupby calls to keep the grouping variables as the index.

Use the following shell:

def aggregate_emissions(df: pd.DataFrame, group_vars: list) -> pd.DataFrame:
    """
    Some docstrings.
    """

    return None

Optional Bonus

Email me (Lukas) an interesting fact about this data that you obtain from these or other explorations. I will be the sole arbiter of whether or not this fact is interesting enough to merit a bonus point on the assignment.

Footnotes

  1. Note that the file format is xlsb – you may need to download an additional library.↩︎