SQL

Lukas Hager

2024-11-28

Learning Objectives

Be able to use python to connect to a database
Understand the basics of SQL syntax and how to query data

SQL Background

Pronounciation¹

“Sequel”, not “Ess-Queue-Ell”

Background

Structured Query Language
Used to extract data from relational databases
Core concepts:
- Record (table row)
- Table column
- Table – collection of rows and columns

Tables

Any table operation produces another table as a result
A record is a collection of key-value pairs
- Think of this like a row in an excel spreadsheet
- “Name”: “John”, “Salary”: 50, …
A table is a collection of records
- “Name”: “John”, “Salary”: 50, …,
- “Name”: “Mary”, “Salary”: 55, …
pandas treats tables as DataFrames

SQL in Pipelines

Most large companies and research groups store data in relational databases
The first step of any project is to define the data you need and query it from SQL
Once you have the data, you can clean and model using pandas etc.
This first querying step is key!

Why?

Why should you use databases instead of CSVs?

Answer (per Luke¹ Wylie²)

Databases are tools built specifically for using and sharing data in a matched “state” - as soon as someone else needs to use your data at the same time as you, and even keeping track of changes and mutations to the transaction, a CSV is useless.
As soon as you start mutating data and creating multiple datasets while refusing to use a database, you resign yourself to the special hell that is juggling multiple CSVs. You will inevitably lose data.

Accessing a Database in Python

`sqlalchemy`

There are lots of ways to connect to a database
- Hopefully the group that you’re working with already has an in-house solution
We’ll work with a very simple version (no authentication, etc.)
Connecting to a sqlite database using sqlalchemy

`create_engine`

We have a .db file called auctions.db that contains data on bidding for 500 North Face clothing items on ShopGoodwill.com¹
To connect to it, we have to create a sqlalchemy engine:

import sqlalchemy
from sqlalchemy import create_engine

path = '/Users/hlukas/git/personal_website/static/econ-481/data/auctions.db'
engine = create_engine(f'sqlite:///{path}')

`create_engine` Argument

Note that at the beginning, we tell sqlalchemy what sort of database we’re connecting to
We then pass three / characters before the database location.

Listing Tables

Databases contain multiple tables
We want to know what they are

from sqlalchemy import inspect

inspector = inspect(engine)
inspector.get_table_names()

['bids', 'items']

So we have two tables, named “bids” and “items”

Querying Data

We’ll begin by working with SQL in a “traditional” sense, where we just write queries instead of leveraging the python package
- Libraries like sqlalchemy or pyspark have methods to take the place of querying
- These are a little easier to learn once we get the basics of writing a query
Query: a letter to the database telling it what we want

Writing a Query Class

To assess the output of our queries, we’re going to write a class that will run our query against the database and return a DataFrame as the table output.

import pandas as pd
from sqlalchemy.orm import Session

class DataBase:
    def __init__(self, loc: str, db_type: str = "sqlite") -> None:
        """Initialize the class and connect to the database"""
        self.loc = loc
        self.db_type = db_type
        self.engine = create_engine(f'{self.db_type}:///{self.loc}')
    def query(self, q: str) -> pd.DataFrame:
        """Run a query against the database and return a DataFrame"""
        with Session(self.engine) as session:
            df = pd.read_sql(q, session.bind)
        return(df)

auctions = DataBase(path)

Aside: Why a Class?

Why is a class better than a function here?

A function would either require us to pass the engine as an argument or reference a global variable (not good)
In the class, all of our queries will share the same engine
Logical flow – we create run queries against only one database at a time

Queries

Query Syntax

SELECT comma-separated list of columns
FROM Table1 JOIN Table2 … JOIN TableN
WHERE Condition1 AND … AND ConditionM
GROUP BY comma-separated list of grouping columns
[HAVING] Condition1 AND … AND ConditionK
[ORDER BY] comma-separated list of sorting cols
[LIMIT] number of rows to return

`SELECT *` Statement¹

q = 'select * from bids'
print(auctions.query(q).head())

   index  bidLogId     itemId  itemPrice  bidAmount  \
0     50         0  178348858       9.99       20.0   
1     51         0  178348858      13.00       12.0   
2     52         0  178348858      21.00       23.0   
3     53         0  178348858      24.00       35.0   
4     54         0  178348858      36.00       48.0   

                      bidTime  quantity bidIPAddress adCode serverIP  \
0  2023-09-18 16:11:04.587000         1         None   None     None   
1  2023-09-22 14:22:06.700000         1         None   None     None   
2  2023-09-23 12:35:18.157000         1         None   None     None   
3  2023-09-23 18:23:27.993000         1         None   None     None   
4  2023-09-23 18:37:47.213000         1         None   None     None   

   retracted bidderName highBidderName  isBuyerHighBidder  isLoggedInBuyer  
0          0     a****9         a****9                  0                0  
1          0     S****p         a****9                  0                0  
2          0     H****2         H****2                  0                0  
3          0     b****e         b****e                  0                0  
4          0     t****5         t****5                  0                0

`SELECT` Columns Statement

q = 'select itemid, description, isbuynowused from items'
print(auctions.query(q).head())

      itemId                                        description  isBuyNowUsed
0  179353985  <p><strong>Description:</strong></p>\n<p>Women...             0
1  177087535  <p><strong>Details &amp; Condition</strong></p...             0
2  180876361  <p>The North Face Womens Pink Long Sleeve Mock...             0
3  177763109  <p><br></p><ul><li><span class="ql-size-large"...             0
4  179660197  <p><b>Title: </b>The North Face Mens Red Flat ...             0

`JOIN` Statements

Recall our discussion on joining in pandas – these are SQL-style joins, and SQL has the same types.

q = """
select items.itemid, items.description, bids.biddername, bids.bidamount, bids.bidtime 
from items
left join bids
on items.itemid = bids.itemid
"""
print(auctions.query(q).head())

      itemId                                        description bidderName  \
0  179353985  <p><strong>Description:</strong></p>\n<p>Women...       None   
1  177087535  <p><strong>Details &amp; Condition</strong></p...       None   
2  180876361  <p>The North Face Womens Pink Long Sleeve Mock...     B****4   
3  177763109  <p><br></p><ul><li><span class="ql-size-large"...     n****m   
4  177763109  <p><br></p><ul><li><span class="ql-size-large"...     N****0   

   bidAmount                     bidTime  
0        NaN                        None  
1        NaN                        None  
2      19.99  2023-10-18 05:54:55.327000  
3      10.00  2023-09-17 11:52:27.447000  
4      14.00  2023-09-17 17:33:48.517000

`JOIN` Aliases

Should this run?

q = """
select itemid, description, biddername, bidamount, bidtime 
from items
left join bids
on items.itemid = bids.itemid
"""
print(auctions.query(q).head())

OperationalError: (sqlite3.OperationalError) ambiguous column name: itemid
[SQL: 
select itemid, description, biddername, bidamount, bidtime 
from items
left join bids
on items.itemid = bids.itemid
]
(Background on this error at: https://sqlalche.me/e/20/e3q8)

`JOIN` Renaming Tables

It’s often convenient to rename tables in joins to make your query less verbose (potentially at the cost of readability)

q = """
select i.itemid, i.description, b.biddername, b.bidamount, b.bidtime 
from items as i
left join bids as b
on i.itemid = b.itemid
"""
print(auctions.query(q).head())

      itemId                                        description bidderName  \
0  179353985  <p><strong>Description:</strong></p>\n<p>Women...       None   
1  177087535  <p><strong>Details &amp; Condition</strong></p...       None   
2  180876361  <p>The North Face Womens Pink Long Sleeve Mock...     B****4   
3  177763109  <p><br></p><ul><li><span class="ql-size-large"...     n****m   
4  177763109  <p><br></p><ul><li><span class="ql-size-large"...     N****0   

   bidAmount                     bidTime  
0        NaN                        None  
1        NaN                        None  
2      19.99  2023-10-18 05:54:55.327000  
3      10.00  2023-09-17 11:52:27.447000  
4      14.00  2023-09-17 17:33:48.517000

Exercise: Joins

For each of the join types supported in sqlite (left, inner, cross), perform the join on the two tables and report the number of observations in the resulting join.

Solutions: Joins

join_types = ['inner', 'left', 'cross']
queries = [
    f"""select count(*) as n
    from items as i
    {join} join bids as b
    on i.itemid = b.itemid""" for join in join_types
]
[auctions.query(q)['n'].item() for q in queries]

[551, 879, 551]

`WHERE`

q = """
select i.itemid, i.description, b.biddername, b.bidamount, b.bidtime 
from items as i
left join bids as b
on i.itemid = b.itemid
where b.bidamount is not null
"""
print(auctions.query(q).head())

      itemId                                        description bidderName  \
0  178348858  <p><br></p><ul><li><span class="ql-size-large"...     a****9   
1  178348858  <p><br></p><ul><li><span class="ql-size-large"...     S****p   
2  178348858  <p><br></p><ul><li><span class="ql-size-large"...     H****2   
3  178348858  <p><br></p><ul><li><span class="ql-size-large"...     b****e   
4  178348858  <p><br></p><ul><li><span class="ql-size-large"...     t****5   

   bidAmount                     bidTime  
0       20.0  2023-09-18 16:11:04.587000  
1       12.0  2023-09-22 14:22:06.700000  
2       23.0  2023-09-23 12:35:18.157000  
3       35.0  2023-09-23 18:23:27.993000  
4       48.0  2023-09-23 18:37:47.213000

`WHERE` With Multiple Conditions

q = """
select i.itemid, i.description, b.biddername, b.bidamount, b.bidtime
from items as i
left join bids as b
on i.itemid = b.itemid
where b.bidamount is not null and i.isbuynowused is false
"""
print(auctions.query(q).head())

      itemId                                        description bidderName  \
0  180876361  <p>The North Face Womens Pink Long Sleeve Mock...     B****4   
1  177763109  <p><br></p><ul><li><span class="ql-size-large"...     n****m   
2  177763109  <p><br></p><ul><li><span class="ql-size-large"...     N****0   
3  177763109  <p><br></p><ul><li><span class="ql-size-large"...     M****y   
4  177763109  <p><br></p><ul><li><span class="ql-size-large"...     M****y   

   bidAmount                     bidTime  
0      19.99  2023-10-18 05:54:55.327000  
1      10.00  2023-09-17 11:52:27.447000  
2      14.00  2023-09-17 17:33:48.517000  
3      15.00  2023-09-17 18:27:00.087000  
4      15.00  2023-09-17 18:33:37.233000

`GROUP BY`

The same as .groupby() in pandas – add aggregating functions to the SELECT clause

q = """
select i.itemid, count(distinct b.biddername) as n_bidders
from items as i
left join bids as b
on i.itemid = b.itemid
where b.bidamount is not null and i.isbuynowused is false
group by i.itemid
"""
print(auctions.query(q).head())

      itemId  n_bidders
0  165561698          1
1  170983900          1
2  172998011          2
3  173907435          1
4  174445924          3

Aside: `COUNT`

We can also just count observations without a grouping:

q = """
select count(*) from items
"""
print(auctions.query(q).head())

   count(*)
0       500

Or count the distinct number of something without a grouping:

q = """
select count(distinct biddername) from bids
"""
print(auctions.query(q).head())

   count(distinct biddername)
0                         284

Exercise: `MIN` and `MAX`

In SQL, MIN and MAX are aggregating functions that work the same way as COUNT. Use them to create a table of the number of bids each bidder submitted for each item, as well as their largest and smallest bid.

Exercise: `MIN` and `MAX`

q = """
select itemid, biddername, count(*) as n_bids, min(bidamount) as min_bid, 
max(bidamount) as max_bid
from bids
group by itemid, biddername
"""
print(auctions.query(q).head())

      itemId bidderName  n_bids  min_bid  max_bid
0  165561698     n****4       1     9.91     9.91
1  170983900     c****3       1     9.91     9.91
2  172998011     A****e       1     9.91     9.91
3  172998011     J****m       1     9.91     9.91
4  173907435     M****n       1    14.99    14.99

Filter on Aggregate Function Value

What if we only care about bid distribution for a bidder when their largest bid is more than $20?

q = """
select itemid, biddername, count(*) as n_bids, min(bidamount) as min_bid, 
max(bidamount) as max_bid
from bids
group by itemid, biddername
where max_bid > 20
"""
print(auctions.query(q).head())

OperationalError: (sqlite3.OperationalError) near "where": syntax error
[SQL: 
select itemid, biddername, count(*) as n_bids, min(bidamount) as min_bid, 
max(bidamount) as max_bid
from bids
group by itemid, biddername
where max_bid > 20
]
(Background on this error at: https://sqlalche.me/e/20/e3q8)

`HAVING`

If we want to filter on the aggregate function value, we need to use HAVING instead of WHERE

q = """
select itemid, biddername, count(*) as n_bids, min(bidamount) as min_bid, 
max(bidamount) as max_bid
from bids
group by itemid, biddername
having max_bid > 20
"""
print(auctions.query(q).head())

      itemId bidderName  n_bids  min_bid  max_bid
0  174767945     C****2       3    24.44    34.00
1  174767945     b****z       4    25.00    33.00
2  174871788     J****3       1    21.00    21.00
3  174871788     v****l       3    15.00    22.00
4  174901466     c****8       1    39.99    39.99

`ORDER BY`

Sorting works in an intuitive way

q = """
select itemid, biddername, count(*) as n_bids, min(bidamount) as min_bid, 
max(bidamount) as max_bid
from bids
group by itemid, biddername
having max_bid > 20
order by max_bid desc, biddername
"""
print(auctions.query(q).head())

      itemId bidderName  n_bids  min_bid  max_bid
0  180573534     j****a       1    301.0    301.0
1  180573534     A****3       4    140.0    300.0
2  180601736     c****c       4    180.0    201.0
3  180601736     A****8       2    150.0    200.0
4  180601736     B****a       1    160.0    160.0

`LIMIT`

We’ve been asking for the head of our DataFrame to limit output – we can do this directly in the query:

q = """
select itemid, biddername, count(*) as n_bids, min(bidamount) as min_bid, 
max(bidamount) as max_bid
from bids
group by itemid, biddername
having max_bid > 20
order by max_bid desc, biddername
limit 1
"""
print(auctions.query(q))

      itemId bidderName  n_bids  min_bid  max_bid
0  180573534     j****a       1    301.0    301.0

Exercise: Bidder Participation

In our sample, how many bidders participate in multiple auctions? And how many auctions do they participate in?

Solutions: Bidder Participation

q = """
select biddername, count(distinct itemid) as n_auctions
from bids
group by biddername
having n_auctions > 1
"""
bidder_participation = auctions.query(q)
print(bidder_participation.shape[0])

Solutions: Bidder Participation

We’ll see soon that we could also do this with a “subquery”

q = """
select count(*) from (
    select biddername, count(distinct itemid) as n_auctions
    from bids
    group by biddername
    having n_auctions > 1
) as a
"""
print(auctions.query(q))

   count(*)
0        60

Solutions: Bidder Participation

import numpy as np
bidder_participation.hist(
    bins = np.arange(
        np.min(bidder_participation['n_auctions']), 
        np.max(bidder_participation['n_auctions'])+1
    )
);

Window Functions

`OVER`

If we want to compute operations by group and assign it as a new variable, we need to tell SQL how to organize the groups:

q = """
select itemid, min(bidamount) over (partition by itemid) as min_bid, itemprice
from bids
"""
print(auctions.query(q).head())

      itemId  min_bid  itemPrice
0  165561698     9.91       9.91
1  170983900     9.91       9.91
2  172998011     9.91       9.91
3  172998011     9.91       9.91
4  173907435    14.99      14.99

`LAG`

Window functions are particularly useful if we need to lag data in SQL

q = """
select itemid, 
min(bidamount) over (partition by itemid) as min_bid, 
itemprice,
lag(itemprice) over (partition by itemid order by bidtime) as lagged_price
from bids
"""
print(auctions.query(q).head())

      itemId  min_bid  itemPrice  lagged_price
0  165561698     9.91       9.91           NaN
1  170983900     9.91       9.91           NaN
2  172998011     9.91       9.91           NaN
3  172998011     9.91       9.91          9.91
4  173907435    14.99      14.99           NaN

Creating Columns

String Concatenation

String concatenation in SQL is performed with ||

q = """
select title, itemid, title || " " || description as full_description
from items
"""
print(auctions.query(q).head())

                                               title     itemId  \
0         Womens Size M The North Face Zip Up Jacket  179353985   
1  The North Face Women's Size 4 Tan/Khaki Lightw...  177087535   
2  The North Face Womens Pink Long Sleeve Mock Ne...  180876361   
3  The North Face Women's Medium Sweaters/Shirt L...  177763109   
4  The North Face Mens Red Flat Front Slash Pocke...  179660197   

                                    full_description  
0  Womens Size M The North Face Zip Up Jacket <p>...  
1  The North Face Women's Size 4 Tan/Khaki Lightw...  
2  The North Face Womens Pink Long Sleeve Mock Ne...  
3  The North Face Women's Medium Sweaters/Shirt L...  
4  The North Face Mens Red Flat Front Slash Pocke...

Arithmetic

q = """
select itemid, currentprice, shipping, 
currentprice + shipping as final_price
from items
"""
print(auctions.query(q).head())

      itemId  currentPrice  shipping  final_price
0  179353985         10.99         0        10.99
1  177087535         24.98         0        24.98
2  180876361         19.99         0        19.99
3  177763109         15.00         0        15.00
4  179660197         12.99         0        12.99

`CASE WHEN`

SQL’s if-else statement (similar to R’s ifelse or case_when verbs)

q = """
select itemid, currentprice, shipping, 
currentprice + case when shipping == 0 then 5 else shipping end as final_price
from items
order by shipping desc
"""
print(auctions.query(q).head())

      itemId  currentPrice  shipping  final_price
0  176705357         19.99         2        21.99
1  179025543         14.99         2        16.99
2  179353985         10.99         0        15.99
3  177087535         24.98         0        29.98
4  180876361         19.99         0        24.99

More Cases

We can use LIKE to pattern match – % means zero, one, or multiple characters (this is a bad application – why?)

q = """
select itemid, currentprice, 
case when lower(description) like "%small%" then "small"
when lower(description) like "%medium%" then "medium"
when lower(description) like "%large%" then "large"
else null end as size
from items
where size is not null
"""
print(auctions.query(q).head())

      itemId  currentPrice   size
0  177087535         24.98  small
1  180876361         19.99  small
2  177763109         15.00  large
3  179660197         12.99  small
4  176601978          9.99  large

Database Operations

Adding to our Class

SQL doesn’t just query data – it also allows us to change the database
- We can add tables (temporary or otherwise), for example
We want to be able to also run statements that don’t just return data, but perform operations on our database
Let’s add an execute method that facilitates this for our engine

New Class

from sqlalchemy import text

class DataBase:
    def __init__(self, loc: str, db_type: str = "sqlite") -> None:
        """Initialize the class and connect to the database"""
        self.loc = loc
        self.db_type = db_type
        self.engine = create_engine(f'{self.db_type}:///{self.loc}')
    def query(self, q: str) -> pd.DataFrame:
        """Run a query against the database and return a DataFrame"""
        with Session(self.engine) as session:
            df = pd.read_sql(q, session.bind)
        return(df)
    def execute(self, q: str) -> None:
        """Execute statement on the database"""
        with self.engine.connect() as conn:
            conn.execute(text(q))

auctions = DataBase(path)

Creating a Joined Table

If we want to create a new table that contains only observations with bids where the buy now option wasn’t used, we can execute a statement to do so.

q = """
create table full_data as 
select i.*, b.* 
from items as i
inner join bids as b
on i.itemid = b.itemid
where i.isbuynowused = 0
"""
auctions.execute("drop table if exists full_data")
auctions.execute(q)
print(auctions.query("select * from full_data limit 1"))

   index buyerCountry buyerCountryCode buyerState buyerStreet buyerZip  \
0  12100         None               US       None        None     None   

                              categoryParentList defaultShippingResults  \
0  10|Clothing|27|Women's Clothing|154|Outerwear                   None   

                                         description  \
0  <p>The North Face Womens Pink Long Sleeve Mock...   

                                         imageServer  ...  \
0  https://shopgoodwillimages.azureedge.net/produ...  ...   

                      bidTime quantity:1 bidIPAddress adCode serverIP  \
0  2023-10-18 05:54:55.327000          1         None   None     None   

  retracted bidderName highBidderName isBuyerHighBidder isLoggedInBuyer  
0         0     B****4         B****4                 0               0  

[1 rows x 100 columns]

Dropping Tables

Why do we need the first statement? Because SQL won’t let us create a table that already has a given name

q = """
create table full_data as 
select * from items
"""
auctions.execute(q)

OperationalError: (sqlite3.OperationalError) table full_data already exists
[SQL: 
create table full_data as 
select * from items
]
(Background on this error at: https://sqlalche.me/e/20/e3q8)

Temporary Tables Creation

q = """
create temp table full_data as 
select i.*, b.* 
from items as i
inner join bids as b
on i.itemid = b.itemid
where i.isbuynowused = 0
"""
auctions.execute("drop table if exists full_data")
auctions.execute(q)
print(auctions.query("select * from full_data limit 1"))

   index buyerCountry buyerCountryCode buyerState buyerStreet buyerZip  \
0  12100         None               US       None        None     None   

                              categoryParentList defaultShippingResults  \
0  10|Clothing|27|Women's Clothing|154|Outerwear                   None   

                                         description  \
0  <p>The North Face Womens Pink Long Sleeve Mock...   

                                         imageServer  ...  \
0  https://shopgoodwillimages.azureedge.net/produ...  ...   

                      bidTime quantity:1 bidIPAddress adCode serverIP  \
0  2023-10-18 05:54:55.327000          1         None   None     None   

  retracted bidderName highBidderName isBuyerHighBidder isLoggedInBuyer  
0         0     B****4         B****4                 0               0  

[1 rows x 100 columns]

Rerunning

auctions = DataBase(path)
print(auctions.query("select * from full_data limit 1"))

OperationalError: (sqlite3.OperationalError) no such table: full_data
[SQL: select * from full_data limit 1]
(Background on this error at: https://sqlalche.me/e/20/e3q8)

Temporary tables get dropped when a session or connection is closed
This is desirable if these are just intermediate tables (they won’t clog up your database)
This is undesirable if they take a lot of time to compute (maybe just save them as normal tables)

Exercise: Temporary Tables

For each bid, express its time as relative to when the auction ended (endtime). That means that if an auction was 10 hours long (as measured by endtime - starttime) and a bid was placed an hour before the auction ended, it would have a normalized timestamp of .1. Plot this distribution as a histogram.

Hint: to compute the difference in time between two dates, use julianday(time1)-julianday(time2).

Solutions: Temporary Tables

q = """
create temp table auction_length as 
select itemid, starttime, endtime, 
julianday(endtime) - julianday(starttime) as length
from items
"""
auctions.execute("drop table if exists auction_length")
auctions.execute(q)
print(auctions.query('select * from auction_length limit 4'))

      itemId                   startTime                     endTime    length
0  179353985  2023-09-28 17:00:54.000000  2023-10-02 18:14:00.000000  4.050764
1  177087535  2023-09-04 22:54:00.000000  2023-09-12 19:46:00.000000  7.869444
2  180876361  2023-10-14 03:18:40.000000  2023-10-19 04:04:40.000000  5.031944
3  177763109  2023-09-12 08:22:45.000000  2023-09-17 18:34:00.000000  5.424479

Solutions: Temporary Tables

q = """
select b.itemid, b.bidtime, a.starttime, a.endtime,
(julianday(endtime)-julianday(bidtime)) / a.length as time_norm
from bids as b
inner join auction_length as a
on b.itemid=a.itemid
"""
df = auctions.query(q)
print(df.head())

      itemId                     bidTime                   startTime  \
0  178348858  2023-09-18 16:11:04.587000  2023-09-18 14:29:56.000000   
1  178348858  2023-09-22 14:22:06.700000  2023-09-18 14:29:56.000000   
2  178348858  2023-09-23 12:35:18.157000  2023-09-18 14:29:56.000000   
3  178348858  2023-09-23 18:23:27.993000  2023-09-18 14:29:56.000000   
4  178348858  2023-09-23 18:37:47.213000  2023-09-18 14:29:56.000000   

                      endTime  time_norm  
0  2023-09-23 18:39:00.000000   0.986422  
1  2023-09-23 18:39:00.000000   0.227799  
2  2023-09-23 18:39:00.000000   0.048825  
3  2023-09-23 18:39:00.000000   0.002085  
4  2023-09-23 18:39:00.000000   0.000163

Solutions: Temporary Tables

df['time_norm'].hist(bins=20)

What does this tell us? Are these results surprising?

Subqueries

Alternative Solution

q = """
select b.itemid, b.bidtime, a.starttime, a.endtime,
(julianday(endtime)-julianday(bidtime)) / a.length as time_norm
from bids as b
inner join (
    select itemid, starttime, endtime, 
    julianday(endtime) - julianday(starttime) as length
    from items
) as a
on b.itemid=a.itemid
"""
df = auctions.query(q)
print(df.head(2))

      itemId                     bidTime                   starttime  \
0  178348858  2023-09-18 16:11:04.587000  2023-09-18 14:29:56.000000   
1  178348858  2023-09-22 14:22:06.700000  2023-09-18 14:29:56.000000   

                      endtime  time_norm  
0  2023-09-23 18:39:00.000000   0.986422  
1  2023-09-23 18:39:00.000000   0.227799

Better Approach

Using WITH improves readability

q = """
with a as (
    select itemid, starttime, endtime, 
    julianday(endtime) - julianday(starttime) as length
    from items
)
select b.itemid, b.bidtime, a.starttime, a.endtime,
(julianday(endtime)-julianday(bidtime)) / a.length as time_norm
from bids as b
inner join a
on b.itemid=a.itemid
"""
df = auctions.query(q)
print(df.head(2))

      itemId                     bidTime                   starttime  \
0  178348858  2023-09-18 16:11:04.587000  2023-09-18 14:29:56.000000   
1  178348858  2023-09-22 14:22:06.700000  2023-09-18 14:29:56.000000   

                      endtime  time_norm  
0  2023-09-23 18:39:00.000000   0.986422  
1  2023-09-23 18:39:00.000000   0.227799

Do As I Say, Not As I Do

Writing Readable SQL Queries¹

No unified linting tools such as pylint for python

SQL is NOT case sensitive and ignores whitespace
It is easy to write unreadable code

Always assume that the code you write today will be inherited by a murderous psychopath who knows where you live!

Use Consistent Indentation/Breaks

SELECT <X>
FROM <A>
WHERE <TRUE>

not

SELECT <X> FROM <A> WHERE <TRUE>

One Column Per Line

SELECT 
 a
,b
,c
FROM <A>
WHERE <TRUE>

not

SELECT a,b,c
FROM <A>
WHERE <TRUE>

Why put the comma first?

Aligning Column Names

Align column names with manual spaces

SELECT
 short_column_name         AS col1
,longer_column_name        AS col2,
,longest_column_name       AS col3,
,short_column_name + 2 * 3 AS col4
FROM <A>
WHERE <TRUE>

Nesting Subqueries

If nesting subqueries, use consistent indentation

SELECT 
 a
,b
FROM (
    SELECT
     c
    ,d
    FROM <A>
    WHERE <TRUE>
)

Additional Suggestions

Additional suggestions:

Capitalize operators, such as SELECT, FROM, WHERE, etc
Use snake_case for naming columns and subqueries
Avoid using spaces in names
Adopt aliases for all tables used, even if only using one table
Less rewriting to do when (usually not if) you add a second table
Popular approach is to use first letters of words in table names, such as ct for customer_transactions

Managing a `sqlite` Database

CSV to Database

If you have CSV files, you can create a database like this:

create_db.py

engine = create_engine("sqlite:////Users/hlukas/git/personal_website/static/econ-481/data/auctions.db") 

bids = pd.read_csv('/Users/hlukas/Google Drive/Raw Data/goodwill/final_data/the north face/bidding_data.csv')
items = pd.read_csv('/Users/hlukas/Google Drive/Raw Data/goodwill/final_data/the north face/item_data.csv')

items_small = items.sample(500)
bids_small = bids.loc[bids['itemId'].isin(items_small['itemId'])]

bids_small.to_sql(con=engine, name='bids', if_exists='replace') 
items_small.to_sql(con=engine, name='items', if_exists='replace')

Inserting Data Into Table

update_db.py

from sqlalchemy.ext.declarative import declarative_base

engine = create_engine(f'sqlite:///{path}')
Base = declarative_base()
Base.metadata.create_all(engine)

items = pd.read_csv('/Users/hlukas/Google Drive/Raw Data/goodwill/final_data/the north face/item_data.csv')

items_small = items.sample(500)

items_small.to_sql(con=engine, name='items', if_exists='append')

Using `sqlalchemy`

Avoiding Queries

We don’t really need to write SQL if we don’t want to to use the package:

from sqlalchemy import MetaData, Table, select
from sqlalchemy.ext.declarative import declarative_base

engine = create_engine(f'sqlite:///{path}')
Base = declarative_base()
Base.metadata.reflect(engine)
bids = Base.metadata.tables['bids']
query = select(bids.c.itemId, bids.c.bidAmount)\
    .where(bids.c.bidAmount==10)\
    .limit(5)

with Session(engine) as s:
    print(pd.DataFrame(s.execute(query)))

      itemId  bidAmount
0  177106026       10.0
1  177963226       10.0
2  178438915       10.0
3  181622497       10.0
4  179719241       10.0

Group Operations

from sqlalchemy import func, distinct

query = select(
    bids.c.itemId, 
    func.count(distinct(bids.c.bidderName)
).label('n_bidders'))\
    .group_by(bids.c.itemId)

with Session(engine) as s:
    print(pd.DataFrame(s.execute(query)))

        itemId  n_bidders
0    165561698          1
1    170983900          1
2    172998011          2
3    173907435          1
4    174445924          3
..         ...        ...
167  182760698          1
168  182777527          1
169  182883873          2
170  182894197          1
171  182925212          2

[172 rows x 2 columns]

Distributed Computing

Distributed Computing and SQL

One benefit of knowing SQL is that it gives us access to database solutions that facilitate parallelized operations
- For example, IBM Netezza or Spark
If our data is big, having the database parallelize operations makes our lives much easier

Non-Parallel Computing

Simple example
- I have the vector [1,2,3,4,5]
- I want to square each element
This requires five computations
Suppose each computation takes $x$ seconds
If I run this computation on one “computer”, it will take roughly $5x$ seconds to compute

Parallel Computing

Suppose now I have five computers available
If the “overhead” to coordinate the tasks is $t$ (sending out the instructions and getting back the results), then parallel computing is an improvement if

\[ 5x \geq x + t \iff t \leq 5 \]

Parallel Computing

Netezza and Spark (and many others) handle a lot of this on their own
What should we consider outside of the overhead cost when considering running code in parallel?
- Is the task actually parallelizable?
- How many cores should I allocate to the task?

Appendix

Luke Wylie

$Luke Wylie Eating Ice Cream$

SQL

Learning Objectives

SQL Background

Pronounciation1

Background

Tables

SQL in Pipelines

Why?

Answer (per Luke1 Wylie2)

Accessing a Database in Python

sqlalchemy

create_engine

create_engine Argument

Listing Tables

Querying Data

Writing a Query Class

Aside: Why a Class?

Queries

Query Syntax

SELECT * Statement1

SELECT Columns Statement

JOIN Statements

JOIN Aliases

JOIN Renaming Tables

Exercise: Joins

Solutions: Joins

WHERE

WHERE With Multiple Conditions

GROUP BY

Aside: COUNT

Exercise: MIN and MAX

Exercise: MIN and MAX

Filter on Aggregate Function Value

HAVING

ORDER BY

LIMIT

Exercise: Bidder Participation

Solutions: Bidder Participation

Solutions: Bidder Participation

Solutions: Bidder Participation

Window Functions

OVER

LAG

Creating Columns

String Concatenation

Arithmetic

CASE WHEN

More Cases

Database Operations

Adding to our Class

New Class

Creating a Joined Table

Dropping Tables

Temporary Tables Creation

Rerunning

Exercise: Temporary Tables

Solutions: Temporary Tables

Solutions: Temporary Tables

Solutions: Temporary Tables

Subqueries

Alternative Solution

Better Approach

Do As I Say, Not As I Do

Writing Readable SQL Queries1

Use Consistent Indentation/Breaks

One Column Per Line

Aligning Column Names

Nesting Subqueries

Additional Suggestions

Managing a sqlite Database

CSV to Database

Inserting Data Into Table

Using sqlalchemy

Avoiding Queries

Group Operations

Distributed Computing

Distributed Computing and SQL

Non-Parallel Computing

Parallel Computing

Parallel Computing

Pronounciation¹

Answer (per Luke¹ Wylie²)

`sqlalchemy`

`create_engine`

`create_engine` Argument

`SELECT *` Statement¹

`SELECT` Columns Statement

`JOIN` Statements

`JOIN` Aliases

`JOIN` Renaming Tables

`WHERE`

`WHERE` With Multiple Conditions

`GROUP BY`

Aside: `COUNT`

Exercise: `MIN` and `MAX`

Exercise: `MIN` and `MAX`

`HAVING`

`ORDER BY`

`LIMIT`

`OVER`

`LAG`

`CASE WHEN`

Writing Readable SQL Queries¹

Managing a `sqlite` Database

Using `sqlalchemy`