SQL Fundamentals for Python Engineers

Reading time: ~35 minutes | Level: Intermediate → Engineering

Before reading further, predict the result of this query:

SELECT * FROM orders WHERE user_id = NULL;

There is a user_id column. There are rows where user_id is null. The query runs without error.

How many rows does it return?

Show Answer

Zero rows. Always. Even when there are null user_id values in the table.

The reason is fundamental to how SQL works: NULL is not a value - it is the absence of a value. Comparing anything to NULL with = always produces a third logical result: UNKNOWN, not TRUE or FALSE. The WHERE clause only passes rows where the condition evaluates to TRUE. UNKNOWN fails the filter.

The correct query is:

SELECT * FROM orders WHERE user_id IS NULL;

IS NULL is a special predicate designed specifically for null checking. It returns TRUE or FALSE, never UNKNOWN.

The same applies to != (not equal): WHERE user_id != NULL also returns zero rows, for the same reason. Use IS NOT NULL to filter out nulls.

This surprises every developer the first time they encounter it. It is not a bug in the database - it is three-valued logic (TRUE / FALSE / UNKNOWN), defined in the SQL standard, present in every major RDBMS. Understanding NULL semantics is one of the first things that separates engineers from casual SQL users.

SQL is the language Python engineers interact with more than almost any other external system. You write it in migrations, in query builders, in ORMs that generate it, and directly in cursor.execute() calls. When an ORM query is slow, you debug the SQL it generated. When a report is wrong, you read the SQL behind it. When a data pipeline fails, you trace it to a JOIN that produced unexpected nulls.

This lesson builds SQL intuition from an engineering perspective - not just syntax, but semantics. What does the database actually do with your query? Why is this query fast and that one slow? What are the traps that catch even experienced developers?

What You Will Learn

NULL semantics and three-valued logic
SQL as a set language - the mental model that makes queries intuitive
SELECT, WHERE, ORDER BY, LIMIT - with engineering depth on what the database executes
INNER, LEFT, RIGHT, and FULL OUTER JOINs - with visual table diagrams
GROUP BY, HAVING, and aggregate functions (COUNT, SUM, AVG, MIN, MAX)
Subqueries vs JOINs - performance implications and when to use each
Window functions - ROW_NUMBER, RANK, LAG, LEAD - the foundation of analytics queries
SQL injection - how it works, how to prevent it, why it matters in Python code
Query execution order - the exact order the database processes your query

Prerequisites

Basic Python: variables, functions, loops, dictionaries
No prior SQL experience required - this lesson starts from scratch
Recommended: the Module 07 Overview for context on where SQL fits in the data layer stack

SQL Is a Set Language

The single most important mental model shift when learning SQL is this: SQL describes what data you want, not how to retrieve it. It is declarative, not procedural.

In Python you write:

results = []
for order in all_orders:
    if order["status"] == "completed":
        results.append(order)
results.sort(key=lambda o: o["total"], reverse=True)
results = results[:10]

In SQL you declare:

SELECT *
FROM orders
WHERE status = 'completed'
ORDER BY total DESC
LIMIT 10;

The database figures out how to retrieve it - which index to use, whether to sort before or after filtering, how much memory to allocate. The query optimizer is doing the work that you would have to write manually in Python. Your job is to describe the correct set of rows, joins, and aggregations; the database's job is to find the most efficient execution plan.

This also means SQL operates on tables as sets - unordered collections of rows. A SELECT without ORDER BY has no guaranteed row order. Different executions of the same query can return rows in different sequences. If order matters to you, you must ask for it explicitly.

Part 1 - SELECT, WHERE, ORDER BY, LIMIT

The Anatomy of a SELECT Statement

SELECT
    order_id,
    user_id,
    total,
    status,
    created_at
FROM orders
WHERE status = 'completed'
  AND total > 100.00
ORDER BY created_at DESC
LIMIT 25
OFFSET 0;

Breaking down each clause:

Clause	What it does	Execution order
`FROM orders`	Identifies the source table (or tables)	1st
`WHERE status = 'completed' AND total > 100`	Filters rows - only rows where ALL conditions are TRUE pass	2nd
`SELECT order_id, user_id, total, ...`	Chooses which columns to return	3rd
`ORDER BY created_at DESC`	Sorts the result set	4th
`LIMIT 25 OFFSET 0`	Returns only the first 25 rows	5th

:::warning SELECT * in Production Is an Anti-Pattern SELECT * retrieves every column in the table. If the table has 30 columns and you only need 3, you are transferring 10x the data over the network, consuming 10x the memory in the application, and coupling your code to the table's column list (so every future ALTER TABLE ADD COLUMN might silently change your query's behavior). Always name the columns you need. :::

WHERE Clause Operators

-- Equality and inequality
WHERE status = 'completed'
WHERE status != 'cancelled'
WHERE total > 100
WHERE total BETWEEN 50 AND 200      -- inclusive on both ends

-- Pattern matching (case-sensitive in most databases)
WHERE email LIKE '%@gmail.com'      -- % matches any sequence of characters
WHERE name LIKE 'Al_ce'             -- _ matches exactly one character

-- Set membership
WHERE status IN ('completed', 'shipped', 'processing')
WHERE user_id NOT IN (SELECT user_id FROM banned_users)

-- NULL handling - always use IS NULL / IS NOT NULL
WHERE deleted_at IS NULL            -- row has not been soft-deleted
WHERE coupon_code IS NOT NULL       -- row has a coupon applied

-- Combining conditions
WHERE status = 'completed'
  AND total > 100
  AND created_at >= '2025-01-01'

WHERE status = 'pending'
   OR (status = 'processing' AND priority = 'high')

The NULL Trap in Detail

NULL propagates through expressions in ways that surprise most developers:

-- All three of these evaluate to UNKNOWN, not TRUE or FALSE
NULL = NULL        -- UNKNOWN
NULL != NULL       -- UNKNOWN
NULL + 5           -- NULL (arithmetic on NULL produces NULL)
CONCAT('hello', NULL)  -- NULL (in most databases)

-- The COALESCE function provides a default value when a column is NULL
SELECT
    order_id,
    COALESCE(discount_amount, 0) AS discount,  -- use 0 if NULL
    COALESCE(notes, 'No notes')   AS notes
FROM orders;

-- NULLIF returns NULL when two values are equal (useful to avoid division by zero)
SELECT
    order_id,
    total / NULLIF(quantity, 0) AS price_per_unit  -- returns NULL instead of dividing by zero
FROM order_items;

:::note Why Three-Valued Logic Exists NULL represents missing or unknown data - not zero, not empty string, not false. When you ask "is this unknown value equal to 'completed'?", the honest answer is "unknown." SQL's three-valued logic (TRUE / FALSE / UNKNOWN) is mathematically correct for this scenario. The practical consequence: always check for nulls explicitly with IS NULL, and design schemas to use NOT NULL constraints wherever the business logic guarantees a value exists. A column that is NOT NULL eliminates an entire category of three-valued logic bugs. :::

Part 2 - JOINs

A JOIN combines rows from two or more tables based on a related column. Before writing the SQL, always draw the relationship:

INNER JOIN

Returns only rows that have a match in both tables. Rows with no match on either side are excluded.

-- Get all orders with the customer's name and email
SELECT
    o.order_id,
    u.name        AS customer_name,
    u.email       AS customer_email,
    o.total,
    o.status
FROM orders o
INNER JOIN users u ON o.user_id = u.user_id
WHERE o.status = 'completed'
ORDER BY o.created_at DESC;

If a user has been deleted from the users table but their orders remain, INNER JOIN will exclude those orphaned orders entirely - they have no match in users.

LEFT JOIN (LEFT OUTER JOIN)

Returns all rows from the left table, plus matching rows from the right table. If there is no match on the right side, the right-side columns are NULL.

-- Get all users, even those who have never placed an order
SELECT
    u.user_id,
    u.name,
    u.email,
    COUNT(o.order_id)    AS total_orders,
    COALESCE(SUM(o.total), 0) AS lifetime_value
FROM users u
LEFT JOIN orders o ON u.user_id = o.user_id
GROUP BY u.user_id, u.name, u.email
ORDER BY lifetime_value DESC;

Users with no orders will have total_orders = 0 and lifetime_value = 0. With INNER JOIN, they would be absent from the result entirely.

RIGHT JOIN (RIGHT OUTER JOIN)

Returns all rows from the right table, plus matching rows from the left table. Rarely used - the same result can always be expressed as a LEFT JOIN with the tables swapped. Most engineers prefer LEFT JOIN for readability.

-- Equivalent to a LEFT JOIN with tables reversed
SELECT u.name, o.order_id
FROM orders o
RIGHT JOIN users u ON o.user_id = u.user_id;

-- Preferred equivalent with LEFT JOIN:
SELECT u.name, o.order_id
FROM users u
LEFT JOIN orders o ON u.user_id = o.user_id;

FULL OUTER JOIN

Returns all rows from both tables. Where there is no match, the missing side's columns are NULL.

-- Find all users and all orders, even unmatched ones
-- Useful for data quality audits
SELECT
    u.user_id,
    u.email,
    o.order_id,
    o.total
FROM users u
FULL OUTER JOIN orders o ON u.user_id = o.user_id
WHERE u.user_id IS NULL   -- orders with no user (orphaned)
   OR o.order_id IS NULL; -- users with no orders

:::note SQLite Does Not Support FULL OUTER JOIN SQLite supports INNER JOIN and LEFT JOIN but not FULL OUTER JOIN. For learning purposes, all examples that require FULL OUTER JOIN can be run on PostgreSQL. The concept applies universally - only the dialect differs. :::

Multi-Table JOIN

-- Get order line items with product names and customer emails
SELECT
    u.email,
    o.order_id,
    o.created_at,
    p.name        AS product_name,
    oi.quantity,
    oi.unit_price,
    (oi.quantity * oi.unit_price) AS line_total
FROM order_items oi
INNER JOIN orders   o ON oi.order_id   = o.order_id
INNER JOIN users    u ON o.user_id     = u.user_id
INNER JOIN products p ON oi.product_id = p.product_id
WHERE o.status = 'completed'
ORDER BY o.created_at DESC, o.order_id, oi.item_id;

The table aliases (o, u, oi, p) are essential for readability in multi-table queries. Always alias tables when joining more than two.

Part 3 - GROUP BY, HAVING, and Aggregate Functions

GROUP BY collapses multiple rows with the same column value into a single output row, with aggregate functions summarizing the grouped data.

Aggregate Functions

Function	What it computes	NULL handling
`COUNT(*)`	All rows in the group, including NULLs	Counts everything
`COUNT(column)`	Rows where `column` is not NULL	Ignores NULLs
`SUM(column)`	Sum of non-NULL values	Ignores NULLs; returns NULL if all values are NULL
`AVG(column)`	Average of non-NULL values	Ignores NULLs
`MIN(column)`	Smallest non-NULL value	Ignores NULLs
`MAX(column)`	Largest non-NULL value	Ignores NULLs

GROUP BY Examples

-- Sales summary by status
SELECT
    status,
    COUNT(*)          AS order_count,
    SUM(total)        AS revenue,
    AVG(total)        AS avg_order_value,
    MIN(total)        AS smallest_order,
    MAX(total)        AS largest_order
FROM orders
GROUP BY status
ORDER BY revenue DESC;

status       | order_count | revenue    | avg_order_value | smallest_order | largest_order
-------------|-------------|------------|-----------------|----------------|---------------
completed    | 8241        | 1824050.00 | 221.35          | 12.50          | 9875.00
processing   | 412         | 89234.50   | 216.59          | 18.00          | 3200.00
cancelled    | 1053        | 0.00       | 0.00            | 0.00           | 0.00

The GROUP BY Rule

Every column in SELECT must be either in GROUP BY or inside an aggregate function. This is not a style choice - it is a logical requirement. If you group by status, the database collapses all completed rows into one output row. Which order_id would it show? It is ambiguous - so SQL forbids it.

-- WRONG - order_id is neither in GROUP BY nor aggregated
SELECT status, order_id, COUNT(*)
FROM orders
GROUP BY status;
-- Error: column "order_id" must appear in the GROUP BY clause
-- or be used in an aggregate function

-- CORRECT
SELECT status, COUNT(*) AS order_count
FROM orders
GROUP BY status;

HAVING - Filtering After Aggregation

WHERE filters rows before grouping. HAVING filters groups after aggregation. They solve different problems.

-- Find customers who have spent more than $1,000 in total
SELECT
    u.user_id,
    u.email,
    COUNT(o.order_id)   AS order_count,
    SUM(o.total)        AS lifetime_value
FROM users u
LEFT JOIN orders o ON u.user_id = o.user_id
WHERE o.status = 'completed'          -- filter ROWS before grouping
GROUP BY u.user_id, u.email
HAVING SUM(o.total) > 1000            -- filter GROUPS after aggregation
ORDER BY lifetime_value DESC
LIMIT 100;

You cannot use WHERE SUM(o.total) > 1000 because WHERE is evaluated before aggregation - SUM does not exist yet at that stage. HAVING is evaluated after GROUP BY, so aggregate functions are available.

Part 4 - Subqueries vs JOINs

A subquery is a SELECT statement embedded inside another query. Both subqueries and JOINs can produce the same results - but they often have different performance characteristics.

Correlated vs Non-Correlated Subqueries

A non-correlated subquery executes once and its result is used by the outer query:

-- Find orders placed by customers who signed up in 2025
SELECT order_id, user_id, total
FROM orders
WHERE user_id IN (
    SELECT user_id
    FROM users
    WHERE created_at >= '2025-01-01'   -- executes once
);

A correlated subquery executes once per row in the outer query - this is the expensive one:

-- For each order, calculate the customer's average order value
-- WARNING: this executes the subquery once per order row
SELECT
    o.order_id,
    o.total,
    (
        SELECT AVG(total)
        FROM orders inner_o
        WHERE inner_o.user_id = o.user_id  -- references outer query
    ) AS customer_avg_order
FROM orders o;

For a table with 100,000 orders, this correlated subquery executes 100,000 times. The JOIN version executes once:

-- Same result, executed as a single JOIN + aggregation pass
SELECT
    o.order_id,
    o.total,
    avg_orders.avg_total AS customer_avg_order
FROM orders o
INNER JOIN (
    SELECT user_id, AVG(total) AS avg_total
    FROM orders
    GROUP BY user_id
) avg_orders ON o.user_id = avg_orders.user_id;

:::tip Prefer JOINs Over Correlated Subqueries Correlated subqueries are often the slowest SQL pattern you can write. For each outer row, the database executes the inner query independently - effectively a nested loop. Rewrite correlated subqueries as JOINs against a derived table or CTE whenever possible. The query plan (via EXPLAIN ANALYZE) will show the difference clearly. :::

Common Table Expressions (CTEs)

CTEs (introduced with WITH) make complex queries readable by naming intermediate result sets:

-- Find the top 10 products by revenue in the last 30 days
WITH recent_orders AS (
    SELECT order_id
    FROM orders
    WHERE created_at >= NOW() - INTERVAL '30 days'
      AND status = 'completed'
),
product_revenue AS (
    SELECT
        oi.product_id,
        SUM(oi.quantity * oi.unit_price) AS revenue,
        COUNT(DISTINCT oi.order_id)       AS times_ordered
    FROM order_items oi
    INNER JOIN recent_orders ro ON oi.order_id = ro.order_id
    GROUP BY oi.product_id
)
SELECT
    p.name,
    pr.revenue,
    pr.times_ordered,
    pr.revenue / pr.times_ordered AS avg_revenue_per_order
FROM product_revenue pr
INNER JOIN products p ON pr.product_id = p.product_id
ORDER BY pr.revenue DESC
LIMIT 10;

CTEs do not always improve performance (in many databases they are not independently optimized), but they dramatically improve readability for multi-step analytical queries.

Part 5 - Window Functions

Window functions compute values across a set of rows related to the current row - without collapsing those rows into one output row (unlike GROUP BY). They are one of the most powerful SQL features and appear in virtually every analytics or reporting query.

The core syntax:

window_function() OVER (
    PARTITION BY column1, column2   -- defines the "window" (like GROUP BY, but rows are not collapsed)
    ORDER BY column3 DESC           -- defines row ordering within the window
    ROWS BETWEEN ... AND ...        -- optional: defines the frame (subset of the window)
)

ROW_NUMBER - Assign a Sequential Integer

-- Assign row numbers to each user's orders, newest first
SELECT
    order_id,
    user_id,
    total,
    created_at,
    ROW_NUMBER() OVER (
        PARTITION BY user_id
        ORDER BY created_at DESC
    ) AS order_sequence
FROM orders;

order_id | user_id | total  | created_at          | order_sequence
---------|---------|--------|---------------------|---------------
1042     | 7       | 219.00 | 2025-12-01 10:00:00 | 1
 891     | 7       | 87.50  | 2025-09-15 08:22:00 | 2
 234     | 7       | 445.00 | 2024-11-30 14:01:00 | 3

Use case: Get each user's most recent order (WHERE order_sequence = 1).

-- Get each user's most recent completed order
WITH ranked_orders AS (
    SELECT
        *,
        ROW_NUMBER() OVER (
            PARTITION BY user_id
            ORDER BY created_at DESC
        ) AS rn
    FROM orders
    WHERE status = 'completed'
)
SELECT order_id, user_id, total, created_at
FROM ranked_orders
WHERE rn = 1;

RANK and DENSE_RANK - Handle Ties

-- Rank products by revenue; ties get the same rank
SELECT
    p.name,
    SUM(oi.quantity * oi.unit_price) AS revenue,
    RANK()       OVER (ORDER BY SUM(oi.quantity * oi.unit_price) DESC) AS rank_with_gaps,
    DENSE_RANK() OVER (ORDER BY SUM(oi.quantity * oi.unit_price) DESC) AS rank_no_gaps
FROM order_items oi
INNER JOIN products p ON oi.product_id = p.product_id
GROUP BY p.product_id, p.name
ORDER BY revenue DESC;

name          | revenue  | rank_with_gaps | rank_no_gaps
--------------|----------|----------------|-------------
Widget Pro    | 84500.00 | 1              | 1
Widget Basic  | 84500.00 | 1              | 1
Gadget Plus   | 62100.00 | 3              | 2   ← RANK skips 2; DENSE_RANK does not

LAG and LEAD - Access Adjacent Rows

LAG accesses the value from a previous row; LEAD accesses the value from a following row - without a self-JOIN.

-- Calculate month-over-month revenue change
WITH monthly_revenue AS (
    SELECT
        DATE_TRUNC('month', created_at)   AS month,
        SUM(total)                         AS revenue
    FROM orders
    WHERE status = 'completed'
    GROUP BY DATE_TRUNC('month', created_at)
)
SELECT
    month,
    revenue,
    LAG(revenue) OVER (ORDER BY month)  AS prev_month_revenue,
    revenue - LAG(revenue) OVER (ORDER BY month) AS revenue_change,
    ROUND(
        100.0 * (revenue - LAG(revenue) OVER (ORDER BY month))
        / NULLIF(LAG(revenue) OVER (ORDER BY month), 0),
        2
    ) AS pct_change
FROM monthly_revenue
ORDER BY month;

month       | revenue    | prev_month_revenue | revenue_change | pct_change
------------|------------|-------------------|----------------|------------
2025-10-01  | 142000.00  | NULL              | NULL           | NULL
2025-11-01  | 168000.00  | 142000.00         | 26000.00       | 18.31
2025-12-01  | 195000.00  | 168000.00         | 27000.00       | 16.07
2026-01-01  | 131000.00  | 195000.00         | -64000.00      | -32.82

Running Totals with Frames

-- Running total of revenue per user, ordered by order date
SELECT
    user_id,
    order_id,
    total,
    created_at,
    SUM(total) OVER (
        PARTITION BY user_id
        ORDER BY created_at
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS running_lifetime_value
FROM orders
WHERE status = 'completed'
ORDER BY user_id, created_at;

:::tip Window Functions Are Not GROUP BY GROUP BY collapses rows: you get one output row per group. Window functions preserve all rows and add a computed column. This is the distinction that makes ROW_NUMBER() OVER (PARTITION BY ...) so useful - you can rank rows within groups without losing the individual row data, then filter in an outer query. :::

Part 6 - SQL Injection

SQL injection is the most consistently exploited vulnerability class in web applications. It is entirely preventable. Yet it still appears in production code written by experienced developers who "just needed a quick fix."

How SQL Injection Works

# DANGEROUS - do not do this
def get_user_by_name(name: str) -> dict:
    query = f"SELECT * FROM users WHERE name = '{name}'"
    cursor.execute(query)
    return cursor.fetchone()

If name = "Alice", the query is:

SELECT * FROM users WHERE name = 'Alice';

That is fine. Now if name = "' OR '1'='1", the query becomes:

SELECT * FROM users WHERE name = '' OR '1'='1';

'1'='1' is always TRUE. The WHERE clause is effectively disabled. This query returns every row in the users table.

A more destructive payload: name = "'; DROP TABLE users; --" produces:

SELECT * FROM users WHERE name = ''; DROP TABLE users; --';

If the database driver permits multiple statements (some do, some do not), this drops the entire users table. The -- comments out the trailing quote so the statement is syntactically valid.

:::danger Parameterized Queries Are Non-Negotiable Never use string formatting, f-strings, or concatenation to build SQL queries with user-supplied input. Never. Not "usually not." Not "not in production." Not in logging statements. Not in admin scripts. Not even for integer values that "can't have SQL in them" - because the habit of string interpolation is what leads to the bug when a string column is added later.

The parameterized query passes the values separately from the SQL text. The database driver handles escaping. The SQL structure cannot be changed by the data. :::

Parameterized Queries in Python

Every major Python database driver supports parameterized queries. Use them for every value that comes from outside the query itself:

import sqlite3

# WRONG - string interpolation with user input
def get_user_bad(conn: sqlite3.Connection, name: str) -> dict | None:
    query = f"SELECT * FROM users WHERE name = '{name}'"  # NEVER
    return conn.execute(query).fetchone()

# CORRECT - parameterized query with sqlite3 (uses ? placeholder)
def get_user_sqlite(conn: sqlite3.Connection, name: str) -> dict | None:
    query = "SELECT * FROM users WHERE name = ?"
    return conn.execute(query, (name,)).fetchone()

# CORRECT - parameterized query with psycopg2 (uses %s placeholder)
import psycopg2

def get_user_postgres(conn: psycopg2.extensions.connection, name: str) -> dict | None:
    query = "SELECT * FROM users WHERE name = %s"
    with conn.cursor() as cur:
        cur.execute(query, (name,))
        return cur.fetchone()

# CORRECT - multiple parameters
def get_orders_by_status_and_user(
    conn: sqlite3.Connection,
    user_id: int,
    status: str,
    limit: int,
) -> list:
    query = """
        SELECT order_id, total, created_at
        FROM orders
        WHERE user_id = ?
          AND status = ?
        ORDER BY created_at DESC
        LIMIT ?
    """
    return conn.execute(query, (user_id, status, limit)).fetchall()

The placeholders (? in sqlite3, %s in psycopg2) are not string substitution - they are protocol-level parameter binding. The values travel to the database server separately from the SQL text and are never interpreted as SQL syntax.

Dynamic Column and Table Names

Parameterization works for values, not for identifiers (table names, column names, ORDER BY columns). If you need to make those dynamic, validate against an allowlist:

ALLOWED_SORT_COLUMNS = {"created_at", "total", "status"}
ALLOWED_SORT_DIRECTIONS = {"ASC", "DESC"}

def get_orders_sorted(
    conn: sqlite3.Connection,
    sort_by: str,
    direction: str,
) -> list:
    # Validate against allowlist - NEVER pass user input directly into the identifier
    if sort_by not in ALLOWED_SORT_COLUMNS:
        raise ValueError(f"Invalid sort column: {sort_by!r}")
    if direction not in ALLOWED_SORT_DIRECTIONS:
        raise ValueError(f"Invalid sort direction: {direction!r}")

    # Now safe to interpolate - both values came from our allowlist, not user input
    query = f"SELECT * FROM orders ORDER BY {sort_by} {direction}"
    return conn.execute(query).fetchall()

Query Execution Order

The order you write SQL clauses and the order the database executes them are different. This is one of the most common sources of SQL confusion.

Why this matters in practice:

WHERE cannot reference SELECT aliases because SELECT has not been evaluated yet. You must repeat the expression or use a subquery/CTE.
HAVING can reference aggregate functions but WHERE cannot - GROUP BY has not happened yet at the WHERE stage.
ORDER BY can reference SELECT aliases - it is evaluated after SELECT.
LIMIT is applied last - the database may still need to sort the entire result set before returning the first N rows (unless an index supports the sort).

-- WRONG - cannot reference SELECT alias in WHERE
SELECT total * 0.1 AS tax
FROM orders
WHERE tax > 5;           -- Error: column "tax" does not exist

-- CORRECT - repeat the expression in WHERE
SELECT total * 0.1 AS tax
FROM orders
WHERE total * 0.1 > 5;

-- OR use a subquery / CTE
WITH order_taxes AS (
    SELECT order_id, total * 0.1 AS tax
    FROM orders
)
SELECT order_id, tax
FROM order_taxes
WHERE tax > 5;

Putting It Together - A Real Analytics Query

Here is a complete query combining JOINs, aggregation, and window functions - the kind you will write regularly as a Python engineer working on a product:

-- Monthly cohort retention: for each signup month, what % of users
-- placed a second order within 90 days?
WITH user_cohorts AS (
    SELECT
        user_id,
        DATE_TRUNC('month', created_at) AS signup_month
    FROM users
),
user_orders AS (
    SELECT
        o.user_id,
        o.order_id,
        o.created_at,
        ROW_NUMBER() OVER (
            PARTITION BY o.user_id
            ORDER BY o.created_at ASC
        ) AS order_rank
    FROM orders o
    WHERE o.status = 'completed'
),
first_second_orders AS (
    SELECT
        uc.signup_month,
        uo_first.user_id,
        uo_first.created_at    AS first_order_date,
        uo_second.created_at   AS second_order_date
    FROM user_cohorts uc
    INNER JOIN user_orders uo_first
        ON uc.user_id = uo_first.user_id
        AND uo_first.order_rank = 1
    LEFT JOIN user_orders uo_second
        ON uc.user_id = uo_second.user_id
        AND uo_second.order_rank = 2
        AND uo_second.created_at <= uo_first.created_at + INTERVAL '90 days'
)
SELECT
    signup_month,
    COUNT(DISTINCT user_id)                                    AS cohort_size,
    COUNT(DISTINCT CASE WHEN second_order_date IS NOT NULL
                        THEN user_id END)                      AS retained_users,
    ROUND(
        100.0 * COUNT(DISTINCT CASE WHEN second_order_date IS NOT NULL
                                    THEN user_id END)
        / NULLIF(COUNT(DISTINCT user_id), 0),
        1
    )                                                          AS retention_pct
FROM first_second_orders
GROUP BY signup_month
ORDER BY signup_month DESC;

This query uses: CTEs for readability, ROW_NUMBER() for ranking within a partition, LEFT JOIN to preserve users with no second order, CASE WHEN inside COUNT(DISTINCT ...) for conditional aggregation, NULLIF to avoid division by zero, and INTERVAL arithmetic for date math. These patterns appear constantly in production analytics code.

Graded Practice Challenges

Beginner - Write the Query

Challenge: Given the schema from the Mermaid diagram above (users, orders, order_items, products), write a query that returns the 5 most recently placed orders, with the customer's email and the number of line items in each order.

Expected output columns: order_id, customer_email, order_date, item_count, order_total

Show Solution

SELECT
    o.order_id,
    u.email         AS customer_email,
    o.created_at    AS order_date,
    COUNT(oi.item_id) AS item_count,
    o.total           AS order_total
FROM orders o
INNER JOIN users u
    ON o.user_id = u.user_id
LEFT JOIN order_items oi
    ON o.order_id = oi.order_id
GROUP BY o.order_id, u.email, o.created_at, o.total
ORDER BY o.created_at DESC
LIMIT 5;

Why LEFT JOIN for order_items? An order might have zero items if it was created but not yet populated (edge case in some systems). LEFT JOIN keeps the order row even in that case with item_count = 0. Use INNER JOIN if you know all orders always have items.

Why GROUP BY o.total? The total column is not an aggregate, so it must appear in GROUP BY even though it is functionally determined by order_id. In PostgreSQL, GROUP BY o.order_id is sufficient if order_id is the primary key (because all other columns are functionally dependent on it). In SQLite and older PostgreSQL, list all non-aggregated SELECT columns explicitly.

Intermediate - Debug the Query

This query is supposed to find products that have never been ordered. It always returns zero rows. Find and fix the bug:

SELECT p.product_id, p.name
FROM products p
INNER JOIN order_items oi ON p.product_id = oi.product_id
WHERE oi.order_id = NULL;

Show Solution

Two bugs:

Bug 1 - Wrong JOIN type: INNER JOIN only returns products that have at least one matching order_items row. Products that have never been ordered have no match in order_items, so INNER JOIN excludes them entirely. We need LEFT JOIN to keep all products and produce NULL on the order_items side for unordered products.

Bug 2 - NULL comparison with =: WHERE oi.order_id = NULL always evaluates to UNKNOWN, never TRUE. As we saw in the opening puzzle, NULL must be compared with IS NULL.

Corrected query:

SELECT p.product_id, p.name
FROM products p
LEFT JOIN order_items oi ON p.product_id = oi.product_id
WHERE oi.order_id IS NULL;   -- oi columns are NULL when there is no matching order_item

Alternative using NOT EXISTS (often faster on large tables):

SELECT p.product_id, p.name
FROM products p
WHERE NOT EXISTS (
    SELECT 1
    FROM order_items oi
    WHERE oi.product_id = p.product_id
);

NOT EXISTS terminates as soon as it finds one matching row, which can be more efficient than LEFT JOIN ... WHERE IS NULL when the order_items table is large and many products are unordered.

Advanced - Design and Optimise

Challenge: Write a query that produces a "customer leaderboard" for the past 12 months. For each customer, show their rank by total spend, their total spend, their order count, their most purchased product category, and whether they are a "new" customer (first order within the past 12 months) or "returning" (first order more than 12 months ago).

Requirements:

Rank by total spend descending (use DENSE_RANK so ties share the same rank)
Only include customers with at least one completed order in the past 12 months
The "most purchased category" is by quantity of items ordered (not revenue)
Output columns: rank, customer_email, total_spend, order_count, top_category, customer_type

Show Reference Solution

WITH period_orders AS (
    -- Completed orders in the last 12 months
    SELECT o.order_id, o.user_id, o.total
    FROM orders o
    WHERE o.status = 'completed'
      AND o.created_at >= NOW() - INTERVAL '12 months'
),
customer_spend AS (
    -- Total spend and order count per customer in the period
    SELECT
        po.user_id,
        SUM(po.total)        AS total_spend,
        COUNT(po.order_id)   AS order_count
    FROM period_orders po
    GROUP BY po.user_id
),
category_quantities AS (
    -- Quantity of items ordered per customer per category, in the period
    SELECT
        o.user_id,
        p.category,
        SUM(oi.quantity) AS qty
    FROM order_items oi
    INNER JOIN period_orders po ON oi.order_id = po.order_id
    INNER JOIN orders          o  ON oi.order_id = o.order_id
    INNER JOIN products        p  ON oi.product_id = p.product_id
    GROUP BY o.user_id, p.category
),
top_categories AS (
    -- Pick the single highest-quantity category per customer
    SELECT
        user_id,
        category AS top_category
    FROM (
        SELECT
            user_id,
            category,
            ROW_NUMBER() OVER (
                PARTITION BY user_id
                ORDER BY qty DESC
            ) AS rn
        FROM category_quantities
    ) ranked
    WHERE rn = 1
),
customer_first_order AS (
    -- Determine when each customer first ordered (ever, not just in the period)
    SELECT
        user_id,
        MIN(created_at) AS first_order_date
    FROM orders
    WHERE status = 'completed'
    GROUP BY user_id
)
SELECT
    DENSE_RANK() OVER (ORDER BY cs.total_spend DESC)   AS rank,
    u.email                                             AS customer_email,
    ROUND(cs.total_spend, 2)                           AS total_spend,
    cs.order_count,
    tc.top_category,
    CASE
        WHEN cfo.first_order_date >= NOW() - INTERVAL '12 months'
        THEN 'new'
        ELSE 'returning'
    END                                                 AS customer_type
FROM customer_spend cs
INNER JOIN users                u   ON cs.user_id  = u.user_id
INNER JOIN top_categories       tc  ON cs.user_id  = tc.user_id
INNER JOIN customer_first_order cfo ON cs.user_id  = cfo.user_id
ORDER BY rank, u.email
LIMIT 500;

Design decisions:

Five CTEs decompose the problem into independently understandable steps. Each CTE does one thing. The final SELECT only joins them together - it contains no complex logic.
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY qty DESC) inside top_categories selects the single best category without a correlated subquery or a messy GROUP BY hack.
customer_first_order uses all completed orders, not just period orders - because "new vs returning" is about the customer's entire history, not just the past 12 months.
DENSE_RANK() rather than RANK() or ROW_NUMBER() - ties in spend share the same rank, and no rank numbers are skipped after a tie.
LIMIT 500 - leaderboards rarely need all rows. Adding a limit also makes the query safe to run repeatedly without accidentally returning millions of rows.

For production: Add an index on orders(user_id, status, created_at) to support the date range filter efficiently. Check EXPLAIN ANALYZE to verify it is used.

Key Takeaways

NULL is not a value - it is the absence of a value. column = NULL always evaluates to UNKNOWN, never TRUE. Always use IS NULL and IS NOT NULL for null checks
SQL is a set language - you describe the data you want, and the query optimizer determines how to retrieve it. Think in sets, not loops
JOIN types control which rows survive: INNER JOIN keeps only matching rows from both sides; LEFT JOIN keeps all left-side rows; FULL OUTER JOIN keeps all rows from both sides with NULLs for non-matches
WHERE filters before grouping; HAVING filters after - they are evaluated at different stages of query execution
The query execution order is: FROM → WHERE → GROUP BY → HAVING → SELECT → DISTINCT → ORDER BY → LIMIT. You cannot reference a SELECT alias in WHERE because SELECT has not been evaluated yet
Correlated subqueries execute once per outer row - for large tables, this is catastrophically slow. Rewrite as a JOIN against a derived table or CTE
Window functions preserve individual rows unlike GROUP BY. They are the correct tool for ranking, running totals, and lag/lead comparisons
SQL injection is fully preventable with parameterized queries. Never build SQL strings with f-strings or concatenation using values from outside the query. Validate column and table names against an allowlist if they must be dynamic
SELECT * is an anti-pattern in production - name every column you need, to control data volume and prevent schema coupling

What's Next

Lesson 02 covers SQLite - Python's built-in, zero-configuration database. You will learn to use the sqlite3 standard library module, manage connections and cursors, run parameterized queries you learned in this lesson, and write a complete test suite using in-memory SQLite databases - a pattern used in every serious Python project to run database tests without a server.

What You Will Learn​

Prerequisites​

SQL Is a Set Language​

Part 1 - SELECT, WHERE, ORDER BY, LIMIT​

The Anatomy of a SELECT Statement​

WHERE Clause Operators​

The NULL Trap in Detail​

Part 2 - JOINs​

INNER JOIN​

LEFT JOIN (LEFT OUTER JOIN)​

RIGHT JOIN (RIGHT OUTER JOIN)​

FULL OUTER JOIN​

Multi-Table JOIN​

Part 3 - GROUP BY, HAVING, and Aggregate Functions​

Aggregate Functions​

GROUP BY Examples​

The GROUP BY Rule​

HAVING - Filtering After Aggregation​

Part 4 - Subqueries vs JOINs​

Correlated vs Non-Correlated Subqueries​

Common Table Expressions (CTEs)​

Part 5 - Window Functions​

ROW_NUMBER - Assign a Sequential Integer​

RANK and DENSE_RANK - Handle Ties​

LAG and LEAD - Access Adjacent Rows​

Running Totals with Frames​

Part 6 - SQL Injection​

How SQL Injection Works​

Parameterized Queries in Python​

Dynamic Column and Table Names​

Query Execution Order​

Putting It Together - A Real Analytics Query​

Graded Practice Challenges​

Beginner - Write the Query​

Intermediate - Debug the Query​

Advanced - Design and Optimise​

Key Takeaways​

What's Next​