Skip to main content

SQL Fundamentals for Python Engineers

Reading time: ~35 minutes | Level: Intermediate → Engineering

Before reading further, predict the result of this query:

SELECT * FROM orders WHERE user_id = NULL;

There is a user_id column. There are rows where user_id is null. The query runs without error.

How many rows does it return?

Show Answer

Zero rows. Always. Even when there are null user_id values in the table.

The reason is fundamental to how SQL works: NULL is not a value - it is the absence of a value. Comparing anything to NULL with = always produces a third logical result: UNKNOWN, not TRUE or FALSE. The WHERE clause only passes rows where the condition evaluates to TRUE. UNKNOWN fails the filter.

The correct query is:

SELECT * FROM orders WHERE user_id IS NULL;

IS NULL is a special predicate designed specifically for null checking. It returns TRUE or FALSE, never UNKNOWN.

The same applies to != (not equal): WHERE user_id != NULL also returns zero rows, for the same reason. Use IS NOT NULL to filter out nulls.

This surprises every developer the first time they encounter it. It is not a bug in the database - it is three-valued logic (TRUE / FALSE / UNKNOWN), defined in the SQL standard, present in every major RDBMS. Understanding NULL semantics is one of the first things that separates engineers from casual SQL users.

SQL is the language Python engineers interact with more than almost any other external system. You write it in migrations, in query builders, in ORMs that generate it, and directly in cursor.execute() calls. When an ORM query is slow, you debug the SQL it generated. When a report is wrong, you read the SQL behind it. When a data pipeline fails, you trace it to a JOIN that produced unexpected nulls.

This lesson builds SQL intuition from an engineering perspective - not just syntax, but semantics. What does the database actually do with your query? Why is this query fast and that one slow? What are the traps that catch even experienced developers?

What You Will Learn

  • NULL semantics and three-valued logic
  • SQL as a set language - the mental model that makes queries intuitive
  • SELECT, WHERE, ORDER BY, LIMIT - with engineering depth on what the database executes
  • INNER, LEFT, RIGHT, and FULL OUTER JOINs - with visual table diagrams
  • GROUP BY, HAVING, and aggregate functions (COUNT, SUM, AVG, MIN, MAX)
  • Subqueries vs JOINs - performance implications and when to use each
  • Window functions - ROW_NUMBER, RANK, LAG, LEAD - the foundation of analytics queries
  • SQL injection - how it works, how to prevent it, why it matters in Python code
  • Query execution order - the exact order the database processes your query

Prerequisites

  • Basic Python: variables, functions, loops, dictionaries
  • No prior SQL experience required - this lesson starts from scratch
  • Recommended: the Module 07 Overview for context on where SQL fits in the data layer stack

SQL Is a Set Language

The single most important mental model shift when learning SQL is this: SQL describes what data you want, not how to retrieve it. It is declarative, not procedural.

In Python you write:

results = []
for order in all_orders:
if order["status"] == "completed":
results.append(order)
results.sort(key=lambda o: o["total"], reverse=True)
results = results[:10]

In SQL you declare:

SELECT *
FROM orders
WHERE status = 'completed'
ORDER BY total DESC
LIMIT 10;

The database figures out how to retrieve it - which index to use, whether to sort before or after filtering, how much memory to allocate. The query optimizer is doing the work that you would have to write manually in Python. Your job is to describe the correct set of rows, joins, and aggregations; the database's job is to find the most efficient execution plan.

This also means SQL operates on tables as sets - unordered collections of rows. A SELECT without ORDER BY has no guaranteed row order. Different executions of the same query can return rows in different sequences. If order matters to you, you must ask for it explicitly.

Part 1 - SELECT, WHERE, ORDER BY, LIMIT

The Anatomy of a SELECT Statement

SELECT
order_id,
user_id,
total,
status,
created_at
FROM orders
WHERE status = 'completed'
AND total > 100.00
ORDER BY created_at DESC
LIMIT 25
OFFSET 0;

Breaking down each clause:

ClauseWhat it doesExecution order
FROM ordersIdentifies the source table (or tables)1st
WHERE status = 'completed' AND total > 100Filters rows - only rows where ALL conditions are TRUE pass2nd
SELECT order_id, user_id, total, ...Chooses which columns to return3rd
ORDER BY created_at DESCSorts the result set4th
LIMIT 25 OFFSET 0Returns only the first 25 rows5th

:::warning SELECT * in Production Is an Anti-Pattern SELECT * retrieves every column in the table. If the table has 30 columns and you only need 3, you are transferring 10x the data over the network, consuming 10x the memory in the application, and coupling your code to the table's column list (so every future ALTER TABLE ADD COLUMN might silently change your query's behavior). Always name the columns you need. :::

WHERE Clause Operators

-- Equality and inequality
WHERE status = 'completed'
WHERE status != 'cancelled'
WHERE total > 100
WHERE total BETWEEN 50 AND 200 -- inclusive on both ends

-- Pattern matching (case-sensitive in most databases)
WHERE email LIKE '%@gmail.com' -- % matches any sequence of characters
WHERE name LIKE 'Al_ce' -- _ matches exactly one character

-- Set membership
WHERE status IN ('completed', 'shipped', 'processing')
WHERE user_id NOT IN (SELECT user_id FROM banned_users)

-- NULL handling - always use IS NULL / IS NOT NULL
WHERE deleted_at IS NULL -- row has not been soft-deleted
WHERE coupon_code IS NOT NULL -- row has a coupon applied

-- Combining conditions
WHERE status = 'completed'
AND total > 100
AND created_at >= '2025-01-01'

WHERE status = 'pending'
OR (status = 'processing' AND priority = 'high')

The NULL Trap in Detail

NULL propagates through expressions in ways that surprise most developers:

-- All three of these evaluate to UNKNOWN, not TRUE or FALSE
NULL = NULL -- UNKNOWN
NULL != NULL -- UNKNOWN
NULL + 5 -- NULL (arithmetic on NULL produces NULL)
CONCAT('hello', NULL) -- NULL (in most databases)

-- The COALESCE function provides a default value when a column is NULL
SELECT
order_id,
COALESCE(discount_amount, 0) AS discount, -- use 0 if NULL
COALESCE(notes, 'No notes') AS notes
FROM orders;

-- NULLIF returns NULL when two values are equal (useful to avoid division by zero)
SELECT
order_id,
total / NULLIF(quantity, 0) AS price_per_unit -- returns NULL instead of dividing by zero
FROM order_items;

:::note Why Three-Valued Logic Exists NULL represents missing or unknown data - not zero, not empty string, not false. When you ask "is this unknown value equal to 'completed'?", the honest answer is "unknown." SQL's three-valued logic (TRUE / FALSE / UNKNOWN) is mathematically correct for this scenario. The practical consequence: always check for nulls explicitly with IS NULL, and design schemas to use NOT NULL constraints wherever the business logic guarantees a value exists. A column that is NOT NULL eliminates an entire category of three-valued logic bugs. :::

Part 2 - JOINs

A JOIN combines rows from two or more tables based on a related column. Before writing the SQL, always draw the relationship:

INNER JOIN

Returns only rows that have a match in both tables. Rows with no match on either side are excluded.

-- Get all orders with the customer's name and email
SELECT
o.order_id,
u.name AS customer_name,
u.email AS customer_email,
o.total,
o.status
FROM orders o
INNER JOIN users u ON o.user_id = u.user_id
WHERE o.status = 'completed'
ORDER BY o.created_at DESC;

If a user has been deleted from the users table but their orders remain, INNER JOIN will exclude those orphaned orders entirely - they have no match in users.

LEFT JOIN (LEFT OUTER JOIN)

Returns all rows from the left table, plus matching rows from the right table. If there is no match on the right side, the right-side columns are NULL.

-- Get all users, even those who have never placed an order
SELECT
u.user_id,
u.name,
u.email,
COUNT(o.order_id) AS total_orders,
COALESCE(SUM(o.total), 0) AS lifetime_value
FROM users u
LEFT JOIN orders o ON u.user_id = o.user_id
GROUP BY u.user_id, u.name, u.email
ORDER BY lifetime_value DESC;

Users with no orders will have total_orders = 0 and lifetime_value = 0. With INNER JOIN, they would be absent from the result entirely.

RIGHT JOIN (RIGHT OUTER JOIN)

Returns all rows from the right table, plus matching rows from the left table. Rarely used - the same result can always be expressed as a LEFT JOIN with the tables swapped. Most engineers prefer LEFT JOIN for readability.

-- Equivalent to a LEFT JOIN with tables reversed
SELECT u.name, o.order_id
FROM orders o
RIGHT JOIN users u ON o.user_id = u.user_id;

-- Preferred equivalent with LEFT JOIN:
SELECT u.name, o.order_id
FROM users u
LEFT JOIN orders o ON u.user_id = o.user_id;

FULL OUTER JOIN

Returns all rows from both tables. Where there is no match, the missing side's columns are NULL.

-- Find all users and all orders, even unmatched ones
-- Useful for data quality audits
SELECT
u.user_id,
u.email,
o.order_id,
o.total
FROM users u
FULL OUTER JOIN orders o ON u.user_id = o.user_id
WHERE u.user_id IS NULL -- orders with no user (orphaned)
OR o.order_id IS NULL; -- users with no orders

:::note SQLite Does Not Support FULL OUTER JOIN SQLite supports INNER JOIN and LEFT JOIN but not FULL OUTER JOIN. For learning purposes, all examples that require FULL OUTER JOIN can be run on PostgreSQL. The concept applies universally - only the dialect differs. :::

Multi-Table JOIN

-- Get order line items with product names and customer emails
SELECT
u.email,
o.order_id,
o.created_at,
p.name AS product_name,
oi.quantity,
oi.unit_price,
(oi.quantity * oi.unit_price) AS line_total
FROM order_items oi
INNER JOIN orders o ON oi.order_id = o.order_id
INNER JOIN users u ON o.user_id = u.user_id
INNER JOIN products p ON oi.product_id = p.product_id
WHERE o.status = 'completed'
ORDER BY o.created_at DESC, o.order_id, oi.item_id;

The table aliases (o, u, oi, p) are essential for readability in multi-table queries. Always alias tables when joining more than two.

Part 3 - GROUP BY, HAVING, and Aggregate Functions

GROUP BY collapses multiple rows with the same column value into a single output row, with aggregate functions summarizing the grouped data.

Aggregate Functions

FunctionWhat it computesNULL handling
COUNT(*)All rows in the group, including NULLsCounts everything
COUNT(column)Rows where column is not NULLIgnores NULLs
SUM(column)Sum of non-NULL valuesIgnores NULLs; returns NULL if all values are NULL
AVG(column)Average of non-NULL valuesIgnores NULLs
MIN(column)Smallest non-NULL valueIgnores NULLs
MAX(column)Largest non-NULL valueIgnores NULLs

GROUP BY Examples

-- Sales summary by status
SELECT
status,
COUNT(*) AS order_count,
SUM(total) AS revenue,
AVG(total) AS avg_order_value,
MIN(total) AS smallest_order,
MAX(total) AS largest_order
FROM orders
GROUP BY status
ORDER BY revenue DESC;
status | order_count | revenue | avg_order_value | smallest_order | largest_order
-------------|-------------|------------|-----------------|----------------|---------------
completed | 8241 | 1824050.00 | 221.35 | 12.50 | 9875.00
processing | 412 | 89234.50 | 216.59 | 18.00 | 3200.00
cancelled | 1053 | 0.00 | 0.00 | 0.00 | 0.00

The GROUP BY Rule

Every column in SELECT must be either in GROUP BY or inside an aggregate function. This is not a style choice - it is a logical requirement. If you group by status, the database collapses all completed rows into one output row. Which order_id would it show? It is ambiguous - so SQL forbids it.

-- WRONG - order_id is neither in GROUP BY nor aggregated
SELECT status, order_id, COUNT(*)
FROM orders
GROUP BY status;
-- Error: column "order_id" must appear in the GROUP BY clause
-- or be used in an aggregate function

-- CORRECT
SELECT status, COUNT(*) AS order_count
FROM orders
GROUP BY status;

HAVING - Filtering After Aggregation

WHERE filters rows before grouping. HAVING filters groups after aggregation. They solve different problems.

-- Find customers who have spent more than $1,000 in total
SELECT
u.user_id,
u.email,
COUNT(o.order_id) AS order_count,
SUM(o.total) AS lifetime_value
FROM users u
LEFT JOIN orders o ON u.user_id = o.user_id
WHERE o.status = 'completed' -- filter ROWS before grouping
GROUP BY u.user_id, u.email
HAVING SUM(o.total) > 1000 -- filter GROUPS after aggregation
ORDER BY lifetime_value DESC
LIMIT 100;

You cannot use WHERE SUM(o.total) > 1000 because WHERE is evaluated before aggregation - SUM does not exist yet at that stage. HAVING is evaluated after GROUP BY, so aggregate functions are available.

Part 4 - Subqueries vs JOINs

A subquery is a SELECT statement embedded inside another query. Both subqueries and JOINs can produce the same results - but they often have different performance characteristics.

Correlated vs Non-Correlated Subqueries

A non-correlated subquery executes once and its result is used by the outer query:

-- Find orders placed by customers who signed up in 2025
SELECT order_id, user_id, total
FROM orders
WHERE user_id IN (
SELECT user_id
FROM users
WHERE created_at >= '2025-01-01' -- executes once
);

A correlated subquery executes once per row in the outer query - this is the expensive one:

-- For each order, calculate the customer's average order value
-- WARNING: this executes the subquery once per order row
SELECT
o.order_id,
o.total,
(
SELECT AVG(total)
FROM orders inner_o
WHERE inner_o.user_id = o.user_id -- references outer query
) AS customer_avg_order
FROM orders o;

For a table with 100,000 orders, this correlated subquery executes 100,000 times. The JOIN version executes once:

-- Same result, executed as a single JOIN + aggregation pass
SELECT
o.order_id,
o.total,
avg_orders.avg_total AS customer_avg_order
FROM orders o
INNER JOIN (
SELECT user_id, AVG(total) AS avg_total
FROM orders
GROUP BY user_id
) avg_orders ON o.user_id = avg_orders.user_id;

:::tip Prefer JOINs Over Correlated Subqueries Correlated subqueries are often the slowest SQL pattern you can write. For each outer row, the database executes the inner query independently - effectively a nested loop. Rewrite correlated subqueries as JOINs against a derived table or CTE whenever possible. The query plan (via EXPLAIN ANALYZE) will show the difference clearly. :::

Common Table Expressions (CTEs)

CTEs (introduced with WITH) make complex queries readable by naming intermediate result sets:

-- Find the top 10 products by revenue in the last 30 days
WITH recent_orders AS (
SELECT order_id
FROM orders
WHERE created_at >= NOW() - INTERVAL '30 days'
AND status = 'completed'
),
product_revenue AS (
SELECT
oi.product_id,
SUM(oi.quantity * oi.unit_price) AS revenue,
COUNT(DISTINCT oi.order_id) AS times_ordered
FROM order_items oi
INNER JOIN recent_orders ro ON oi.order_id = ro.order_id
GROUP BY oi.product_id
)
SELECT
p.name,
pr.revenue,
pr.times_ordered,
pr.revenue / pr.times_ordered AS avg_revenue_per_order
FROM product_revenue pr
INNER JOIN products p ON pr.product_id = p.product_id
ORDER BY pr.revenue DESC
LIMIT 10;

CTEs do not always improve performance (in many databases they are not independently optimized), but they dramatically improve readability for multi-step analytical queries.

Part 5 - Window Functions

Window functions compute values across a set of rows related to the current row - without collapsing those rows into one output row (unlike GROUP BY). They are one of the most powerful SQL features and appear in virtually every analytics or reporting query.

The core syntax:

window_function() OVER (
PARTITION BY column1, column2 -- defines the "window" (like GROUP BY, but rows are not collapsed)
ORDER BY column3 DESC -- defines row ordering within the window
ROWS BETWEEN ... AND ... -- optional: defines the frame (subset of the window)
)

ROW_NUMBER - Assign a Sequential Integer

-- Assign row numbers to each user's orders, newest first
SELECT
order_id,
user_id,
total,
created_at,
ROW_NUMBER() OVER (
PARTITION BY user_id
ORDER BY created_at DESC
) AS order_sequence
FROM orders;
order_id | user_id | total | created_at | order_sequence
---------|---------|--------|---------------------|---------------
1042 | 7 | 219.00 | 2025-12-01 10:00:00 | 1
891 | 7 | 87.50 | 2025-09-15 08:22:00 | 2
234 | 7 | 445.00 | 2024-11-30 14:01:00 | 3

Use case: Get each user's most recent order (WHERE order_sequence = 1).

-- Get each user's most recent completed order
WITH ranked_orders AS (
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY user_id
ORDER BY created_at DESC
) AS rn
FROM orders
WHERE status = 'completed'
)
SELECT order_id, user_id, total, created_at
FROM ranked_orders
WHERE rn = 1;

RANK and DENSE_RANK - Handle Ties

-- Rank products by revenue; ties get the same rank
SELECT
p.name,
SUM(oi.quantity * oi.unit_price) AS revenue,
RANK() OVER (ORDER BY SUM(oi.quantity * oi.unit_price) DESC) AS rank_with_gaps,
DENSE_RANK() OVER (ORDER BY SUM(oi.quantity * oi.unit_price) DESC) AS rank_no_gaps
FROM order_items oi
INNER JOIN products p ON oi.product_id = p.product_id
GROUP BY p.product_id, p.name
ORDER BY revenue DESC;
name | revenue | rank_with_gaps | rank_no_gaps
--------------|----------|----------------|-------------
Widget Pro | 84500.00 | 1 | 1
Widget Basic | 84500.00 | 1 | 1
Gadget Plus | 62100.00 | 3 | 2 ← RANK skips 2; DENSE_RANK does not

LAG and LEAD - Access Adjacent Rows

LAG accesses the value from a previous row; LEAD accesses the value from a following row - without a self-JOIN.

-- Calculate month-over-month revenue change
WITH monthly_revenue AS (
SELECT
DATE_TRUNC('month', created_at) AS month,
SUM(total) AS revenue
FROM orders
WHERE status = 'completed'
GROUP BY DATE_TRUNC('month', created_at)
)
SELECT
month,
revenue,
LAG(revenue) OVER (ORDER BY month) AS prev_month_revenue,
revenue - LAG(revenue) OVER (ORDER BY month) AS revenue_change,
ROUND(
100.0 * (revenue - LAG(revenue) OVER (ORDER BY month))
/ NULLIF(LAG(revenue) OVER (ORDER BY month), 0),
2
) AS pct_change
FROM monthly_revenue
ORDER BY month;
month | revenue | prev_month_revenue | revenue_change | pct_change
------------|------------|-------------------|----------------|------------
2025-10-01 | 142000.00 | NULL | NULL | NULL
2025-11-01 | 168000.00 | 142000.00 | 26000.00 | 18.31
2025-12-01 | 195000.00 | 168000.00 | 27000.00 | 16.07
2026-01-01 | 131000.00 | 195000.00 | -64000.00 | -32.82

Running Totals with Frames

-- Running total of revenue per user, ordered by order date
SELECT
user_id,
order_id,
total,
created_at,
SUM(total) OVER (
PARTITION BY user_id
ORDER BY created_at
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS running_lifetime_value
FROM orders
WHERE status = 'completed'
ORDER BY user_id, created_at;

:::tip Window Functions Are Not GROUP BY GROUP BY collapses rows: you get one output row per group. Window functions preserve all rows and add a computed column. This is the distinction that makes ROW_NUMBER() OVER (PARTITION BY ...) so useful - you can rank rows within groups without losing the individual row data, then filter in an outer query. :::

Part 6 - SQL Injection

SQL injection is the most consistently exploited vulnerability class in web applications. It is entirely preventable. Yet it still appears in production code written by experienced developers who "just needed a quick fix."

How SQL Injection Works

# DANGEROUS - do not do this
def get_user_by_name(name: str) -> dict:
query = f"SELECT * FROM users WHERE name = '{name}'"
cursor.execute(query)
return cursor.fetchone()

If name = "Alice", the query is:

SELECT * FROM users WHERE name = 'Alice';

That is fine. Now if name = "' OR '1'='1", the query becomes:

SELECT * FROM users WHERE name = '' OR '1'='1';

'1'='1' is always TRUE. The WHERE clause is effectively disabled. This query returns every row in the users table.

A more destructive payload: name = "'; DROP TABLE users; --" produces:

SELECT * FROM users WHERE name = ''; DROP TABLE users; --';

If the database driver permits multiple statements (some do, some do not), this drops the entire users table. The -- comments out the trailing quote so the statement is syntactically valid.

:::danger Parameterized Queries Are Non-Negotiable Never use string formatting, f-strings, or concatenation to build SQL queries with user-supplied input. Never. Not "usually not." Not "not in production." Not in logging statements. Not in admin scripts. Not even for integer values that "can't have SQL in them" - because the habit of string interpolation is what leads to the bug when a string column is added later.

The parameterized query passes the values separately from the SQL text. The database driver handles escaping. The SQL structure cannot be changed by the data. :::

Parameterized Queries in Python

Every major Python database driver supports parameterized queries. Use them for every value that comes from outside the query itself:

import sqlite3

# WRONG - string interpolation with user input
def get_user_bad(conn: sqlite3.Connection, name: str) -> dict | None:
query = f"SELECT * FROM users WHERE name = '{name}'" # NEVER
return conn.execute(query).fetchone()

# CORRECT - parameterized query with sqlite3 (uses ? placeholder)
def get_user_sqlite(conn: sqlite3.Connection, name: str) -> dict | None:
query = "SELECT * FROM users WHERE name = ?"
return conn.execute(query, (name,)).fetchone()

# CORRECT - parameterized query with psycopg2 (uses %s placeholder)
import psycopg2

def get_user_postgres(conn: psycopg2.extensions.connection, name: str) -> dict | None:
query = "SELECT * FROM users WHERE name = %s"
with conn.cursor() as cur:
cur.execute(query, (name,))
return cur.fetchone()

# CORRECT - multiple parameters
def get_orders_by_status_and_user(
conn: sqlite3.Connection,
user_id: int,
status: str,
limit: int,
) -> list:
query = """
SELECT order_id, total, created_at
FROM orders
WHERE user_id = ?
AND status = ?
ORDER BY created_at DESC
LIMIT ?
"""
return conn.execute(query, (user_id, status, limit)).fetchall()

The placeholders (? in sqlite3, %s in psycopg2) are not string substitution - they are protocol-level parameter binding. The values travel to the database server separately from the SQL text and are never interpreted as SQL syntax.

Dynamic Column and Table Names

Parameterization works for values, not for identifiers (table names, column names, ORDER BY columns). If you need to make those dynamic, validate against an allowlist:

ALLOWED_SORT_COLUMNS = {"created_at", "total", "status"}
ALLOWED_SORT_DIRECTIONS = {"ASC", "DESC"}

def get_orders_sorted(
conn: sqlite3.Connection,
sort_by: str,
direction: str,
) -> list:
# Validate against allowlist - NEVER pass user input directly into the identifier
if sort_by not in ALLOWED_SORT_COLUMNS:
raise ValueError(f"Invalid sort column: {sort_by!r}")
if direction not in ALLOWED_SORT_DIRECTIONS:
raise ValueError(f"Invalid sort direction: {direction!r}")

# Now safe to interpolate - both values came from our allowlist, not user input
query = f"SELECT * FROM orders ORDER BY {sort_by} {direction}"
return conn.execute(query).fetchall()

Query Execution Order

The order you write SQL clauses and the order the database executes them are different. This is one of the most common sources of SQL confusion.

Why this matters in practice:

  • WHERE cannot reference SELECT aliases because SELECT has not been evaluated yet. You must repeat the expression or use a subquery/CTE.
  • HAVING can reference aggregate functions but WHERE cannot - GROUP BY has not happened yet at the WHERE stage.
  • ORDER BY can reference SELECT aliases - it is evaluated after SELECT.
  • LIMIT is applied last - the database may still need to sort the entire result set before returning the first N rows (unless an index supports the sort).
-- WRONG - cannot reference SELECT alias in WHERE
SELECT total * 0.1 AS tax
FROM orders
WHERE tax > 5; -- Error: column "tax" does not exist

-- CORRECT - repeat the expression in WHERE
SELECT total * 0.1 AS tax
FROM orders
WHERE total * 0.1 > 5;

-- OR use a subquery / CTE
WITH order_taxes AS (
SELECT order_id, total * 0.1 AS tax
FROM orders
)
SELECT order_id, tax
FROM order_taxes
WHERE tax > 5;

Putting It Together - A Real Analytics Query

Here is a complete query combining JOINs, aggregation, and window functions - the kind you will write regularly as a Python engineer working on a product:

-- Monthly cohort retention: for each signup month, what % of users
-- placed a second order within 90 days?
WITH user_cohorts AS (
SELECT
user_id,
DATE_TRUNC('month', created_at) AS signup_month
FROM users
),
user_orders AS (
SELECT
o.user_id,
o.order_id,
o.created_at,
ROW_NUMBER() OVER (
PARTITION BY o.user_id
ORDER BY o.created_at ASC
) AS order_rank
FROM orders o
WHERE o.status = 'completed'
),
first_second_orders AS (
SELECT
uc.signup_month,
uo_first.user_id,
uo_first.created_at AS first_order_date,
uo_second.created_at AS second_order_date
FROM user_cohorts uc
INNER JOIN user_orders uo_first
ON uc.user_id = uo_first.user_id
AND uo_first.order_rank = 1
LEFT JOIN user_orders uo_second
ON uc.user_id = uo_second.user_id
AND uo_second.order_rank = 2
AND uo_second.created_at <= uo_first.created_at + INTERVAL '90 days'
)
SELECT
signup_month,
COUNT(DISTINCT user_id) AS cohort_size,
COUNT(DISTINCT CASE WHEN second_order_date IS NOT NULL
THEN user_id END) AS retained_users,
ROUND(
100.0 * COUNT(DISTINCT CASE WHEN second_order_date IS NOT NULL
THEN user_id END)
/ NULLIF(COUNT(DISTINCT user_id), 0),
1
) AS retention_pct
FROM first_second_orders
GROUP BY signup_month
ORDER BY signup_month DESC;

This query uses: CTEs for readability, ROW_NUMBER() for ranking within a partition, LEFT JOIN to preserve users with no second order, CASE WHEN inside COUNT(DISTINCT ...) for conditional aggregation, NULLIF to avoid division by zero, and INTERVAL arithmetic for date math. These patterns appear constantly in production analytics code.

Graded Practice Challenges

Beginner - Write the Query

Challenge: Given the schema from the Mermaid diagram above (users, orders, order_items, products), write a query that returns the 5 most recently placed orders, with the customer's email and the number of line items in each order.

Expected output columns: order_id, customer_email, order_date, item_count, order_total

Show Solution
SELECT
o.order_id,
u.email AS customer_email,
o.created_at AS order_date,
COUNT(oi.item_id) AS item_count,
o.total AS order_total
FROM orders o
INNER JOIN users u
ON o.user_id = u.user_id
LEFT JOIN order_items oi
ON o.order_id = oi.order_id
GROUP BY o.order_id, u.email, o.created_at, o.total
ORDER BY o.created_at DESC
LIMIT 5;

Why LEFT JOIN for order_items? An order might have zero items if it was created but not yet populated (edge case in some systems). LEFT JOIN keeps the order row even in that case with item_count = 0. Use INNER JOIN if you know all orders always have items.

Why GROUP BY o.total? The total column is not an aggregate, so it must appear in GROUP BY even though it is functionally determined by order_id. In PostgreSQL, GROUP BY o.order_id is sufficient if order_id is the primary key (because all other columns are functionally dependent on it). In SQLite and older PostgreSQL, list all non-aggregated SELECT columns explicitly.

Intermediate - Debug the Query

This query is supposed to find products that have never been ordered. It always returns zero rows. Find and fix the bug:

SELECT p.product_id, p.name
FROM products p
INNER JOIN order_items oi ON p.product_id = oi.product_id
WHERE oi.order_id = NULL;
Show Solution

Two bugs:

Bug 1 - Wrong JOIN type: INNER JOIN only returns products that have at least one matching order_items row. Products that have never been ordered have no match in order_items, so INNER JOIN excludes them entirely. We need LEFT JOIN to keep all products and produce NULL on the order_items side for unordered products.

Bug 2 - NULL comparison with =: WHERE oi.order_id = NULL always evaluates to UNKNOWN, never TRUE. As we saw in the opening puzzle, NULL must be compared with IS NULL.

Corrected query:

SELECT p.product_id, p.name
FROM products p
LEFT JOIN order_items oi ON p.product_id = oi.product_id
WHERE oi.order_id IS NULL; -- oi columns are NULL when there is no matching order_item

Alternative using NOT EXISTS (often faster on large tables):

SELECT p.product_id, p.name
FROM products p
WHERE NOT EXISTS (
SELECT 1
FROM order_items oi
WHERE oi.product_id = p.product_id
);

NOT EXISTS terminates as soon as it finds one matching row, which can be more efficient than LEFT JOIN ... WHERE IS NULL when the order_items table is large and many products are unordered.

Advanced - Design and Optimise

Challenge: Write a query that produces a "customer leaderboard" for the past 12 months. For each customer, show their rank by total spend, their total spend, their order count, their most purchased product category, and whether they are a "new" customer (first order within the past 12 months) or "returning" (first order more than 12 months ago).

Requirements:

  • Rank by total spend descending (use DENSE_RANK so ties share the same rank)
  • Only include customers with at least one completed order in the past 12 months
  • The "most purchased category" is by quantity of items ordered (not revenue)
  • Output columns: rank, customer_email, total_spend, order_count, top_category, customer_type
Show Reference Solution
WITH period_orders AS (
-- Completed orders in the last 12 months
SELECT o.order_id, o.user_id, o.total
FROM orders o
WHERE o.status = 'completed'
AND o.created_at >= NOW() - INTERVAL '12 months'
),
customer_spend AS (
-- Total spend and order count per customer in the period
SELECT
po.user_id,
SUM(po.total) AS total_spend,
COUNT(po.order_id) AS order_count
FROM period_orders po
GROUP BY po.user_id
),
category_quantities AS (
-- Quantity of items ordered per customer per category, in the period
SELECT
o.user_id,
p.category,
SUM(oi.quantity) AS qty
FROM order_items oi
INNER JOIN period_orders po ON oi.order_id = po.order_id
INNER JOIN orders o ON oi.order_id = o.order_id
INNER JOIN products p ON oi.product_id = p.product_id
GROUP BY o.user_id, p.category
),
top_categories AS (
-- Pick the single highest-quantity category per customer
SELECT
user_id,
category AS top_category
FROM (
SELECT
user_id,
category,
ROW_NUMBER() OVER (
PARTITION BY user_id
ORDER BY qty DESC
) AS rn
FROM category_quantities
) ranked
WHERE rn = 1
),
customer_first_order AS (
-- Determine when each customer first ordered (ever, not just in the period)
SELECT
user_id,
MIN(created_at) AS first_order_date
FROM orders
WHERE status = 'completed'
GROUP BY user_id
)
SELECT
DENSE_RANK() OVER (ORDER BY cs.total_spend DESC) AS rank,
u.email AS customer_email,
ROUND(cs.total_spend, 2) AS total_spend,
cs.order_count,
tc.top_category,
CASE
WHEN cfo.first_order_date >= NOW() - INTERVAL '12 months'
THEN 'new'
ELSE 'returning'
END AS customer_type
FROM customer_spend cs
INNER JOIN users u ON cs.user_id = u.user_id
INNER JOIN top_categories tc ON cs.user_id = tc.user_id
INNER JOIN customer_first_order cfo ON cs.user_id = cfo.user_id
ORDER BY rank, u.email
LIMIT 500;

Design decisions:

  • Five CTEs decompose the problem into independently understandable steps. Each CTE does one thing. The final SELECT only joins them together - it contains no complex logic.
  • ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY qty DESC) inside top_categories selects the single best category without a correlated subquery or a messy GROUP BY hack.
  • customer_first_order uses all completed orders, not just period orders - because "new vs returning" is about the customer's entire history, not just the past 12 months.
  • DENSE_RANK() rather than RANK() or ROW_NUMBER() - ties in spend share the same rank, and no rank numbers are skipped after a tie.
  • LIMIT 500 - leaderboards rarely need all rows. Adding a limit also makes the query safe to run repeatedly without accidentally returning millions of rows.

For production: Add an index on orders(user_id, status, created_at) to support the date range filter efficiently. Check EXPLAIN ANALYZE to verify it is used.

Key Takeaways

  • NULL is not a value - it is the absence of a value. column = NULL always evaluates to UNKNOWN, never TRUE. Always use IS NULL and IS NOT NULL for null checks
  • SQL is a set language - you describe the data you want, and the query optimizer determines how to retrieve it. Think in sets, not loops
  • JOIN types control which rows survive: INNER JOIN keeps only matching rows from both sides; LEFT JOIN keeps all left-side rows; FULL OUTER JOIN keeps all rows from both sides with NULLs for non-matches
  • WHERE filters before grouping; HAVING filters after - they are evaluated at different stages of query execution
  • The query execution order is: FROMWHEREGROUP BYHAVINGSELECTDISTINCTORDER BYLIMIT. You cannot reference a SELECT alias in WHERE because SELECT has not been evaluated yet
  • Correlated subqueries execute once per outer row - for large tables, this is catastrophically slow. Rewrite as a JOIN against a derived table or CTE
  • Window functions preserve individual rows unlike GROUP BY. They are the correct tool for ranking, running totals, and lag/lead comparisons
  • SQL injection is fully preventable with parameterized queries. Never build SQL strings with f-strings or concatenation using values from outside the query. Validate column and table names against an allowlist if they must be dynamic
  • SELECT * is an anti-pattern in production - name every column you need, to control data volume and prevent schema coupling

What's Next

Lesson 02 covers SQLite - Python's built-in, zero-configuration database. You will learn to use the sqlite3 standard library module, manage connections and cursors, run parameterized queries you learned in this lesson, and write a complete test suite using in-memory SQLite databases - a pattern used in every serious Python project to run database tests without a server.

© 2026 EngineersOfAI. All rights reserved.