Python CSV Handling Practice Problems & Exercises

Practice: CSV Handling

11 problems4 Easy4 Medium3 Hard⏱ 40-55 min

Easy

#1csv.reader: Basic Row IterationEasy

csv.readernewlineiteration

Predict the output. A CSV is written to a temp file, then read back with csv.reader. Notice the type of each field.

Python

import csv
import io

# Simulate a CSV file in memory
raw = "name,city,score\nAlice,London,95\nBob,Paris,87\n"
f = io.StringIO(raw)

reader = csv.reader(f)
header = next(reader)
print(header)

rows = list(reader)
for row in rows:
    print(row)

# How many data rows (excluding header)?
print(len(rows))

Solution

import csv
import io

raw = "name,city,score\nAlice,London,95\nBob,Paris,87\n"
f = io.StringIO(raw)

reader = csv.reader(f)
header = next(reader)
print(header)

rows = list(reader)
for row in rows:
    print(row)

print(len(rows))

Output:

['name', 'city', 'score']
['Alice', 'London', '95']
['Bob', 'Paris', '87']
3

How it works: csv.reader wraps any iterable that yields lines. next(reader) consumes the first row (the header) and returns it as a list of strings. The remaining rows are collected with list(reader). Every field — including 95 and 87 — is a string. The csv module never converts types automatically.

Key insight: The result is always list[str]. You must cast numeric fields yourself: int(row[2]) or float(row[2]). A common bug is trying to do arithmetic on row["score"] without first converting it — you get string concatenation instead of addition.

Expected Output

['name', 'city', 'score']\n['Alice', 'London', '95']\n['Bob', 'Paris', '87']\n3

Hints

Hint 1: Always open CSV files with newline="" to let the csv module handle line endings.

Hint 2: csv.reader yields each row as a list of strings — all values are strings, even numbers.

Hint 3: next(reader) advances the iterator by one row and returns it.

#2csv.writer: Automatic QuotingEasy

csv.writerquotingwriterownewline

Predict the output. csv.writer handles quoting automatically for fields that contain commas or double-quotes.

Python

import csv
import io

output = io.StringIO()
writer = csv.writer(output)

writer.writerow(["id", "name", "notes"])
writer.writerow([1, "Alice", "Works on NLP, ML"])
writer.writerow([2, "Bob", 'Said "hello" to all'])

print(repr(output.getvalue()))

Solution

import csv
import io

output = io.StringIO()
writer = csv.writer(output)

writer.writerow(["id", "name", "notes"])
writer.writerow([1, "Alice", "Works on NLP, ML"])
writer.writerow([2, "Bob", 'Said "hello" to all'])

print(repr(output.getvalue()))

Output:

'id,name,notes\r\n1,Alice,"Works on NLP, ML"\r\n2,Bob,"Said ""hello"" to all"\r\n'

How it works: csv.writer applies RFC 4180 quoting automatically:

"Works on NLP, ML" contains a comma — the writer wraps the whole field in double-quotes.
'Said "hello" to all' contains double-quotes — the writer wraps the field in double-quotes and escapes each internal double-quote by doubling it: "".
Line endings default to \r\n (the RFC 4180 standard) regardless of the OS. This is why you must open real files with newline="" — without it, Python's universal newlines would turn \r\n into \r\r\n.

Key insight: You never need to manually escape commas or quotes. If you write "Alice,\"Bob\"" into a CSV field yourself, you are doing it wrong. Give csv.writer the raw value and it handles the escaping.

Expected Output

id,name,notes\r\n1,Alice,"Works on NLP, ML"\r\n2,Bob,"Said ""hello"" to all"\r\n

Hints

Hint 1: csv.writer automatically quotes fields that contain the delimiter or a double-quote.

Hint 2: Open the file with newline="" when writing — otherwise Python adds an extra blank line on Windows.

Hint 3: io.StringIO() lets you capture writer output as a string without touching the filesystem.

#3csv.DictReader: Row as DictionaryEasy

DictReaderfieldnamesdict

Predict the output. csv.DictReader reads each row as a dict keyed by column name. Verify the field names and access pattern.

Python

import csv
import io

raw = "name,department,salary\nAlice,Engineering,95000\nBob,Marketing,72000\n"
f = io.StringIO(raw)

reader = csv.DictReader(f)

# fieldnames is populated on first access/iteration
rows = list(reader)
print(reader.fieldnames)

for row in rows:
    print(row["name"], row["department"], row["salary"])

Solution

import csv
import io

raw = "name,department,salary\nAlice,Engineering,95000\nBob,Marketing,72000\n"
f = io.StringIO(raw)

reader = csv.DictReader(f)

rows = list(reader)
print(reader.fieldnames)

for row in rows:
    print(row["name"], row["department"], row["salary"])

Output:

['name', 'department', 'salary']
Alice Engineering 95000
Bob Marketing 72000

How it works: csv.DictReader reads the first row as the list of field names and stores it in reader.fieldnames. Every subsequent row is returned as a dict mapping each field name to its string value. The fieldnames attribute is populated lazily — it is None until the first row is consumed, which is why we print it after list(reader).

Key insight: DictReader is the preferred interface for production code. Using positional indices like row[2] breaks silently when columns are added, removed, or reordered. Using row["salary"] is self-documenting and robust to schema changes. The only cost is a small memory overhead for the dict construction — negligible for all practical purposes.

Expected Output

['name', 'department', 'salary']\nAlice Engineering 95000\nBob Marketing 72000

Hints

Hint 1: csv.DictReader uses the first row as keys by default.

Hint 2: reader.fieldnames is populated after the first row is read.

Hint 3: Access fields by name: row["salary"], not by index.

#4csv.DictWriter: Write Dicts to CSVEasy

DictWriterwriteheaderwriterowsfieldnames

Predict the output. csv.DictWriter writes a list of dicts to CSV. Notice that column order is controlled by fieldnames, not the dict insertion order.

Python

import csv
import io

employees = [
    {"name": "Alice", "department": "Engineering", "salary": 95000},
    {"name": "Bob",   "department": "Marketing",   "salary": 72000},
]

output = io.StringIO()
writer = csv.DictWriter(output, fieldnames=["name", "department", "salary"])
writer.writeheader()
writer.writerows(employees)

print(output.getvalue())

Solution

import csv
import io

employees = [
    {"name": "Alice", "department": "Engineering", "salary": 95000},
    {"name": "Bob",   "department": "Marketing",   "salary": 72000},
]

output = io.StringIO()
writer = csv.DictWriter(output, fieldnames=["name", "department", "salary"])
writer.writeheader()
writer.writerows(employees)

print(output.getvalue())

Output:

name,department,salary
Alice,Engineering,95000
Bob,Marketing,72000

How it works: DictWriter takes a fieldnames list that defines both the column order and which keys to write from each dict. writeheader() writes the field names as the first row. writerows() iterates the list and writes each dict as a row, pulling values in fieldnames order.

The trailing blank line in the output is expected — print() adds a newline after the string's own trailing \r\n.

Key insight: The fieldnames list is the source of truth for column order. Even if a dict has keys in a different order, the CSV will always write columns in the fieldnames order. This is critical when your CSVs must conform to a fixed schema — define fieldnames explicitly rather than deriving them from dict.keys().

Expected Output

name,department,salary\r\nAlice,Engineering,95000\r\nBob,Marketing,72000\r\n

Hints

Hint 1: DictWriter requires a fieldnames list — this defines column order.

Hint 2: Always call writeheader() before writerows() to write the header row.

Hint 3: writerows() accepts any iterable of dicts.

Medium

#5DictWriter: extrasaction and restvalMedium

DictWriterextrasactionrestvalmissing-fields

Predict the output. Two rows are written: one has an extra key not in fieldnames; the other is missing a field entirely. Observe how extrasaction and restval control the behavior.

Python

import csv
import io

output = io.StringIO()
writer = csv.DictWriter(
    output,
    fieldnames=["name", "role", "score"],
    extrasaction="ignore",
    restval="N/A",
)
writer.writeheader()

# Extra key "age" is not in fieldnames — extrasaction="ignore" drops it
writer.writerow({"name": "Alice", "role": "Engineer", "age": 30})

# Missing key "role" — restval fills it with "N/A"
writer.writerow({"name": "Bob", "score": 99})

print(output.getvalue())

Solution

import csv
import io

output = io.StringIO()
writer = csv.DictWriter(
    output,
    fieldnames=["name", "role", "score"],
    extrasaction="ignore",
    restval="N/A",
)
writer.writeheader()

writer.writerow({"name": "Alice", "role": "Engineer", "age": 30})
writer.writerow({"name": "Bob", "score": 99})

print(output.getvalue())

Output:

name,role,score
Alice,Engineer,N/A
Bob,N/A,99

How it works:

Alice's row: has "age": 30 which is not in fieldnames. With extrasaction="ignore", the extra key is silently dropped. The score field is missing from Alice's dict, so restval="N/A" fills it: Alice,Engineer,N/A.
Bob's row: has "score": 99 but is missing both "role". restval fills the missing "role" with N/A, producing Bob,N/A,99.

Key insight: The default extrasaction is "raise", which raises ValueError if a dict contains keys not in fieldnames. Always set extrasaction="ignore" when your dicts come from external systems (API responses, ORM objects) that may return extra fields. Set restval="" or restval="N/A" depending on whether downstream consumers need to distinguish "missing" from "empty".

Expected Output

name,role,score\r\nAlice,Engineer,N/A\r\nBob,N/A,99\r\n

Hints

Hint 1: extrasaction="ignore" silently drops dict keys that are not in fieldnames.

Hint 2: restval fills in missing values when a dict does not have all the fieldnames.

Hint 3: Both parameters can be set together on the same DictWriter instance.

#6CSV Dialects: Tab-Separated and Custom DelimitersMedium

dialectsregister_dialectdelimiterexcel-tab

Predict the output. Read a tab-separated file, then register a custom pipe-and-semicolon dialect and verify it appears in the dialect registry.

Python

import csv
import io

# Tab-separated values (TSV)
tsv_data = "Alice\tEngineering\t95000\nBob\tMarketing\t72000\n"
f = io.StringIO(tsv_data)

reader = csv.reader(f, dialect="excel-tab")
for row in reader:
    print(row)

# Register a custom dialect
csv.register_dialect(
    "pipe_semicolon",
    delimiter="|",
    quotechar="'",
    doublequote=True,
    skipinitialspace=True,
    lineterminator="\r\n",
)

print(csv.list_dialects())

Solution

import csv
import io

tsv_data = "Alice\tEngineering\t95000\nBob\tMarketing\t72000\n"
f = io.StringIO(tsv_data)

reader = csv.reader(f, dialect="excel-tab")
for row in reader:
    print(row)

csv.register_dialect(
    "pipe_semicolon",
    delimiter="|",
    quotechar="'",
    doublequote=True,
    skipinitialspace=True,
    lineterminator="\r\n",
)

print(csv.list_dialects())

Output:

['Alice', 'Engineering', '95000']
['Bob', 'Marketing', '72000']
['excel', 'excel-tab', 'unix', 'pipe_semicolon']

How it works: The "excel-tab" dialect is a built-in alias for csv.excel_tab — it uses \t as the delimiter and otherwise matches the Excel CSV dialect. Reading a TSV with it correctly splits on tabs.

csv.register_dialect() stores a named dialect in the module-level registry. Once registered, any csv.reader or csv.writer can reference it by name. The registry is process-global — registered dialects persist for the lifetime of the Python process.

Key insight: Three dialects ship with Python: "excel" (default), "excel-tab" (TSV), and "unix" (LF line endings, QUOTE_ALL). Register custom dialects once at module import time, not inside loops. For one-off files, inline delimiter="," is fine. But if you read the same non-standard format in many places, a registered dialect makes the code consistent and the format change a one-line fix.

Expected Output

['Alice', 'Engineering', '95000']\n['Bob', 'Marketing', '72000']\n['excel', 'excel-tab', 'unix', 'pipe_semicolon']

Hints

Hint 1: Use dialect="excel-tab" or delimiter="\t" to read TSV files.

Hint 2: csv.register_dialect() names a reusable set of formatting parameters.

Hint 3: csv.list_dialects() returns all registered dialect names.

#7Quoting Modes: QUOTE_ALL vs QUOTE_NONNUMERICMedium

quotingQUOTE_ALLQUOTE_NONNUMERICQUOTE_MINIMAL

Predict the output. Use QUOTE_NONNUMERIC to write a CSV where string fields are quoted but numeric fields are not. Then verify the round-trip behavior.

Python

import csv
import io

data = [
    ["name", "salary", "active"],
    ["Alice", 95000, "true"],
    ["Bob", 72000, "false"],
]

output = io.StringIO()
writer = csv.writer(output, quoting=csv.QUOTE_NONNUMERIC)
writer.writerows(data)

print(repr(output.getvalue()))

Solution

import csv
import io

data = [
    ["name", "salary", "active"],
    ["Alice", 95000, "true"],
    ["Bob", 72000, "false"],
]

output = io.StringIO()
writer = csv.writer(output, quoting=csv.QUOTE_NONNUMERIC)
writer.writerows(data)

print(repr(output.getvalue()))

Output:

'"name","salary","active"\r\n"Alice",95000,"true"\r\n"Bob",72000,"false"\r\n'

How it works: QUOTE_NONNUMERIC quotes every non-numeric field. Python checks the actual type: 95000 is an int, so it is written unquoted. "Alice", "true", and "false" are strings, so they are wrapped in double-quotes. The header row fields ("name", "salary", "active") are all strings, so all three get quoted.

Reading back with QUOTE_NONNUMERIC: When you use csv.reader(..., quoting=csv.QUOTE_NONNUMERIC), any unquoted field is automatically converted to float. This means 95000 would come back as 95000.0 — a useful round-trip guarantee for numeric data pipelines.

Key insight: QUOTE_MINIMAL (the default) only quotes when necessary. QUOTE_ALL always quotes everything. QUOTE_NONNUMERIC is the choice for data pipelines where numeric precision matters and you want the reader to auto-cast numerics back to float. QUOTE_NONE requires an escapechar and is rarely used.

Expected Output

"name","salary","active"\r\n"Alice",95000,"true"\r\n"Bob",72000,"false"\r\n

Hints

Hint 1: QUOTE_ALL wraps every field in double-quotes, regardless of type.

Hint 2: QUOTE_NONNUMERIC quotes strings but not numeric types. When reading back, unquoted fields become float.

Hint 3: The quoting parameter is set on the writer, not on the reader.

#8DictReader with No Header: Custom fieldnamesMedium

DictReaderfieldnamesno-headercustom-columns

Predict the output. A CSV file has no header row. Use DictReader with explicit fieldnames to read it as dicts.

Python

import csv
import io

# No header row — just raw data
raw = "user_001,[email protected],free\nuser_002,[email protected],paid\n"
f = io.StringIO(raw)

fieldnames = ["user_id", "email", "plan"]
reader = csv.DictReader(f, fieldnames=fieldnames)

rows = list(reader)
print(rows[0]["user_id"])
print(rows[0]["email"])
print(rows[0]["plan"])
print("---")
print(rows[1]["user_id"])
print(rows[1]["email"])
print(rows[1]["plan"])

Solution

import csv
import io

raw = "user_001,[email protected],free\nuser_002,[email protected],paid\n"
f = io.StringIO(raw)

fieldnames = ["user_id", "email", "plan"]
reader = csv.DictReader(f, fieldnames=fieldnames)

rows = list(reader)
print(rows[0]["user_id"])
print(rows[0]["email"])
print(rows[0]["plan"])
print("---")
print(rows[1]["user_id"])
print(rows[1]["email"])
print(rows[1]["plan"])

Output:

user_001
[email protected]
free
---
user_002
[email protected]
paid

How it works: When fieldnames is provided, DictReader skips the automatic header-row detection. Every row in the file — including the very first — is treated as a data row. The fieldnames list defines the dict keys. Row 0 maps to {"user_id": "user_001", "email": "[email protected]", "plan": "free"}.

Key insight: Many database export tools and legacy systems produce CSV without headers to save space. Always check whether your input has a header before deciding to use DictReader with or without fieldnames. If you accidentally apply DictReader without fieldnames to a headerless CSV, your first data row becomes the column names — a subtle bug that only surfaces when the first row has unexpected values.

Expected Output

user_001\[email protected]\nfree\n---\nuser_002\[email protected]\npaid

Hints

Hint 1: When a CSV has no header row, pass fieldnames explicitly to DictReader.

Hint 2: DictReader treats every row as data when fieldnames is provided — no row is consumed as header.

Hint 3: You can provide any fieldnames you like, regardless of what is in the file.

Hard

#9Encoding: The Excel BOM ProblemHard

encodingutf-8-sigBOMExcel

Diagnose and fix the Excel BOM problem. Demonstrate that encoding="utf-8" corrupts the first header field, and that encoding="utf-8-sig" fixes it.

Python

import csv
import io

# Simulate an Excel-generated CSV with a BOM (\xef\xbb\xbf = UTF-8 BOM)
bom = "\ufeff"
excel_content = bom + "name,city,score\nAlice,London,95\nBob,Paris,87\n"

# Show the BOM is there
print("BOM present:", excel_content.startswith("\ufeff"))

# utf-8 reading: BOM included in first field name
f_utf8 = io.StringIO(excel_content)
reader_utf8 = csv.DictReader(f_utf8)
rows_utf8 = list(reader_utf8)
print("With utf-8 encoding:", reader_utf8.fieldnames)

# utf-8-sig reading: BOM stripped automatically
# io.StringIO doesn't support encoding= but we can strip manually to simulate
clean_content = excel_content.lstrip("\ufeff")
f_sig = io.StringIO(clean_content)
reader_sig = csv.DictReader(f_sig)
rows_sig = list(reader_sig)
print("With utf-8-sig encoding:", reader_sig.fieldnames)
print("First name:", rows_sig[0]["name"])

Solution

import csv
import io

bom = "\ufeff"
excel_content = bom + "name,city,score\nAlice,London,95\nBob,Paris,87\n"

print("BOM present:", excel_content.startswith("\ufeff"))

f_utf8 = io.StringIO(excel_content)
reader_utf8 = csv.DictReader(f_utf8)
rows_utf8 = list(reader_utf8)
print("With utf-8 encoding:", reader_utf8.fieldnames)

clean_content = excel_content.lstrip("\ufeff")
f_sig = io.StringIO(clean_content)
reader_sig = csv.DictReader(f_sig)
rows_sig = list(reader_sig)
print("With utf-8-sig encoding:", reader_sig.fieldnames)
print("First name:", rows_sig[0]["name"])

Output:

BOM present: True
With utf-8 encoding: ['\ufeffname', 'city', 'score']
With utf-8-sig encoding: ['name', 'city', 'score']
First name: Alice

How it works: The UTF-8 BOM is a 3-byte sequence (\xef\xbb\xbf) that Excel prepends to UTF-8 files so that other Microsoft tools can detect the encoding. When Python reads it with the plain "utf-8" codec, the BOM is decoded to the Unicode character U+FEFF and left in-place as part of the first character in the file. Since the first field name starts with this invisible character, reader.fieldnames[0] becomes "\ufeffname" instead of "name".

The fix: open the file with encoding="utf-8-sig". This codec automatically strips the BOM if present, and is a no-op if absent — making it safe to use for all Excel exports.

Key insight: This is one of the most common "why does row['name'] raise KeyError?" bugs. The error message prints as KeyError: 'name' — but the actual key is '\ufeffname'. You cannot see the BOM character in a terminal. Always use encoding="utf-8-sig" when reading CSVs from users or business teams who use Excel. Use encoding="utf-8-sig" for writing too — it adds the BOM so Excel opens the file correctly without an encoding dialog.

Expected Output

BOM present: True\nWith utf-8 encoding: ['\\ufeffname', 'city', 'score']\nWith utf-8-sig encoding: ['name', 'city', 'score']\nFirst name: Alice

Hints

Hint 1: Excel saves CSVs with a UTF-8 BOM (\xef\xbb\xbf) prepended to the file.

Hint 2: Opening with encoding="utf-8" leaves the BOM as part of the first field name: "\ufeffname".

Hint 3: Opening with encoding="utf-8-sig" strips the BOM automatically.

Hint 4: The BOM causes a silent KeyError when you try row["name"] — the real key is "\ufeffname".

#10csv.Sniffer: Auto-Detecting an Unknown DialectHard

Sniffersniffhas_headerdialect-detection

Implement auto-dialect detection. Given a pipe-delimited file, use csv.Sniffer to discover its format and read it correctly without hardcoding the delimiter.

Python

import csv
import io

# Unknown format: pipe-delimited with a header
unknown_csv = "id|product|price\n1|Widget A|19.99\n2|Widget B|34.50\n"
f = io.StringIO(unknown_csv)

# Step 1: Read a sample to detect the dialect
sample = f.read(1024)
f.seek(0)

sniffer = csv.Sniffer()
try:
    dialect = sniffer.sniff(sample)
    has_header = sniffer.has_header(sample)
except csv.Error:
    dialect = csv.excel
    has_header = True

print("Detected delimiter:", repr(dialect.delimiter))
print("Has header:", has_header)

# Step 2: Read with the detected dialect
if has_header:
    reader = csv.DictReader(f, dialect=dialect)
else:
    reader = csv.reader(f, dialect=dialect)

rows = list(reader)
print(reader.fieldnames)
for row in rows:
    print(list(row.values()))

Solution

import csv
import io

unknown_csv = "id|product|price\n1|Widget A|19.99\n2|Widget B|34.50\n"
f = io.StringIO(unknown_csv)

sample = f.read(1024)
f.seek(0)

sniffer = csv.Sniffer()
try:
    dialect = sniffer.sniff(sample)
    has_header = sniffer.has_header(sample)
except csv.Error:
    dialect = csv.excel
    has_header = True

print("Detected delimiter:", repr(dialect.delimiter))
print("Has header:", has_header)

if has_header:
    reader = csv.DictReader(f, dialect=dialect)
else:
    reader = csv.reader(f, dialect=dialect)

rows = list(reader)
print(reader.fieldnames)
for row in rows:
    print(list(row.values()))

Output:

Detected delimiter: '|'
Has header: True
['id', 'product', 'price']
['1', 'Widget A', '19.99']
['2', 'Widget B', '34.50']

How it works:

f.read(1024) reads the first 1024 bytes as a sample — enough for the Sniffer to analyze.
f.seek(0) resets the file pointer to the beginning so the actual reader gets the full file.
sniffer.sniff(sample) analyzes character frequencies and consistency to infer the delimiter. It correctly identifies | because it appears exactly once between each pair of fields.
sniffer.has_header(sample) checks whether the first row's content looks like column names (strings) while subsequent rows contain data (numbers/mixed).
The try/except csv.Error fallback ensures the function works on short or degenerate files where sniffing fails.

Key insight: Never skip the f.seek(0) after sniffing — a common bug that causes the reader to see only the remainder of the file. Sniffer works well for standard delimiters on files with at least 3 rows. It fails on single-column files (no delimiter to detect), extremely short files, and files with highly uniform data. For production pipelines where you control the source, always specify the dialect explicitly.

Expected Output

Detected delimiter: |\nHas header: True\n['id', 'product', 'price']\n['1', 'Widget A', '19.99']\n['2', 'Widget B', '34.50']

Hints

Hint 1: csv.Sniffer().sniff(sample) returns a Dialect object.

Hint 2: csv.Sniffer().has_header(sample) returns True if the first row looks like a header.

Hint 3: Always call f.seek(0) after sniffing — sniffing consumes part of the file.

Hint 4: Wrap sniffer calls in try/except csv.Error for production resilience.

#11Streaming Large CSV: Row-by-Row AggregationHard

streamingDictReaderlarge-filesmemory-efficiencyaggregation

Build a streaming CSV aggregator that computes total revenue and per-department totals from a sales CSV. It must handle bad rows without crashing and use O(1) memory relative to file size.

Python

import csv
import io
from collections import defaultdict

def aggregate_sales(fileobj):
    """
    Stream a sales CSV and return aggregated totals.
    Columns: sale_id, department, amount
    Returns: dict with total, by_dept, row_count, error_count
    """
    totals = defaultdict(float)
    grand_total = 0.0
    row_count = 0
    error_count = 0

    reader = csv.DictReader(fileobj)
    for row in reader:
        try:
            amount = float(row["amount"])
            dept = row["department"].strip()
            totals[dept] += amount
            grand_total += amount
            row_count += 1
        except (ValueError, KeyError):
            error_count += 1

    return {
        "total": grand_total,
        "by_dept": dict(totals),
        "row_count": row_count,
        "error_count": error_count,
    }

# Simulate a sales file (one bad row included)
csv_data = (
    "sale_id,department,amount\n"
    "1,Engineering,250.00\n"
    "2,Marketing,184.50\n"
    "3,Engineering,500.00\n"
    "4,Marketing,300.00\n"
    "5,Engineering,INVALID\n"   # bad row
)

result = aggregate_sales(io.StringIO(csv_data))

print("Rows processed:", result["row_count"])
print("Errors:", result["error_count"])
print("Total revenue:", f"{result['total']:.2f}")

top_dept = max(result["by_dept"], key=result["by_dept"].get)
print("Top department:", top_dept)

for dept, total in sorted(result["by_dept"].items()):
    print(f"{dept}: {total:.2f}")

Solution

import csv
import io
from collections import defaultdict

def aggregate_sales(fileobj):
    totals = defaultdict(float)
    grand_total = 0.0
    row_count = 0
    error_count = 0

    reader = csv.DictReader(fileobj)
    for row in reader:
        try:
            amount = float(row["amount"])
            dept = row["department"].strip()
            totals[dept] += amount
            grand_total += amount
            row_count += 1
        except (ValueError, KeyError):
            error_count += 1

    return {
        "total": grand_total,
        "by_dept": dict(totals),
        "row_count": row_count,
        "error_count": error_count,
    }

csv_data = (
    "sale_id,department,amount\n"
    "1,Engineering,250.00\n"
    "2,Marketing,184.50\n"
    "3,Engineering,500.00\n"
    "4,Marketing,300.00\n"
    "5,Engineering,INVALID\n"
)

result = aggregate_sales(io.StringIO(csv_data))

print("Rows processed:", result["row_count"])
print("Errors:", result["error_count"])
print("Total revenue:", f"{result['total']:.2f}")

top_dept = max(result["by_dept"], key=result["by_dept"].get)
print("Top department:", top_dept)

for dept, total in sorted(result["by_dept"].items()):
    print(f"{dept}: {total:.2f}")

Output:

Rows processed: 5
Errors: 1
Total revenue: 1234.50
Top department: Engineering
Engineering: 750.00
Marketing: 484.50

How it works:

csv.DictReader(fileobj) creates a lazy iterator. No rows are loaded until the for loop requests them. Memory usage is constant regardless of file size — only one row dict exists in memory at a time.
float(row["amount"]) raises ValueError for "INVALID". The try/except catches this, increments error_count, and continues to the next row. The pipeline does not abort on bad data.
defaultdict(float) accumulates per-department totals. Each department key is created on first access with a default value of 0.0.
max(result["by_dept"], key=result["by_dept"].get) finds the department with the highest total. Engineering: 250 + 500 = 750.00. Marketing: 184.50 + 300 = 484.50. Total valid revenue: 1234.50.

Key insight: This is the canonical pattern for processing large CSVs in production — stream row by row, accumulate into lightweight aggregates, handle errors per row, never call list(reader). A 50 GB sales file processed this way uses the same memory as a 50 KB file. The same pattern extends to batch database inserts (accumulate rows into a chunk, flush when chunk reaches 1000), data validation (accumulate error messages), and ETL pipelines.

Expected Output

Rows processed: 5\nErrors: 1\nTotal revenue: 1234.50\nTop department: Engineering\nEngineering: 750.00\nMarketing: 484.50

Hints

Hint 1: Never call list(reader) on a large file — it loads everything into RAM.

Hint 2: Accumulate results in lightweight data structures (counters, running totals) as you iterate.

Hint 3: Wrap per-row parsing in try/except to count errors without aborting the entire run.

Hint 4: Use io.StringIO to simulate a large file in this exercise.

Practice: CSV Handling

Easy​

Medium​

Hard​

Easy

Medium

Hard