Python CSV Handling Practice Problems & Exercises
Practice: CSV Handling
← Back to lessonEasy
Predict the output. A CSV is written to a temp file, then read back with csv.reader. Notice the type of each field.
import csv
import io
# Simulate a CSV file in memory
raw = "name,city,score\nAlice,London,95\nBob,Paris,87\n"
f = io.StringIO(raw)
reader = csv.reader(f)
header = next(reader)
print(header)
rows = list(reader)
for row in rows:
print(row)
# How many data rows (excluding header)?
print(len(rows))Solution
import csv
import io
raw = "name,city,score\nAlice,London,95\nBob,Paris,87\n"
f = io.StringIO(raw)
reader = csv.reader(f)
header = next(reader)
print(header)
rows = list(reader)
for row in rows:
print(row)
print(len(rows))
Output:
['name', 'city', 'score']
['Alice', 'London', '95']
['Bob', 'Paris', '87']
3
How it works: csv.reader wraps any iterable that yields lines. next(reader) consumes the first row (the header) and returns it as a list of strings. The remaining rows are collected with list(reader). Every field — including 95 and 87 — is a string. The csv module never converts types automatically.
Key insight: The result is always list[str]. You must cast numeric fields yourself: int(row[2]) or float(row[2]). A common bug is trying to do arithmetic on row["score"] without first converting it — you get string concatenation instead of addition.
Expected Output
['name', 'city', 'score']\n['Alice', 'London', '95']\n['Bob', 'Paris', '87']\n3Hints
Hint 1: Always open CSV files with newline="" to let the csv module handle line endings.
Hint 2: csv.reader yields each row as a list of strings — all values are strings, even numbers.
Hint 3: next(reader) advances the iterator by one row and returns it.
Predict the output. csv.writer handles quoting automatically for fields that contain commas or double-quotes.
import csv import io output = io.StringIO() writer = csv.writer(output) writer.writerow(["id", "name", "notes"]) writer.writerow([1, "Alice", "Works on NLP, ML"]) writer.writerow([2, "Bob", 'Said "hello" to all']) print(repr(output.getvalue()))
Solution
import csv
import io
output = io.StringIO()
writer = csv.writer(output)
writer.writerow(["id", "name", "notes"])
writer.writerow([1, "Alice", "Works on NLP, ML"])
writer.writerow([2, "Bob", 'Said "hello" to all'])
print(repr(output.getvalue()))
Output:
'id,name,notes\r\n1,Alice,"Works on NLP, ML"\r\n2,Bob,"Said ""hello"" to all"\r\n'
How it works: csv.writer applies RFC 4180 quoting automatically:
"Works on NLP, ML"contains a comma — the writer wraps the whole field in double-quotes.'Said "hello" to all'contains double-quotes — the writer wraps the field in double-quotes and escapes each internal double-quote by doubling it:"".- Line endings default to
\r\n(the RFC 4180 standard) regardless of the OS. This is why you must open real files withnewline=""— without it, Python's universal newlines would turn\r\ninto\r\r\n.
Key insight: You never need to manually escape commas or quotes. If you write "Alice,\"Bob\"" into a CSV field yourself, you are doing it wrong. Give csv.writer the raw value and it handles the escaping.
Expected Output
id,name,notes\r\n1,Alice,"Works on NLP, ML"\r\n2,Bob,"Said ""hello"" to all"\r\nHints
Hint 1: csv.writer automatically quotes fields that contain the delimiter or a double-quote.
Hint 2: Open the file with newline="" when writing — otherwise Python adds an extra blank line on Windows.
Hint 3: io.StringIO() lets you capture writer output as a string without touching the filesystem.
Predict the output. csv.DictReader reads each row as a dict keyed by column name. Verify the field names and access pattern.
import csv
import io
raw = "name,department,salary\nAlice,Engineering,95000\nBob,Marketing,72000\n"
f = io.StringIO(raw)
reader = csv.DictReader(f)
# fieldnames is populated on first access/iteration
rows = list(reader)
print(reader.fieldnames)
for row in rows:
print(row["name"], row["department"], row["salary"])Solution
import csv
import io
raw = "name,department,salary\nAlice,Engineering,95000\nBob,Marketing,72000\n"
f = io.StringIO(raw)
reader = csv.DictReader(f)
rows = list(reader)
print(reader.fieldnames)
for row in rows:
print(row["name"], row["department"], row["salary"])
Output:
['name', 'department', 'salary']
Alice Engineering 95000
Bob Marketing 72000
How it works: csv.DictReader reads the first row as the list of field names and stores it in reader.fieldnames. Every subsequent row is returned as a dict mapping each field name to its string value. The fieldnames attribute is populated lazily — it is None until the first row is consumed, which is why we print it after list(reader).
Key insight: DictReader is the preferred interface for production code. Using positional indices like row[2] breaks silently when columns are added, removed, or reordered. Using row["salary"] is self-documenting and robust to schema changes. The only cost is a small memory overhead for the dict construction — negligible for all practical purposes.
Expected Output
['name', 'department', 'salary']\nAlice Engineering 95000\nBob Marketing 72000Hints
Hint 1: csv.DictReader uses the first row as keys by default.
Hint 2: reader.fieldnames is populated after the first row is read.
Hint 3: Access fields by name: row["salary"], not by index.
Predict the output. csv.DictWriter writes a list of dicts to CSV. Notice that column order is controlled by fieldnames, not the dict insertion order.
import csv
import io
employees = [
{"name": "Alice", "department": "Engineering", "salary": 95000},
{"name": "Bob", "department": "Marketing", "salary": 72000},
]
output = io.StringIO()
writer = csv.DictWriter(output, fieldnames=["name", "department", "salary"])
writer.writeheader()
writer.writerows(employees)
print(output.getvalue())Solution
import csv
import io
employees = [
{"name": "Alice", "department": "Engineering", "salary": 95000},
{"name": "Bob", "department": "Marketing", "salary": 72000},
]
output = io.StringIO()
writer = csv.DictWriter(output, fieldnames=["name", "department", "salary"])
writer.writeheader()
writer.writerows(employees)
print(output.getvalue())
Output:
name,department,salary
Alice,Engineering,95000
Bob,Marketing,72000
How it works: DictWriter takes a fieldnames list that defines both the column order and which keys to write from each dict. writeheader() writes the field names as the first row. writerows() iterates the list and writes each dict as a row, pulling values in fieldnames order.
The trailing blank line in the output is expected — print() adds a newline after the string's own trailing \r\n.
Key insight: The fieldnames list is the source of truth for column order. Even if a dict has keys in a different order, the CSV will always write columns in the fieldnames order. This is critical when your CSVs must conform to a fixed schema — define fieldnames explicitly rather than deriving them from dict.keys().
Expected Output
name,department,salary\r\nAlice,Engineering,95000\r\nBob,Marketing,72000\r\nHints
Hint 1: DictWriter requires a fieldnames list — this defines column order.
Hint 2: Always call writeheader() before writerows() to write the header row.
Hint 3: writerows() accepts any iterable of dicts.
Medium
Predict the output. Two rows are written: one has an extra key not in fieldnames; the other is missing a field entirely. Observe how extrasaction and restval control the behavior.
import csv
import io
output = io.StringIO()
writer = csv.DictWriter(
output,
fieldnames=["name", "role", "score"],
extrasaction="ignore",
restval="N/A",
)
writer.writeheader()
# Extra key "age" is not in fieldnames — extrasaction="ignore" drops it
writer.writerow({"name": "Alice", "role": "Engineer", "age": 30})
# Missing key "role" — restval fills it with "N/A"
writer.writerow({"name": "Bob", "score": 99})
print(output.getvalue())Solution
import csv
import io
output = io.StringIO()
writer = csv.DictWriter(
output,
fieldnames=["name", "role", "score"],
extrasaction="ignore",
restval="N/A",
)
writer.writeheader()
writer.writerow({"name": "Alice", "role": "Engineer", "age": 30})
writer.writerow({"name": "Bob", "score": 99})
print(output.getvalue())
Output:
name,role,score
Alice,Engineer,N/A
Bob,N/A,99
How it works:
-
Alice's row: has
"age": 30which is not infieldnames. Withextrasaction="ignore", the extra key is silently dropped. Thescorefield is missing from Alice's dict, sorestval="N/A"fills it:Alice,Engineer,N/A. -
Bob's row: has
"score": 99but is missing both"role".restvalfills the missing"role"withN/A, producingBob,N/A,99.
Key insight: The default extrasaction is "raise", which raises ValueError if a dict contains keys not in fieldnames. Always set extrasaction="ignore" when your dicts come from external systems (API responses, ORM objects) that may return extra fields. Set restval="" or restval="N/A" depending on whether downstream consumers need to distinguish "missing" from "empty".
Expected Output
name,role,score\r\nAlice,Engineer,N/A\r\nBob,N/A,99\r\nHints
Hint 1: extrasaction="ignore" silently drops dict keys that are not in fieldnames.
Hint 2: restval fills in missing values when a dict does not have all the fieldnames.
Hint 3: Both parameters can be set together on the same DictWriter instance.
Predict the output. Read a tab-separated file, then register a custom pipe-and-semicolon dialect and verify it appears in the dialect registry.
import csv
import io
# Tab-separated values (TSV)
tsv_data = "Alice\tEngineering\t95000\nBob\tMarketing\t72000\n"
f = io.StringIO(tsv_data)
reader = csv.reader(f, dialect="excel-tab")
for row in reader:
print(row)
# Register a custom dialect
csv.register_dialect(
"pipe_semicolon",
delimiter="|",
quotechar="'",
doublequote=True,
skipinitialspace=True,
lineterminator="\r\n",
)
print(csv.list_dialects())Solution
import csv
import io
tsv_data = "Alice\tEngineering\t95000\nBob\tMarketing\t72000\n"
f = io.StringIO(tsv_data)
reader = csv.reader(f, dialect="excel-tab")
for row in reader:
print(row)
csv.register_dialect(
"pipe_semicolon",
delimiter="|",
quotechar="'",
doublequote=True,
skipinitialspace=True,
lineterminator="\r\n",
)
print(csv.list_dialects())
Output:
['Alice', 'Engineering', '95000']
['Bob', 'Marketing', '72000']
['excel', 'excel-tab', 'unix', 'pipe_semicolon']
How it works: The "excel-tab" dialect is a built-in alias for csv.excel_tab — it uses \t as the delimiter and otherwise matches the Excel CSV dialect. Reading a TSV with it correctly splits on tabs.
csv.register_dialect() stores a named dialect in the module-level registry. Once registered, any csv.reader or csv.writer can reference it by name. The registry is process-global — registered dialects persist for the lifetime of the Python process.
Key insight: Three dialects ship with Python: "excel" (default), "excel-tab" (TSV), and "unix" (LF line endings, QUOTE_ALL). Register custom dialects once at module import time, not inside loops. For one-off files, inline delimiter="," is fine. But if you read the same non-standard format in many places, a registered dialect makes the code consistent and the format change a one-line fix.
Expected Output
['Alice', 'Engineering', '95000']\n['Bob', 'Marketing', '72000']\n['excel', 'excel-tab', 'unix', 'pipe_semicolon']Hints
Hint 1: Use dialect="excel-tab" or delimiter="\t" to read TSV files.
Hint 2: csv.register_dialect() names a reusable set of formatting parameters.
Hint 3: csv.list_dialects() returns all registered dialect names.
Predict the output. Use QUOTE_NONNUMERIC to write a CSV where string fields are quoted but numeric fields are not. Then verify the round-trip behavior.
import csv
import io
data = [
["name", "salary", "active"],
["Alice", 95000, "true"],
["Bob", 72000, "false"],
]
output = io.StringIO()
writer = csv.writer(output, quoting=csv.QUOTE_NONNUMERIC)
writer.writerows(data)
print(repr(output.getvalue()))Solution
import csv
import io
data = [
["name", "salary", "active"],
["Alice", 95000, "true"],
["Bob", 72000, "false"],
]
output = io.StringIO()
writer = csv.writer(output, quoting=csv.QUOTE_NONNUMERIC)
writer.writerows(data)
print(repr(output.getvalue()))
Output:
'"name","salary","active"\r\n"Alice",95000,"true"\r\n"Bob",72000,"false"\r\n'
How it works: QUOTE_NONNUMERIC quotes every non-numeric field. Python checks the actual type: 95000 is an int, so it is written unquoted. "Alice", "true", and "false" are strings, so they are wrapped in double-quotes. The header row fields ("name", "salary", "active") are all strings, so all three get quoted.
Reading back with QUOTE_NONNUMERIC: When you use csv.reader(..., quoting=csv.QUOTE_NONNUMERIC), any unquoted field is automatically converted to float. This means 95000 would come back as 95000.0 — a useful round-trip guarantee for numeric data pipelines.
Key insight: QUOTE_MINIMAL (the default) only quotes when necessary. QUOTE_ALL always quotes everything. QUOTE_NONNUMERIC is the choice for data pipelines where numeric precision matters and you want the reader to auto-cast numerics back to float. QUOTE_NONE requires an escapechar and is rarely used.
Expected Output
"name","salary","active"\r\n"Alice",95000,"true"\r\n"Bob",72000,"false"\r\nHints
Hint 1: QUOTE_ALL wraps every field in double-quotes, regardless of type.
Hint 2: QUOTE_NONNUMERIC quotes strings but not numeric types. When reading back, unquoted fields become float.
Hint 3: The quoting parameter is set on the writer, not on the reader.
Predict the output. A CSV file has no header row. Use DictReader with explicit fieldnames to read it as dicts.
import csv import io # No header row — just raw data raw = "user_001,[email protected],free\nuser_002,[email protected],paid\n" f = io.StringIO(raw) fieldnames = ["user_id", "email", "plan"] reader = csv.DictReader(f, fieldnames=fieldnames) rows = list(reader) print(rows[0]["user_id"]) print(rows[0]["email"]) print(rows[0]["plan"]) print("---") print(rows[1]["user_id"]) print(rows[1]["email"]) print(rows[1]["plan"])
Solution
import csv
import io
f = io.StringIO(raw)
fieldnames = ["user_id", "email", "plan"]
reader = csv.DictReader(f, fieldnames=fieldnames)
rows = list(reader)
print(rows[0]["user_id"])
print(rows[0]["email"])
print(rows[0]["plan"])
print("---")
print(rows[1]["user_id"])
print(rows[1]["email"])
print(rows[1]["plan"])
Output:
user_001
free
---
user_002
paid
How it works: When fieldnames is provided, DictReader skips the automatic header-row detection. Every row in the file — including the very first — is treated as a data row. The fieldnames list defines the dict keys. Row 0 maps to {"user_id": "user_001", "email": "[email protected]", "plan": "free"}.
Key insight: Many database export tools and legacy systems produce CSV without headers to save space. Always check whether your input has a header before deciding to use DictReader with or without fieldnames. If you accidentally apply DictReader without fieldnames to a headerless CSV, your first data row becomes the column names — a subtle bug that only surfaces when the first row has unexpected values.
Expected Output
user_001\[email protected]\nfree\n---\nuser_002\[email protected]\npaidHints
Hint 1: When a CSV has no header row, pass fieldnames explicitly to DictReader.
Hint 2: DictReader treats every row as data when fieldnames is provided — no row is consumed as header.
Hint 3: You can provide any fieldnames you like, regardless of what is in the file.
Hard
Diagnose and fix the Excel BOM problem. Demonstrate that encoding="utf-8" corrupts the first header field, and that encoding="utf-8-sig" fixes it.
import csv
import io
# Simulate an Excel-generated CSV with a BOM (\xef\xbb\xbf = UTF-8 BOM)
bom = "\ufeff"
excel_content = bom + "name,city,score\nAlice,London,95\nBob,Paris,87\n"
# Show the BOM is there
print("BOM present:", excel_content.startswith("\ufeff"))
# utf-8 reading: BOM included in first field name
f_utf8 = io.StringIO(excel_content)
reader_utf8 = csv.DictReader(f_utf8)
rows_utf8 = list(reader_utf8)
print("With utf-8 encoding:", reader_utf8.fieldnames)
# utf-8-sig reading: BOM stripped automatically
# io.StringIO doesn't support encoding= but we can strip manually to simulate
clean_content = excel_content.lstrip("\ufeff")
f_sig = io.StringIO(clean_content)
reader_sig = csv.DictReader(f_sig)
rows_sig = list(reader_sig)
print("With utf-8-sig encoding:", reader_sig.fieldnames)
print("First name:", rows_sig[0]["name"])Solution
import csv
import io
bom = "\ufeff"
excel_content = bom + "name,city,score\nAlice,London,95\nBob,Paris,87\n"
print("BOM present:", excel_content.startswith("\ufeff"))
f_utf8 = io.StringIO(excel_content)
reader_utf8 = csv.DictReader(f_utf8)
rows_utf8 = list(reader_utf8)
print("With utf-8 encoding:", reader_utf8.fieldnames)
clean_content = excel_content.lstrip("\ufeff")
f_sig = io.StringIO(clean_content)
reader_sig = csv.DictReader(f_sig)
rows_sig = list(reader_sig)
print("With utf-8-sig encoding:", reader_sig.fieldnames)
print("First name:", rows_sig[0]["name"])
Output:
BOM present: True
With utf-8 encoding: ['\ufeffname', 'city', 'score']
With utf-8-sig encoding: ['name', 'city', 'score']
First name: Alice
How it works: The UTF-8 BOM is a 3-byte sequence (\xef\xbb\xbf) that Excel prepends to UTF-8 files so that other Microsoft tools can detect the encoding. When Python reads it with the plain "utf-8" codec, the BOM is decoded to the Unicode character U+FEFF and left in-place as part of the first character in the file. Since the first field name starts with this invisible character, reader.fieldnames[0] becomes "\ufeffname" instead of "name".
The fix: open the file with encoding="utf-8-sig". This codec automatically strips the BOM if present, and is a no-op if absent — making it safe to use for all Excel exports.
Key insight: This is one of the most common "why does row['name'] raise KeyError?" bugs. The error message prints as KeyError: 'name' — but the actual key is '\ufeffname'. You cannot see the BOM character in a terminal. Always use encoding="utf-8-sig" when reading CSVs from users or business teams who use Excel. Use encoding="utf-8-sig" for writing too — it adds the BOM so Excel opens the file correctly without an encoding dialog.
Expected Output
BOM present: True\nWith utf-8 encoding: ['\\ufeffname', 'city', 'score']\nWith utf-8-sig encoding: ['name', 'city', 'score']\nFirst name: AliceHints
Hint 1: Excel saves CSVs with a UTF-8 BOM (\xef\xbb\xbf) prepended to the file.
Hint 2: Opening with encoding="utf-8" leaves the BOM as part of the first field name: "\ufeffname".
Hint 3: Opening with encoding="utf-8-sig" strips the BOM automatically.
Hint 4: The BOM causes a silent KeyError when you try row["name"] — the real key is "\ufeffname".
Implement auto-dialect detection. Given a pipe-delimited file, use csv.Sniffer to discover its format and read it correctly without hardcoding the delimiter.
import csv
import io
# Unknown format: pipe-delimited with a header
unknown_csv = "id|product|price\n1|Widget A|19.99\n2|Widget B|34.50\n"
f = io.StringIO(unknown_csv)
# Step 1: Read a sample to detect the dialect
sample = f.read(1024)
f.seek(0)
sniffer = csv.Sniffer()
try:
dialect = sniffer.sniff(sample)
has_header = sniffer.has_header(sample)
except csv.Error:
dialect = csv.excel
has_header = True
print("Detected delimiter:", repr(dialect.delimiter))
print("Has header:", has_header)
# Step 2: Read with the detected dialect
if has_header:
reader = csv.DictReader(f, dialect=dialect)
else:
reader = csv.reader(f, dialect=dialect)
rows = list(reader)
print(reader.fieldnames)
for row in rows:
print(list(row.values()))Solution
import csv
import io
unknown_csv = "id|product|price\n1|Widget A|19.99\n2|Widget B|34.50\n"
f = io.StringIO(unknown_csv)
sample = f.read(1024)
f.seek(0)
sniffer = csv.Sniffer()
try:
dialect = sniffer.sniff(sample)
has_header = sniffer.has_header(sample)
except csv.Error:
dialect = csv.excel
has_header = True
print("Detected delimiter:", repr(dialect.delimiter))
print("Has header:", has_header)
if has_header:
reader = csv.DictReader(f, dialect=dialect)
else:
reader = csv.reader(f, dialect=dialect)
rows = list(reader)
print(reader.fieldnames)
for row in rows:
print(list(row.values()))
Output:
Detected delimiter: '|'
Has header: True
['id', 'product', 'price']
['1', 'Widget A', '19.99']
['2', 'Widget B', '34.50']
How it works:
f.read(1024)reads the first 1024 bytes as a sample — enough for the Sniffer to analyze.f.seek(0)resets the file pointer to the beginning so the actual reader gets the full file.sniffer.sniff(sample)analyzes character frequencies and consistency to infer the delimiter. It correctly identifies|because it appears exactly once between each pair of fields.sniffer.has_header(sample)checks whether the first row's content looks like column names (strings) while subsequent rows contain data (numbers/mixed).- The
try/except csv.Errorfallback ensures the function works on short or degenerate files where sniffing fails.
Key insight: Never skip the f.seek(0) after sniffing — a common bug that causes the reader to see only the remainder of the file. Sniffer works well for standard delimiters on files with at least 3 rows. It fails on single-column files (no delimiter to detect), extremely short files, and files with highly uniform data. For production pipelines where you control the source, always specify the dialect explicitly.
Expected Output
Detected delimiter: |\nHas header: True\n['id', 'product', 'price']\n['1', 'Widget A', '19.99']\n['2', 'Widget B', '34.50']Hints
Hint 1: csv.Sniffer().sniff(sample) returns a Dialect object.
Hint 2: csv.Sniffer().has_header(sample) returns True if the first row looks like a header.
Hint 3: Always call f.seek(0) after sniffing — sniffing consumes part of the file.
Hint 4: Wrap sniffer calls in try/except csv.Error for production resilience.
Build a streaming CSV aggregator that computes total revenue and per-department totals from a sales CSV. It must handle bad rows without crashing and use O(1) memory relative to file size.
import csv
import io
from collections import defaultdict
def aggregate_sales(fileobj):
"""
Stream a sales CSV and return aggregated totals.
Columns: sale_id, department, amount
Returns: dict with total, by_dept, row_count, error_count
"""
totals = defaultdict(float)
grand_total = 0.0
row_count = 0
error_count = 0
reader = csv.DictReader(fileobj)
for row in reader:
try:
amount = float(row["amount"])
dept = row["department"].strip()
totals[dept] += amount
grand_total += amount
row_count += 1
except (ValueError, KeyError):
error_count += 1
return {
"total": grand_total,
"by_dept": dict(totals),
"row_count": row_count,
"error_count": error_count,
}
# Simulate a sales file (one bad row included)
csv_data = (
"sale_id,department,amount\n"
"1,Engineering,250.00\n"
"2,Marketing,184.50\n"
"3,Engineering,500.00\n"
"4,Marketing,300.00\n"
"5,Engineering,INVALID\n" # bad row
)
result = aggregate_sales(io.StringIO(csv_data))
print("Rows processed:", result["row_count"])
print("Errors:", result["error_count"])
print("Total revenue:", f"{result['total']:.2f}")
top_dept = max(result["by_dept"], key=result["by_dept"].get)
print("Top department:", top_dept)
for dept, total in sorted(result["by_dept"].items()):
print(f"{dept}: {total:.2f}")Solution
import csv
import io
from collections import defaultdict
def aggregate_sales(fileobj):
totals = defaultdict(float)
grand_total = 0.0
row_count = 0
error_count = 0
reader = csv.DictReader(fileobj)
for row in reader:
try:
amount = float(row["amount"])
dept = row["department"].strip()
totals[dept] += amount
grand_total += amount
row_count += 1
except (ValueError, KeyError):
error_count += 1
return {
"total": grand_total,
"by_dept": dict(totals),
"row_count": row_count,
"error_count": error_count,
}
csv_data = (
"sale_id,department,amount\n"
"1,Engineering,250.00\n"
"2,Marketing,184.50\n"
"3,Engineering,500.00\n"
"4,Marketing,300.00\n"
"5,Engineering,INVALID\n"
)
result = aggregate_sales(io.StringIO(csv_data))
print("Rows processed:", result["row_count"])
print("Errors:", result["error_count"])
print("Total revenue:", f"{result['total']:.2f}")
top_dept = max(result["by_dept"], key=result["by_dept"].get)
print("Top department:", top_dept)
for dept, total in sorted(result["by_dept"].items()):
print(f"{dept}: {total:.2f}")
Output:
Rows processed: 5
Errors: 1
Total revenue: 1234.50
Top department: Engineering
Engineering: 750.00
Marketing: 484.50
How it works:
-
csv.DictReader(fileobj)creates a lazy iterator. No rows are loaded until theforloop requests them. Memory usage is constant regardless of file size — only one row dict exists in memory at a time. -
float(row["amount"])raisesValueErrorfor"INVALID". Thetry/exceptcatches this, incrementserror_count, and continues to the next row. The pipeline does not abort on bad data. -
defaultdict(float)accumulates per-department totals. Each department key is created on first access with a default value of0.0. -
max(result["by_dept"], key=result["by_dept"].get)finds the department with the highest total. Engineering:250 + 500 = 750.00. Marketing:184.50 + 300 = 484.50. Total valid revenue:1234.50.
Key insight: This is the canonical pattern for processing large CSVs in production — stream row by row, accumulate into lightweight aggregates, handle errors per row, never call list(reader). A 50 GB sales file processed this way uses the same memory as a 50 KB file. The same pattern extends to batch database inserts (accumulate rows into a chunk, flush when chunk reaches 1000), data validation (accumulate error messages), and ETL pipelines.
Expected Output
Rows processed: 5\nErrors: 1\nTotal revenue: 1234.50\nTop department: Engineering\nEngineering: 750.00\nMarketing: 484.50Hints
Hint 1: Never call list(reader) on a large file — it loads everything into RAM.
Hint 2: Accumulate results in lightweight data structures (counters, running totals) as you iterate.
Hint 3: Wrap per-row parsing in try/except to count errors without aborting the entire run.
Hint 4: Use io.StringIO to simulate a large file in this exercise.
