Reading Files - open(), Modes, Encodings, and Buffering
Reading time: ~18 minutes | Level: Foundation → Engineering
Here is a question that trips up experienced developers:
# You have a 10 GB server log file.
# What does this code do?
with open("server.log", "r") as f:
content = f.read()
print(len(content))
The code works - but it loads the entire 10 GB file into RAM as a single string. On a machine with 8 GB of RAM, Python will crash with MemoryError before it even reaches print.
The "right" answer is almost never f.read() for files of unknown size. This page explains why, what open() actually creates under the hood, and the full toolkit for reading files efficiently - from a 100-byte config file to a multi-gigabyte dataset.
What You Will Learn
- Every parameter of
open()-path,mode,encoding,buffering,errors,newline - All file modes:
'r','rb','rt','r+'- what each means at the OS level - How Python translates newlines in text mode and when to bypass that
- Encodings: why UTF-8 is the correct default, what
UnicodeDecodeErrormeans, and all theerrorsparameter values - All reading methods:
read(),readline(),readlines(), and iteration - when to use each - File object internals: buffered I/O layers and the underlying OS file descriptor
- Memory-efficient reading: generators, chunked reads, and memory-mapped files with
mmap - Real-world patterns: streaming 10 GB log files, parsing CSV line-by-line, loading config files
Prerequisites
- Python 3.8+ installed and running
- Basic understanding of Python data types:
str,bytes,list - Familiarity with
forloops andwithstatements at a syntactic level
Mental Model: What open() Actually Creates
Most developers think open("file.txt") connects your code directly to a file on disk. The reality is a three-layer stack:
When you call f.read(1024), Python:
- Asks
TextIOWrapperfor 1024 characters TextIOWrapperasksBufferedReaderfor bytesBufferedReaderchecks its internal buffer - if enough bytes are there, returns them; otherwise asksFileIOfor a new chunk (typically 8 KB)FileIOissues the OSread()system call- The OS reads from disk (or page cache) and returns raw bytes
- Bytes flow back up,
TextIOWrapperdecodes them tostr
Understanding this stack is why buffering=0 only works in binary mode (there is no TextIOWrapper to bypass), and why large files benefit from chunked reading.
Part 1 - The open() Function, All Parameters
Signature
open(
file, # path (str, bytes, or Path object)
mode='r', # how to open the file
buffering=-1, # buffer size: -1 = automatic
encoding=None, # text encoding (text mode only)
errors=None, # encoding error handling policy
newline=None, # newline translation (text mode only)
closefd=True, # if file is a file descriptor integer
opener=None # custom opener callable
)
The file Parameter
# String path (most common)
f = open("/var/log/app.log", "r")
# pathlib.Path object - works natively since Python 3.6
from pathlib import Path
f = open(Path.home() / "data" / "config.json", "r")
# Integer file descriptor (advanced: wraps an OS fd)
import os
fd = os.open("/tmp/test.txt", os.O_RDONLY)
f = open(fd, "r", closefd=True) # closefd=True means close fd when f.close() is called
Part 2 - File Modes in Depth
The mode string is one or two characters that specify intent and format.
| Mode | Meaning |
|---|---|
'r' | Read, text mode. File must exist. Default. |
'w' | Write, text mode. Truncates (destroys) existing file. |
'a' | Append, text mode. Creates if missing. |
'x' | Exclusive create. Fails if file exists. |
'b' | Binary modifier (combine with r/w/a/x) |
't' | Text modifier. Default, rarely written explicitly. |
'+' | Update (read+write). Combine with r/w/a. |
Common combinations: 'rb' - read binary (images, PDFs, pickle files) · 'wb' - write binary · 'r+' - read AND write (no truncation, must exist) · 'w+' - write AND read (truncates first) · 'a+' - append AND read
Text Mode 'r' (default)
with open("poem.txt", "r", encoding="utf-8") as f:
content = f.read()
print(type(content)) # <class 'str'>
In text mode, Python decodes bytes from disk into a str object using the specified encoding. Newline sequences are also normalized (more on this below).
Binary Mode 'rb'
with open("image.png", "rb") as f:
header = f.read(8)
print(header) # b'\x89PNG\r\n\x1a\n'
print(type(header)) # <class 'bytes'>
In binary mode:
- No encoding/decoding happens - you get raw bytes
- No newline translation
- The
encodingparameter is ignored (and should not be passed)
Use binary mode for: images, audio, video, compiled files, pickle serialization, network protocol parsing, any format where bytes are semantically significant.
'r+' - Read and Write Without Truncation
with open("counter.txt", "r+", encoding="utf-8") as f:
value = int(f.read().strip()) # read current value
f.seek(0) # go back to start
f.write(str(value + 1)) # write new value
f.truncate() # remove any leftover bytes
'r+' is rarely used. It requires the file to exist and does not truncate it. You must manage position with seek() manually.
Part 3 - Text vs Binary Mode: Newline Translation
This is one of Python's most surprising behaviors for cross-platform code.
What Happens in Text Mode
Different operating systems use different newline sequences:
- Unix/Linux/macOS:
\n(LF, one byte, 0x0A) - Windows:
\r\n(CRLF, two bytes, 0x0D 0x0A) - Old Mac OS (pre-X):
\r(CR, one byte, 0x0D)
In text mode, Python automatically translates on reading:
\r\n→\n(Windows files read on any platform)\r→\n(old Mac files)
And on writing, Python translates \n → the platform's native newline.
# A Windows file containing: "line1\r\nline2\r\n"
with open("windows_file.txt", "rb") as f:
raw = f.read()
print(repr(raw)) # b'line1\r\nline2\r\n'
with open("windows_file.txt", "r") as f:
text = f.read()
print(repr(text)) # 'line1\nline2\n' ← \r\n collapsed to \n
When This Matters
# Counting bytes won't match if you use text mode on a Windows file
import os
filename = "windows_file.txt"
file_size = os.path.getsize(filename) # real bytes on disk: includes \r\n
with open(filename, "r") as f:
content = f.read()
text_len = len(content) # len() of str: \r\n counted as 1 char (\n)
print(file_size) # 14 (len of b'line1\r\nline2\r\n')
print(text_len) # 12 (len of 'line1\nline2\n') - different!
Controlling Newline Behavior with newline=
# newline='' - no translation, preserve all newline bytes
with open("file.txt", "r", newline="") as f:
content = f.read() # \r\n preserved as-is
# newline='\n' - only recognize \n as line terminator
with open("file.txt", "r", newline="\n") as f:
content = f.read() # \r\n in file will appear as \r\n in string
:::tip Use newline='' for CSV
Python's csv module documentation explicitly states you should open files with newline='' in text mode. This prevents double-translation of newlines within quoted CSV fields.
import csv
with open("data.csv", "r", newline="", encoding="utf-8") as f:
reader = csv.reader(f)
for row in reader:
print(row)
:::
Part 4 - Encodings: UTF-8 Everywhere
Why Encoding Matters
Files on disk are bytes. A str in Python is a sequence of Unicode code points. Encoding is the mapping between the two.
str: "café" ← Python string (4 Unicode code points)
↓ UTF-8 encode
bytes: b'caf\xc3\xa9' ← 5 bytes on disk (é is 2 bytes in UTF-8)
↓ UTF-8 decode
str: "café" ← back to Python string
The Default Encoding Trap
import sys
# Platform default - varies by OS and locale!
print(sys.getdefaultencoding()) # utf-8 (CPython default)
print(sys.getfilesystemencoding()) # utf-8 on Linux/Mac, cp1252 on some Windows
# This is dangerous - behavior differs across machines:
with open("data.txt") as f: # no encoding= specified!
content = f.read()
On a Windows machine configured with a Russian locale, the default encoding might be cp1251. Files written on a Linux machine in UTF-8 would fail to read.
:::warning Always specify encoding explicitly
# Wrong - behavior undefined across platforms:
with open("config.json") as f:
data = f.read()
# Correct - always specify:
with open("config.json", encoding="utf-8") as f:
data = f.read()
Make this a team rule. Add a linter rule (W1514 in pylint) to enforce it.
:::
Common Encodings
| Encoding | Use Case |
|---|---|
utf-8 | Universal. Use this for everything you control. |
utf-8-sig | UTF-8 with BOM. Needed for Excel CSV compatibility. |
utf-16 | Windows APIs, some legacy formats. |
latin-1 | Western European legacy files (ISO-8859-1). |
cp1252 | Windows Western European (superset of latin-1). |
ascii | 7-bit only. Fails on any non-ASCII character. |
UnicodeDecodeError - Diagnosis and Fix
# Reading a latin-1 file as UTF-8 causes a decode error
try:
with open("legacy_data.txt", "r", encoding="utf-8") as f:
content = f.read()
except UnicodeDecodeError as e:
print(e)
# 'utf-8' codec can't decode byte 0xe9 in position 42: ...
# Fix 1: use the correct encoding
with open("legacy_data.txt", "r", encoding="latin-1") as f:
content = f.read()
# Fix 2: detect encoding automatically (requires chardet)
import chardet
with open("legacy_data.txt", "rb") as f:
raw = f.read(10000) # read a sample
detected = chardet.detect(raw)
print(detected) # {'encoding': 'ISO-8859-1', 'confidence': 0.73}
with open("legacy_data.txt", "r", encoding=detected["encoding"]) as f:
content = f.read()
The errors Parameter
When a byte sequence cannot be decoded, Python raises UnicodeDecodeError by default. The errors parameter controls this:
# 'strict' (default) - raise UnicodeDecodeError
with open("file.txt", "r", encoding="utf-8", errors="strict") as f:
content = f.read() # raises on bad bytes
# 'ignore' - silently skip undecoded bytes
with open("file.txt", "r", encoding="utf-8", errors="ignore") as f:
content = f.read() # bad bytes disappear; data may be corrupted silently
# 'replace' - replace bad bytes with the Unicode replacement char (U+FFFD: )
with open("file.txt", "r", encoding="utf-8", errors="replace") as f:
content = f.read() # bad bytes become in the output
# 'backslashreplace' - represent bad bytes as \xNN escape sequences
with open("file.txt", "r", encoding="utf-8", errors="backslashreplace") as f:
content = f.read() # bad byte 0xe9 becomes the string \xe9
# 'surrogateescape' - advanced: encode bad bytes as surrogate code points
# Used internally by the OS layer; allows round-trip fidelity
with open("file.txt", "r", encoding="utf-8", errors="surrogateescape") as f:
content = f.read()
:::tip Production recommendation
For parsing unknown-origin files (web scraping, uploaded files), use errors="replace" or errors="ignore". For data pipelines where silent data loss is dangerous, use errors="strict" and catch UnicodeDecodeError explicitly to log the problematic file.
:::
Part 5 - Reading Methods in Depth
read(size=-1) - Read All or N Bytes
with open("config.txt", "r", encoding="utf-8") as f:
# Read entire file into one string
all_text = f.read() # returns str
print(type(all_text)) # <class 'str'>
# After reading to the end, position is at EOF
print(f.read()) # '' - empty string, not an error
# Seek back to beginning
f.seek(0)
chunk = f.read(100) # read exactly 100 characters
# Binary read - returns bytes
with open("image.png", "rb") as f:
header = f.read(8) # PNG signature: first 8 bytes
print(header) # b'\x89PNG\r\n\x1a\n'
:::warning f.read() loads the entire file into RAM
For a 10 GB log file, f.read() requires 10 GB of free memory. On a typical server with 8 GB RAM, this raises MemoryError. Use readline() or iteration for large files.
:::
readline(size=-1) - One Line at a Time
with open("server.log", "r", encoding="utf-8") as f:
# Read lines one at a time - memory-efficient
while True:
line = f.readline()
if not line: # empty string = EOF (not '\n'!)
break
process(line.rstrip("\n"))
The distinction between an empty line and EOF:
File content: "line1\nline2\n\nline4\n"
readline() calls:
1st call → "line1\n" ← line1 with newline
2nd call → "line2\n" ← line2 with newline
3rd call → "\n" ← empty line (just a newline)
4th call → "line4\n" ← line4 with newline
5th call → "" ← empty string = EOF
readlines() - Read All Lines into a List
with open("config.ini", "r", encoding="utf-8") as f:
lines = f.readlines() # returns list of str, each with \n
print(lines)
# ['[database]\n', 'host=localhost\n', 'port=5432\n']
# Strip newlines
lines = [line.rstrip("\n") for line in lines]
:::note readlines() vs list(f)
f.readlines() and list(f) both produce a list of all lines. They are equivalent in behavior, but list(f) is slightly more Pythonic. Both load all lines into memory - only use them for files that fit in RAM.
:::
Iteration Over a File Object - The Best Default
# Iteration is the cleanest, most memory-efficient approach for line-by-line reading
with open("access.log", "r", encoding="utf-8") as f:
for line in f:
# line includes the trailing \n
if "ERROR" in line:
print(line.rstrip())
File objects are iterators - each __next__() call reads one line via the buffer. This is:
- Memory efficient: only one line in RAM at a time
- Faster than
readline()in a while loop (less Python call overhead) - The idiomatic Python pattern
# Practical pattern: filter and transform large log files
def parse_errors(log_path):
"""Yield (timestamp, message) tuples for ERROR lines."""
with open(log_path, "r", encoding="utf-8") as f:
for line in f:
line = line.rstrip()
if not line.startswith("[ERROR]"):
continue
# "[ERROR] 2024-01-15 14:32:01 - Database connection failed"
parts = line.split(" ", 3)
timestamp = f"{parts[1]} {parts[2]}"
message = parts[3].lstrip("- ")
yield timestamp, message
for ts, msg in parse_errors("/var/log/app.log"):
print(f"{ts}: {msg}")
Part 6 - File Object Internals: Buffered I/O
What Buffering Does
Without buffering, every readline() call would issue a system call to the OS - a context switch from user space to kernel space. For a file with 1 million lines, that is 1 million system calls.
With buffering, Python reads a large chunk (default: 8 KB) into RAM in one system call, then serves individual reads from that in-memory buffer. This reduces system calls by a factor of ~4,000 for typical line lengths.
Without buffering:
Python → OS → disk → OS → Python → OS → disk → OS → Python ...
(one system call per read)
With buffering (8 KB chunks):
Python → OS → disk → OS → Python (8 KB buffer filled)
serve line 1 from buffer (no system call)
serve line 2 from buffer (no system call)
...
serve line 400 from buffer (no system call)
Python → OS → disk → OS → Python (refill buffer)
The buffering Parameter
# buffering=-1 (default): system default, typically 8192 bytes for files
with open("data.txt", "r") as f:
print(f.buffer.read1.__doc__) # shows BufferedReader internals
# buffering=0: unbuffered (binary mode only!)
# Every read goes directly to OS - maximum latency, useful for tail -f style reading
with open("data.bin", "rb", buffering=0) as f:
chunk = f.read(8)
# buffering=1: line buffering (text mode only, used for interactive streams)
# Flushes on every \n - used for stdout when writing to terminal
# buffering=N (N > 1): explicit buffer size in bytes
with open("data.txt", "r", buffering=65536) as f: # 64 KB buffer
pass
:::note Buffer size for large files For reading very large files sequentially, a larger buffer (64 KB or 256 KB) reduces system call overhead. The default 8 KB is a good general-purpose size, but sequential large-file workloads benefit from larger values. :::
Inspecting the IO Stack
import io
with open("test.txt", "r", encoding="utf-8") as f:
print(type(f)) # <class '_io.TextIOWrapper'>
print(type(f.buffer)) # <class '_io.BufferedReader'>
print(type(f.buffer.raw)) # <class '_io.FileIO'>
# The underlying OS file descriptor
print(f.fileno()) # integer, e.g. 3
# Current byte position in the file
print(f.buffer.tell()) # 0 (at start)
Part 7 - The Context Manager: Always Use with open()
Why finally Alone Is Not Enough
Many tutorials show this "safe" pattern:
f = open("data.txt", "r")
try:
content = f.read()
finally:
f.close() # always close, even if read() raises
This works. But it has two problems:
- It is verbose - 5 lines instead of 2
- There is a subtle issue: if
open()itself raises (permissions error, file not found),fis never assigned, andf.close()infinallywould raiseNameError
The context manager handles all of this:
with open("data.txt", "r", encoding="utf-8") as f:
content = f.read()
# f is guaranteed closed here, even if read() raises an exception
# Even if the exception propagates, the file is closed before it unwinds
What with open() Does Under the Hood
# This code:
with open("data.txt") as f:
content = f.read()
# Is equivalent to:
_file = open("data.txt")
f = _file.__enter__() # returns the file object itself
try:
content = f.read()
except:
_file.__exit__(*sys.exc_info()) # passes exception info
raise
else:
_file.__exit__(None, None, None) # no exception
File objects implement __enter__ (returns self) and __exit__ (calls self.close()). Context managers are covered in depth in the next section.
:::danger Never leave files open
Open file descriptors are system resources. The OS limits the number of open files per process (typically 1024 on Linux). A long-running server that opens files without closing them will hit this limit and crash with OSError: [Errno 24] Too many open files.
Always use with open(). Never rely on garbage collection to close files - CPython uses reference counting and usually closes promptly, but PyPy and other implementations do not guarantee this.
:::
Part 8 - Large File Handling
Strategy 1: Line-by-Line Iteration (Most Common)
def count_errors(log_path):
"""Count ERROR lines in a log file of any size."""
count = 0
with open(log_path, "r", encoding="utf-8") as f:
for line in f:
if "ERROR" in line:
count += 1
return count
# Works on a 10 GB file - only one line in RAM at a time
errors = count_errors("/var/log/production.log")
Strategy 2: Chunked Reading for Binary or Non-Line Data
def compute_sha256(filepath):
"""Compute SHA-256 hash of a file without loading it into memory."""
import hashlib
hasher = hashlib.sha256()
chunk_size = 65536 # 64 KB chunks
with open(filepath, "rb") as f:
while True:
chunk = f.read(chunk_size)
if not chunk:
break
hasher.update(chunk)
return hasher.hexdigest()
checksum = compute_sha256("/var/backup/database.dump")
print(checksum) # e3b0c44298fc1c149afb...
Strategy 3: Generator Pipeline for Stream Processing
def read_lines(path, encoding="utf-8"):
"""Generator that yields lines from a file."""
with open(path, "r", encoding=encoding) as f:
for line in f:
yield line.rstrip("\n")
def filter_errors(lines):
"""Generator that yields only error lines."""
for line in lines:
if line.startswith("[ERROR]"):
yield line
def parse_timestamp(lines):
"""Generator that yields (timestamp, rest) tuples."""
for line in lines:
parts = line.split(" ", 2)
yield parts[1], parts[2] if len(parts) >= 3 else ""
# Compose the pipeline - zero extra memory, pure generator chain
pipeline = parse_timestamp(filter_errors(read_lines("/var/log/app.log")))
for timestamp, message in pipeline:
print(f" {timestamp}: {message}")
This is the generator pipeline pattern - each stage processes one item at a time. No intermediate lists are created. Memory usage is O(1) regardless of file size.
Strategy 4: Memory-Mapped Files with mmap
For random access into large files (like a binary index file), memory-mapping is the correct tool:
import mmap
def find_in_large_file(filepath, search_bytes):
"""
Find all byte offsets of search_bytes in a large file
using memory-mapped I/O for efficient random access.
"""
offsets = []
with open(filepath, "rb") as f:
# Map the entire file into virtual address space
# The OS loads pages on demand - never all in RAM
with mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) as mm:
offset = 0
while True:
pos = mm.find(search_bytes, offset)
if pos == -1:
break
offsets.append(pos)
offset = pos + 1
return offsets
# Find all occurrences of b"ERROR" in a 5 GB binary log
positions = find_in_large_file("/data/app.bin", b"ERROR")
print(f"Found {len(positions)} occurrences")
Result: O(1) memory for any file size, fast random access.
Part 9 - Real-World Patterns
Pattern 1: Reading a Config File Safely
def load_config(path):
"""Load a key=value config file into a dict."""
config = {}
config_path = Path(path)
if not config_path.exists():
raise FileNotFoundError(f"Config file not found: {path}")
with open(config_path, "r", encoding="utf-8") as f:
for lineno, line in enumerate(f, start=1):
line = line.strip()
if not line or line.startswith("#"): # skip empty lines and comments
continue
if "=" not in line:
raise ValueError(f"Line {lineno}: expected 'key=value', got: {line!r}")
key, _, value = line.partition("=")
config[key.strip()] = value.strip()
return config
# Usage
config = load_config("/etc/myapp/settings.conf")
db_host = config.get("DB_HOST", "localhost")
Pattern 2: Streaming CSV Parsing Without Pandas
import csv
def stream_large_csv(filepath, batch_size=1000):
"""
Read a large CSV file in batches.
Yields lists of dicts, batch_size rows at a time.
"""
with open(filepath, "r", newline="", encoding="utf-8") as f:
reader = csv.DictReader(f)
batch = []
for row in reader:
batch.append(row)
if len(batch) >= batch_size:
yield batch
batch = []
if batch:
yield batch # last partial batch
# Process 5 million rows with constant memory
for batch in stream_large_csv("/data/transactions.csv"):
# batch is a list of dicts, max 1000 rows
process_batch(batch)
Pattern 3: Tail a Growing Log File
import time
def tail_file(filepath, interval=0.5):
"""
Yield new lines as they are appended to a file (like Unix tail -f).
"""
with open(filepath, "r", encoding="utf-8") as f:
f.seek(0, 2) # seek to end of file (SEEK_END = 2)
while True:
line = f.readline()
if line:
yield line.rstrip("\n")
else:
time.sleep(interval) # no new data, wait
# Monitor a log file in real time
for new_line in tail_file("/var/log/app.log"):
if "CRITICAL" in new_line:
send_alert(new_line)
Interview Questions
Q1: What is the difference between text mode and binary mode in Python file I/O?
Answer: In text mode ('r', 'w', 'a'), Python applies encoding/decoding (converting between bytes on disk and str in Python) and performs newline translation (collapsing \r\n or \r to \n on read, and expanding \n to the platform newline on write). In binary mode ('rb', 'wb'), no encoding or newline translation occurs - you receive and send raw bytes objects. Binary mode is required for non-text files (images, compressed data, serialized objects) and any situation where byte-for-byte fidelity matters.
Q2: What causes UnicodeDecodeError when reading a file, and how do you fix it?
Answer: UnicodeDecodeError occurs when a file contains byte sequences that are invalid under the specified encoding. The most common cause is reading a file created with a legacy encoding (Latin-1, Windows-1252) while specifying encoding="utf-8". Fixes: (1) identify the correct encoding using chardet.detect() and specify it; (2) use errors="replace" to substitute bad bytes with U+FFFD; (3) use errors="ignore" to discard bad bytes (risks data loss); (4) use errors="backslashreplace" to represent bad bytes as escape sequences for debugging.
Q3: Why should you always specify encoding= when calling open() in text mode?
Answer: If encoding is omitted, Python uses the locale-dependent default from locale.getpreferredencoding(False). This default varies by OS and system configuration - it is UTF-8 on modern Linux/macOS but may be cp1252 on Windows or another encoding on differently-configured systems. Code that relies on the default encoding works on one developer's machine but silently produces corrupt output or crashes on another's. Always specify encoding="utf-8" (or the specific encoding you need) to make file I/O deterministic across environments.
Q4: What is the difference between f.read(), f.readline(), f.readlines(), and iterating over f?
Answer:
f.read()reads the entire file into onestrobject - uses maximum memory, fast for small filesf.readline()reads one line per call - memory efficient, useful when you need to mix reading and other logicf.readlines()reads all lines into alistofstr- uses the same memory asf.read()plus list overhead, convenient for random access to lines- Iterating (
for line in f) calls__next__()which reads one line per iteration using the buffer - most memory-efficient, idiomatic, and usually fastest for line-by-line processing
For large files, always prefer iteration or readline(). For small files (configuration, small data), read() or readlines() are fine.
Q5: Explain Python's buffered I/O stack and why buffering=0 only works in binary mode.
Answer: Python's I/O stack has three layers: FileIO (raw OS system calls), BufferedReader/BufferedWriter (in-memory chunk buffer), and TextIOWrapper (encoding/decoding + newline translation). buffering=0 requests unbuffered I/O, bypassing the BufferedReader layer. In binary mode, FileIO alone is sufficient - you get raw bytes directly from OS calls. In text mode, TextIOWrapper requires a buffered source because it needs to read ahead for multi-byte character boundaries. Python raises ValueError: can't have unbuffered text I/O if you attempt buffering=0 with text mode.
Q6: How would you process a 20 GB CSV file on a machine with 4 GB of RAM? Walk through your approach.
Answer: I would use a generator pipeline to stream the file:
- Open the file with
open(path, "r", newline="", encoding="utf-8")and wrap it incsv.DictReader - Iterate over the reader row by row - each iteration reads one line from the buffer
- Process each row immediately or accumulate into small batches (1,000-10,000 rows) for batch DB inserts
- If random access is needed (e.g., binary index file), use
mmapto memory-map the file - the OS pages in only the accessed portions - If aggregation is needed (sums, counts), maintain running accumulators rather than storing all rows
The key principle: never call f.read() or f.readlines() on unknown-size files. Always stream. Memory usage should be O(batch_size), not O(file_size).
Practice Challenges
Beginner - Read and Filter
Write a function find_lines(filepath, keyword) that reads a text file and returns a list of all lines containing the keyword (case-insensitive).
# Expected behavior:
# File contents:
# "Alice bought apples\n"
# "Bob bought Bananas\n"
# "Alice sold oranges\n"
#
# find_lines("purchases.txt", "alice") → ["Alice bought apples", "Alice sold oranges"]
Solution
def find_lines(filepath, keyword):
"""
Return a list of lines containing keyword (case-insensitive).
Lines are returned without trailing newlines.
"""
keyword_lower = keyword.lower()
results = []
with open(filepath, "r", encoding="utf-8") as f:
for line in f:
stripped = line.rstrip("\n")
if keyword_lower in stripped.lower():
results.append(stripped)
return results
# Test it
import tempfile, os
content = "Alice bought apples\nBob bought Bananas\nAlice sold oranges\n"
with tempfile.NamedTemporaryFile("w", encoding="utf-8", suffix=".txt", delete=False) as tmp:
tmp.write(content)
tmp_path = tmp.name
result = find_lines(tmp_path, "alice")
print(result) # ['Alice bought apples', 'Alice sold oranges']
os.unlink(tmp_path)
Key points:
- Use
with open()to ensure the file is always closed - Specify
encoding="utf-8"explicitly - Use
line.rstrip("\n")notline.strip()-strip()removes leading spaces too - Lower-case both sides for case-insensitive comparison
Intermediate - Word Frequency Counter
Write a memory-efficient function word_frequency(filepath) that returns a dict mapping each word to its frequency in a large text file. The file might be too large to fit in RAM.
Requirements:
- Lowercase all words
- Strip punctuation from words
- Use streaming (line-by-line) - do not call
f.read()
Solution
import string
from collections import Counter
def word_frequency(filepath):
"""
Count word frequencies in a file using line-by-line streaming.
Returns a Counter (dict subclass) mapping word → count.
Works on files larger than available RAM.
"""
counts = Counter()
translator = str.maketrans("", "", string.punctuation)
with open(filepath, "r", encoding="utf-8", errors="replace") as f:
for line in f:
# Remove punctuation and lowercase
cleaned = line.translate(translator).lower()
words = cleaned.split()
counts.update(words)
return counts
# Test
import tempfile, os
text = """
Python is great. Python is fast!
Python's ecosystem is huge, and Python is fun.
"""
with tempfile.NamedTemporaryFile("w", encoding="utf-8", suffix=".txt", delete=False) as tmp:
tmp.write(text)
tmp_path = tmp.name
freq = word_frequency(tmp_path)
print(freq.most_common(5))
# [('python', 4), ('is', 4), ('great', 1), ('fast', 1), ('ecosystem', 1)]
os.unlink(tmp_path)
Design notes:
Counter.update()accepts an iterable and increments counts - O(words per line) memory per callstr.maketranswith punctuation removal is faster than regex for this use caseerrors="replace"handles corrupted bytes gracefully in real-world files- The function uses O(unique_words) memory - much less than O(total_words)
Advanced - Streaming Log Aggregator
Build a streaming log aggregator that reads a large log file and produces a summary report without loading the full file into memory.
Log format:
[LEVEL] YYYY-MM-DD HH:MM:SS - message
Requirements:
- Count occurrences by level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
- Find the first and last timestamp in the file
- Collect the last 10 ERROR/CRITICAL messages
- Work on a file of any size using constant memory (aside from the last-10 buffer)
Solution
from collections import defaultdict, deque
from dataclasses import dataclass, field
from typing import Optional
@dataclass
class LogSummary:
level_counts: dict = field(default_factory=lambda: defaultdict(int))
first_timestamp: Optional[str] = None
last_timestamp: Optional[str] = None
recent_errors: deque = field(default_factory=lambda: deque(maxlen=10))
total_lines: int = 0
parse_errors: int = 0
def analyze_log(filepath):
"""
Stream a log file and produce a summary report.
Memory usage: O(1) + O(last_10_errors) regardless of file size.
"""
summary = LogSummary()
with open(filepath, "r", encoding="utf-8", errors="replace") as f:
for line in f:
summary.total_lines += 1
line = line.rstrip()
if not line:
continue
# Parse: "[ERROR] 2024-01-15 14:32:01 - message"
if not line.startswith("["):
summary.parse_errors += 1
continue
try:
bracket_end = line.index("]")
level = line[1:bracket_end]
rest = line[bracket_end + 2:] # skip "] "
# rest: "2024-01-15 14:32:01 - message"
parts = rest.split(" ", 3)
if len(parts) < 3:
summary.parse_errors += 1
continue
timestamp = f"{parts[0]} {parts[1]}"
except (ValueError, IndexError):
summary.parse_errors += 1
continue
# Update counts
summary.level_counts[level] += 1
# Track first/last timestamp
if summary.first_timestamp is None:
summary.first_timestamp = timestamp
summary.last_timestamp = timestamp
# Collect recent errors (deque with maxlen=10 auto-evicts old entries)
if level in ("ERROR", "CRITICAL"):
message = parts[3].lstrip("- ") if len(parts) > 3 else ""
summary.recent_errors.append(f"[{level}] {timestamp}: {message}")
return summary
def print_report(summary):
print(f"Total lines processed: {summary.total_lines:,}")
print(f"Parse errors: {summary.parse_errors}")
print(f"Time range: {summary.first_timestamp} → {summary.last_timestamp}")
print()
print("Level breakdown:")
for level in ["CRITICAL", "ERROR", "WARNING", "INFO", "DEBUG"]:
count = summary.level_counts.get(level, 0)
bar = "#" * min(count, 50)
print(f" {level:10} {count:8,} {bar}")
print()
print("Recent errors (last 10):")
for err in summary.recent_errors:
print(f" {err}")
# Demo with generated test data
import tempfile, os, random
def generate_test_log(path, num_lines=10000):
levels = ["DEBUG"] * 50 + ["INFO"] * 30 + ["WARNING"] * 10 + ["ERROR"] * 8 + ["CRITICAL"] * 2
with open(path, "w", encoding="utf-8") as f:
for i in range(num_lines):
level = random.choice(levels)
f.write(f"[{level}] 2024-01-15 {i//3600:02d}:{(i%3600)//60:02d}:{i%60:02d} - Event {i}\n")
with tempfile.NamedTemporaryFile("w", suffix=".log", delete=False) as tmp:
tmp_path = tmp.name
generate_test_log(tmp_path, 10000)
summary = analyze_log(tmp_path)
print_report(summary)
os.unlink(tmp_path)
Output (approximate):
Total lines processed: 10,000
Parse errors: 0
Time range: 2024-01-15 00:00:00 → 2024-01-15 02:46:39
Level breakdown:
CRITICAL 200 ####################
ERROR 800 ################################################################################
WARNING 1,000 ################################################################################
INFO 3,000 ################################################################################
DEBUG 5,000 ################################################################################
Recent errors (last 10):
[CRITICAL] 2024-01-15 02:46:37: Event 9997
...
Design highlights:
deque(maxlen=10)automatically discards old entries - O(10) memory alwaysdefaultdict(int)is cleaner than checking key existenceerrors="replace"handles corrupted log files gracefully- The function is a single pass through the file - O(1) memory
Quick Reference
| Task | Code | Notes |
|---|---|---|
| Open text file for reading | open(path, "r", encoding="utf-8") | Always specify encoding |
| Open binary file for reading | open(path, "rb") | No encoding in binary mode |
| Read entire file | f.read() | Loads all into RAM - small files only |
| Read N characters | f.read(1024) | Returns up to 1024 chars |
| Read one line | f.readline() | Includes \n; "" at EOF |
| Read all lines as list | f.readlines() | Each item includes \n |
| Iterate line by line | for line in f: | Most efficient, idiomatic |
| Skip bad bytes | errors="replace" | Replaces with U+FFFD |
| Ignore bad bytes | errors="ignore" | Silent data loss risk |
| Preserve raw newlines | newline="" | Required for CSV module |
| Get file position | f.tell() | Byte offset from start |
| Move file position | f.seek(offset, whence) | whence: 0=start, 1=current, 2=end |
| Memory-map large file | mmap.mmap(f.fileno(), 0) | Random access without full RAM load |
| Detect encoding | chardet.detect(raw_bytes) | Third-party library |
| Read in 64 KB chunks | f.read(65536) in while loop | For binary or non-line data |
| Open with larger buffer | buffering=65536 | Reduces system calls for large files |
Key Takeaways
- Python's
open()creates a three-layer stack:TextIOWrapper(encoding),BufferedReader(chunking), andFileIO(OS calls). Understanding this stack explains most file I/O behavior. - Always specify
encoding="utf-8"(or the correct encoding) in text mode. Platform defaults vary and cause subtle cross-environment bugs. - Text mode translates newlines; binary mode preserves raw bytes. Use binary mode for anything that is not human-readable text.
f.read()loads the entire file into RAM. For files of unknown or large size, usefor line in f:iteration instead - it is memory-efficient and idiomatic.- The
errorsparameter controlsUnicodeDecodeErrorhandling:"strict"(default, raises),"replace"(substitutes U+FFFD),"ignore"(drops bad bytes). - Always use
with open() as f:- it guarantees the file is closed even when exceptions occur, preventing file descriptor leaks. - For very large files, prefer generator pipelines (O(1) memory), chunked reads for binary data, or
mmapfor random access patterns.
