Reading Files - `open()`, Modes, Encodings, and Buffering

Reading time: ~18 minutes | Level: Foundation → Engineering

Here is a question that trips up experienced developers:

# You have a 10 GB server log file.
# What does this code do?

with open("server.log", "r") as f:
    content = f.read()
    print(len(content))

The code works - but it loads the entire 10 GB file into RAM as a single string. On a machine with 8 GB of RAM, Python will crash with MemoryError before it even reaches print.

The "right" answer is almost never f.read() for files of unknown size. This page explains why, what open() actually creates under the hood, and the full toolkit for reading files efficiently - from a 100-byte config file to a multi-gigabyte dataset.

What You Will Learn

Every parameter of open() - path, mode, encoding, buffering, errors, newline
All file modes: 'r', 'rb', 'rt', 'r+' - what each means at the OS level
How Python translates newlines in text mode and when to bypass that
Encodings: why UTF-8 is the correct default, what UnicodeDecodeError means, and all the errors parameter values
All reading methods: read(), readline(), readlines(), and iteration - when to use each
File object internals: buffered I/O layers and the underlying OS file descriptor
Memory-efficient reading: generators, chunked reads, and memory-mapped files with mmap
Real-world patterns: streaming 10 GB log files, parsing CSV line-by-line, loading config files

Prerequisites

Python 3.8+ installed and running
Basic understanding of Python data types: str, bytes, list
Familiarity with for loops and with statements at a syntactic level

Mental Model: What `open()` Actually Creates

Most developers think open("file.txt") connects your code directly to a file on disk. The reality is a three-layer stack:

When you call f.read(1024), Python:

Asks TextIOWrapper for 1024 characters
TextIOWrapper asks BufferedReader for bytes
BufferedReader checks its internal buffer - if enough bytes are there, returns them; otherwise asks FileIO for a new chunk (typically 8 KB)
FileIO issues the OS read() system call
The OS reads from disk (or page cache) and returns raw bytes
Bytes flow back up, TextIOWrapper decodes them to str

Understanding this stack is why buffering=0 only works in binary mode (there is no TextIOWrapper to bypass), and why large files benefit from chunked reading.

Part 1 - The `open()` Function, All Parameters

Signature

open(
    file,                    # path (str, bytes, or Path object)
    mode='r',                # how to open the file
    buffering=-1,            # buffer size: -1 = automatic
    encoding=None,           # text encoding (text mode only)
    errors=None,             # encoding error handling policy
    newline=None,            # newline translation (text mode only)
    closefd=True,            # if file is a file descriptor integer
    opener=None              # custom opener callable
)

The `file` Parameter

# String path (most common)
f = open("/var/log/app.log", "r")

# pathlib.Path object - works natively since Python 3.6
from pathlib import Path
f = open(Path.home() / "data" / "config.json", "r")

# Integer file descriptor (advanced: wraps an OS fd)
import os
fd = os.open("/tmp/test.txt", os.O_RDONLY)
f = open(fd, "r", closefd=True)   # closefd=True means close fd when f.close() is called

Part 2 - File Modes in Depth

The mode string is one or two characters that specify intent and format.

Mode	Meaning
`'r'`	Read, text mode. File must exist. Default.
`'w'`	Write, text mode. Truncates (destroys) existing file.
`'a'`	Append, text mode. Creates if missing.
`'x'`	Exclusive create. Fails if file exists.
`'b'`	Binary modifier (combine with r/w/a/x)
`'t'`	Text modifier. Default, rarely written explicitly.
`'+'`	Update (read+write). Combine with r/w/a.

Common combinations: 'rb' - read binary (images, PDFs, pickle files) · 'wb' - write binary · 'r+' - read AND write (no truncation, must exist) · 'w+' - write AND read (truncates first) · 'a+' - append AND read

Text Mode `'r'` (default)

with open("poem.txt", "r", encoding="utf-8") as f:
    content = f.read()
    print(type(content))   # <class 'str'>

In text mode, Python decodes bytes from disk into a str object using the specified encoding. Newline sequences are also normalized (more on this below).

Binary Mode `'rb'`

with open("image.png", "rb") as f:
    header = f.read(8)
    print(header)          # b'\x89PNG\r\n\x1a\n'
    print(type(header))    # <class 'bytes'>

In binary mode:

No encoding/decoding happens - you get raw bytes
No newline translation
The encoding parameter is ignored (and should not be passed)

Use binary mode for: images, audio, video, compiled files, pickle serialization, network protocol parsing, any format where bytes are semantically significant.

`'r+'` - Read and Write Without Truncation

with open("counter.txt", "r+", encoding="utf-8") as f:
    value = int(f.read().strip())   # read current value
    f.seek(0)                        # go back to start
    f.write(str(value + 1))         # write new value
    f.truncate()                     # remove any leftover bytes

'r+' is rarely used. It requires the file to exist and does not truncate it. You must manage position with seek() manually.

Part 3 - Text vs Binary Mode: Newline Translation

This is one of Python's most surprising behaviors for cross-platform code.

What Happens in Text Mode

Different operating systems use different newline sequences:

Unix/Linux/macOS: \n (LF, one byte, 0x0A)
Windows: \r\n (CRLF, two bytes, 0x0D 0x0A)
Old Mac OS (pre-X): \r (CR, one byte, 0x0D)

In text mode, Python automatically translates on reading:

\r\n → \n (Windows files read on any platform)
\r → \n (old Mac files)

And on writing, Python translates \n → the platform's native newline.

# A Windows file containing: "line1\r\nline2\r\n"
with open("windows_file.txt", "rb") as f:
    raw = f.read()
    print(repr(raw))       # b'line1\r\nline2\r\n'

with open("windows_file.txt", "r") as f:
    text = f.read()
    print(repr(text))      # 'line1\nline2\n'  ← \r\n collapsed to \n

When This Matters

# Counting bytes won't match if you use text mode on a Windows file
import os

filename = "windows_file.txt"
file_size = os.path.getsize(filename)   # real bytes on disk: includes \r\n

with open(filename, "r") as f:
    content = f.read()
    text_len = len(content)   # len() of str: \r\n counted as 1 char (\n)

print(file_size)   # 14 (len of b'line1\r\nline2\r\n')
print(text_len)    # 12 (len of 'line1\nline2\n')  - different!

Controlling Newline Behavior with `newline=`

# newline='' - no translation, preserve all newline bytes
with open("file.txt", "r", newline="") as f:
    content = f.read()    # \r\n preserved as-is

# newline='\n' - only recognize \n as line terminator
with open("file.txt", "r", newline="\n") as f:
    content = f.read()    # \r\n in file will appear as \r\n in string

:::tip Use newline='' for CSV Python's csv module documentation explicitly states you should open files with newline='' in text mode. This prevents double-translation of newlines within quoted CSV fields.

import csv
with open("data.csv", "r", newline="", encoding="utf-8") as f:
    reader = csv.reader(f)
    for row in reader:
        print(row)

:::

Part 4 - Encodings: UTF-8 Everywhere

Why Encoding Matters

Files on disk are bytes. A str in Python is a sequence of Unicode code points. Encoding is the mapping between the two.

  str: "café"           ← Python string (4 Unicode code points)
   ↓  UTF-8 encode
bytes: b'caf\xc3\xa9'  ← 5 bytes on disk (é is 2 bytes in UTF-8)
   ↓  UTF-8 decode
  str: "café"           ← back to Python string

The Default Encoding Trap

import sys

# Platform default - varies by OS and locale!
print(sys.getdefaultencoding())    # utf-8 (CPython default)
print(sys.getfilesystemencoding()) # utf-8 on Linux/Mac, cp1252 on some Windows

# This is dangerous - behavior differs across machines:
with open("data.txt") as f:    # no encoding= specified!
    content = f.read()

On a Windows machine configured with a Russian locale, the default encoding might be cp1251. Files written on a Linux machine in UTF-8 would fail to read.

:::warning Always specify encoding explicitly

# Wrong - behavior undefined across platforms:
with open("config.json") as f:
    data = f.read()

# Correct - always specify:
with open("config.json", encoding="utf-8") as f:
    data = f.read()

Make this a team rule. Add a linter rule (W1514 in pylint) to enforce it. :::

Common Encodings

Encoding	Use Case
`utf-8`	Universal. Use this for everything you control.
`utf-8-sig`	UTF-8 with BOM. Needed for Excel CSV compatibility.
`utf-16`	Windows APIs, some legacy formats.
`latin-1`	Western European legacy files (ISO-8859-1).
`cp1252`	Windows Western European (superset of latin-1).
`ascii`	7-bit only. Fails on any non-ASCII character.

UnicodeDecodeError - Diagnosis and Fix

# Reading a latin-1 file as UTF-8 causes a decode error
try:
    with open("legacy_data.txt", "r", encoding="utf-8") as f:
        content = f.read()
except UnicodeDecodeError as e:
    print(e)
    # 'utf-8' codec can't decode byte 0xe9 in position 42: ...

# Fix 1: use the correct encoding
with open("legacy_data.txt", "r", encoding="latin-1") as f:
    content = f.read()

# Fix 2: detect encoding automatically (requires chardet)
import chardet
with open("legacy_data.txt", "rb") as f:
    raw = f.read(10000)   # read a sample
    detected = chardet.detect(raw)
    print(detected)   # {'encoding': 'ISO-8859-1', 'confidence': 0.73}

with open("legacy_data.txt", "r", encoding=detected["encoding"]) as f:
    content = f.read()

The `errors` Parameter

When a byte sequence cannot be decoded, Python raises UnicodeDecodeError by default. The errors parameter controls this:

# 'strict' (default) - raise UnicodeDecodeError
with open("file.txt", "r", encoding="utf-8", errors="strict") as f:
    content = f.read()   # raises on bad bytes

# 'ignore' - silently skip undecoded bytes
with open("file.txt", "r", encoding="utf-8", errors="ignore") as f:
    content = f.read()   # bad bytes disappear; data may be corrupted silently

# 'replace' - replace bad bytes with the Unicode replacement char (U+FFFD: )
with open("file.txt", "r", encoding="utf-8", errors="replace") as f:
    content = f.read()   # bad bytes become  in the output

# 'backslashreplace' - represent bad bytes as \xNN escape sequences
with open("file.txt", "r", encoding="utf-8", errors="backslashreplace") as f:
    content = f.read()   # bad byte 0xe9 becomes the string \xe9

# 'surrogateescape' - advanced: encode bad bytes as surrogate code points
# Used internally by the OS layer; allows round-trip fidelity
with open("file.txt", "r", encoding="utf-8", errors="surrogateescape") as f:
    content = f.read()

:::tip Production recommendation For parsing unknown-origin files (web scraping, uploaded files), use errors="replace" or errors="ignore". For data pipelines where silent data loss is dangerous, use errors="strict" and catch UnicodeDecodeError explicitly to log the problematic file. :::

Part 5 - Reading Methods in Depth

`read(size=-1)` - Read All or N Bytes

with open("config.txt", "r", encoding="utf-8") as f:
    # Read entire file into one string
    all_text = f.read()      # returns str
    print(type(all_text))    # <class 'str'>

    # After reading to the end, position is at EOF
    print(f.read())          # '' - empty string, not an error

    # Seek back to beginning
    f.seek(0)
    chunk = f.read(100)      # read exactly 100 characters

# Binary read - returns bytes
with open("image.png", "rb") as f:
    header = f.read(8)       # PNG signature: first 8 bytes
    print(header)            # b'\x89PNG\r\n\x1a\n'

:::warning f.read() loads the entire file into RAM For a 10 GB log file, f.read() requires 10 GB of free memory. On a typical server with 8 GB RAM, this raises MemoryError. Use readline() or iteration for large files. :::

`readline(size=-1)` - One Line at a Time

with open("server.log", "r", encoding="utf-8") as f:
    # Read lines one at a time - memory-efficient
    while True:
        line = f.readline()
        if not line:          # empty string = EOF (not '\n'!)
            break
        process(line.rstrip("\n"))

The distinction between an empty line and EOF:

File content: "line1\nline2\n\nline4\n"

readline() calls:
  1st call → "line1\n"    ← line1 with newline
  2nd call → "line2\n"    ← line2 with newline
  3rd call → "\n"         ← empty line (just a newline)
  4th call → "line4\n"    ← line4 with newline
  5th call → ""           ← empty string = EOF

`readlines()` - Read All Lines into a List

with open("config.ini", "r", encoding="utf-8") as f:
    lines = f.readlines()    # returns list of str, each with \n
    print(lines)
    # ['[database]\n', 'host=localhost\n', 'port=5432\n']

# Strip newlines
lines = [line.rstrip("\n") for line in lines]

:::note readlines() vs list(f) f.readlines() and list(f) both produce a list of all lines. They are equivalent in behavior, but list(f) is slightly more Pythonic. Both load all lines into memory - only use them for files that fit in RAM. :::

Iteration Over a File Object - The Best Default

# Iteration is the cleanest, most memory-efficient approach for line-by-line reading
with open("access.log", "r", encoding="utf-8") as f:
    for line in f:
        # line includes the trailing \n
        if "ERROR" in line:
            print(line.rstrip())

File objects are iterators - each __next__() call reads one line via the buffer. This is:

Memory efficient: only one line in RAM at a time
Faster than readline() in a while loop (less Python call overhead)
The idiomatic Python pattern

# Practical pattern: filter and transform large log files
def parse_errors(log_path):
    """Yield (timestamp, message) tuples for ERROR lines."""
    with open(log_path, "r", encoding="utf-8") as f:
        for line in f:
            line = line.rstrip()
            if not line.startswith("[ERROR]"):
                continue
            # "[ERROR] 2024-01-15 14:32:01 - Database connection failed"
            parts = line.split(" ", 3)
            timestamp = f"{parts[1]} {parts[2]}"
            message = parts[3].lstrip("- ")
            yield timestamp, message

for ts, msg in parse_errors("/var/log/app.log"):
    print(f"{ts}: {msg}")

Part 6 - File Object Internals: Buffered I/O

What Buffering Does

Without buffering, every readline() call would issue a system call to the OS - a context switch from user space to kernel space. For a file with 1 million lines, that is 1 million system calls.

With buffering, Python reads a large chunk (default: 8 KB) into RAM in one system call, then serves individual reads from that in-memory buffer. This reduces system calls by a factor of ~4,000 for typical line lengths.

Without buffering:
  Python → OS → disk → OS → Python → OS → disk → OS → Python ...
  (one system call per read)

With buffering (8 KB chunks):
  Python → OS → disk → OS → Python (8 KB buffer filled)
  serve line 1 from buffer  (no system call)
  serve line 2 from buffer  (no system call)
  ...
  serve line 400 from buffer (no system call)
  Python → OS → disk → OS → Python (refill buffer)

The `buffering` Parameter

# buffering=-1 (default): system default, typically 8192 bytes for files
with open("data.txt", "r") as f:
    print(f.buffer.read1.__doc__)   # shows BufferedReader internals

# buffering=0: unbuffered (binary mode only!)
# Every read goes directly to OS - maximum latency, useful for tail -f style reading
with open("data.bin", "rb", buffering=0) as f:
    chunk = f.read(8)

# buffering=1: line buffering (text mode only, used for interactive streams)
# Flushes on every \n - used for stdout when writing to terminal

# buffering=N (N > 1): explicit buffer size in bytes
with open("data.txt", "r", buffering=65536) as f:  # 64 KB buffer
    pass

:::note Buffer size for large files For reading very large files sequentially, a larger buffer (64 KB or 256 KB) reduces system call overhead. The default 8 KB is a good general-purpose size, but sequential large-file workloads benefit from larger values. :::

Inspecting the IO Stack

import io

with open("test.txt", "r", encoding="utf-8") as f:
    print(type(f))          # <class '_io.TextIOWrapper'>
    print(type(f.buffer))   # <class '_io.BufferedReader'>
    print(type(f.buffer.raw))  # <class '_io.FileIO'>

    # The underlying OS file descriptor
    print(f.fileno())       # integer, e.g. 3

    # Current byte position in the file
    print(f.buffer.tell())  # 0 (at start)

Part 7 - The Context Manager: Always Use `with open()`

Why `finally` Alone Is Not Enough

Many tutorials show this "safe" pattern:

f = open("data.txt", "r")
try:
    content = f.read()
finally:
    f.close()   # always close, even if read() raises

This works. But it has two problems:

It is verbose - 5 lines instead of 2
There is a subtle issue: if open() itself raises (permissions error, file not found), f is never assigned, and f.close() in finally would raise NameError

The context manager handles all of this:

with open("data.txt", "r", encoding="utf-8") as f:
    content = f.read()
# f is guaranteed closed here, even if read() raises an exception
# Even if the exception propagates, the file is closed before it unwinds

What `with open()` Does Under the Hood

# This code:
with open("data.txt") as f:
    content = f.read()

# Is equivalent to:
_file = open("data.txt")
f = _file.__enter__()     # returns the file object itself
try:
    content = f.read()
except:
    _file.__exit__(*sys.exc_info())   # passes exception info
    raise
else:
    _file.__exit__(None, None, None)  # no exception

File objects implement __enter__ (returns self) and __exit__ (calls self.close()). Context managers are covered in depth in the next section.

:::danger Never leave files open Open file descriptors are system resources. The OS limits the number of open files per process (typically 1024 on Linux). A long-running server that opens files without closing them will hit this limit and crash with OSError: [Errno 24] Too many open files.

Always use with open(). Never rely on garbage collection to close files - CPython uses reference counting and usually closes promptly, but PyPy and other implementations do not guarantee this. :::

Part 8 - Large File Handling

Strategy 1: Line-by-Line Iteration (Most Common)

def count_errors(log_path):
    """Count ERROR lines in a log file of any size."""
    count = 0
    with open(log_path, "r", encoding="utf-8") as f:
        for line in f:
            if "ERROR" in line:
                count += 1
    return count

# Works on a 10 GB file - only one line in RAM at a time
errors = count_errors("/var/log/production.log")

Strategy 2: Chunked Reading for Binary or Non-Line Data

def compute_sha256(filepath):
    """Compute SHA-256 hash of a file without loading it into memory."""
    import hashlib
    hasher = hashlib.sha256()
    chunk_size = 65536   # 64 KB chunks

    with open(filepath, "rb") as f:
        while True:
            chunk = f.read(chunk_size)
            if not chunk:
                break
            hasher.update(chunk)

    return hasher.hexdigest()

checksum = compute_sha256("/var/backup/database.dump")
print(checksum)   # e3b0c44298fc1c149afb...

Strategy 3: Generator Pipeline for Stream Processing

def read_lines(path, encoding="utf-8"):
    """Generator that yields lines from a file."""
    with open(path, "r", encoding=encoding) as f:
        for line in f:
            yield line.rstrip("\n")

def filter_errors(lines):
    """Generator that yields only error lines."""
    for line in lines:
        if line.startswith("[ERROR]"):
            yield line

def parse_timestamp(lines):
    """Generator that yields (timestamp, rest) tuples."""
    for line in lines:
        parts = line.split(" ", 2)
        yield parts[1], parts[2] if len(parts) >= 3 else ""

# Compose the pipeline - zero extra memory, pure generator chain
pipeline = parse_timestamp(filter_errors(read_lines("/var/log/app.log")))

for timestamp, message in pipeline:
    print(f"  {timestamp}: {message}")

This is the generator pipeline pattern - each stage processes one item at a time. No intermediate lists are created. Memory usage is O(1) regardless of file size.

Strategy 4: Memory-Mapped Files with `mmap`

For random access into large files (like a binary index file), memory-mapping is the correct tool:

import mmap

def find_in_large_file(filepath, search_bytes):
    """
    Find all byte offsets of search_bytes in a large file
    using memory-mapped I/O for efficient random access.
    """
    offsets = []
    with open(filepath, "rb") as f:
        # Map the entire file into virtual address space
        # The OS loads pages on demand - never all in RAM
        with mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) as mm:
            offset = 0
            while True:
                pos = mm.find(search_bytes, offset)
                if pos == -1:
                    break
                offsets.append(pos)
                offset = pos + 1
    return offsets

# Find all occurrences of b"ERROR" in a 5 GB binary log
positions = find_in_large_file("/data/app.bin", b"ERROR")
print(f"Found {len(positions)} occurrences")

Result: O(1) memory for any file size, fast random access.

Part 9 - Real-World Patterns

Pattern 1: Reading a Config File Safely

def load_config(path):
    """Load a key=value config file into a dict."""
    config = {}
    config_path = Path(path)

    if not config_path.exists():
        raise FileNotFoundError(f"Config file not found: {path}")

    with open(config_path, "r", encoding="utf-8") as f:
        for lineno, line in enumerate(f, start=1):
            line = line.strip()
            if not line or line.startswith("#"):  # skip empty lines and comments
                continue
            if "=" not in line:
                raise ValueError(f"Line {lineno}: expected 'key=value', got: {line!r}")
            key, _, value = line.partition("=")
            config[key.strip()] = value.strip()

    return config

# Usage
config = load_config("/etc/myapp/settings.conf")
db_host = config.get("DB_HOST", "localhost")

Pattern 2: Streaming CSV Parsing Without Pandas

import csv

def stream_large_csv(filepath, batch_size=1000):
    """
    Read a large CSV file in batches.
    Yields lists of dicts, batch_size rows at a time.
    """
    with open(filepath, "r", newline="", encoding="utf-8") as f:
        reader = csv.DictReader(f)
        batch = []
        for row in reader:
            batch.append(row)
            if len(batch) >= batch_size:
                yield batch
                batch = []
        if batch:
            yield batch   # last partial batch

# Process 5 million rows with constant memory
for batch in stream_large_csv("/data/transactions.csv"):
    # batch is a list of dicts, max 1000 rows
    process_batch(batch)

Pattern 3: Tail a Growing Log File

import time

def tail_file(filepath, interval=0.5):
    """
    Yield new lines as they are appended to a file (like Unix tail -f).
    """
    with open(filepath, "r", encoding="utf-8") as f:
        f.seek(0, 2)   # seek to end of file (SEEK_END = 2)
        while True:
            line = f.readline()
            if line:
                yield line.rstrip("\n")
            else:
                time.sleep(interval)   # no new data, wait

# Monitor a log file in real time
for new_line in tail_file("/var/log/app.log"):
    if "CRITICAL" in new_line:
        send_alert(new_line)

Interview Questions

Q1: What is the difference between text mode and binary mode in Python file I/O?

Answer: In text mode ('r', 'w', 'a'), Python applies encoding/decoding (converting between bytes on disk and str in Python) and performs newline translation (collapsing \r\n or \r to \n on read, and expanding \n to the platform newline on write). In binary mode ('rb', 'wb'), no encoding or newline translation occurs - you receive and send raw bytes objects. Binary mode is required for non-text files (images, compressed data, serialized objects) and any situation where byte-for-byte fidelity matters.

Q2: What causes `UnicodeDecodeError` when reading a file, and how do you fix it?

Answer: UnicodeDecodeError occurs when a file contains byte sequences that are invalid under the specified encoding. The most common cause is reading a file created with a legacy encoding (Latin-1, Windows-1252) while specifying encoding="utf-8". Fixes: (1) identify the correct encoding using chardet.detect() and specify it; (2) use errors="replace" to substitute bad bytes with U+FFFD; (3) use errors="ignore" to discard bad bytes (risks data loss); (4) use errors="backslashreplace" to represent bad bytes as escape sequences for debugging.

Q3: Why should you always specify `encoding=` when calling `open()` in text mode?

Answer: If encoding is omitted, Python uses the locale-dependent default from locale.getpreferredencoding(False). This default varies by OS and system configuration - it is UTF-8 on modern Linux/macOS but may be cp1252 on Windows or another encoding on differently-configured systems. Code that relies on the default encoding works on one developer's machine but silently produces corrupt output or crashes on another's. Always specify encoding="utf-8" (or the specific encoding you need) to make file I/O deterministic across environments.

Q4: What is the difference between `f.read()`, `f.readline()`, `f.readlines()`, and iterating over `f`?

Answer:

f.read() reads the entire file into one str object - uses maximum memory, fast for small files
f.readline() reads one line per call - memory efficient, useful when you need to mix reading and other logic
f.readlines() reads all lines into a list of str - uses the same memory as f.read() plus list overhead, convenient for random access to lines
Iterating (for line in f) calls __next__() which reads one line per iteration using the buffer - most memory-efficient, idiomatic, and usually fastest for line-by-line processing

For large files, always prefer iteration or readline(). For small files (configuration, small data), read() or readlines() are fine.

Q5: Explain Python's buffered I/O stack and why `buffering=0` only works in binary mode.

Answer: Python's I/O stack has three layers: FileIO (raw OS system calls), BufferedReader/BufferedWriter (in-memory chunk buffer), and TextIOWrapper (encoding/decoding + newline translation). buffering=0 requests unbuffered I/O, bypassing the BufferedReader layer. In binary mode, FileIO alone is sufficient - you get raw bytes directly from OS calls. In text mode, TextIOWrapper requires a buffered source because it needs to read ahead for multi-byte character boundaries. Python raises ValueError: can't have unbuffered text I/O if you attempt buffering=0 with text mode.

Q6: How would you process a 20 GB CSV file on a machine with 4 GB of RAM? Walk through your approach.

Answer: I would use a generator pipeline to stream the file:

Open the file with open(path, "r", newline="", encoding="utf-8") and wrap it in csv.DictReader
Iterate over the reader row by row - each iteration reads one line from the buffer
Process each row immediately or accumulate into small batches (1,000-10,000 rows) for batch DB inserts
If random access is needed (e.g., binary index file), use mmap to memory-map the file - the OS pages in only the accessed portions
If aggregation is needed (sums, counts), maintain running accumulators rather than storing all rows

The key principle: never call f.read() or f.readlines() on unknown-size files. Always stream. Memory usage should be O(batch_size), not O(file_size).

Practice Challenges

Beginner - Read and Filter

Write a function find_lines(filepath, keyword) that reads a text file and returns a list of all lines containing the keyword (case-insensitive).

# Expected behavior:
# File contents:
#   "Alice bought apples\n"
#   "Bob bought Bananas\n"
#   "Alice sold oranges\n"
#
# find_lines("purchases.txt", "alice") → ["Alice bought apples", "Alice sold oranges"]

Solution

def find_lines(filepath, keyword):
    """
    Return a list of lines containing keyword (case-insensitive).
    Lines are returned without trailing newlines.
    """
    keyword_lower = keyword.lower()
    results = []
    with open(filepath, "r", encoding="utf-8") as f:
        for line in f:
            stripped = line.rstrip("\n")
            if keyword_lower in stripped.lower():
                results.append(stripped)
    return results

# Test it
import tempfile, os

content = "Alice bought apples\nBob bought Bananas\nAlice sold oranges\n"
with tempfile.NamedTemporaryFile("w", encoding="utf-8", suffix=".txt", delete=False) as tmp:
    tmp.write(content)
    tmp_path = tmp.name

result = find_lines(tmp_path, "alice")
print(result)   # ['Alice bought apples', 'Alice sold oranges']

os.unlink(tmp_path)

Key points:

Use with open() to ensure the file is always closed
Specify encoding="utf-8" explicitly
Use line.rstrip("\n") not line.strip() - strip() removes leading spaces too
Lower-case both sides for case-insensitive comparison

Intermediate - Word Frequency Counter

Write a memory-efficient function word_frequency(filepath) that returns a dict mapping each word to its frequency in a large text file. The file might be too large to fit in RAM.

Requirements:

Lowercase all words
Strip punctuation from words
Use streaming (line-by-line) - do not call f.read()

Solution

import string
from collections import Counter

def word_frequency(filepath):
    """
    Count word frequencies in a file using line-by-line streaming.
    Returns a Counter (dict subclass) mapping word → count.
    Works on files larger than available RAM.
    """
    counts = Counter()
    translator = str.maketrans("", "", string.punctuation)

    with open(filepath, "r", encoding="utf-8", errors="replace") as f:
        for line in f:
            # Remove punctuation and lowercase
            cleaned = line.translate(translator).lower()
            words = cleaned.split()
            counts.update(words)

    return counts


# Test
import tempfile, os

text = """
Python is great. Python is fast!
Python's ecosystem is huge, and Python is fun.
"""

with tempfile.NamedTemporaryFile("w", encoding="utf-8", suffix=".txt", delete=False) as tmp:
    tmp.write(text)
    tmp_path = tmp.name

freq = word_frequency(tmp_path)
print(freq.most_common(5))
# [('python', 4), ('is', 4), ('great', 1), ('fast', 1), ('ecosystem', 1)]

os.unlink(tmp_path)

Design notes:

Counter.update() accepts an iterable and increments counts - O(words per line) memory per call
str.maketrans with punctuation removal is faster than regex for this use case
errors="replace" handles corrupted bytes gracefully in real-world files
The function uses O(unique_words) memory - much less than O(total_words)

Advanced - Streaming Log Aggregator

Build a streaming log aggregator that reads a large log file and produces a summary report without loading the full file into memory.

Log format:

[LEVEL] YYYY-MM-DD HH:MM:SS - message

Requirements:

Count occurrences by level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
Find the first and last timestamp in the file
Collect the last 10 ERROR/CRITICAL messages
Work on a file of any size using constant memory (aside from the last-10 buffer)

Solution

from collections import defaultdict, deque
from dataclasses import dataclass, field
from typing import Optional

@dataclass
class LogSummary:
    level_counts: dict = field(default_factory=lambda: defaultdict(int))
    first_timestamp: Optional[str] = None
    last_timestamp: Optional[str] = None
    recent_errors: deque = field(default_factory=lambda: deque(maxlen=10))
    total_lines: int = 0
    parse_errors: int = 0


def analyze_log(filepath):
    """
    Stream a log file and produce a summary report.
    Memory usage: O(1) + O(last_10_errors) regardless of file size.
    """
    summary = LogSummary()

    with open(filepath, "r", encoding="utf-8", errors="replace") as f:
        for line in f:
            summary.total_lines += 1
            line = line.rstrip()
            if not line:
                continue

            # Parse: "[ERROR] 2024-01-15 14:32:01 - message"
            if not line.startswith("["):
                summary.parse_errors += 1
                continue

            try:
                bracket_end = line.index("]")
                level = line[1:bracket_end]
                rest = line[bracket_end + 2:]  # skip "] "

                # rest: "2024-01-15 14:32:01 - message"
                parts = rest.split(" ", 3)
                if len(parts) < 3:
                    summary.parse_errors += 1
                    continue

                timestamp = f"{parts[0]} {parts[1]}"

            except (ValueError, IndexError):
                summary.parse_errors += 1
                continue

            # Update counts
            summary.level_counts[level] += 1

            # Track first/last timestamp
            if summary.first_timestamp is None:
                summary.first_timestamp = timestamp
            summary.last_timestamp = timestamp

            # Collect recent errors (deque with maxlen=10 auto-evicts old entries)
            if level in ("ERROR", "CRITICAL"):
                message = parts[3].lstrip("- ") if len(parts) > 3 else ""
                summary.recent_errors.append(f"[{level}] {timestamp}: {message}")

    return summary


def print_report(summary):
    print(f"Total lines processed: {summary.total_lines:,}")
    print(f"Parse errors: {summary.parse_errors}")
    print(f"Time range: {summary.first_timestamp} → {summary.last_timestamp}")
    print()
    print("Level breakdown:")
    for level in ["CRITICAL", "ERROR", "WARNING", "INFO", "DEBUG"]:
        count = summary.level_counts.get(level, 0)
        bar = "#" * min(count, 50)
        print(f"  {level:10} {count:8,}  {bar}")
    print()
    print("Recent errors (last 10):")
    for err in summary.recent_errors:
        print(f"  {err}")


# Demo with generated test data
import tempfile, os, random

def generate_test_log(path, num_lines=10000):
    levels = ["DEBUG"] * 50 + ["INFO"] * 30 + ["WARNING"] * 10 + ["ERROR"] * 8 + ["CRITICAL"] * 2
    with open(path, "w", encoding="utf-8") as f:
        for i in range(num_lines):
            level = random.choice(levels)
            f.write(f"[{level}] 2024-01-15 {i//3600:02d}:{(i%3600)//60:02d}:{i%60:02d} - Event {i}\n")

with tempfile.NamedTemporaryFile("w", suffix=".log", delete=False) as tmp:
    tmp_path = tmp.name

generate_test_log(tmp_path, 10000)
summary = analyze_log(tmp_path)
print_report(summary)
os.unlink(tmp_path)

Output (approximate):

Total lines processed: 10,000
Parse errors: 0
Time range: 2024-01-15 00:00:00 → 2024-01-15 02:46:39

Level breakdown:
  CRITICAL         200  ####################
  ERROR            800  ################################################################################
  WARNING        1,000  ################################################################################
  INFO           3,000  ################################################################################
  DEBUG          5,000  ################################################################################

Recent errors (last 10):
  [CRITICAL] 2024-01-15 02:46:37: Event 9997
  ...

Design highlights:

deque(maxlen=10) automatically discards old entries - O(10) memory always
defaultdict(int) is cleaner than checking key existence
errors="replace" handles corrupted log files gracefully
The function is a single pass through the file - O(1) memory

Quick Reference

Task	Code	Notes
Open text file for reading	`open(path, "r", encoding="utf-8")`	Always specify encoding
Open binary file for reading	`open(path, "rb")`	No encoding in binary mode
Read entire file	`f.read()`	Loads all into RAM - small files only
Read N characters	`f.read(1024)`	Returns up to 1024 chars
Read one line	`f.readline()`	Includes `\n`; `""` at EOF
Read all lines as list	`f.readlines()`	Each item includes `\n`
Iterate line by line	`for line in f:`	Most efficient, idiomatic
Skip bad bytes	`errors="replace"`	Replaces with U+FFFD
Ignore bad bytes	`errors="ignore"`	Silent data loss risk
Preserve raw newlines	`newline=""`	Required for CSV module
Get file position	`f.tell()`	Byte offset from start
Move file position	`f.seek(offset, whence)`	whence: 0=start, 1=current, 2=end
Memory-map large file	`mmap.mmap(f.fileno(), 0)`	Random access without full RAM load
Detect encoding	`chardet.detect(raw_bytes)`	Third-party library
Read in 64 KB chunks	`f.read(65536)` in while loop	For binary or non-line data
Open with larger buffer	`buffering=65536`	Reduces system calls for large files

Key Takeaways

Python's open() creates a three-layer stack: TextIOWrapper (encoding), BufferedReader (chunking), and FileIO (OS calls). Understanding this stack explains most file I/O behavior.
Always specify encoding="utf-8" (or the correct encoding) in text mode. Platform defaults vary and cause subtle cross-environment bugs.
Text mode translates newlines; binary mode preserves raw bytes. Use binary mode for anything that is not human-readable text.
f.read() loads the entire file into RAM. For files of unknown or large size, use for line in f: iteration instead - it is memory-efficient and idiomatic.
The errors parameter controls UnicodeDecodeError handling: "strict" (default, raises), "replace" (substitutes U+FFFD), "ignore" (drops bad bytes).
Always use with open() as f: - it guarantees the file is closed even when exceptions occur, preventing file descriptor leaks.
For very large files, prefer generator pipelines (O(1) memory), chunked reads for binary data, or mmap for random access patterns.

What You Will Learn​

Prerequisites​

Mental Model: What open() Actually Creates​

Part 1 - The open() Function, All Parameters​

Signature​

The file Parameter​

Part 2 - File Modes in Depth​

Text Mode 'r' (default)​

Binary Mode 'rb'​

'r+' - Read and Write Without Truncation​

Part 3 - Text vs Binary Mode: Newline Translation​

What Happens in Text Mode​

When This Matters​

Controlling Newline Behavior with newline=​

Part 4 - Encodings: UTF-8 Everywhere​

Why Encoding Matters​

The Default Encoding Trap​

Common Encodings​

UnicodeDecodeError - Diagnosis and Fix​

The errors Parameter​

Part 5 - Reading Methods in Depth​

read(size=-1) - Read All or N Bytes​

readline(size=-1) - One Line at a Time​

readlines() - Read All Lines into a List​

Iteration Over a File Object - The Best Default​

Part 6 - File Object Internals: Buffered I/O​

What Buffering Does​

The buffering Parameter​

Inspecting the IO Stack​

Part 7 - The Context Manager: Always Use with open()​

Why finally Alone Is Not Enough​

What with open() Does Under the Hood​

Part 8 - Large File Handling​

Strategy 1: Line-by-Line Iteration (Most Common)​

Strategy 2: Chunked Reading for Binary or Non-Line Data​

Strategy 3: Generator Pipeline for Stream Processing​

Strategy 4: Memory-Mapped Files with mmap​

Part 9 - Real-World Patterns​

Pattern 1: Reading a Config File Safely​

Pattern 2: Streaming CSV Parsing Without Pandas​

Pattern 3: Tail a Growing Log File​

Interview Questions​

Q1: What is the difference between text mode and binary mode in Python file I/O?​

Q2: What causes UnicodeDecodeError when reading a file, and how do you fix it?​

Q3: Why should you always specify encoding= when calling open() in text mode?​

Q4: What is the difference between f.read(), f.readline(), f.readlines(), and iterating over f?​

Q5: Explain Python's buffered I/O stack and why buffering=0 only works in binary mode.​

Q6: How would you process a 20 GB CSV file on a machine with 4 GB of RAM? Walk through your approach.​

Practice Challenges​

Beginner - Read and Filter​

Intermediate - Word Frequency Counter​

Advanced - Streaming Log Aggregator​

Quick Reference​

Key Takeaways​

What You Will Learn

Prerequisites

Mental Model: What `open()` Actually Creates

Part 1 - The `open()` Function, All Parameters

Signature

The `file` Parameter

Part 2 - File Modes in Depth

Text Mode `'r'` (default)

Binary Mode `'rb'`

`'r+'` - Read and Write Without Truncation

Part 3 - Text vs Binary Mode: Newline Translation

What Happens in Text Mode

When This Matters

Controlling Newline Behavior with `newline=`

Part 4 - Encodings: UTF-8 Everywhere

Why Encoding Matters

The Default Encoding Trap

Common Encodings

UnicodeDecodeError - Diagnosis and Fix

The `errors` Parameter

Part 5 - Reading Methods in Depth

`read(size=-1)` - Read All or N Bytes

`readline(size=-1)` - One Line at a Time

`readlines()` - Read All Lines into a List

Iteration Over a File Object - The Best Default

Part 6 - File Object Internals: Buffered I/O

What Buffering Does

The `buffering` Parameter

Inspecting the IO Stack

Part 7 - The Context Manager: Always Use `with open()`

Why `finally` Alone Is Not Enough

What `with open()` Does Under the Hood

Part 8 - Large File Handling

Strategy 1: Line-by-Line Iteration (Most Common)

Strategy 2: Chunked Reading for Binary or Non-Line Data

Strategy 3: Generator Pipeline for Stream Processing

Strategy 4: Memory-Mapped Files with `mmap`

Part 9 - Real-World Patterns

Pattern 1: Reading a Config File Safely

Pattern 2: Streaming CSV Parsing Without Pandas

Pattern 3: Tail a Growing Log File

Interview Questions

Q1: What is the difference between text mode and binary mode in Python file I/O?

Q2: What causes `UnicodeDecodeError` when reading a file, and how do you fix it?

Q3: Why should you always specify `encoding=` when calling `open()` in text mode?

Q4: What is the difference between `f.read()`, `f.readline()`, `f.readlines()`, and iterating over `f`?

Q5: Explain Python's buffered I/O stack and why `buffering=0` only works in binary mode.

Q6: How would you process a 20 GB CSV file on a machine with 4 GB of RAM? Walk through your approach.

Practice Challenges

Beginner - Read and Filter

Intermediate - Word Frequency Counter

Advanced - Streaming Log Aggregator

Quick Reference

Key Takeaways