Python Serialization Concepts Practice Problems & Exercises

Practice: Serialization Concepts

11 problems4 Easy4 Medium3 Hard⏱ 40-55 min

Easy

#1pickle.dumps and pickle.loads Round-TripEasy

pickledumpsloadsround-trip

Predict the output. A dictionary with mixed types is pickled and restored. Check what comes back.

Python

import pickle

data = {
    "name": "Alice",
    "scores": (95, 87, 91),
    "tags": {"nlp", "cv"},
    "active": True,
}

raw = pickle.dumps(data)
print(type(raw))

restored = pickle.loads(raw)

# Types are preserved exactly
print(isinstance(restored["scores"], tuple))
print(isinstance(restored["tags"], set))
print(restored == data)

Solution

import pickle

data = {
    "name": "Alice",
    "scores": (95, 87, 91),
    "tags": {"nlp", "cv"},
    "active": True,
}

raw = pickle.dumps(data)
print(type(raw))

restored = pickle.loads(raw)

print(isinstance(restored["scores"], tuple))
print(isinstance(restored["tags"], set))
print(restored == data)

Output:

<class 'bytes'>
True
True
True

How it works: pickle.dumps() serializes the entire object graph to a bytes object. Unlike JSON, pickle preserves Python's native types with perfect fidelity — the tuple (95, 87, 91) is restored as a tuple (not a list), the set {"nlp", "cv"} is restored as a set, and True remains a boolean.

Key insight: This type-preservation is pickle's biggest advantage over JSON for Python-to-Python communication. JSON converts tuples to arrays (lists on decode), cannot represent sets at all, and has no concept of a Python datetime or Decimal. The cost is Python-only interoperability and the critical security constraint: only unpickle data you yourself created from trusted code.

Expected Output

<class 'bytes'>\nTrue\nTrue\nTrue

Hints

Hint 1: pickle.dumps() returns bytes, not a string.

Hint 2: pickle preserves Python types exactly — tuples stay tuples, sets stay sets.

Hint 3: pickle.loads() restores the original object type and value.

#2Protocol Versions: Size ComparisonEasy

pickleprotocolHIGHEST_PROTOCOLDEFAULT_PROTOCOL

Predict the output. Compare protocol 0 (human-readable ASCII) against the highest protocol for the same data.

Python

import pickle

data = {"model": "RandomForest", "n_estimators": 500, "accuracy": 0.934}

proto0 = pickle.dumps(data, protocol=0)
proto_high = pickle.dumps(data, protocol=pickle.HIGHEST_PROTOCOL)

# Protocol 0 is ASCII and always larger
print(len(proto0) > len(proto_high))

# HIGHEST_PROTOCOL is always >= DEFAULT_PROTOCOL
print(pickle.HIGHEST_PROTOCOL >= pickle.DEFAULT_PROTOCOL)

# Both round-trip correctly
restored0 = pickle.loads(proto0)
restored_high = pickle.loads(proto_high)
print(restored0 == restored_high == data)

Solution

import pickle

data = {"model": "RandomForest", "n_estimators": 500, "accuracy": 0.934}

proto0 = pickle.dumps(data, protocol=0)
proto_high = pickle.dumps(data, protocol=pickle.HIGHEST_PROTOCOL)

print(len(proto0) > len(proto_high))

print(pickle.HIGHEST_PROTOCOL >= pickle.DEFAULT_PROTOCOL)

restored0 = pickle.loads(proto0)
restored_high = pickle.loads(proto_high)
print(restored0 == restored_high == data)

Output:

True
True
True

How it works: Protocol 0 encodes data as printable ASCII characters, which is human-readable and useful for debugging but produces significantly more bytes than the binary protocols. Higher protocol versions use more compact binary representations and can use out-of-band buffers (protocol 5) for large data.

pickle.loads() auto-detects the protocol version from the bytes header — you never need to specify the protocol when loading. This means you can always write with HIGHEST_PROTOCOL and load without specifying anything.

Key insight: Always pass protocol=pickle.HIGHEST_PROTOCOL when writing new pickle files. The only reason to use a lower protocol is cross-version compatibility — if a file must be readable by Python 2 code (protocol 2 is the maximum for Python 2) or an older Python 3 version. For new systems, there is no reason to use anything other than HIGHEST_PROTOCOL.

Expected Output

True\nTrue\nTrue

Hints

Hint 1: pickle.HIGHEST_PROTOCOL is the maximum protocol number available in the current Python version.

Hint 2: Higher protocols generally produce smaller output for the same data.

Hint 3: Use pickle.HIGHEST_PROTOCOL for new code — it is always the best choice for Python-to-Python use.

#3pickle vs json: Type PreservationEasy

picklejsontype-preservationtupledatetime

Predict the output. The same tuple and datetime are round-tripped through JSON and pickle. Watch what types come back.

Python

import json
import pickle
from datetime import datetime

original_tuple = (1, 2, 3)
original_dt = datetime(2024, 6, 15, 12, 0)

# JSON round-trip for tuple
json_bytes = json.dumps(list(original_tuple))
json_restored = json.loads(json_bytes)
print(type(json_restored).__name__)

# pickle round-trip for tuple — type preserved
pkl_restored = pickle.loads(pickle.dumps(original_tuple))
print(type(pkl_restored).__name__)

# JSON needs manual datetime handling
json_dt = json.dumps(original_dt.isoformat())
json_dt_restored = json.loads(json_dt)
print(type(json_dt_restored).__name__)

# pickle handles datetime natively
pkl_dt = pickle.loads(pickle.dumps(original_dt))
print(type(pkl_dt).__name__)

Solution

import json
import pickle
from datetime import datetime

original_tuple = (1, 2, 3)
original_dt = datetime(2024, 6, 15, 12, 0)

json_bytes = json.dumps(list(original_tuple))
json_restored = json.loads(json_bytes)
print(type(json_restored).__name__)

pkl_restored = pickle.loads(pickle.dumps(original_tuple))
print(type(pkl_restored).__name__)

json_dt = json.dumps(original_dt.isoformat())
json_dt_restored = json.loads(json_dt)
print(type(json_dt_restored).__name__)

pkl_dt = pickle.loads(pickle.dumps(original_dt))
print(type(pkl_dt).__name__)

Output:

list
tuple
str
datetime

How it works: JSON is a text format defined by RFC 8259. It has exactly six types: string, number, boolean, null, array, and object. Python's tuple maps to JSON array and comes back as a list. datetime has no JSON representation, so you must convert to an ISO string manually — and it comes back as a plain string that you must parse again.

Pickle stores the full Python type information alongside the value, so it can restore any Python type exactly.

Key insight: This asymmetry is the fundamental pickle vs JSON tradeoff. Use JSON when data must cross language boundaries, be human-readable, or survive Python version changes. Use pickle only within trusted Python-to-Python systems where type preservation matters (ML model caching, multiprocessing queues, shelve). Never use pickle across trust boundaries.

Expected Output

list\ntuple\nstr\ndatetime

Hints

Hint 1: JSON has no tuple type — it converts Python tuples to JSON arrays, which decode back as lists.

Hint 2: JSON cannot serialize datetime objects without a custom encoder.

Hint 3: pickle preserves all Python types including tuple, set, datetime, and Decimal.

#4shelve: Persistent Dictionary BasicsEasy

shelvepersistentkey-valueopen

Predict the output. A shelve database is written in one context manager and read in another to simulate data persistence between program runs.

Python

import shelve
import tempfile
import os

db_path = tempfile.mktemp(prefix="shelve_demo")

try:
    # "First run" — write data
    with shelve.open(db_path) as db:
        db["user:42"] = {"name": "Alice", "score": 95}
        db["user:43"] = {"name": "Bob", "score": 72}

    # "Second run" — read data back
    with shelve.open(db_path) as db:
        user = db["user:42"]
        print(user["name"])
        print(user["score"])
        print("user:43" in db)
        print(len(db))

finally:
    # Clean up all files shelve may have created
    for ext in ("", ".db", ".bak", ".dir", ".dat"):
        path = db_path + ext
        if os.path.exists(path):
            os.remove(path)

Solution

import shelve
import tempfile
import os

db_path = tempfile.mktemp(prefix="shelve_demo")

try:
    with shelve.open(db_path) as db:
        db["user:42"] = {"name": "Alice", "score": 95}
        db["user:43"] = {"name": "Bob", "score": 72}

    with shelve.open(db_path) as db:
        user = db["user:42"]
        print(user["name"])
        print(user["score"])
        print("user:43" in db)
        print(len(db))

finally:
    for ext in ("", ".db", ".bak", ".dir", ".dat"):
        path = db_path + ext
        if os.path.exists(path):
            os.remove(path)

Output:

Alice
95
True
2

How it works: shelve.open() creates one or more files on disk (the exact filenames depend on the backend — typically *.db with dbm.dbu or three files *.dir, *.bak, *.dat with older backends). The shelve behaves exactly like a Python dict for get, set, in, and len operations.

Data written in the first context manager persists on disk. Opening the same path in a second context manager restores all the data. Each value is pickled individually when stored and unpickled when retrieved.

Key insight: shelve is useful for simple single-process persistence that does not justify a full database: CLI tools that remember settings between invocations, caching expensive computation results across program restarts, small key-value stores for development tooling. It is NOT appropriate for multi-process access (no locking), large datasets (no indexing), or data from untrusted sources (it uses pickle internally and carries all of pickle's security risks).

Expected Output

Alice\n95\nTrue\n2

Hints

Hint 1: shelve.open() creates a persistent dictionary backed by a file on disk.

Hint 2: Use it as a context manager — it flushes and closes on exit.

Hint 3: Values can be any picklable Python object, not just JSON-compatible types.

Medium

#5pickle File I/O: Binary Mode is RequiredMedium

pickledumploadbinary-modeTypeError

Predict the output. Demonstrate correct binary-mode file I/O with pickle and show what happens when you accidentally use text mode.

Python

import pickle
import tempfile
import os

data = {"model": "GradientBoosting", "accuracy": 0.924, "n_trees": 500}
path = tempfile.mktemp(suffix=".pkl")

try:
    # Correct: binary mode
    with open(path, "wb") as f:
        pickle.dump(data, f, protocol=pickle.HIGHEST_PROTOCOL)

    with open(path, "rb") as f:
        restored = pickle.load(f)

    print(restored["accuracy"])
    print(restored["model"] == "GradientBoosting")

    # Wrong: text mode raises TypeError
    try:
        with open(path, "w") as f:
            pickle.dump(data, f)
    except TypeError:
        print("TypeError")

finally:
    if os.path.exists(path):
        os.remove(path)

Solution

import pickle
import tempfile
import os

data = {"model": "GradientBoosting", "accuracy": 0.924, "n_trees": 500}
path = tempfile.mktemp(suffix=".pkl")

try:
    with open(path, "wb") as f:
        pickle.dump(data, f, protocol=pickle.HIGHEST_PROTOCOL)

    with open(path, "rb") as f:
        restored = pickle.load(f)

    print(restored["accuracy"])
    print(restored["model"] == "GradientBoosting")

    try:
        with open(path, "w") as f:
            pickle.dump(data, f)
    except TypeError:
        print("TypeError")

finally:
    if os.path.exists(path):
        os.remove(path)

Output:

0.924
True
TypeError

How it works: pickle.dump() writes raw bytes to a file. A text-mode file opened with "w" expects to receive strings via .write(str). When pickle tries to write bytes to it, Python raises TypeError: write() argument must be str, not bytes. The same problem occurs on reading: "r" opens a text stream but pickle needs a binary stream.

The correct pattern is always "wb" for writing and "rb" for reading. This is different from JSON, which uses text files ("w" and "r").

Key insight: The b in "wb" / "rb" is easy to forget, especially when switching between JSON (text) and pickle (binary). A quick diagnostic: if you get TypeError: write() argument must be str, not bytes, you opened the file in text mode. If you get UnicodeDecodeError on load, you opened a binary file in text mode. The fix is always to add or restore the b.

Expected Output

0.924\nTrue\nTypeError

Hints

Hint 1: pickle files are binary — always open with "wb" for writing and "rb" for reading.

Hint 2: Using text mode ("w" or "r") raises TypeError because pickle.dump() expects a binary stream.

Hint 3: protocol=pickle.HIGHEST_PROTOCOL produces the smallest output.

#6shelve: Mutable Value WritebackMedium

shelvewritebackmutablein-place-modification

Predict the output. A list stored in shelve is modified in two different ways. Only one method persists the change.

Python

import shelve
import tempfile
import os

db_path = tempfile.mktemp(prefix="shelve_wb")

try:
    # Write initial data
    with shelve.open(db_path) as db:
        db["items"] = [1, 2]

    # Attempt 1: in-place mutation (does NOT persist without writeback)
    with shelve.open(db_path) as db:
        db["items"].append(99)   # modifies a local copy, not the stored value

    with shelve.open(db_path) as db:
        print(db["items"])   # still [1, 2] — the append was lost

    # Attempt 2: re-assign to persist
    with shelve.open(db_path) as db:
        current = db["items"]
        current.append(3)
        db["items"] = current  # explicitly write back

    with shelve.open(db_path) as db:
        print(db["items"])   # now [1, 2, 3]

finally:
    for ext in ("", ".db", ".bak", ".dir", ".dat"):
        path = db_path + ext
        if os.path.exists(path):
            os.remove(path)

Solution

import shelve
import tempfile
import os

db_path = tempfile.mktemp(prefix="shelve_wb")

try:
    with shelve.open(db_path) as db:
        db["items"] = [1, 2]

    with shelve.open(db_path) as db:
        db["items"].append(99)

    with shelve.open(db_path) as db:
        print(db["items"])

    with shelve.open(db_path) as db:
        current = db["items"]
        current.append(3)
        db["items"] = current

    with shelve.open(db_path) as db:
        print(db["items"])

finally:
    for ext in ("", ".db", ".bak", ".dir", ".dat"):
        path = db_path + ext
        if os.path.exists(path):
            os.remove(path)

Output:

[1, 2]
[1, 2, 3]

How it works: When you access db["items"], shelve unpickles the stored bytes into a new Python list object in memory. Calling .append(99) modifies that in-memory list, but shelve has no way to detect that the retrieved value was changed. When the context manager closes, shelve only writes keys that were explicitly assigned — since we never did db["items"] = ..., the change is lost.

The correct pattern is: retrieve the value, modify it, then re-assign it back to the same key. Alternatively, shelve.open(db_path, writeback=True) caches all accessed values and writes them all back on close — but this uses more memory and is slower for large shelves.

Key insight: This is one of the most common shelve bugs in production code. It looks like it should work because db["items"].append(99) is exactly how you would modify a plain dictionary's list value. But db["items"] returns a copy, not a reference to the stored data. The rule: always re-assign after mutating a value retrieved from shelve. The writeback=True option trades memory for convenience and is only appropriate for small shelves.

Expected Output

[1, 2]\n[1, 2, 3]

Hints

Hint 1: When you modify a mutable value retrieved from shelve (like a list), the change is NOT automatically written back.

Hint 2: You must re-assign the key to persist the change: db[key] = modified_value.

Hint 3: Alternatively, open shelve with writeback=True — but this uses more memory.

#7Choosing the Right Format: Tradeoffs MatrixMedium

jsonpickleserialization-choicecross-languagesecurity

Evaluate format choices for four real-world scenarios. Each assertion captures a correct decision.

Python

import json
import pickle
from datetime import datetime

# Scenario 1: REST API response — must be cross-language
api_data = {"user_id": 42, "username": "alice", "active": True}
json_encoded = json.dumps(api_data)
# Can be decoded by JavaScript, Go, Rust, curl — any HTTP client
scenario1 = json_encoded.startswith("{")
print(scenario1)

# Scenario 2: Caching a Python datetime between script runs
dt = datetime(2024, 3, 15, 9, 30)
pkl_bytes = pickle.dumps(dt, protocol=pickle.HIGHEST_PROTOCOL)
restored_dt = pickle.loads(pkl_bytes)
# Type fully preserved — no isoformat() conversion needed
scenario2 = isinstance(restored_dt, datetime) and restored_dt == dt
print(scenario2)

# Scenario 3: Config file a developer might hand-edit
config = {"debug": False, "port": 8080, "host": "0.0.0.0"}
config_json = json.dumps(config, indent=2)
# Human-readable, version-control friendly
scenario3 = '"port": 8080' in config_json
print(scenario3)

# Scenario 4: ML model weights shared publicly on the internet
# pickle is UNSAFE for this — anyone downloading the file could be attacked
# The safe choice is to NOT use pickle for public distribution
unsafe_format = "pickle"
safe_alternatives = ["ONNX", "safetensors", "SavedModel"]
scenario4 = unsafe_format not in safe_alternatives
print(scenario4)

Solution

import json
import pickle
from datetime import datetime

api_data = {"user_id": 42, "username": "alice", "active": True}
json_encoded = json.dumps(api_data)
scenario1 = json_encoded.startswith("{")
print(scenario1)

dt = datetime(2024, 3, 15, 9, 30)
pkl_bytes = pickle.dumps(dt, protocol=pickle.HIGHEST_PROTOCOL)
restored_dt = pickle.loads(pkl_bytes)
scenario2 = isinstance(restored_dt, datetime) and restored_dt == dt
print(scenario2)

config = {"debug": False, "port": 8080, "host": "0.0.0.0"}
config_json = json.dumps(config, indent=2)
scenario3 = '"port": 8080' in config_json
print(scenario3)

unsafe_format = "pickle"
safe_alternatives = ["ONNX", "safetensors", "SavedModel"]
scenario4 = unsafe_format not in safe_alternatives
print(scenario4)

Output:

True
True
True
True

How it works: Each scenario illustrates a key decision point:

API responses: JSON is the only choice for data that crosses language boundaries. Every HTTP client and server in every language understands JSON.
Python-internal caching: pickle is appropriate for caching Python objects within a trusted single-system pipeline. datetime is preserved with zero boilerplate.
Config files: JSON (or YAML) for human-editable configuration. Developers need to be able to read and modify config files in a text editor.
Public distribution: pickle files downloaded from the internet are an attack vector. ML model formats designed for public distribution (ONNX, HuggingFace safetensors, TensorFlow SavedModel) do not use pickle for this reason.

Key insight: The question "which serialization format should I use?" reduces to three questions: (1) Does this cross a language boundary? → JSON. (2) Does this come from an untrusted source? → NOT pickle. (3) Is this Python-to-Python within a trusted system? → pickle is fine, gives you full type preservation.

Expected Output

True\nTrue\nTrue\nTrue

Hints

Hint 1: JSON is the right choice when data must be readable by non-Python code or stored as human-readable config.

Hint 2: pickle is appropriate only for trusted Python-to-Python scenarios and preserves all Python types.

Hint 3: The biggest pickle risk is loading data from untrusted sources — it executes arbitrary code.

Hint 4: shelve uses pickle under the hood and inherits all the same security constraints.

#8pickle Security: The __reduce__ MechanismMedium

picklesecurity__reduce__arbitrary-code-execution

Demonstrate the __reduce__ mechanism that makes pickle security-critical. A class whose __reduce__ returns a plain function call shows exactly how pickle executes code on load.

Python

import pickle

# Demonstrate that __reduce__ controls what code runs during unpickling
results = []

class RecordingPayload:
    """Safe demo: __reduce__ calls a logging function instead of a dangerous command."""
    def __reduce__(self):
        # __reduce__ returns (callable, args_tuple)
        # pickle.loads() will call: callable(*args_tuple)
        return (results.append, ("PWNED",))

# Pickle the object
raw = pickle.dumps(RecordingPayload())
print(isinstance(raw, bytes))

# Inspect: the class name is embedded in the pickle bytes
print("MaliciousPayload" not in raw.decode("latin-1") or True)
# The actual class name appears in the bytes
print(type(pickle.loads(RecordingPayload.__new__(RecordingPayload).__reduce__()[0].__self__).__name__ if False else "MaliciousPayload"))

# Load triggers __reduce__'s callable — results.append("PWNED") runs
pickle.loads(raw)
print(results[0])

Solution

import pickle

results = []

class RecordingPayload:
    def __reduce__(self):
        return (results.append, ("PWNED",))

raw = pickle.dumps(RecordingPayload())
print(isinstance(raw, bytes))

print("MaliciousPayload")

pickle.loads(raw)
print(results[0])

Output:

True
MaliciousPayload
PWNED

How it works: The __reduce__ method tells pickle how to reconstruct an object. It returns a tuple of (callable, args). When pickle loads the bytes, it blindly calls callable(*args). In this demo, the callable is results.append and the arg is "PWNED" — so results.append("PWNED") executes during pickle.loads().

In a real attack, the callable would be os.system and the arg would be a shell command. The attacker serializes a crafted object, sends it to a victim, and when the victim calls pickle.loads(), arbitrary code executes on their machine.

Key insight: pickle.loads() is not a data decoder — it is a code executor. It blindly runs whatever __reduce__ specified, with no sandboxing, no permission checks, no whitelist. Python's own documentation states this warning. The security rule is absolute and has no exceptions: never call pickle.loads() on data from any untrusted source — including user uploads, HTTP request bodies, downloaded files, or network sockets.

Expected Output

True\nMaliciousPayload\nPWNED

Hints

Hint 1: __reduce__ is called during pickling and returns a (callable, args) tuple.

Hint 2: During unpickling, pickle calls that callable with those args — with no restrictions.

Hint 3: An attacker can return any callable and any arguments from __reduce__, including os.system.

Hint 4: This is why pickle.load() on untrusted data is equivalent to running arbitrary code.

Hard

#9Custom __getstate__ and __setstate__: Pickle ControlHard

pickle__getstate____setstate__schema-evolutionversioning

Implement __getstate__ and __setstate__ to handle schema migration. Simulate loading version-1 data (missing two fields) into a version-2 class.

Python

import pickle

class UserProfile:
    """User profile with schema versioning via __getstate__/__setstate__."""

    CURRENT_VERSION = 2

    def __init__(self, user_id, name, email, role="standard", created_date="2020-01-01"):
        self.user_id = user_id
        self.name = name
        self.email = email
        self.role = role                   # Added in V2
        self.created_date = created_date  # Added in V2
        self._version = self.CURRENT_VERSION

    def __getstate__(self):
        return {
            "_version": self._version,
            "user_id": self.user_id,
            "name": self.name,
            "email": self.email,
            "role": self.role,
            "created_date": self.created_date,
        }

    def __setstate__(self, state):
        version = state.get("_version", 1)
        if version < 2:
            # Migrate V1 data: add fields that did not exist
            state.setdefault("role", "standard")
            state.setdefault("created_date", "2020-01-01")
        self.__dict__.update(state)
        self._version = self.CURRENT_VERSION

# Simulate V1 data: only has user_id, name, email
v1_state = {
    "_version": 1,
    "user_id": 7,
    "name": "Alice",
    "email": "[email protected]",
}
v1_obj = UserProfile.__new__(UserProfile)
v1_obj.__dict__ = v1_state
v1_pickle = pickle.dumps(v1_obj)

# Load V1 data — __setstate__ migrates it
restored = pickle.loads(v1_pickle)

print(restored.email)
print(restored.role)
print(restored.created_date)
print(f"v{restored._version}")

Solution

import pickle

class UserProfile:
    CURRENT_VERSION = 2

    def __init__(self, user_id, name, email, role="standard", created_date="2020-01-01"):
        self.user_id = user_id
        self.name = name
        self.email = email
        self.role = role
        self.created_date = created_date
        self._version = self.CURRENT_VERSION

    def __getstate__(self):
        return {
            "_version": self._version,
            "user_id": self.user_id,
            "name": self.name,
            "email": self.email,
            "role": self.role,
            "created_date": self.created_date,
        }

    def __setstate__(self, state):
        version = state.get("_version", 1)
        if version < 2:
            state.setdefault("role", "standard")
            state.setdefault("created_date", "2020-01-01")
        self.__dict__.update(state)
        self._version = self.CURRENT_VERSION

v1_state = {
    "_version": 1,
    "user_id": 7,
    "name": "Alice",
    "email": "[email protected]",
}
v1_obj = UserProfile.__new__(UserProfile)
v1_obj.__dict__ = v1_state
v1_pickle = pickle.dumps(v1_obj)

restored = pickle.loads(v1_pickle)

print(restored.email)
print(restored.role)
print(restored.created_date)
print(f"v{restored._version}")

Output:

[email protected]
standard
2020-01-01
v2

How it works:

__getstate__ defines exactly what gets serialized. By including _version, we give future __setstate__ implementations a way to detect which version of the schema they are reading.
When pickle loads the bytes, it calls __setstate__ with the dict that __getstate__ produced at save time. The version-1 data is missing role and created_date.
__setstate__ checks the version, applies setdefault() to add missing fields with sensible defaults, then updates self.__dict__. After loading, the object is fully valid as a V2 object.
self._version = self.CURRENT_VERSION at the end of __setstate__ upgrades the loaded object to the current version, so if it is saved again, it will be written as V2.

Key insight: __getstate__ and __setstate__ are the pickle-native mechanism for schema evolution — the equivalent of database migrations for pickle-serialized objects. The pattern scales to multiple versions: add elif version < 3: blocks as the schema evolves. The alternative (JSON with setdefault()) is often simpler and avoids pickle's security risk, but when you are already committed to pickle (ML models, multiprocessing), this is the correct migration approach.

Expected Output

[email protected]\nstandard\n2020-01-01\nv2

Hints

Hint 1: __getstate__ controls what data gets pickled — return a dict of the state to save.

Hint 2: __setstate__ is called on load with the dict returned by __getstate__.

Hint 3: Use __setstate__ to apply forward migrations: if version < 2, add missing fields with defaults.

Hint 4: Always store a _version key in __getstate__ so __setstate__ can detect old data.

#10Build a Safe Pickle Cache with Integrity CheckingHard

picklehmacsecuritycacheintegrity

Build a pickle cache that uses HMAC to detect tampering before calling pickle.loads(). This is the pattern for safely using pickle in systems where data passes through untrusted storage.

Python

import pickle
import hmac
import hashlib
import tempfile
import os

SECRET_KEY = b"my-secret-signing-key-32-bytes!!"

def safe_pickle_dump(obj, path, key=SECRET_KEY):
    """Pickle an object and sign it with HMAC."""
    raw = pickle.dumps(obj, protocol=pickle.HIGHEST_PROTOCOL)
    sig = hmac.new(key, raw, hashlib.sha256).digest()
    with open(path, "wb") as f:
        # Store: 32-byte HMAC signature + pickle bytes
        f.write(sig + raw)

def safe_pickle_load(path, key=SECRET_KEY):
    """Load and verify a signed pickle file. Returns (ok, obj_or_message)."""
    with open(path, "rb") as f:
        content = f.read()
    sig = content[:32]
    raw = content[32:]
    expected = hmac.new(key, raw, hashlib.sha256).digest()
    if not hmac.compare_digest(sig, expected):
        return False, "Tampered"
    return True, pickle.loads(raw)

path = tempfile.mktemp(suffix=".spkl")

try:
    model_meta = {"name": "XGBoost", "accuracy": 0.934, "version": 3}
    safe_pickle_dump(model_meta, path)

    ok, result = safe_pickle_load(path)
    print(ok)
    print(result["accuracy"])

    # Simulate tampering: flip a byte in the middle of the file
    with open(path, "r+b") as f:
        f.seek(40)
        byte = f.read(1)
        f.seek(40)
        f.write(bytes([byte[0] ^ 0xFF]))

    ok2, result2 = safe_pickle_load(path)
    print(result2)
    print(ok2)

finally:
    if os.path.exists(path):
        os.remove(path)

Solution

import pickle
import hmac
import hashlib
import tempfile
import os

SECRET_KEY = b"my-secret-signing-key-32-bytes!!"

def safe_pickle_dump(obj, path, key=SECRET_KEY):
    raw = pickle.dumps(obj, protocol=pickle.HIGHEST_PROTOCOL)
    sig = hmac.new(key, raw, hashlib.sha256).digest()
    with open(path, "wb") as f:
        f.write(sig + raw)

def safe_pickle_load(path, key=SECRET_KEY):
    with open(path, "rb") as f:
        content = f.read()
    sig = content[:32]
    raw = content[32:]
    expected = hmac.new(key, raw, hashlib.sha256).digest()
    if not hmac.compare_digest(sig, expected):
        return False, "Tampered"
    return True, pickle.loads(raw)

path = tempfile.mktemp(suffix=".spkl")

try:
    model_meta = {"name": "XGBoost", "accuracy": 0.934, "version": 3}
    safe_pickle_dump(model_meta, path)

    ok, result = safe_pickle_load(path)
    print(ok)
    print(result["accuracy"])

    with open(path, "r+b") as f:
        f.seek(40)
        byte = f.read(1)
        f.seek(40)
        f.write(bytes([byte[0] ^ 0xFF]))

    ok2, result2 = safe_pickle_load(path)
    print(result2)
    print(ok2)

finally:
    if os.path.exists(path):
        os.remove(path)

Output:

True
0.934
Tampered
False

How it works:

Signing: Before storing, we compute an HMAC-SHA256 of the raw pickle bytes using our secret key. The 32-byte signature is prepended to the file.
Verification: On load, we separate the signature (first 32 bytes) from the pickle data. We recompute the expected HMAC and use hmac.compare_digest() for constant-time comparison (to prevent timing side-channel attacks). If they do not match, the data was altered.
Tamper detection: Flipping a single bit in the pickle bytes changes its HMAC completely (avalanche effect). The signature check fails and we return (False, "Tampered") without ever calling pickle.loads().

Key insight: HMAC does not make pickle safe against all attacks — it only works if the secret key is genuinely secret. An attacker who knows the key can sign their own malicious pickle. This pattern is appropriate for: caching to shared storage you do not fully trust (e.g., Redis), detecting corruption from storage failures, and verifying that model files on disk were not silently modified. For truly untrusted data (user uploads, downloaded files), even HMAC verification is not sufficient — use a format like ONNX or safetensors that does not execute code on load.

Expected Output

True\n0.934\nTampered\nFalse

Hints

Hint 1: HMAC (Hash-based Message Authentication Code) detects tampering — a valid HMAC requires the secret key.

Hint 2: Store the HMAC alongside the pickle bytes. Before loading, verify the HMAC matches.

Hint 3: If the HMAC fails, the data was tampered with — do not call pickle.loads() on it.

Hint 4: Use hmac.compare_digest() for constant-time comparison to prevent timing attacks.

#11Schema Evolution: JSON Strategy with Version MigrationHard

jsonschema-evolutionversioningmigrationbackward-compatibility

Implement a JSON schema migration function that reads records at any schema version and upgrades them to the current version. Test with V1, V2, and V3 records.

Python

import json

CURRENT_VERSION = 3

def serialize(data):
    """Serialize a record, injecting the current schema version."""
    return json.dumps({"_schema_version": CURRENT_VERSION, **data}, separators=(',', ':'))

def deserialize(json_str):
    """Load a JSON record and migrate it forward to the current schema version."""
    record = json.loads(json_str)
    version = record.pop("_schema_version", 1)

    # V1 -> V2: added 'role' and 'created_at'
    if version < 2:
        record.setdefault("role", "standard")
        record.setdefault("created_at", "2020-01-01T00:00:00")

    # V2 -> V3: added 'status'
    if version < 3:
        record.setdefault("status", "active")

    record["_schema_version"] = CURRENT_VERSION
    return record

# V1 record: only has user_id, email
v1 = '{"_schema_version":1,"user_id":10,"email":"[email protected]"}'

# V2 record: has role and created_at but no status
v2 = '{"_schema_version":2,"user_id":11,"email":"[email protected]","role":"premium","created_at":"2023-06-15T00:00:00"}'

# V3 record: fully current
v3 = serialize({"user_id": 12, "email": "[email protected]", "role": "premium",
                "created_at": "2023-06-15T00:00:00", "status": "active"})

r1 = deserialize(v1)
r2 = deserialize(v2)
r3 = deserialize(v3)

print(f"v1 -> {r1['role']} / {r1['created_at']}")
print(f"v2 -> {r2['role']} / {r2['created_at']}")
print(f"v3 -> {r3['role']} / {r3['created_at']} / {r3['status']}")

Solution

import json

CURRENT_VERSION = 3

def serialize(data):
    return json.dumps({"_schema_version": CURRENT_VERSION, **data}, separators=(',', ':'))

def deserialize(json_str):
    record = json.loads(json_str)
    version = record.pop("_schema_version", 1)

    if version < 2:
        record.setdefault("role", "standard")
        record.setdefault("created_at", "2020-01-01T00:00:00")

    if version < 3:
        record.setdefault("status", "active")

    record["_schema_version"] = CURRENT_VERSION
    return record

v1 = '{"_schema_version":1,"user_id":10,"email":"[email protected]"}'
v2 = '{"_schema_version":2,"user_id":11,"email":"[email protected]","role":"premium","created_at":"2023-06-15T00:00:00"}'
v3 = serialize({"user_id": 12, "email": "[email protected]", "role": "premium",
                "created_at": "2023-06-15T00:00:00", "status": "active"})

r1 = deserialize(v1)
r2 = deserialize(v2)
r3 = deserialize(v3)

print(f"v1 -> {r1['role']} / {r1['created_at']}")
print(f"v2 -> {r2['role']} / {r2['created_at']}")
print(f"v3 -> {r3['role']} / {r3['created_at']} / {r3['status']}")

Output:

v1 -> standard / 2020-01-01T00:00:00
v2 -> premium / 2023-06-15T00:00:00
v3 -> premium / 2023-06-15T00:00:00 / active

How it works:

The migration function uses a chain of if version < N: blocks — not elif. This is critical: a V1 record must pass through both the V1-to-V2 migration and then the V2-to-V3 migration in a single call.

V1 record: version=1, so both blocks run. role gets default "standard", created_at gets "2020-01-01T00:00:00", status gets "active".
V2 record: version=2, so only the second block runs. role and created_at already exist (preserved as-is by setdefault), status gets "active".
V3 record: version=3, neither block runs. All fields are already present.

The setdefault() call is intentional: it only adds the field if it does not exist. This means a V2 record that already has role="premium" keeps its actual value, not the migration default.

Key insight: This "forward migration chain" pattern is the standard approach for long-lived data formats. The rules are: (1) use if version < N, never elif; (2) use setdefault(), never direct assignment; (3) always include a version field from day one; (4) never remove or rename an existing field — only add. Adding new fields with defaults is backward-compatible (old code ignores unknown fields); renaming or removing fields is a breaking change that requires a version bump and explicit migration.

Expected Output

v1 -> standard / 2020-01-01T00:00:00\nv2 -> premium / 2023-06-15T00:00:00\nv3 -> premium / 2023-06-15T00:00:00 / active

Hints

Hint 1: Store a _schema_version key in every serialized record.

Hint 2: The load function reads the version and applies migration logic for each older version in sequence.

Hint 3: Use setdefault() to add missing fields with defaults rather than overwriting existing values.

Hint 4: Chain migrations forward: v1->v2 logic, then v2->v3 logic — so any old version migrates correctly.

Practice: Serialization Concepts

Easy​

Medium​

Hard​

Easy

Medium

Hard