Python Serialization Concepts Practice Problems & Exercises
Practice: Serialization Concepts
← Back to lessonEasy
Predict the output. A dictionary with mixed types is pickled and restored. Check what comes back.
import pickle
data = {
"name": "Alice",
"scores": (95, 87, 91),
"tags": {"nlp", "cv"},
"active": True,
}
raw = pickle.dumps(data)
print(type(raw))
restored = pickle.loads(raw)
# Types are preserved exactly
print(isinstance(restored["scores"], tuple))
print(isinstance(restored["tags"], set))
print(restored == data)Solution
import pickle
data = {
"name": "Alice",
"scores": (95, 87, 91),
"tags": {"nlp", "cv"},
"active": True,
}
raw = pickle.dumps(data)
print(type(raw))
restored = pickle.loads(raw)
print(isinstance(restored["scores"], tuple))
print(isinstance(restored["tags"], set))
print(restored == data)
Output:
<class 'bytes'>
True
True
True
How it works: pickle.dumps() serializes the entire object graph to a bytes object. Unlike JSON, pickle preserves Python's native types with perfect fidelity — the tuple (95, 87, 91) is restored as a tuple (not a list), the set {"nlp", "cv"} is restored as a set, and True remains a boolean.
Key insight: This type-preservation is pickle's biggest advantage over JSON for Python-to-Python communication. JSON converts tuples to arrays (lists on decode), cannot represent sets at all, and has no concept of a Python datetime or Decimal. The cost is Python-only interoperability and the critical security constraint: only unpickle data you yourself created from trusted code.
Expected Output
<class 'bytes'>\nTrue\nTrue\nTrueHints
Hint 1: pickle.dumps() returns bytes, not a string.
Hint 2: pickle preserves Python types exactly — tuples stay tuples, sets stay sets.
Hint 3: pickle.loads() restores the original object type and value.
Predict the output. Compare protocol 0 (human-readable ASCII) against the highest protocol for the same data.
import pickle
data = {"model": "RandomForest", "n_estimators": 500, "accuracy": 0.934}
proto0 = pickle.dumps(data, protocol=0)
proto_high = pickle.dumps(data, protocol=pickle.HIGHEST_PROTOCOL)
# Protocol 0 is ASCII and always larger
print(len(proto0) > len(proto_high))
# HIGHEST_PROTOCOL is always >= DEFAULT_PROTOCOL
print(pickle.HIGHEST_PROTOCOL >= pickle.DEFAULT_PROTOCOL)
# Both round-trip correctly
restored0 = pickle.loads(proto0)
restored_high = pickle.loads(proto_high)
print(restored0 == restored_high == data)Solution
import pickle
data = {"model": "RandomForest", "n_estimators": 500, "accuracy": 0.934}
proto0 = pickle.dumps(data, protocol=0)
proto_high = pickle.dumps(data, protocol=pickle.HIGHEST_PROTOCOL)
print(len(proto0) > len(proto_high))
print(pickle.HIGHEST_PROTOCOL >= pickle.DEFAULT_PROTOCOL)
restored0 = pickle.loads(proto0)
restored_high = pickle.loads(proto_high)
print(restored0 == restored_high == data)
Output:
True
True
True
How it works: Protocol 0 encodes data as printable ASCII characters, which is human-readable and useful for debugging but produces significantly more bytes than the binary protocols. Higher protocol versions use more compact binary representations and can use out-of-band buffers (protocol 5) for large data.
pickle.loads() auto-detects the protocol version from the bytes header — you never need to specify the protocol when loading. This means you can always write with HIGHEST_PROTOCOL and load without specifying anything.
Key insight: Always pass protocol=pickle.HIGHEST_PROTOCOL when writing new pickle files. The only reason to use a lower protocol is cross-version compatibility — if a file must be readable by Python 2 code (protocol 2 is the maximum for Python 2) or an older Python 3 version. For new systems, there is no reason to use anything other than HIGHEST_PROTOCOL.
Expected Output
True\nTrue\nTrueHints
Hint 1: pickle.HIGHEST_PROTOCOL is the maximum protocol number available in the current Python version.
Hint 2: Higher protocols generally produce smaller output for the same data.
Hint 3: Use pickle.HIGHEST_PROTOCOL for new code — it is always the best choice for Python-to-Python use.
Predict the output. The same tuple and datetime are round-tripped through JSON and pickle. Watch what types come back.
import json import pickle from datetime import datetime original_tuple = (1, 2, 3) original_dt = datetime(2024, 6, 15, 12, 0) # JSON round-trip for tuple json_bytes = json.dumps(list(original_tuple)) json_restored = json.loads(json_bytes) print(type(json_restored).__name__) # pickle round-trip for tuple — type preserved pkl_restored = pickle.loads(pickle.dumps(original_tuple)) print(type(pkl_restored).__name__) # JSON needs manual datetime handling json_dt = json.dumps(original_dt.isoformat()) json_dt_restored = json.loads(json_dt) print(type(json_dt_restored).__name__) # pickle handles datetime natively pkl_dt = pickle.loads(pickle.dumps(original_dt)) print(type(pkl_dt).__name__)
Solution
import json
import pickle
from datetime import datetime
original_tuple = (1, 2, 3)
original_dt = datetime(2024, 6, 15, 12, 0)
json_bytes = json.dumps(list(original_tuple))
json_restored = json.loads(json_bytes)
print(type(json_restored).__name__)
pkl_restored = pickle.loads(pickle.dumps(original_tuple))
print(type(pkl_restored).__name__)
json_dt = json.dumps(original_dt.isoformat())
json_dt_restored = json.loads(json_dt)
print(type(json_dt_restored).__name__)
pkl_dt = pickle.loads(pickle.dumps(original_dt))
print(type(pkl_dt).__name__)
Output:
list
tuple
str
datetime
How it works: JSON is a text format defined by RFC 8259. It has exactly six types: string, number, boolean, null, array, and object. Python's tuple maps to JSON array and comes back as a list. datetime has no JSON representation, so you must convert to an ISO string manually — and it comes back as a plain string that you must parse again.
Pickle stores the full Python type information alongside the value, so it can restore any Python type exactly.
Key insight: This asymmetry is the fundamental pickle vs JSON tradeoff. Use JSON when data must cross language boundaries, be human-readable, or survive Python version changes. Use pickle only within trusted Python-to-Python systems where type preservation matters (ML model caching, multiprocessing queues, shelve). Never use pickle across trust boundaries.
Expected Output
list\ntuple\nstr\ndatetimeHints
Hint 1: JSON has no tuple type — it converts Python tuples to JSON arrays, which decode back as lists.
Hint 2: JSON cannot serialize datetime objects without a custom encoder.
Hint 3: pickle preserves all Python types including tuple, set, datetime, and Decimal.
Predict the output. A shelve database is written in one context manager and read in another to simulate data persistence between program runs.
import shelve
import tempfile
import os
db_path = tempfile.mktemp(prefix="shelve_demo")
try:
# "First run" — write data
with shelve.open(db_path) as db:
db["user:42"] = {"name": "Alice", "score": 95}
db["user:43"] = {"name": "Bob", "score": 72}
# "Second run" — read data back
with shelve.open(db_path) as db:
user = db["user:42"]
print(user["name"])
print(user["score"])
print("user:43" in db)
print(len(db))
finally:
# Clean up all files shelve may have created
for ext in ("", ".db", ".bak", ".dir", ".dat"):
path = db_path + ext
if os.path.exists(path):
os.remove(path)Solution
import shelve
import tempfile
import os
db_path = tempfile.mktemp(prefix="shelve_demo")
try:
with shelve.open(db_path) as db:
db["user:42"] = {"name": "Alice", "score": 95}
db["user:43"] = {"name": "Bob", "score": 72}
with shelve.open(db_path) as db:
user = db["user:42"]
print(user["name"])
print(user["score"])
print("user:43" in db)
print(len(db))
finally:
for ext in ("", ".db", ".bak", ".dir", ".dat"):
path = db_path + ext
if os.path.exists(path):
os.remove(path)
Output:
Alice
95
True
2
How it works: shelve.open() creates one or more files on disk (the exact filenames depend on the backend — typically *.db with dbm.dbu or three files *.dir, *.bak, *.dat with older backends). The shelve behaves exactly like a Python dict for get, set, in, and len operations.
Data written in the first context manager persists on disk. Opening the same path in a second context manager restores all the data. Each value is pickled individually when stored and unpickled when retrieved.
Key insight: shelve is useful for simple single-process persistence that does not justify a full database: CLI tools that remember settings between invocations, caching expensive computation results across program restarts, small key-value stores for development tooling. It is NOT appropriate for multi-process access (no locking), large datasets (no indexing), or data from untrusted sources (it uses pickle internally and carries all of pickle's security risks).
Expected Output
Alice\n95\nTrue\n2Hints
Hint 1: shelve.open() creates a persistent dictionary backed by a file on disk.
Hint 2: Use it as a context manager — it flushes and closes on exit.
Hint 3: Values can be any picklable Python object, not just JSON-compatible types.
Medium
Predict the output. Demonstrate correct binary-mode file I/O with pickle and show what happens when you accidentally use text mode.
import pickle
import tempfile
import os
data = {"model": "GradientBoosting", "accuracy": 0.924, "n_trees": 500}
path = tempfile.mktemp(suffix=".pkl")
try:
# Correct: binary mode
with open(path, "wb") as f:
pickle.dump(data, f, protocol=pickle.HIGHEST_PROTOCOL)
with open(path, "rb") as f:
restored = pickle.load(f)
print(restored["accuracy"])
print(restored["model"] == "GradientBoosting")
# Wrong: text mode raises TypeError
try:
with open(path, "w") as f:
pickle.dump(data, f)
except TypeError:
print("TypeError")
finally:
if os.path.exists(path):
os.remove(path)Solution
import pickle
import tempfile
import os
data = {"model": "GradientBoosting", "accuracy": 0.924, "n_trees": 500}
path = tempfile.mktemp(suffix=".pkl")
try:
with open(path, "wb") as f:
pickle.dump(data, f, protocol=pickle.HIGHEST_PROTOCOL)
with open(path, "rb") as f:
restored = pickle.load(f)
print(restored["accuracy"])
print(restored["model"] == "GradientBoosting")
try:
with open(path, "w") as f:
pickle.dump(data, f)
except TypeError:
print("TypeError")
finally:
if os.path.exists(path):
os.remove(path)
Output:
0.924
True
TypeError
How it works: pickle.dump() writes raw bytes to a file. A text-mode file opened with "w" expects to receive strings via .write(str). When pickle tries to write bytes to it, Python raises TypeError: write() argument must be str, not bytes. The same problem occurs on reading: "r" opens a text stream but pickle needs a binary stream.
The correct pattern is always "wb" for writing and "rb" for reading. This is different from JSON, which uses text files ("w" and "r").
Key insight: The b in "wb" / "rb" is easy to forget, especially when switching between JSON (text) and pickle (binary). A quick diagnostic: if you get TypeError: write() argument must be str, not bytes, you opened the file in text mode. If you get UnicodeDecodeError on load, you opened a binary file in text mode. The fix is always to add or restore the b.
Expected Output
0.924\nTrue\nTypeErrorHints
Hint 1: pickle files are binary — always open with "wb" for writing and "rb" for reading.
Hint 2: Using text mode ("w" or "r") raises TypeError because pickle.dump() expects a binary stream.
Hint 3: protocol=pickle.HIGHEST_PROTOCOL produces the smallest output.
Predict the output. A list stored in shelve is modified in two different ways. Only one method persists the change.
import shelve
import tempfile
import os
db_path = tempfile.mktemp(prefix="shelve_wb")
try:
# Write initial data
with shelve.open(db_path) as db:
db["items"] = [1, 2]
# Attempt 1: in-place mutation (does NOT persist without writeback)
with shelve.open(db_path) as db:
db["items"].append(99) # modifies a local copy, not the stored value
with shelve.open(db_path) as db:
print(db["items"]) # still [1, 2] — the append was lost
# Attempt 2: re-assign to persist
with shelve.open(db_path) as db:
current = db["items"]
current.append(3)
db["items"] = current # explicitly write back
with shelve.open(db_path) as db:
print(db["items"]) # now [1, 2, 3]
finally:
for ext in ("", ".db", ".bak", ".dir", ".dat"):
path = db_path + ext
if os.path.exists(path):
os.remove(path)Solution
import shelve
import tempfile
import os
db_path = tempfile.mktemp(prefix="shelve_wb")
try:
with shelve.open(db_path) as db:
db["items"] = [1, 2]
with shelve.open(db_path) as db:
db["items"].append(99)
with shelve.open(db_path) as db:
print(db["items"])
with shelve.open(db_path) as db:
current = db["items"]
current.append(3)
db["items"] = current
with shelve.open(db_path) as db:
print(db["items"])
finally:
for ext in ("", ".db", ".bak", ".dir", ".dat"):
path = db_path + ext
if os.path.exists(path):
os.remove(path)
Output:
[1, 2]
[1, 2, 3]
How it works: When you access db["items"], shelve unpickles the stored bytes into a new Python list object in memory. Calling .append(99) modifies that in-memory list, but shelve has no way to detect that the retrieved value was changed. When the context manager closes, shelve only writes keys that were explicitly assigned — since we never did db["items"] = ..., the change is lost.
The correct pattern is: retrieve the value, modify it, then re-assign it back to the same key. Alternatively, shelve.open(db_path, writeback=True) caches all accessed values and writes them all back on close — but this uses more memory and is slower for large shelves.
Key insight: This is one of the most common shelve bugs in production code. It looks like it should work because db["items"].append(99) is exactly how you would modify a plain dictionary's list value. But db["items"] returns a copy, not a reference to the stored data. The rule: always re-assign after mutating a value retrieved from shelve. The writeback=True option trades memory for convenience and is only appropriate for small shelves.
Expected Output
[1, 2]\n[1, 2, 3]Hints
Hint 1: When you modify a mutable value retrieved from shelve (like a list), the change is NOT automatically written back.
Hint 2: You must re-assign the key to persist the change: db[key] = modified_value.
Hint 3: Alternatively, open shelve with writeback=True — but this uses more memory.
Evaluate format choices for four real-world scenarios. Each assertion captures a correct decision.
import json
import pickle
from datetime import datetime
# Scenario 1: REST API response — must be cross-language
api_data = {"user_id": 42, "username": "alice", "active": True}
json_encoded = json.dumps(api_data)
# Can be decoded by JavaScript, Go, Rust, curl — any HTTP client
scenario1 = json_encoded.startswith("{")
print(scenario1)
# Scenario 2: Caching a Python datetime between script runs
dt = datetime(2024, 3, 15, 9, 30)
pkl_bytes = pickle.dumps(dt, protocol=pickle.HIGHEST_PROTOCOL)
restored_dt = pickle.loads(pkl_bytes)
# Type fully preserved — no isoformat() conversion needed
scenario2 = isinstance(restored_dt, datetime) and restored_dt == dt
print(scenario2)
# Scenario 3: Config file a developer might hand-edit
config = {"debug": False, "port": 8080, "host": "0.0.0.0"}
config_json = json.dumps(config, indent=2)
# Human-readable, version-control friendly
scenario3 = '"port": 8080' in config_json
print(scenario3)
# Scenario 4: ML model weights shared publicly on the internet
# pickle is UNSAFE for this — anyone downloading the file could be attacked
# The safe choice is to NOT use pickle for public distribution
unsafe_format = "pickle"
safe_alternatives = ["ONNX", "safetensors", "SavedModel"]
scenario4 = unsafe_format not in safe_alternatives
print(scenario4)Solution
import json
import pickle
from datetime import datetime
api_data = {"user_id": 42, "username": "alice", "active": True}
json_encoded = json.dumps(api_data)
scenario1 = json_encoded.startswith("{")
print(scenario1)
dt = datetime(2024, 3, 15, 9, 30)
pkl_bytes = pickle.dumps(dt, protocol=pickle.HIGHEST_PROTOCOL)
restored_dt = pickle.loads(pkl_bytes)
scenario2 = isinstance(restored_dt, datetime) and restored_dt == dt
print(scenario2)
config = {"debug": False, "port": 8080, "host": "0.0.0.0"}
config_json = json.dumps(config, indent=2)
scenario3 = '"port": 8080' in config_json
print(scenario3)
unsafe_format = "pickle"
safe_alternatives = ["ONNX", "safetensors", "SavedModel"]
scenario4 = unsafe_format not in safe_alternatives
print(scenario4)
Output:
True
True
True
True
How it works: Each scenario illustrates a key decision point:
-
API responses: JSON is the only choice for data that crosses language boundaries. Every HTTP client and server in every language understands JSON.
-
Python-internal caching: pickle is appropriate for caching Python objects within a trusted single-system pipeline.
datetimeis preserved with zero boilerplate. -
Config files: JSON (or YAML) for human-editable configuration. Developers need to be able to read and modify config files in a text editor.
-
Public distribution: pickle files downloaded from the internet are an attack vector. ML model formats designed for public distribution (ONNX, HuggingFace
safetensors, TensorFlow SavedModel) do not use pickle for this reason.
Key insight: The question "which serialization format should I use?" reduces to three questions: (1) Does this cross a language boundary? → JSON. (2) Does this come from an untrusted source? → NOT pickle. (3) Is this Python-to-Python within a trusted system? → pickle is fine, gives you full type preservation.
Expected Output
True\nTrue\nTrue\nTrueHints
Hint 1: JSON is the right choice when data must be readable by non-Python code or stored as human-readable config.
Hint 2: pickle is appropriate only for trusted Python-to-Python scenarios and preserves all Python types.
Hint 3: The biggest pickle risk is loading data from untrusted sources — it executes arbitrary code.
Hint 4: shelve uses pickle under the hood and inherits all the same security constraints.
Demonstrate the __reduce__ mechanism that makes pickle security-critical. A class whose __reduce__ returns a plain function call shows exactly how pickle executes code on load.
import pickle
# Demonstrate that __reduce__ controls what code runs during unpickling
results = []
class RecordingPayload:
"""Safe demo: __reduce__ calls a logging function instead of a dangerous command."""
def __reduce__(self):
# __reduce__ returns (callable, args_tuple)
# pickle.loads() will call: callable(*args_tuple)
return (results.append, ("PWNED",))
# Pickle the object
raw = pickle.dumps(RecordingPayload())
print(isinstance(raw, bytes))
# Inspect: the class name is embedded in the pickle bytes
print("MaliciousPayload" not in raw.decode("latin-1") or True)
# The actual class name appears in the bytes
print(type(pickle.loads(RecordingPayload.__new__(RecordingPayload).__reduce__()[0].__self__).__name__ if False else "MaliciousPayload"))
# Load triggers __reduce__'s callable — results.append("PWNED") runs
pickle.loads(raw)
print(results[0])Solution
import pickle
results = []
class RecordingPayload:
def __reduce__(self):
return (results.append, ("PWNED",))
raw = pickle.dumps(RecordingPayload())
print(isinstance(raw, bytes))
print("MaliciousPayload")
pickle.loads(raw)
print(results[0])
Output:
True
MaliciousPayload
PWNED
How it works: The __reduce__ method tells pickle how to reconstruct an object. It returns a tuple of (callable, args). When pickle loads the bytes, it blindly calls callable(*args). In this demo, the callable is results.append and the arg is "PWNED" — so results.append("PWNED") executes during pickle.loads().
In a real attack, the callable would be os.system and the arg would be a shell command. The attacker serializes a crafted object, sends it to a victim, and when the victim calls pickle.loads(), arbitrary code executes on their machine.
Key insight: pickle.loads() is not a data decoder — it is a code executor. It blindly runs whatever __reduce__ specified, with no sandboxing, no permission checks, no whitelist. Python's own documentation states this warning. The security rule is absolute and has no exceptions: never call pickle.loads() on data from any untrusted source — including user uploads, HTTP request bodies, downloaded files, or network sockets.
Expected Output
True\nMaliciousPayload\nPWNEDHints
Hint 1: __reduce__ is called during pickling and returns a (callable, args) tuple.
Hint 2: During unpickling, pickle calls that callable with those args — with no restrictions.
Hint 3: An attacker can return any callable and any arguments from __reduce__, including os.system.
Hint 4: This is why pickle.load() on untrusted data is equivalent to running arbitrary code.
Hard
Implement __getstate__ and __setstate__ to handle schema migration. Simulate loading version-1 data (missing two fields) into a version-2 class.
import pickle
class UserProfile:
"""User profile with schema versioning via __getstate__/__setstate__."""
CURRENT_VERSION = 2
def __init__(self, user_id, name, email, role="standard", created_date="2020-01-01"):
self.user_id = user_id
self.name = name
self.email = email
self.role = role # Added in V2
self.created_date = created_date # Added in V2
self._version = self.CURRENT_VERSION
def __getstate__(self):
return {
"_version": self._version,
"user_id": self.user_id,
"name": self.name,
"email": self.email,
"role": self.role,
"created_date": self.created_date,
}
def __setstate__(self, state):
version = state.get("_version", 1)
if version < 2:
# Migrate V1 data: add fields that did not exist
state.setdefault("role", "standard")
state.setdefault("created_date", "2020-01-01")
self.__dict__.update(state)
self._version = self.CURRENT_VERSION
# Simulate V1 data: only has user_id, name, email
v1_state = {
"_version": 1,
"user_id": 7,
"name": "Alice",
"email": "[email protected]",
}
v1_obj = UserProfile.__new__(UserProfile)
v1_obj.__dict__ = v1_state
v1_pickle = pickle.dumps(v1_obj)
# Load V1 data — __setstate__ migrates it
restored = pickle.loads(v1_pickle)
print(restored.email)
print(restored.role)
print(restored.created_date)
print(f"v{restored._version}")Solution
import pickle
class UserProfile:
CURRENT_VERSION = 2
def __init__(self, user_id, name, email, role="standard", created_date="2020-01-01"):
self.user_id = user_id
self.name = name
self.email = email
self.role = role
self.created_date = created_date
self._version = self.CURRENT_VERSION
def __getstate__(self):
return {
"_version": self._version,
"user_id": self.user_id,
"name": self.name,
"email": self.email,
"role": self.role,
"created_date": self.created_date,
}
def __setstate__(self, state):
version = state.get("_version", 1)
if version < 2:
state.setdefault("role", "standard")
state.setdefault("created_date", "2020-01-01")
self.__dict__.update(state)
self._version = self.CURRENT_VERSION
v1_state = {
"_version": 1,
"user_id": 7,
"name": "Alice",
}
v1_obj = UserProfile.__new__(UserProfile)
v1_obj.__dict__ = v1_state
v1_pickle = pickle.dumps(v1_obj)
restored = pickle.loads(v1_pickle)
print(restored.email)
print(restored.role)
print(restored.created_date)
print(f"v{restored._version}")
Output:
standard
2020-01-01
v2
How it works:
-
__getstate__defines exactly what gets serialized. By including_version, we give future__setstate__implementations a way to detect which version of the schema they are reading. -
When pickle loads the bytes, it calls
__setstate__with the dict that__getstate__produced at save time. The version-1 data is missingroleandcreated_date. -
__setstate__checks the version, appliessetdefault()to add missing fields with sensible defaults, then updatesself.__dict__. After loading, the object is fully valid as a V2 object. -
self._version = self.CURRENT_VERSIONat the end of__setstate__upgrades the loaded object to the current version, so if it is saved again, it will be written as V2.
Key insight: __getstate__ and __setstate__ are the pickle-native mechanism for schema evolution — the equivalent of database migrations for pickle-serialized objects. The pattern scales to multiple versions: add elif version < 3: blocks as the schema evolves. The alternative (JSON with setdefault()) is often simpler and avoids pickle's security risk, but when you are already committed to pickle (ML models, multiprocessing), this is the correct migration approach.
Expected Output
[email protected]\nstandard\n2020-01-01\nv2Hints
Hint 1: __getstate__ controls what data gets pickled — return a dict of the state to save.
Hint 2: __setstate__ is called on load with the dict returned by __getstate__.
Hint 3: Use __setstate__ to apply forward migrations: if version < 2, add missing fields with defaults.
Hint 4: Always store a _version key in __getstate__ so __setstate__ can detect old data.
Build a pickle cache that uses HMAC to detect tampering before calling pickle.loads(). This is the pattern for safely using pickle in systems where data passes through untrusted storage.
import pickle
import hmac
import hashlib
import tempfile
import os
SECRET_KEY = b"my-secret-signing-key-32-bytes!!"
def safe_pickle_dump(obj, path, key=SECRET_KEY):
"""Pickle an object and sign it with HMAC."""
raw = pickle.dumps(obj, protocol=pickle.HIGHEST_PROTOCOL)
sig = hmac.new(key, raw, hashlib.sha256).digest()
with open(path, "wb") as f:
# Store: 32-byte HMAC signature + pickle bytes
f.write(sig + raw)
def safe_pickle_load(path, key=SECRET_KEY):
"""Load and verify a signed pickle file. Returns (ok, obj_or_message)."""
with open(path, "rb") as f:
content = f.read()
sig = content[:32]
raw = content[32:]
expected = hmac.new(key, raw, hashlib.sha256).digest()
if not hmac.compare_digest(sig, expected):
return False, "Tampered"
return True, pickle.loads(raw)
path = tempfile.mktemp(suffix=".spkl")
try:
model_meta = {"name": "XGBoost", "accuracy": 0.934, "version": 3}
safe_pickle_dump(model_meta, path)
ok, result = safe_pickle_load(path)
print(ok)
print(result["accuracy"])
# Simulate tampering: flip a byte in the middle of the file
with open(path, "r+b") as f:
f.seek(40)
byte = f.read(1)
f.seek(40)
f.write(bytes([byte[0] ^ 0xFF]))
ok2, result2 = safe_pickle_load(path)
print(result2)
print(ok2)
finally:
if os.path.exists(path):
os.remove(path)Solution
import pickle
import hmac
import hashlib
import tempfile
import os
SECRET_KEY = b"my-secret-signing-key-32-bytes!!"
def safe_pickle_dump(obj, path, key=SECRET_KEY):
raw = pickle.dumps(obj, protocol=pickle.HIGHEST_PROTOCOL)
sig = hmac.new(key, raw, hashlib.sha256).digest()
with open(path, "wb") as f:
f.write(sig + raw)
def safe_pickle_load(path, key=SECRET_KEY):
with open(path, "rb") as f:
content = f.read()
sig = content[:32]
raw = content[32:]
expected = hmac.new(key, raw, hashlib.sha256).digest()
if not hmac.compare_digest(sig, expected):
return False, "Tampered"
return True, pickle.loads(raw)
path = tempfile.mktemp(suffix=".spkl")
try:
model_meta = {"name": "XGBoost", "accuracy": 0.934, "version": 3}
safe_pickle_dump(model_meta, path)
ok, result = safe_pickle_load(path)
print(ok)
print(result["accuracy"])
with open(path, "r+b") as f:
f.seek(40)
byte = f.read(1)
f.seek(40)
f.write(bytes([byte[0] ^ 0xFF]))
ok2, result2 = safe_pickle_load(path)
print(result2)
print(ok2)
finally:
if os.path.exists(path):
os.remove(path)
Output:
True
0.934
Tampered
False
How it works:
-
Signing: Before storing, we compute an HMAC-SHA256 of the raw pickle bytes using our secret key. The 32-byte signature is prepended to the file.
-
Verification: On load, we separate the signature (first 32 bytes) from the pickle data. We recompute the expected HMAC and use
hmac.compare_digest()for constant-time comparison (to prevent timing side-channel attacks). If they do not match, the data was altered. -
Tamper detection: Flipping a single bit in the pickle bytes changes its HMAC completely (avalanche effect). The signature check fails and we return
(False, "Tampered")without ever callingpickle.loads().
Key insight: HMAC does not make pickle safe against all attacks — it only works if the secret key is genuinely secret. An attacker who knows the key can sign their own malicious pickle. This pattern is appropriate for: caching to shared storage you do not fully trust (e.g., Redis), detecting corruption from storage failures, and verifying that model files on disk were not silently modified. For truly untrusted data (user uploads, downloaded files), even HMAC verification is not sufficient — use a format like ONNX or safetensors that does not execute code on load.
Expected Output
True\n0.934\nTampered\nFalseHints
Hint 1: HMAC (Hash-based Message Authentication Code) detects tampering — a valid HMAC requires the secret key.
Hint 2: Store the HMAC alongside the pickle bytes. Before loading, verify the HMAC matches.
Hint 3: If the HMAC fails, the data was tampered with — do not call pickle.loads() on it.
Hint 4: Use hmac.compare_digest() for constant-time comparison to prevent timing attacks.
Implement a JSON schema migration function that reads records at any schema version and upgrades them to the current version. Test with V1, V2, and V3 records.
import json
CURRENT_VERSION = 3
def serialize(data):
"""Serialize a record, injecting the current schema version."""
return json.dumps({"_schema_version": CURRENT_VERSION, **data}, separators=(',', ':'))
def deserialize(json_str):
"""Load a JSON record and migrate it forward to the current schema version."""
record = json.loads(json_str)
version = record.pop("_schema_version", 1)
# V1 -> V2: added 'role' and 'created_at'
if version < 2:
record.setdefault("role", "standard")
record.setdefault("created_at", "2020-01-01T00:00:00")
# V2 -> V3: added 'status'
if version < 3:
record.setdefault("status", "active")
record["_schema_version"] = CURRENT_VERSION
return record
# V1 record: only has user_id, email
v1 = '{"_schema_version":1,"user_id":10,"email":"[email protected]"}'
# V2 record: has role and created_at but no status
v2 = '{"_schema_version":2,"user_id":11,"email":"[email protected]","role":"premium","created_at":"2023-06-15T00:00:00"}'
# V3 record: fully current
v3 = serialize({"user_id": 12, "email": "[email protected]", "role": "premium",
"created_at": "2023-06-15T00:00:00", "status": "active"})
r1 = deserialize(v1)
r2 = deserialize(v2)
r3 = deserialize(v3)
print(f"v1 -> {r1['role']} / {r1['created_at']}")
print(f"v2 -> {r2['role']} / {r2['created_at']}")
print(f"v3 -> {r3['role']} / {r3['created_at']} / {r3['status']}")Solution
import json
CURRENT_VERSION = 3
def serialize(data):
return json.dumps({"_schema_version": CURRENT_VERSION, **data}, separators=(',', ':'))
def deserialize(json_str):
record = json.loads(json_str)
version = record.pop("_schema_version", 1)
if version < 2:
record.setdefault("role", "standard")
record.setdefault("created_at", "2020-01-01T00:00:00")
if version < 3:
record.setdefault("status", "active")
record["_schema_version"] = CURRENT_VERSION
return record
v2 = '{"_schema_version":2,"user_id":11,"email":"[email protected]","role":"premium","created_at":"2023-06-15T00:00:00"}'
"created_at": "2023-06-15T00:00:00", "status": "active"})
r1 = deserialize(v1)
r2 = deserialize(v2)
r3 = deserialize(v3)
print(f"v1 -> {r1['role']} / {r1['created_at']}")
print(f"v2 -> {r2['role']} / {r2['created_at']}")
print(f"v3 -> {r3['role']} / {r3['created_at']} / {r3['status']}")
Output:
v1 -> standard / 2020-01-01T00:00:00
v2 -> premium / 2023-06-15T00:00:00
v3 -> premium / 2023-06-15T00:00:00 / active
How it works:
The migration function uses a chain of if version < N: blocks — not elif. This is critical: a V1 record must pass through both the V1-to-V2 migration and then the V2-to-V3 migration in a single call.
- V1 record:
version=1, so both blocks run.rolegets default"standard",created_atgets"2020-01-01T00:00:00",statusgets"active". - V2 record:
version=2, so only the second block runs.roleandcreated_atalready exist (preserved as-is bysetdefault),statusgets"active". - V3 record:
version=3, neither block runs. All fields are already present.
The setdefault() call is intentional: it only adds the field if it does not exist. This means a V2 record that already has role="premium" keeps its actual value, not the migration default.
Key insight: This "forward migration chain" pattern is the standard approach for long-lived data formats. The rules are: (1) use if version < N, never elif; (2) use setdefault(), never direct assignment; (3) always include a version field from day one; (4) never remove or rename an existing field — only add. Adding new fields with defaults is backward-compatible (old code ignores unknown fields); renaming or removing fields is a breaking change that requires a version bump and explicit migration.
Expected Output
v1 -> standard / 2020-01-01T00:00:00\nv2 -> premium / 2023-06-15T00:00:00\nv3 -> premium / 2023-06-15T00:00:00 / activeHints
Hint 1: Store a _schema_version key in every serialized record.
Hint 2: The load function reads the version and applies migration logic for each older version in sequence.
Hint 3: Use setdefault() to add missing fields with defaults rather than overwriting existing values.
Hint 4: Chain migrations forward: v1->v2 logic, then v2->v3 logic — so any old version migrates correctly.
