JSON Handling - Serialization, Deserialization, and Edge Cases
Reading time: ~18 minutes | Level: Foundation → Engineering
Here is a question that trips up most developers the first time they hit it in production:
import json
from datetime import datetime
from decimal import Decimal
data = {
"user": "alice",
"created_at": datetime.now(),
"balance": Decimal("99.99"),
}
print(json.dumps(data))
Output:
TypeError: Object of type datetime is not JSON serializable
The json module only handles six types. Everything else - datetime, UUID, Decimal, bytes, custom objects - raises TypeError. Knowing exactly which types fail and exactly how to handle them is the difference between a working REST API and a production incident at 2 AM.
What You Will Learn
- The six JSON types and their exact Python equivalents
json.dumps()andjson.loads()for string-based serializationjson.dump()andjson.load()for file-based serializationindent,sort_keys, andseparatorsparameters and when to use each- How to handle non-serializable types: datetime, UUID, Decimal, bytes, custom objects
- Custom encoders with
json.JSONEncoderand thedefault()method - Custom decoders with
object_hookfor round-trip fidelity json.JSONDecodeError: what causes it and how to handle it gracefullyensure_ascii=Falsefor Unicode-rich data- Performance: when to reach for
orjsonorujson
Prerequisites
- Python 3.8+ with
jsonmodule (standard library - no install needed) - Understanding of Python dicts, lists, and basic types
- Familiarity with reading and writing files (see lessons 01 and 02 of this module)
- Basic understanding of context managers (lesson 03)
Mental Model: JSON Is a Typed Subset of Python
JSON is not Python. It is a language-independent text format with exactly six types:
| JSON Type | JSON Example | Python Type |
|---|---|---|
| object | {"key": "value"} | dict |
| array | [1, 2, 3] | list |
| string | "hello" | str |
| number | 42 or 3.14 | int or float |
| boolean | true or false | True or False |
| null | null | None |
Not in JSON: datetime, UUID, Decimal, bytes, set, tuple, custom objects, complex, frozenset, ...
This mismatch is the source of every JSON serialization problem. Python's type system is far richer than JSON's. The json module handles the six core mappings automatically. Everything else is your responsibility.
Part 1 - The Four Core Functions
json.dumps() - Python Object to JSON String
import json
data = {
"name": "Alice",
"age": 30,
"scores": [95, 87, 92],
"active": True,
"profile": None,
}
json_string = json.dumps(data)
print(json_string)
# {"name": "Alice", "age": 30, "scores": [95, 87, 92], "active": true, "profile": null}
print(type(json_string))
# <class 'str'>
Notice the automatic type conversions:
- Python
Truebecomes JSONtrue - Python
Nonebecomes JSONnull - Python
dictbecomes JSONobject - Python
listbecomes JSONarray
json.loads() - JSON String to Python Object
import json
json_string = '{"name": "Alice", "age": 30, "active": true, "profile": null}'
data = json.loads(json_string)
print(data)
# {'name': 'Alice', 'age': 30, 'active': True, 'profile': None}
print(type(data)) # <class 'dict'>
print(type(data["active"])) # <class 'bool'>
print(data["profile"]) # None
The conversions are symmetric:
- JSON
truebecomes PythonTrue - JSON
falsebecomes PythonFalse - JSON
nullbecomes PythonNone
json.dump() - Python Object to JSON File
import json
config = {
"database": {
"host": "localhost",
"port": 5432,
"name": "appdb",
},
"debug": False,
"max_connections": 100,
}
with open("config.json", "w", encoding="utf-8") as f:
json.dump(config, f, indent=2)
# config.json now contains:
# {
# "database": {
# "host": "localhost",
# "port": 5432,
# "name": "appdb"
# },
# "debug": false,
# "max_connections": 100
# }
json.load() - JSON File to Python Object
import json
with open("config.json", "r", encoding="utf-8") as f:
config = json.load(f)
print(config["database"]["host"]) # localhost
print(config["debug"]) # False
print(type(config["database"])) # <class 'dict'>
:::note Always specify encoding
Always open JSON files with encoding="utf-8". JSON is defined to be UTF-8 encoded by RFC 8259. Omitting the encoding parameter uses the platform default, which can differ on Windows.
:::
Part 2 - Formatting Parameters
indent - Human-Readable Output
import json
data = {"users": [{"id": 1, "name": "Alice"}, {"id": 2, "name": "Bob"}]}
# Compact (default)
compact = json.dumps(data)
print(compact)
# {"users": [{"id": 1, "name": "Alice"}, {"id": 2, "name": "Bob"}]}
# Indented - for config files, logging, debugging
readable = json.dumps(data, indent=2)
print(readable)
# {
# "users": [
# {
# "id": 1,
# "name": "Alice"
# },
# {
# "id": 2,
# "name": "Bob"
# }
# ]
# }
sort_keys - Deterministic Output
import json
data = {"zebra": 1, "apple": 2, "mango": 3}
print(json.dumps(data))
# {"zebra": 1, "apple": 2, "mango": 3} - dict insertion order (Python 3.7+)
print(json.dumps(data, sort_keys=True))
# {"apple": 2, "mango": 3, "zebra": 1} - alphabetical
:::tip Use sort_keys for reproducible hashing
When you need to hash JSON (e.g., for caching or checksums), use sort_keys=True to ensure the same dict always produces the same JSON string regardless of insertion order.
import hashlib, json
def dict_hash(d: dict) -> str:
canonical = json.dumps(d, sort_keys=True, separators=(',', ':'))
return hashlib.sha256(canonical.encode()).hexdigest()
:::
separators - Compact JSON for Network Transmission
import json
data = {"event": "click", "x": 100, "y": 200}
# Default separators include spaces: (', ', ': ')
default = json.dumps(data)
print(f"Default: {len(default)} bytes → {default}")
# Default: 36 bytes → {"event": "click", "x": 100, "y": 200}
# Compact separators - no extra whitespace
compact = json.dumps(data, separators=(',', ':'))
print(f"Compact: {len(compact)} bytes → {compact}")
# Compact: 34 bytes → {"event":"click","x":100,"y":200}
| Format | Use for |
|---|---|
indent=2 | Config files, responses for human review |
separators=(',', ':') | Network APIs, high-throughput logging (compact) |
| Default | General use, debugging |
Part 3 - Non-Serializable Types and How to Handle Each
The Problem
import json
from datetime import datetime
from decimal import Decimal
import uuid
# These all raise TypeError:
json.dumps(datetime.now()) # TypeError: datetime not serializable
json.dumps(Decimal("3.14")) # TypeError: Decimal not serializable
json.dumps(uuid.uuid4()) # TypeError: UUID not serializable
json.dumps(b"raw bytes") # TypeError: bytes not serializable
json.dumps({1, 2, 3}) # TypeError: set not serializable
Solution 1: Manual Conversion Before Serializing
The simplest approach for one-off cases:
import json
from datetime import datetime
from decimal import Decimal
import uuid
data = {
"user_id": str(uuid.uuid4()), # UUID → str
"created_at": datetime.now().isoformat(), # datetime → str
"balance": float(Decimal("99.99")), # Decimal → float
"tags": list({"python", "api"}), # set → list
}
print(json.dumps(data, indent=2))
# {
# "user_id": "a3f4...",
# "created_at": "2024-01-15T14:30:00.123456",
# "balance": 99.99,
# "tags": ["python", "api"]
# }
:::warning Float precision loss
Converting Decimal("99.99") to float introduces floating-point representation errors. For financial data, serialize as a string instead: str(Decimal("99.99")) → "99.99". Deserialize back with Decimal(data["balance"]).
:::
Solution 2: Custom Encoder Class
For systematic handling across your entire application:
import json
from datetime import datetime, date
from decimal import Decimal
import uuid
class EngineeringEncoder(json.JSONEncoder):
"""Production-grade JSON encoder handling common Python types."""
def default(self, obj):
# Called for every object the default encoder cannot handle
if isinstance(obj, datetime):
return obj.isoformat()
if isinstance(obj, date):
return obj.isoformat()
if isinstance(obj, Decimal):
return str(obj) # Preserve exact representation
if isinstance(obj, uuid.UUID):
return str(obj)
if isinstance(obj, bytes):
return obj.decode("utf-8") # Or use base64 for binary data
if isinstance(obj, set | frozenset):
return sorted(obj) # Sort for deterministic output
# For any other type, call the parent (raises TypeError)
return super().default(obj)
# Use with cls= parameter
data = {
"event_id": uuid.uuid4(),
"timestamp": datetime.now(),
"amount": Decimal("1234.56"),
"raw": b"hello",
"tags": {"python", "backend"},
}
result = json.dumps(data, cls=EngineeringEncoder, indent=2)
print(result)
# {
# "event_id": "3f2c8b...",
# "timestamp": "2024-01-15T14:30:00.123456",
# "amount": "1234.56",
# "raw": "hello",
# "tags": ["backend", "python"]
# }
Solution 3: default Function Parameter
For lightweight one-off needs without a full class:
import json
from datetime import datetime
from decimal import Decimal
def encode_extended(obj):
if isinstance(obj, datetime):
return {"__type__": "datetime", "value": obj.isoformat()}
if isinstance(obj, Decimal):
return {"__type__": "decimal", "value": str(obj)}
raise TypeError(f"Object of type {type(obj).__name__} is not JSON serializable")
data = {
"created": datetime(2024, 1, 15, 14, 30),
"price": Decimal("29.99"),
}
print(json.dumps(data, default=encode_extended, indent=2))
# {
# "created": {"__type__": "datetime", "value": "2024-01-15T14:30:00"},
# "price": {"__type__": "decimal", "value": "29.99"}
# }
Part 4 - Custom Decoders with object_hook
object_hook is called on every JSON object (dict) after parsing. Use it to restore original Python types - achieving true round-trip serialization.
import json
from datetime import datetime
from decimal import Decimal
def decode_extended(obj):
"""Restore special types encoded with __type__ markers."""
if "__type__" not in obj:
return obj # Regular dict - return as-is
type_name = obj["__type__"]
value = obj["value"]
if type_name == "datetime":
return datetime.fromisoformat(value)
if type_name == "decimal":
return Decimal(value)
return obj # Unknown type - return dict unchanged
# Round-trip example
original = {
"event": "purchase",
"timestamp": datetime(2024, 1, 15, 14, 30),
"amount": Decimal("99.99"),
}
# Encode
json_str = json.dumps(original, default=encode_extended)
# Decode - restores original Python types
restored = json.loads(json_str, object_hook=decode_extended)
print(restored["timestamp"]) # 2024-01-15 14:30:00
print(type(restored["timestamp"])) # <class 'datetime.datetime'>
print(restored["amount"]) # 99.99
print(type(restored["amount"])) # <class 'decimal.Decimal'>
Part 5 - Error Handling
json.JSONDecodeError
json.loads() raises json.JSONDecodeError (a subclass of ValueError) when the input is not valid JSON:
import json
def safe_parse(text: str) -> dict | None:
"""Parse JSON with graceful error handling."""
try:
return json.loads(text)
except json.JSONDecodeError as e:
print(f"JSON parse error at line {e.lineno}, col {e.colno}: {e.msg}")
print(f"Problem text: {e.doc[max(0, e.pos-20):e.pos+20]!r}")
return None
# Common causes of JSONDecodeError:
safe_parse("{'key': 'value'}") # Single quotes - not valid JSON
# JSON parse error at line 1, col 2: Expecting property name enclosed in double quotes
safe_parse('{"key": undefined}') # undefined is JavaScript, not JSON
# JSON parse error at line 1, col 9: Expecting value
safe_parse('{"key": "value",}') # Trailing comma - not allowed in JSON
# JSON parse error at line 1, col 18: Expecting property name enclosed in double quotes
safe_parse("") # Empty string
# JSON parse error at line 1, col 1: Expecting value
Defensive Parsing Pattern
import json
import logging
logger = logging.getLogger(__name__)
def parse_api_response(response_text: str, request_id: str) -> dict:
"""
Parse an API response body, always returning a usable dict.
Logs errors with context for debugging production issues.
"""
if not response_text or not response_text.strip():
logger.warning("Empty response body for request %s", request_id)
return {"error": "empty_response"}
try:
return json.loads(response_text)
except json.JSONDecodeError as e:
logger.error(
"Failed to parse JSON for request %s: %s (pos=%d)",
request_id, e.msg, e.pos,
)
# Log a snippet for debugging (avoid logging full response in case it contains PII)
snippet = response_text[:200]
logger.debug("Response snippet: %r", snippet)
return {"error": "json_parse_error", "detail": e.msg}
Part 6 - ensure_ascii for Unicode Data
By default, json.dumps() escapes all non-ASCII characters:
import json
data = {
"message": "こんにちは", # Japanese: "Hello"
"currency": "€100",
"emoji": "✓",
}
# Default: everything escaped to ASCII-safe sequences
print(json.dumps(data))
# {"message": "\u3053\u3093\u306b\u3061\u306f", "currency": "\u20ac100", "emoji": "\u2713"}
# ensure_ascii=False: write Unicode characters directly
print(json.dumps(data, ensure_ascii=False))
# {"message": "こんにちは", "currency": "€100", "emoji": "✓"}
:::tip Use ensure_ascii=False for modern APIs
Both outputs are valid JSON - any compliant parser handles both. But ensure_ascii=False produces smaller output and is human-readable. Use it whenever you're working with multilingual data and writing to UTF-8 files or HTTP responses with Content-Type: application/json; charset=utf-8.
:::
# Correct pattern for writing international JSON to file
with open("data.json", "w", encoding="utf-8") as f:
json.dump(data, f, ensure_ascii=False, indent=2)
Part 7 - Serializing Custom Objects
Approach 1: __dict__ Serialization
For simple objects, dump the __dict__ attribute:
import json
class User:
def __init__(self, user_id, name, email):
self.user_id = user_id
self.name = name
self.email = email
# Serialize via __dict__
print(json.dumps(user.__dict__))
# {"user_id": 42, "name": "Alice", "email": "[email protected]"}
Approach 2: to_dict() Method
Add explicit serialization control to your class:
import json
from datetime import datetime
class Event:
def __init__(self, name, occurred_at, severity):
self.name = name
self.occurred_at = occurred_at # datetime
self.severity = severity
def to_dict(self) -> dict:
return {
"name": self.name,
"occurred_at": self.occurred_at.isoformat(),
"severity": self.severity,
}
@classmethod
def from_dict(cls, data: dict) -> "Event":
return cls(
name=data["name"],
occurred_at=datetime.fromisoformat(data["occurred_at"]),
severity=data["severity"],
)
event = Event("deploy", datetime.now(), "info")
# Serialize
json_str = json.dumps(event.to_dict())
# Deserialize - fully restores the object
restored = Event.from_dict(json.loads(json_str))
print(restored.name) # deploy
print(type(restored.occurred_at)) # <class 'datetime.datetime'>
Approach 3: Encoder with isinstance Dispatch
The cleanest production pattern for systems with many custom types:
import json
from datetime import datetime
from decimal import Decimal
import uuid
from dataclasses import dataclass, asdict
@dataclass
class Product:
product_id: uuid.UUID
name: str
price: Decimal
created_at: datetime
class AppEncoder(json.JSONEncoder):
def default(self, obj):
if isinstance(obj, uuid.UUID):
return str(obj)
if isinstance(obj, Decimal):
return str(obj)
if isinstance(obj, datetime):
return obj.isoformat()
# Dataclasses: convert to dict first, then individual fields encode recursively
if hasattr(obj, "__dataclass_fields__"):
return asdict(obj)
return super().default(obj)
product = Product(
product_id=uuid.uuid4(),
name="Widget Pro",
price=Decimal("49.99"),
created_at=datetime.now(),
)
print(json.dumps(product, cls=AppEncoder, indent=2))
# {
# "product_id": "b4c2...",
# "name": "Widget Pro",
# "price": "49.99",
# "created_at": "2024-01-15T14:30:00.123456"
# }
Part 8 - Performance: When the Standard Library Is Not Fast Enough
The standard json module is implemented in C (via _json), but third-party libraries go much further:
| Library | Speed vs stdlib | Cross-lang | Custom types | Install |
|---|---|---|---|---|
json (stdlib) | 1x (baseline) | Yes | Manual | Built-in |
orjson | 10x–100x | Yes | Automatic* | pip install orjson |
ujson | 2x–5x | Yes | Limited | pip install ujson |
msgpack | Fast + binary | Yes | Manual | pip install msgpack |
* orjson natively handles: datetime, UUID, numpy arrays, dataclasses.
orjson - The Production Standard for High Throughput
import orjson
from datetime import datetime
from decimal import Decimal
import uuid
data = {
"event_id": uuid.uuid4(),
"timestamp": datetime.now(),
"value": 42,
}
# orjson.dumps returns bytes (not str) - faster for network I/O
json_bytes = orjson.dumps(data)
print(json_bytes)
# b'{"event_id":"b4c2...","timestamp":"2024-01-15T14:30:00.123456","value":42}'
# orjson handles datetime and UUID natively - no custom encoder needed!
# Deserialize
restored = orjson.loads(json_bytes)
print(restored["value"]) # 42
# orjson does NOT restore datetime objects on load - they stay as strings
# This is the same behavior as stdlib json
print(type(restored["timestamp"])) # <class 'str'>
When to Use Each Library
# stdlib json - default choice; zero dependencies
import json
data = json.dumps(payload)
# orjson - high-throughput APIs, event streaming, ML feature stores
# > 10,000 serializations/second, native datetime/UUID/numpy support
import orjson
data = orjson.dumps(payload) # Returns bytes
# ujson - drop-in replacement for stdlib, moderate speedup
import ujson
data = ujson.dumps(payload) # Returns str like stdlib
:::warning orjson returns bytes
orjson.dumps() returns bytes, not str. When writing to a file opened in text mode, you must decode first: f.write(orjson.dumps(data).decode()). Or open the file in binary mode: open("file.json", "wb").
:::
Part 9 - Real-World Patterns
Pattern 1: REST API Response Parsing
import json
import urllib.request
from datetime import datetime
def fetch_github_user(username: str) -> dict:
"""Fetch GitHub user data from the public API."""
url = f"https://api.github.com/users/{username}"
with urllib.request.urlopen(url) as response:
raw = response.read().decode("utf-8")
data = json.loads(raw)
# Extract only what we need; convert types
return {
"login": data["login"],
"id": data["id"],
"repos": data["public_repos"],
# GitHub returns ISO 8601 strings - parse to datetime
"created": datetime.fromisoformat(data["created_at"].replace("Z", "+00:00")),
"bio": data.get("bio"), # May be null → None
}
# user = fetch_github_user("gvanrossum")
# print(user["created"]) # 2011-01-25 18:44:36+00:00
Pattern 2: Append-Only JSON Log (JSONL Format)
JSON Lines (.jsonl) - one JSON object per line - is the standard format for structured logs and ML training data:
import json
from datetime import datetime
def log_event(filepath: str, event_type: str, data: dict) -> None:
"""Append a structured event to a JSON Lines log file."""
record = {
"ts": datetime.utcnow().isoformat() + "Z",
"event": event_type,
**data,
}
with open(filepath, "a", encoding="utf-8") as f:
f.write(json.dumps(record, separators=(',', ':')) + "\n")
def read_log(filepath: str):
"""Read all events from a JSON Lines log file."""
with open(filepath, "r", encoding="utf-8") as f:
for line in f:
line = line.strip()
if line:
yield json.loads(line)
# Usage
log_event("events.jsonl", "user_login", {"user_id": 42, "ip": "10.0.0.1"})
log_event("events.jsonl", "purchase", {"user_id": 42, "amount": 99.99})
for event in read_log("events.jsonl"):
print(event["event"], event["ts"])
# user_login 2024-01-15T14:30:00.000000Z
# purchase 2024-01-15T14:30:01.234567Z
Pattern 3: Config File with Schema Validation
import json
from pathlib import Path
DEFAULT_CONFIG = {
"database": {"host": "localhost", "port": 5432},
"debug": False,
"log_level": "INFO",
}
def load_config(config_path: str | Path) -> dict:
"""
Load JSON config file, falling back to defaults for missing keys.
Validates required keys are present.
"""
path = Path(config_path)
if not path.exists():
return DEFAULT_CONFIG.copy()
with path.open("r", encoding="utf-8") as f:
try:
user_config = json.load(f)
except json.JSONDecodeError as e:
raise ValueError(f"Config file {path} is not valid JSON: {e}") from e
# Deep merge: user config overrides defaults
config = DEFAULT_CONFIG.copy()
for key, value in user_config.items():
if isinstance(value, dict) and key in config and isinstance(config[key], dict):
config[key] = {**config[key], **value}
else:
config[key] = value
return config
Pattern 4: Feature Store Serialization (ML Context)
import json
import numpy as np
from datetime import datetime
class FeatureStoreEncoder(json.JSONEncoder):
"""Encoder for ML feature data including numpy types."""
def default(self, obj):
# numpy scalars
if isinstance(obj, (np.integer,)):
return int(obj)
if isinstance(obj, (np.floating,)):
return float(obj)
# numpy arrays - convert to nested lists
if isinstance(obj, np.ndarray):
return obj.tolist()
if isinstance(obj, datetime):
return obj.isoformat()
return super().default(obj)
# Simulated feature vector
features = {
"user_id": np.int64(12345),
"embedding": np.array([0.1, 0.2, 0.3, 0.4]),
"click_rate": np.float32(0.045),
"computed_at": datetime.utcnow(),
}
json_str = json.dumps(features, cls=FeatureStoreEncoder)
print(json_str)
# {"user_id": 12345, "embedding": [0.1, 0.2, 0.3, 0.4], "click_rate": 0.04500000178813934, "computed_at": "2024-01-15T..."}
Interview Questions
Q1: What are the six JSON types, and what do they map to in Python?
Answer: JSON has exactly six types:
objectmaps to Pythondictarraymaps to Pythonliststringmaps to Pythonstrnumbermaps to Pythonint(if no decimal point) orfloat(if decimal point present)true/falsemap to PythonTrue/Falsenullmaps to PythonNone
Everything else in Python - datetime, UUID, Decimal, bytes, set, custom objects - must be explicitly converted before JSON serialization.
Q2: What is the difference between json.dumps() and json.dump()?
Answer: json.dumps() serializes a Python object to a string (the s stands for "string"). json.dump() serializes to a file-like object - any object with a .write() method. Both accept the same keyword arguments (indent, sort_keys, cls, default, etc.). Use dumps() when you need the JSON as a string in memory (e.g., for an HTTP response body, for hashing). Use dump() when writing directly to a file to avoid holding the entire string in memory.
Q3: How do you serialize a datetime object to JSON? How do you deserialize it back?
Answer: datetime is not JSON-serializable by default. There are two main approaches:
-
Simple (no round-trip guarantee):
datetime.now().isoformat()produces a string like"2024-01-15T14:30:00". Deserialize withdatetime.fromisoformat(s). -
Round-trip with type markers:
# Encode
def encode(obj):
if isinstance(obj, datetime):
return {"__type__": "datetime", "value": obj.isoformat()}
raise TypeError
# Decode
def decode(obj):
if obj.get("__type__") == "datetime":
return datetime.fromisoformat(obj["value"])
return obj
json.dumps(data, default=encode)
json.loads(json_str, object_hook=decode)
Use object_hook to restore the Python type during deserialization.
Q4: You need to hash a dict to use as a cache key. How do you do it correctly with JSON?
Answer: Use json.dumps(d, sort_keys=True, separators=(',', ':')) to get a canonical representation. Without sort_keys=True, two dicts with the same content but different insertion order would produce different strings (though in Python 3.7+ dicts preserve insertion order, so same code always yields the same order - but sort_keys=True is still the safe, explicit choice). Without separators=(',', ':'), whitespace in the default output is harmless but wasteful.
import json, hashlib
def cache_key(params: dict) -> str:
canonical = json.dumps(params, sort_keys=True, separators=(',', ':'))
return hashlib.sha256(canonical.encode()).hexdigest()
Q5: What is object_hook in json.loads() and when would you use it?
Answer: object_hook is a callable that is called for every JSON object (dict) parsed. The return value replaces the default dict. It enables custom deserialization - turning type-annotated dicts back into proper Python objects.
Use it when you control both the encoder and decoder and want true round-trip fidelity. For example, if you encode datetime as {"__type__": "datetime", "value": "..."}, your object_hook checks for "__type__" and reconstructs the datetime. Without object_hook, you would need to walk the deserialized dict manually.
Q6: When should you use orjson instead of the standard json module?
Answer: Use orjson when:
- You are serializing more than ~10,000 JSON objects per second (high-throughput APIs, event streams, ML inference servers)
- Your data contains
datetime,UUID, or numpy arrays -orjsonhandles them natively without a custom encoder - You are writing JSON to network sockets where bytes output is more efficient than str
orjson is 10x–100x faster than stdlib json because it is implemented in Rust. The main difference is that orjson.dumps() returns bytes, not str. This is fine for file I/O in binary mode or HTTP response bodies, but requires .decode() if you need a string.
Practice Challenges
Beginner: Build a Simple Config File Manager
Write a module that loads a JSON config file on startup and saves updates back to disk.
Requirements:
load_config(path)- load from file, return dict; create file with defaults if it doesn't existsave_config(path, config)- save dict to file withindent=2get(path, key, default=None)- get a value from configset(path, key, value)- update a value and immediately persist
Solution
import json
from pathlib import Path
DEFAULTS = {
"theme": "dark",
"language": "en",
"notifications": True,
"max_retries": 3,
}
def load_config(path: str | Path) -> dict:
"""Load config from JSON file, creating it with defaults if absent."""
path = Path(path)
if not path.exists():
config = DEFAULTS.copy()
save_config(path, config)
return config
with path.open("r", encoding="utf-8") as f:
try:
return json.load(f)
except json.JSONDecodeError as e:
print(f"Warning: config file corrupted ({e}), using defaults")
return DEFAULTS.copy()
def save_config(path: str | Path, config: dict) -> None:
"""Save config dict to JSON file with readable formatting."""
path = Path(path)
path.parent.mkdir(parents=True, exist_ok=True)
with path.open("w", encoding="utf-8") as f:
json.dump(config, f, indent=2, sort_keys=True)
f.write("\n") # Trailing newline - POSIX convention
def get(path: str | Path, key: str, default=None):
"""Get a single value from the config file."""
config = load_config(path)
return config.get(key, default)
def set_value(path: str | Path, key: str, value) -> None:
"""Update a single config value and persist immediately."""
config = load_config(path)
config[key] = value
save_config(path, config)
# Demo
config_path = "/tmp/demo_config.json"
# First load creates the file with defaults
config = load_config(config_path)
print(config)
# {'language': 'en', 'max_retries': 3, 'notifications': True, 'theme': 'dark'}
# Update a value
set_value(config_path, "theme", "light")
set_value(config_path, "max_retries", 5)
# Read back
print(get(config_path, "theme")) # light
print(get(config_path, "max_retries")) # 5
print(get(config_path, "missing", 42)) # 42 (default)
# Verify file contents
with open(config_path) as f:
print(f.read())
# {
# "language": "en",
# "max_retries": 5,
# "notifications": true,
# "theme": "light"
# }
Intermediate: Full Round-Trip Serializer for Custom Types
Build a SmartJSON class that handles datetime, Decimal, UUID, set, and dataclasses - with full round-trip fidelity (deserializing restores original Python types).
Solution
import json
from datetime import datetime
from decimal import Decimal
import uuid
from dataclasses import dataclass, asdict, fields
# Type marker key
TYPE_KEY = "__python_type__"
class SmartEncoder(json.JSONEncoder):
def default(self, obj):
if isinstance(obj, datetime):
return {TYPE_KEY: "datetime", "v": obj.isoformat()}
if isinstance(obj, Decimal):
return {TYPE_KEY: "decimal", "v": str(obj)}
if isinstance(obj, uuid.UUID):
return {TYPE_KEY: "uuid", "v": str(obj)}
if isinstance(obj, (set, frozenset)):
return {TYPE_KEY: "set", "v": sorted(str(i) for i in obj)}
if hasattr(obj, "__dataclass_fields__"):
return {TYPE_KEY: "dataclass", "cls": type(obj).__name__, "v": asdict(obj)}
return super().default(obj)
def smart_decoder(obj: dict):
"""object_hook that restores Python types from type-annotated dicts."""
if TYPE_KEY not in obj:
return obj
kind = obj[TYPE_KEY]
val = obj["v"]
if kind == "datetime":
return datetime.fromisoformat(val)
if kind == "decimal":
return Decimal(val)
if kind == "uuid":
return uuid.UUID(val)
if kind == "set":
return set(val)
if kind == "dataclass":
# Note: restoring to dict since we don't have the class in scope here
# In production, maintain a registry of dataclass types
return val
return obj # Unknown type - pass through
class SmartJSON:
"""Drop-in replacement for json module with extended type support."""
@staticmethod
def dumps(obj, **kwargs) -> str:
return json.dumps(obj, cls=SmartEncoder, **kwargs)
@staticmethod
def loads(s: str, **kwargs):
return json.loads(s, object_hook=smart_decoder, **kwargs)
@staticmethod
def dump(obj, fp, **kwargs) -> None:
json.dump(obj, fp, cls=SmartEncoder, **kwargs)
@staticmethod
def load(fp, **kwargs):
return json.load(fp, object_hook=smart_decoder, **kwargs)
# Test round-trips
@dataclass
class Order:
order_id: str
amount: Decimal
created: datetime
data = {
"session_id": uuid.UUID("12345678-1234-5678-1234-567812345678"),
"timestamp": datetime(2024, 1, 15, 14, 30, 0),
"price": Decimal("1234.56"),
"tags": {"python", "backend", "v2"},
}
# Encode
encoded = SmartJSON.dumps(data, indent=2)
print(encoded)
# Decode - restores all original types
restored = SmartJSON.loads(encoded)
print(type(restored["session_id"])) # <class 'uuid.UUID'>
print(type(restored["timestamp"])) # <class 'datetime.datetime'>
print(type(restored["price"])) # <class 'decimal.Decimal'>
print(type(restored["tags"])) # <class 'set'>
# Verify values survived round-trip exactly
assert restored["price"] == Decimal("1234.56") # No float precision loss!
assert restored["timestamp"] == datetime(2024, 1, 15, 14, 30, 0)
print("All round-trip assertions passed.")
Advanced: High-Throughput JSONL Pipeline
Build an event processing pipeline that reads a JSONL log file, filters and transforms events, and writes results to a new JSONL file. Handle malformed lines gracefully. Benchmark the stdlib json version against orjson.
Solution
import json
import time
import random
from datetime import datetime, timedelta
from pathlib import Path
from typing import Iterator
# ── Generate sample data ─────────────────────────────────────────────────────
def generate_events(path: str, count: int = 10_000) -> None:
"""Generate a sample JSONL event log."""
event_types = ["page_view", "click", "purchase", "search", "logout"]
base_time = datetime(2024, 1, 1)
with open(path, "w", encoding="utf-8") as f:
for i in range(count):
ts = base_time + timedelta(seconds=i * 0.5)
event = {
"id": i,
"type": random.choice(event_types),
"user_id": random.randint(1, 1000),
"ts": ts.isoformat() + "Z",
"value": round(random.uniform(0, 1000), 2),
}
f.write(json.dumps(event, separators=(',', ':')) + "\n")
# Inject some bad lines
f.write("not json at all\n")
f.write('{"incomplete": \n')
f.write("\n") # Empty line
# ── Pipeline with stdlib json ─────────────────────────────────────────────────
def read_jsonl(path: str) -> Iterator[dict]:
"""Yield parsed events, skipping malformed lines."""
with open(path, "r", encoding="utf-8") as f:
for line_num, line in enumerate(f, 1):
line = line.strip()
if not line:
continue
try:
yield json.loads(line)
except json.JSONDecodeError as e:
print(f" Skipping bad line {line_num}: {e.msg}")
def process_events(
input_path: str,
output_path: str,
event_filter: str,
min_value: float,
) -> int:
"""
Filter events by type and minimum value, write to new JSONL file.
Returns count of events written.
"""
written = 0
with open(output_path, "w", encoding="utf-8") as out_f:
for event in read_jsonl(input_path):
if event.get("type") != event_filter:
continue
if event.get("value", 0) < min_value:
continue
# Transform: add processing timestamp
event["processed_at"] = datetime.utcnow().isoformat() + "Z"
out_f.write(json.dumps(event, separators=(',', ':')) + "\n")
written += 1
return written
# ── Benchmark ────────────────────────────────────────────────────────────────
def benchmark():
input_path = "/tmp/events.jsonl"
output_path = "/tmp/purchases.jsonl"
print("Generating 10,000 events...")
generate_events(input_path, 10_000)
print("\nProcessing with stdlib json:")
start = time.perf_counter()
count = process_events(input_path, output_path, "purchase", 100.0)
elapsed = time.perf_counter() - start
print(f" Wrote {count} purchase events in {elapsed:.4f}s")
# Try orjson if available
try:
import orjson
def process_events_orjson(input_path, output_path, event_filter, min_value):
written = 0
with open(input_path, "rb") as in_f, open(output_path, "wb") as out_f:
for line in in_f:
line = line.strip()
if not line:
continue
try:
event = orjson.loads(line)
except orjson.JSONDecodeError:
continue
if event.get("type") != event_filter:
continue
if event.get("value", 0) < min_value:
continue
event["processed_at"] = datetime.utcnow().isoformat() + "Z"
out_f.write(orjson.dumps(event) + b"\n")
written += 1
return written
print("\nProcessing with orjson:")
start = time.perf_counter()
count = process_events_orjson(input_path, "/tmp/purchases_orjson.jsonl", "purchase", 100.0)
elapsed = time.perf_counter() - start
print(f" Wrote {count} purchase events in {elapsed:.4f}s")
except ImportError:
print("\norjson not installed. Install with: pip install orjson")
# Verify output
events = list(read_jsonl(output_path))
print(f"\nVerification: first purchase event:")
print(json.dumps(events[0], indent=2))
benchmark()
# Generating 10,000 events...
# Processing with stdlib json:
# Skipping bad line 10001: Expecting value
# Skipping bad line 10002: Expecting property name enclosed in double quotes
# Wrote ~476 purchase events in 0.0234s
# Processing with orjson:
# Wrote ~476 purchase events in 0.0031s (≈7x faster)
Quick Reference
| Operation | Syntax | Notes |
|---|---|---|
| Object to JSON string | json.dumps(obj) | Returns str |
| JSON string to object | json.loads(s) | Returns Python type |
| Object to JSON file | json.dump(obj, f) | f must be open for writing |
| JSON file to object | json.load(f) | f must be open for reading |
| Pretty print | json.dumps(obj, indent=2) | Indent in spaces |
| Sorted keys | json.dumps(obj, sort_keys=True) | Alphabetical key order |
| Compact output | json.dumps(obj, separators=(',', ':')) | No spaces, smaller payload |
| Unicode direct | json.dumps(obj, ensure_ascii=False) | Write non-ASCII as-is |
| Custom encoder class | json.dumps(obj, cls=MyEncoder) | Subclass json.JSONEncoder |
| Custom encoder function | json.dumps(obj, default=fn) | fn(obj) must return serializable value |
| Custom decoder | json.loads(s, object_hook=fn) | Called for every JSON object |
| Handle parse errors | json.JSONDecodeError | Subclass of ValueError |
Python→JSON True | true | Case-sensitive |
Python→JSON None | null | Case-sensitive |
Python→JSON dict | {} object | Keys must be strings |
Python→JSON tuple | [] array | Tuples become arrays |
Key Takeaways
- JSON has exactly six types: object, array, string, number, boolean, null - everything else requires explicit handling
- Use
json.dumps()/json.loads()for string round-trips; usejson.dump()/json.load()for file I/O - Always open JSON files with
encoding="utf-8"- JSON is defined to be UTF-8 by spec indent=2for human-readable output;separators=(',', ':')for compact network payloads;sort_keys=Truefor deterministic hashing- Extend
json.JSONEncoderand overridedefault()for systematic custom-type handling across your application - Use
object_hookinjson.loads()to achieve full round-trip fidelity - restoring original Python types on deserialization - For financial data: serialize
Decimalasstr, notfloat, to avoid floating-point precision loss - At high throughput (10k+ ops/sec), reach for
orjson- it is 10x–100x faster and handlesdatetime,UUID, and numpy arrays natively - JSONL (one JSON object per line) is the standard format for structured logs, event streams, and ML training datasets
