Python Input Validation Practice Problems & Exercises
Practice: Input Validation and Sanitization
← Back to lessonEasy
Demonstrate why whitelist validation is superior to blacklist validation for usernames.
import re
def whitelist_username(username: str) -> bool:
# allow only: a-z, A-Z, 0-9, hyphen, underscore, 3-32 chars
pass
def blacklist_username(username: str) -> bool:
# block: < > ' " ; & (naive blacklist)
pass
tests = ["alice", "alice-123", "alice<script>"]
for t in tests:
print(f"Whitelist '{t}': {'valid' if whitelist_username(t) else 'invalid'}")
blacklist_tests = ["alice", "alice<script>", "alice<SCRIPT>"]
for t in blacklist_tests:
print(f"Blacklist '{t}': {'valid' if blacklist_username(t) else 'invalid (blocked)'}")
Solution
import re
def whitelist_username(username: str) -> bool:
# Whitelist: explicit pattern for allowed characters
return bool(re.fullmatch(r"[a-zA-Z0-9_\-]{3,32}", username))
def blacklist_username(username: str) -> bool:
# Blacklist: only blocks known bad characters
bad_chars = set("<>'\";& ")
return not any(c in bad_chars for c in username)
tests = ["alice", "alice-123", "alice<script>"]
for t in tests:
print(f"Whitelist '{t}': {'valid' if whitelist_username(t) else 'invalid'}")
blacklist_tests = ["alice", "alice<script>", "alice<SCRIPT>"]
for t in blacklist_tests:
print(f"Blacklist '{t}': {'valid' if blacklist_username(t) else 'invalid (blocked)'}")
Why blacklists fail:
alice<SCRIPT>— case variations bypass case-sensitive blacklists.alice%3Cscript%3E— URL encoding bypasses character-based blacklists.alice\u003cscript\u003e— Unicode escapes bypass ASCII-only blacklists.- Blacklists are a game of whack-a-mole — attackers find the bypass you didn't think of.
- Rule: Whitelist by default. Only use blacklists for rate limiting or audit logging — never as the primary defense.
Expected Output
Whitelist 'alice': valid
Whitelist 'alice-123': valid
Whitelist 'alice<script>': invalid
Blacklist 'alice': valid
Blacklist 'alice<script>': valid (bypassed!)
Blacklist 'alice<SCRIPT>': valid (case bypass!)Hints
Hint 1: A whitelist (allowlist) only permits characters/patterns you explicitly define. Everything else is rejected.
Hint 2: A blacklist (denylist) blocks known bad patterns — but attackers can find variants you forgot.
Hint 3: Use re.fullmatch() for whitelist: the pattern must match the entire input, not just part of it.
Implement validate_email(s) and validate_url(s) functions that apply whitelist-style validation.
import re
def validate_email(email: str) -> bool:
pass
def validate_url(url: str) -> bool:
# only allow http:// and https:// schemes
pass
for e in emails:
print(f"{e}: {'valid' if validate_email(e) else 'invalid'} email")
urls = [
"https://example.com",
"http://example.com",
"ftp://example.com",
"javascript:alert(1)",
]
for u in urls:
status = "valid" if validate_url(u) else "invalid"
note = " URL (scheme not allowed)" if not validate_url(u) and "://" in u else " URL"
print(f"{u}: {status}{note}")
Solution
import re
def validate_email(email: str) -> bool:
# Reasonable subset of RFC 5321 — good enough for most applications
pattern = r"^[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}$"
return bool(re.fullmatch(pattern, email))
def validate_url(url: str) -> bool:
# Only allow http and https schemes
pattern = r"^https?://[a-zA-Z0-9.\-_/:%?#&=@\[\]+~]+$"
return bool(re.fullmatch(pattern, url))
for e in emails:
print(f"{e}: {'valid' if validate_email(e) else 'invalid'} email")
urls = [
"https://example.com",
"http://example.com",
"ftp://example.com",
"javascript:alert(1)",
]
for u in urls:
valid = validate_url(u)
status = "valid" if valid else "invalid"
note = " URL (scheme not allowed)" if not valid and "://" in u else " URL"
print(f"{u}: {status}{note}")
Production email validation: Use a library like email-validator for full RFC compliance. In most applications, a format check + sending a confirmation email is the right approach — regex alone cannot confirm the address is real. URL validation: always validate scheme before rendering URLs as links to prevent javascript: XSS.
Expected Output
[email protected]: valid email
user@: invalid email
@example.com: invalid email
https://example.com: valid URL
http://example.com: valid URL
ftp://example.com: invalid URL (scheme not allowed)
javascript:alert(1): invalid URLHints
Hint 1: For email, use a simple pattern: [email protected]. Do not try to fully implement RFC 5321.
Hint 2: For URLs, validate the scheme first (https:// or http:// only) to prevent javascript: and data: XSS.
Hint 3: Use re.fullmatch() to ensure the entire string matches, not just a substring.
Implement a safe HTML rendering function using html.escape.
import html
def safe_render_paragraph(user_input: str) -> str:
# escape user input and wrap in a paragraph tag
pass
raw = "<script>alert('XSS')</script>"
user_content = "<b>World</b>!"
print(f"Raw: {raw}")
print(f"Escaped: {html.escape(raw, quote=True)}")
print(f"Safe HTML: {safe_render_paragraph('Hello, ' + user_content)}")
Solution
import html
def safe_render_paragraph(user_input: str) -> str:
escaped = html.escape(user_input, quote=True)
return f"<p>{escaped}</p>"
raw = "<script>alert('XSS')</script>"
user_content = "<b>World</b>!"
print(f"Raw: {raw}")
print(f"Escaped: {html.escape(raw, quote=True)}")
print(f"Safe HTML: {safe_render_paragraph('Hello, ' + user_content)}")
Escape on output, not input:
- Store raw data — escaping at storage time means double-escaping if you later escape again.
- Escape at render time — you may need to render the same data in different contexts (HTML, JSON, CSV) each requiring different escaping.
- Context matters: HTML body escaping differs from HTML attribute escaping, JavaScript string escaping, and URL parameter escaping.
- Use a templating engine (Jinja2, Django templates) that auto-escapes by default — only opt out when you're rendering trusted HTML.
Expected Output
Raw: <script>alert('XSS')</script>
Escaped: <script>alert('XSS')</script>
Safe HTML: <p>Hello, <b>World</b>!</p>Hints
Hint 1: Use html.escape(text) to convert <, >, &, ", and ' to HTML entities.
Hint 2: html.escape(text, quote=True) also escapes single quotes — use this when inserting into HTML attributes.
Hint 3: Always escape on OUTPUT (when rendering to HTML), not on input storage.
Implement null byte detection and sanitization for file paths.
def validate_filename(name: str) -> bool:
"""Return False if filename contains null bytes or other dangerous characters."""
pass
def sanitize_filename(name: str) -> str:
"""Remove null bytes and control characters from filename."""
pass
tests = [
("safe_file.txt", True),
("file.txt\x00.jpg", False),
]
for name, expected in tests:
result = validate_filename(name)
status = "valid" if result else "invalid (null byte)"
print(f"{repr(name)}: {status}")
malicious = "safe_file.txt\x00.exe"
clean = sanitize_filename(malicious)
print(f"After null byte rejection: {clean}")
print(f"Truncation attack blocked: {'.exe' not in clean or '\x00' not in clean}")
Solution
import re
def validate_filename(name: str) -> bool:
if "\x00" in name:
return False
if any(ord(c) < 32 for c in name):
return False
# Only allow safe filename characters
return bool(re.fullmatch(r"[a-zA-Z0-9._\-]{1,255}", name))
def sanitize_filename(name: str) -> str:
# Remove null bytes and control characters
cleaned = "".join(c for c in name if ord(c) >= 32 and c != "\x00")
# Keep only safe characters
cleaned = re.sub(r"[^a-zA-Z0-9._\-]", "_", cleaned)
return cleaned[:255]
tests = [
("safe_file.txt", True),
("file.txt\x00.jpg", False),
]
for name, expected in tests:
result = validate_filename(name)
status = "valid" if result else "invalid (null byte)"
print(f"{repr(name)}: {status}")
malicious = "safe_file.txt\x00.exe"
clean = sanitize_filename(malicious)
print(f"After null byte rejection: {clean}")
print(f"Truncation attack blocked: {'.exe' not in clean or chr(0) not in clean}")
Null byte history: PHP's fopen(), include(), and many C library functions treat \x00 as string terminator. An attacker submitting ../../etc/passwd\x00.jpg exploited PHP apps that validated only the .jpg extension but passed the raw string to C file-open calls. Python's open() raises ValueError on null bytes in filenames in modern versions — but external calls (subprocess, ctypes, database drivers) may not.
Expected Output
Clean filename: valid
Null byte injection: invalid (null byte)
After null byte rejection: safe_file.txt
Truncation attack blocked: TrueHints
Hint 1: Null bytes (\x00) can truncate strings in C-based code. "file.txt\x00.jpg" may be treated as "file.txt" by the OS.
Hint 2: Check for \x00 in any string that will be passed to file system, database, or C library calls.
Hint 3: Strip or reject any input containing control characters (ord(c) < 32).
Medium
Implement a safe_file_path(base_dir, user_filename) function that prevents directory traversal.
import os
from pathlib import Path
from urllib.parse import unquote
BASE_DIR = "/var/www/files"
def safe_file_path(base_dir: str, user_filename: str) -> str:
"""
Return the safe absolute path for user_filename within base_dir.
Raise ValueError for any traversal attempt.
"""
pass
# Valid
print(f"Safe path: {safe_file_path(BASE_DIR, 'report.pdf')}")
# Attacks
for filename, label in [
("../../../etc/passwd", "Traversal blocked"),
("%2e%2e%2fetc%2fpasswd", "Encoded traversal blocked"),
("/etc/passwd", "Absolute path blocked"),
]:
try:
safe_file_path(BASE_DIR, filename)
except ValueError as e:
print(f"{label}: ValueError: {e}")
Solution
import os
from pathlib import Path
from urllib.parse import unquote
BASE_DIR = "/var/www/files"
def safe_file_path(base_dir: str, user_filename: str) -> str:
# Decode URL encoding first
decoded = unquote(user_filename)
# Reject absolute paths
if os.path.isabs(decoded):
raise ValueError("absolute path not allowed")
# Resolve the full path
base = Path(base_dir).resolve()
full_path = (base / decoded).resolve()
# Ensure the resolved path is within base_dir
try:
full_path.relative_to(base)
except ValueError:
raise ValueError("path traversal detected")
return str(full_path)
print(f"Safe path: {safe_file_path(BASE_DIR, 'report.pdf')}")
for filename, label in [
("../../../etc/passwd", "Traversal blocked"),
("%2e%2e%2fetc%2fpasswd", "Encoded traversal blocked"),
("/etc/passwd", "Absolute path blocked"),
]:
try:
safe_file_path(BASE_DIR, filename)
except ValueError as e:
print(f"{label}: ValueError: {e}")
The resolve() trick: Path.resolve() canonicalizes the path — it follows symlinks and collapses .. components. After resolving, path.relative_to(base) raises ValueError if path is not under base. This is the correct and complete check. A string-based check like if '..' in filename is insufficient because of encoding variants and symlink attacks.
Expected Output
Safe path: /var/www/files/report.pdf
Traversal blocked: ValueError: path traversal detected
Encoded traversal blocked: ValueError: path traversal detected
Absolute path blocked: ValueError: absolute path not allowedHints
Hint 1: Resolve the full path with os.path.realpath() or pathlib.Path.resolve(), then check it starts with the allowed base directory.
Hint 2: Decode the filename before checking — %2F is /, %2E%2E is ..
Hint 3: Reject absolute paths in user-supplied filenames — only allow relative paths within the base directory.
Demonstrate unicode normalization attacks and implement a normalizing validator.
import unicodedata
def normalize_input(text: str, form: str = "NFC") -> str:
"""Normalize unicode text to prevent homograph attacks."""
pass
def safe_compare(a: str, b: str) -> bool:
"""Compare strings after NFC normalization."""
pass
# Composed vs decomposed form
composed = "caf\u00e9" # café (precomposed)
decomposed = "cafe\u0301" # café (e + combining accent)
print(f"Raw equal: {composed == decomposed}")
print(f"NFC normalized equal: {normalize_input(composed) == normalize_input(decomposed)}")
# Homograph: Cyrillic 'а' (U+0430) looks like Latin 'a' (U+0061)
cyrillic_admin = "\u0430dmin"
latin_admin = "admin"
nfkc_cyrillic = unicodedata.normalize("NFKC", cyrillic_admin)
nfkc_latin = unicodedata.normalize("NFKC", latin_admin)
print(f"Visually identical strings are equal after normalization: {safe_compare(composed, decomposed)}")
print(f"Homograph 'аdmin' vs 'admin': different after NFKC? {nfkc_cyrillic != nfkc_latin}")
Solution
import unicodedata
def normalize_input(text: str, form: str = "NFC") -> str:
return unicodedata.normalize(form, text)
def safe_compare(a: str, b: str) -> bool:
return normalize_input(a) == normalize_input(b)
composed = "caf\u00e9"
decomposed = "cafe\u0301"
print(f"Raw equal: {composed == decomposed}")
print(f"NFC normalized equal: {normalize_input(composed) == normalize_input(decomposed)}")
cyrillic_admin = "\u0430dmin"
latin_admin = "admin"
nfkc_cyrillic = unicodedata.normalize("NFKC", cyrillic_admin)
nfkc_latin = unicodedata.normalize("NFKC", latin_admin)
print(f"Visually identical strings are equal after normalization: {safe_compare(composed, decomposed)}")
print(f"Homograph 'аdmin' vs 'admin': different after NFKC? {nfkc_cyrillic != nfkc_latin}")
Real attack: A user registers with username аdmin (Cyrillic а) — a moderator sees it as admin in the UI. The attacker can trick users who see the username into following the account, or exploit access-control checks that compare strings without normalization. Always normalize to NFC/NFKC before: storing usernames, comparing access-control strings, and rendering in security-sensitive UIs.
Expected Output
Visually identical strings are equal after normalization: True
Raw equal: False
NFC normalized equal: True
Homograph 'аdmin' vs 'admin': different after NFKC? False (NFKC normalizes Cyrillic a)Hints
Hint 1: Unicode has multiple representations for some characters. "café" (e + combining accent) and "café" (precomposed) are visually identical but not byte-equal.
Hint 2: Use unicodedata.normalize("NFC", s) or "NFKC" to canonicalize before comparison.
Hint 3: NFKC is stricter — it also maps compatibility characters (e.g., Roman numeral Ⅰ to I).
Validate user registration payloads against a JSON schema.
import jsonschema
USER_SCHEMA = {
"type": "object",
"required": ["email", "username", "age"],
"additionalProperties": False,
"properties": {
"email": {"type": "string", "format": "email", "maxLength": 254},
"username": {"type": "string", "minLength": 3, "maxLength": 32, "pattern": "^[a-zA-Z0-9_\\-]+$"},
"age": {"type": "integer", "minimum": 13, "maximum": 120},
},
}
def validate_user(payload: dict) -> str:
try:
jsonschema.validate(instance=payload, schema=USER_SCHEMA)
return "OK"
except jsonschema.ValidationError as e:
return f"ValidationError: {e.message}"
missing = {"username": "alice_99", "age": 25}
print(f"Valid payload: {validate_user(valid)}")
print(f"Missing required field: {validate_user(missing)}")
print(f"Wrong type: {validate_user(wrong)}")
print(f"Extra field (if additionalProperties=false): {validate_user(extra)}")
Solution
import jsonschema
USER_SCHEMA = {
"type": "object",
"required": ["email", "username", "age"],
"additionalProperties": False,
"properties": {
"email": {"type": "string", "format": "email", "maxLength": 254},
"username": {"type": "string", "minLength": 3, "maxLength": 32, "pattern": "^[a-zA-Z0-9_\\-]+$"},
"age": {"type": "integer", "minimum": 13, "maximum": 120},
},
}
def validate_user(payload: dict) -> str:
try:
jsonschema.validate(instance=payload, schema=USER_SCHEMA)
return "OK"
except jsonschema.ValidationError as e:
return f"ValidationError: {e.message}"
missing = {"username": "alice_99", "age": 25}
print(f"Valid payload: {validate_user(valid)}")
print(f"Missing required field: {validate_user(missing)}")
print(f"Wrong type: {validate_user(wrong)}")
print(f"Extra field (if additionalProperties=false): {validate_user(extra)}")
additionalProperties: false is a critical security setting. Without it, an attacker can include unexpected fields like {"role": "admin"} in a registration payload. If any downstream code reads role from the validated object without explicitly checking it was allowed, privilege escalation is possible. Always define explicit schemas with additionalProperties: false for security-sensitive inputs.
Expected Output
Valid payload: OK
Missing required field: ValidationError: 'email' is a required property
Wrong type: ValidationError: 42 is not of type 'string'
Extra field (if additionalProperties=false): ValidationErrorHints
Hint 1: Use the jsonschema library: jsonschema.validate(instance, schema) raises jsonschema.ValidationError on failure.
Hint 2: Mark required fields with the "required" array in the schema.
Hint 3: Set "additionalProperties": false to reject unknown fields — prevents parameter pollution attacks.
Implement strict type validators that prevent type coercion and mass assignment attacks.
def strict_string(value, field_name: str) -> str:
"""Accept only str, raise ValueError for other types."""
pass
def strict_int(value, field_name: str, min_val: int = None, max_val: int = None) -> int:
"""Accept int (or str that parses to int), reject other types."""
pass
def strict_bool(value, field_name: str) -> bool:
"""Accept only bool, reject int/str coercions."""
pass
# Valid
print(f"String \"42\": validated as int {strict_int('42', 'age')}")
print(f"Boolean True as \"active\": safe_bool={strict_bool(True, 'active')}")
# Attacks
for value, field, fn, label in [
(["admin", "user"], "username", strict_string, "List injection blocked"),
({"role": "admin"}, "username", strict_string, "Nested object injection blocked"),
]:
try:
fn(value, field)
except ValueError as e:
print(f"{label}: ValueError: {e}")
Solution
def strict_string(value, field_name: str) -> str:
if not isinstance(value, str):
raise ValueError(f"expected str, got {type(value).__name__}")
return value
def strict_int(value, field_name: str, min_val: int = None, max_val: int = None) -> int:
if isinstance(value, bool): # bool is subclass of int — reject it
raise ValueError(f"expected int, got bool")
if isinstance(value, str):
try:
value = int(value)
except ValueError:
raise ValueError(f"{field_name}: cannot convert '{value}' to int")
if not isinstance(value, int):
raise ValueError(f"expected int, got {type(value).__name__}")
if min_val is not None and value < min_val:
raise ValueError(f"{field_name}: {value} < minimum {min_val}")
if max_val is not None and value > max_val:
raise ValueError(f"{field_name}: {value} > maximum {max_val}")
return value
def strict_bool(value, field_name: str) -> bool:
if not isinstance(value, bool):
raise ValueError(f"expected bool, got {type(value).__name__}")
return value
print(f"String \"42\": validated as int {strict_int('42', 'age')}")
print(f"Boolean True as \"active\": safe_bool={strict_bool(True, 'active')}")
for value, field, fn, label in [
(["admin", "user"], "username", strict_string, "List injection blocked"),
({"role": "admin"}, "username", strict_string, "Nested object injection blocked"),
]:
try:
fn(value, field)
except ValueError as e:
print(f"{label}: ValueError: {e}")
Python gotchas: bool is a subclass of int in Python — isinstance(True, int) returns True. Always check for bool before int if you want to reject boolean inputs. None is falsy but isinstance(None, str) is False — so explicit isinstance checks are safer than truthiness checks.
Expected Output
String "42": validated as int 42
Boolean True as "active": safe_bool=True
List injection blocked: ValueError: expected str, got list
Nested object injection blocked: ValueError: expected str, got dictHints
Hint 1: JSON parsers return Python objects — {"age": "42"} gives str while {"age": 42} gives int. Always assert the type explicitly.
Hint 2: A malicious client might send a list where you expect a string: {"username": ["admin", "user"]}.
Hint 3: Use isinstance() checks, not just truthiness — None, 0, and "" are all falsy but have different semantic meanings.
Hard
Demonstrate a ReDoS-vulnerable regex and implement a timeout-protected validator.
import re
import threading
import time
# Safe email regex (no nested quantifiers)
SAFE_EMAIL = re.compile(r"^[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}$")
# ReDoS-vulnerable pattern: (a+)+ — catastrophic backtracking
REDOS_PATTERN = re.compile(r"^(a+)+$")
def validate_email_safe(email: str) -> bool:
return bool(SAFE_EMAIL.fullmatch(email))
def validate_with_timeout(pattern: re.Pattern, text: str, timeout: float = 0.1) -> bool:
"""Run regex match in thread with timeout. Returns False if timeout."""
result = [None]
def run():
result[0] = bool(pattern.fullmatch(text))
t = threading.Thread(target=run, daemon=True)
t.start()
t.join(timeout=timeout)
if t.is_alive():
return False # timeout — treat as no match
return bool(result[0])
def is_redos_dangerous(pattern_str: str) -> bool:
"""Heuristic: detect obvious nested quantifier patterns."""
dangerous = [r"\+\)+\+", r"\*\)+\*", r"\+\)+\*", r"\*\)+\+"]
import re as re2
for d in dangerous:
if re2.search(d, pattern_str):
return True
return False
print(f"Safe regex rejects invalid: {validate_email_safe('not-an-email')}")
redos_str = r"^(a+)+$"
print(f"ReDoS pattern detected as dangerous: {is_redos_dangerous(redos_str)}")
# ReDoS attack string: 'aaa...a!' — no match but catastrophic backtracking
attack = "a" * 25 + "!"
start = time.perf_counter()
result = validate_with_timeout(REDOS_PATTERN, attack, timeout=0.5)
elapsed = time.perf_counter() - start
print(f"Timeout protection triggered: {not result and elapsed < 1.0}")
Solution
import re
import threading
import time
SAFE_EMAIL = re.compile(r"^[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}$")
REDOS_PATTERN = re.compile(r"^(a+)+$")
def validate_email_safe(email: str) -> bool:
return bool(SAFE_EMAIL.fullmatch(email))
def validate_with_timeout(pattern: re.Pattern, text: str, timeout: float = 0.1) -> bool:
result = [None]
def run():
result[0] = bool(pattern.fullmatch(text))
t = threading.Thread(target=run, daemon=True)
t.start()
t.join(timeout=timeout)
if t.is_alive():
return False
return bool(result[0])
def is_redos_dangerous(pattern_str: str) -> bool:
import re as re2
# Detect nested quantifiers: (x+)+, (x*)*, etc.
nested = re2.search(r'[\(\|][^)]*[+*]\)+[+*?]', pattern_str)
return bool(nested)
print(f"Safe regex rejects invalid: {validate_email_safe('not-an-email')}")
redos_str = r"^(a+)+$"
print(f"ReDoS pattern detected as dangerous: {is_redos_dangerous(redos_str)}")
attack = "a" * 25 + "!"
start = time.perf_counter()
result = validate_with_timeout(REDOS_PATTERN, attack, timeout=0.5)
elapsed = time.perf_counter() - start
print(f"Timeout protection triggered: {not result and elapsed < 1.0}")
ReDoS in production: CloudFlare suffered a major outage in 2019 caused by a ReDoS vulnerability in a WAF rule. The pattern (?:(?:\"|'|\]|\}|\\|\d|(?:nan|infinity|true|false|null|undefined|symbol|math)|\|-|+)+[)];?((?:\s|-|~|!|{}||||+).(?:.=.*)))caused catastrophic backtracking on certain HTTP headers. Use there2` Python binding (guaranteed linear time) for user-supplied or complex patterns.
Expected Output
Safe regex matches email: True
Safe regex rejects invalid: False
ReDoS pattern detected as dangerous: True
Timeout protection triggered: TrueHints
Hint 1: Catastrophic backtracking (ReDoS) happens with patterns like (a+)+ — exponential backtracking on no-match strings.
Hint 2: Use re.compile() with a timeout by running in a thread, or use the regex module which supports timeouts.
Hint 3: Avoid nested quantifiers: (a+)+, (a|a)*, (a*)*. Use possessive quantifiers or atomic groups if available.
Build a Pydantic-inspired field validation model from scratch using dataclasses and __post_init__.
import re
from dataclasses import dataclass
class ValidationError(Exception):
pass
def validate_username(value: str) -> str:
if not isinstance(value, str):
raise ValidationError("username must be a string")
if len(value) < 3:
raise ValidationError("username too short")
if len(value) > 32:
raise ValidationError("username too long")
if not re.fullmatch(r"[a-zA-Z0-9_\-]+", value):
raise ValidationError("username contains invalid characters")
return value
def validate_email(value: str) -> str:
if not isinstance(value, str):
raise ValidationError("email must be a string")
if not re.fullmatch(r"[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}", value):
raise ValidationError("invalid email format")
return value.lower()
def validate_age(value) -> int:
try:
value = int(value)
except (TypeError, ValueError):
raise ValidationError("age must be an integer")
if not (13 <= value <= 120):
raise ValidationError("age must be 13-120")
return value
@dataclass
class UserInput:
username: str
email: str
age: int
def __post_init__(self):
self.username = validate_username(self.username)
self.email = validate_email(self.email)
self.age = validate_age(self.age)
# Tests
try:
print(f"Valid model: {u}")
except ValidationError as e:
print(f"Unexpected: {e}")
for kwargs, label in [
({"username": "alice", "email": "not-an-email", "age": 25}, "Invalid email"),
]:
try:
UserInput(**kwargs)
except ValidationError as e:
print(f"{label}: ValidationError: {e}")
Solution
import re
from dataclasses import dataclass
class ValidationError(Exception):
pass
def validate_username(value: str) -> str:
if not isinstance(value, str):
raise ValidationError("username must be a string")
if len(value) < 3:
raise ValidationError("username too short")
if len(value) > 32:
raise ValidationError("username too long")
if not re.fullmatch(r"[a-zA-Z0-9_\-]+", value):
raise ValidationError("username contains invalid characters")
return value
def validate_email(value: str) -> str:
if not isinstance(value, str):
raise ValidationError("email must be a string")
if not re.fullmatch(r"[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}", value):
raise ValidationError("invalid email format")
return value.lower()
def validate_age(value) -> int:
try:
value = int(value)
except (TypeError, ValueError):
raise ValidationError("age must be an integer")
if not (13 <= value <= 120):
raise ValidationError("age must be 13-120")
return value
@dataclass
class UserInput:
username: str
email: str
age: int
def __post_init__(self):
self.username = validate_username(self.username)
self.email = validate_email(self.email)
self.age = validate_age(self.age)
try:
print(f"Valid model: {u}")
except ValidationError as e:
print(f"Unexpected: {e}")
for kwargs, label in [
({"username": "alice", "email": "not-an-email", "age": 25}, "Invalid email"),
]:
try:
UserInput(**kwargs)
except ValidationError as e:
print(f"{label}: ValidationError: {e}")
Pydantic v2 in production: This exercise shows the internals of what Pydantic automates. In production, use Pydantic v2 — it compiles validators to Rust via pydantic-core, giving 5-50x faster validation compared to pure Python. The @field_validator decorator and model_validator provide the same pattern with much less boilerplate.
Expected Output
Valid model: UserInput(username='alice', email='[email protected]', age=25)
Short username: ValidationError: username too short
Invalid email: ValidationError: invalid email format
Age out of range: ValidationError: age must be 13-120Hints
Hint 1: Build a simple validation framework using __init_subclass__ or descriptors to register field validators.
Hint 2: A field descriptor can hold validation rules (min_length, max_length, pattern) and raise ValueError on bad input.
Hint 3: The __post_init__ pattern with dataclasses is a lightweight alternative to full Pydantic.
Build a composable input sanitization pipeline with pluggable sanitizer steps.
import re
import unicodedata
import html
def make_pipeline(*sanitizers):
"""Compose multiple sanitizer functions into a pipeline."""
def pipeline(text: str) -> str:
for san in sanitizers:
text = san(text)
return text
return pipeline
def normalize_unicode(text: str) -> str:
return unicodedata.normalize("NFC", text)
def strip_control_chars(text: str) -> str:
return "".join(c for c in text if ord(c) >= 32 or c in "\n\r\t")
def remove_html_tags(text: str) -> str:
return re.sub(r"<[^>]+>", "", text)
def remove_sql_meta(text: str) -> str:
return re.sub(r"[;'\"\-\-]", "", text)
def remove_path_components(text: str) -> str:
return re.sub(r"[/\\.]", "", text)
def truncate(max_len: int):
return lambda text: text[:max_len]
# Demonstrate individual sanitizers
print(f"Clean input: {normalize_unicode('hello world 123')}")
print(f"XSS cleaned: {remove_html_tags('hello <script>alert(1)</script>')}")
print(f"SQL meta cleaned: {remove_sql_meta(\"hello' OR 1=1 -- world\")}")
print(f"Path traversal cleaned: {remove_path_components('../etc/passwd')}")
# Full pipeline
sanitize_comment = make_pipeline(
normalize_unicode,
strip_control_chars,
remove_html_tags,
truncate(1000),
)
print(f"Pipeline result: {sanitize_comment('hello world 123')}")
Solution
import re
import unicodedata
def make_pipeline(*sanitizers):
def pipeline(text: str) -> str:
for san in sanitizers:
text = san(text)
return text
return pipeline
def normalize_unicode(text: str) -> str:
return unicodedata.normalize("NFC", text)
def strip_control_chars(text: str) -> str:
return "".join(c for c in text if ord(c) >= 32 or c in "\n\r\t")
def remove_html_tags(text: str) -> str:
return re.sub(r"<[^>]+>", "", text)
def remove_sql_meta(text: str) -> str:
return re.sub(r"[;'\"\-\-]", "", text)
def remove_path_components(text: str) -> str:
return re.sub(r"[/\\.]", "", text)
def truncate(max_len: int):
return lambda text: text[:max_len]
print(f"Clean input: {normalize_unicode('hello world 123')}")
print(f"XSS cleaned: {remove_html_tags('hello <script>alert(1)</script>')}")
print(f"SQL meta cleaned: {remove_sql_meta(\"hello' OR 1=1 -- world\")}")
print(f"Path traversal cleaned: {remove_path_components('../etc/passwd')}")
sanitize_comment = make_pipeline(
normalize_unicode,
strip_control_chars,
remove_html_tags,
truncate(1000),
)
print(f"Pipeline result: {sanitize_comment('hello world 123')}")
Sanitization vs validation:
- Validation: Check if input is acceptable. Reject if not. (Preferred for IDs, emails, amounts.)
- Sanitization: Transform input into a safe form. (Acceptable for free-text comments, display names.)
- Never sanitize SQL input — use parameterized queries instead. Sanitization cannot reliably prevent SQL injection.
- Never sanitize path inputs — use
Path.resolve()+ bounds check instead. Sanitization misses encoding variants. - Use sanitization only for display content (HTML comments, usernames) where you want to preserve as much of the input as possible.
Expected Output
Clean input: hello world 123
XSS cleaned: hello alert1
SQL meta cleaned: hello world
Path traversal cleaned: etcpasswd
Pipeline result: hello world 123Hints
Hint 1: Build a pipeline of sanitizers: each takes a string and returns a cleaned string.
Hint 2: Apply normalization first, then strip dangerous characters, then validate the result.
Hint 3: A pipeline makes it easy to add, remove, or reorder sanitization steps.
