Python Input Validation Practice Problems & Exercises

Practice: Input Validation and Sanitization

11 problems4 Easy4 Medium3 Hard⏱ 60–90 min

Easy

#1Whitelist vs Blacklist ValidationEasy

whitelistblacklistinput-validationallowlist

Demonstrate why whitelist validation is superior to blacklist validation for usernames.

import re

def whitelist_username(username: str) -> bool:
    # allow only: a-z, A-Z, 0-9, hyphen, underscore, 3-32 chars
    pass

def blacklist_username(username: str) -> bool:
    # block: < > ' " ; & (naive blacklist)
    pass

tests = ["alice", "alice-123", "alice<script>"]
for t in tests:
    print(f"Whitelist '{t}': {'valid' if whitelist_username(t) else 'invalid'}")

blacklist_tests = ["alice", "alice<script>", "alice<SCRIPT>"]
for t in blacklist_tests:
    print(f"Blacklist '{t}': {'valid' if blacklist_username(t) else 'invalid (blocked)'}")

Solution

import re

def whitelist_username(username: str) -> bool:
    # Whitelist: explicit pattern for allowed characters
    return bool(re.fullmatch(r"[a-zA-Z0-9_\-]{3,32}", username))

def blacklist_username(username: str) -> bool:
    # Blacklist: only blocks known bad characters
    bad_chars = set("<>'\";& ")
    return not any(c in bad_chars for c in username)

tests = ["alice", "alice-123", "alice<script>"]
for t in tests:
    print(f"Whitelist '{t}': {'valid' if whitelist_username(t) else 'invalid'}")

blacklist_tests = ["alice", "alice<script>", "alice<SCRIPT>"]
for t in blacklist_tests:
    print(f"Blacklist '{t}': {'valid' if blacklist_username(t) else 'invalid (blocked)'}")

Why blacklists fail:

alice<SCRIPT> — case variations bypass case-sensitive blacklists.
alice%3Cscript%3E — URL encoding bypasses character-based blacklists.
alice\u003cscript\u003e — Unicode escapes bypass ASCII-only blacklists.
Blacklists are a game of whack-a-mole — attackers find the bypass you didn't think of.
Rule: Whitelist by default. Only use blacklists for rate limiting or audit logging — never as the primary defense.

Expected Output

Whitelist 'alice': valid
Whitelist 'alice-123': valid
Whitelist 'alice<script>': invalid
Blacklist 'alice': valid
Blacklist 'alice<script>': valid (bypassed!)
Blacklist 'alice<SCRIPT>': valid (case bypass!)

Hints

Hint 1: A whitelist (allowlist) only permits characters/patterns you explicitly define. Everything else is rejected.

Hint 2: A blacklist (denylist) blocks known bad patterns — but attackers can find variants you forgot.

Hint 3: Use re.fullmatch() for whitelist: the pattern must match the entire input, not just part of it.

#2Email and URL ValidationEasy

emailurlvalidationregex

Implement validate_email(s) and validate_url(s) functions that apply whitelist-style validation.

import re

def validate_email(email: str) -> bool:
    pass

def validate_url(url: str) -> bool:
    # only allow http:// and https:// schemes
    pass

emails = ["[email protected]", "user@", "@example.com"]
for e in emails:
    print(f"{e}: {'valid' if validate_email(e) else 'invalid'} email")

urls = [
    "https://example.com",
    "http://example.com",
    "ftp://example.com",
    "javascript:alert(1)",
]
for u in urls:
    status = "valid" if validate_url(u) else "invalid"
    note = " URL (scheme not allowed)" if not validate_url(u) and "://" in u else " URL"
    print(f"{u}: {status}{note}")

Solution

import re

def validate_email(email: str) -> bool:
    # Reasonable subset of RFC 5321 — good enough for most applications
    pattern = r"^[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}$"
    return bool(re.fullmatch(pattern, email))

def validate_url(url: str) -> bool:
    # Only allow http and https schemes
    pattern = r"^https?://[a-zA-Z0-9.\-_/:%?#&=@\[\]+~]+$"
    return bool(re.fullmatch(pattern, url))

emails = ["[email protected]", "user@", "@example.com"]
for e in emails:
    print(f"{e}: {'valid' if validate_email(e) else 'invalid'} email")

urls = [
    "https://example.com",
    "http://example.com",
    "ftp://example.com",
    "javascript:alert(1)",
]
for u in urls:
    valid = validate_url(u)
    status = "valid" if valid else "invalid"
    note = " URL (scheme not allowed)" if not valid and "://" in u else " URL"
    print(f"{u}: {status}{note}")

Production email validation: Use a library like email-validator for full RFC compliance. In most applications, a format check + sending a confirmation email is the right approach — regex alone cannot confirm the address is real. URL validation: always validate scheme before rendering URLs as links to prevent javascript: XSS.

Expected Output

[email protected]: valid email
user@: invalid email
@example.com: invalid email
https://example.com: valid URL
http://example.com: valid URL
ftp://example.com: invalid URL (scheme not allowed)
javascript:alert(1): invalid URL

Hints

Hint 1: For email, use a simple pattern: [email protected]. Do not try to fully implement RFC 5321.

Hint 2: For URLs, validate the scheme first (https:// or http:// only) to prevent javascript: and data: XSS.

Hint 3: Use re.fullmatch() to ensure the entire string matches, not just a substring.

#3HTML Escaping with html.escapeEasy

html-escapexsshtml.escapeoutput-encoding

Implement a safe HTML rendering function using html.escape.

import html

def safe_render_paragraph(user_input: str) -> str:
    # escape user input and wrap in a paragraph tag
    pass

raw = "<script>alert('XSS')</script>"
user_content = "<b>World</b>!"

print(f"Raw: {raw}")
print(f"Escaped: {html.escape(raw, quote=True)}")
print(f"Safe HTML: {safe_render_paragraph('Hello, ' + user_content)}")

Solution

import html

def safe_render_paragraph(user_input: str) -> str:
    escaped = html.escape(user_input, quote=True)
    return f"<p>{escaped}</p>"

raw = "<script>alert('XSS')</script>"
user_content = "<b>World</b>!"

print(f"Raw: {raw}")
print(f"Escaped: {html.escape(raw, quote=True)}")
print(f"Safe HTML: {safe_render_paragraph('Hello, ' + user_content)}")

Escape on output, not input:

Store raw data — escaping at storage time means double-escaping if you later escape again.
Escape at render time — you may need to render the same data in different contexts (HTML, JSON, CSV) each requiring different escaping.
Context matters: HTML body escaping differs from HTML attribute escaping, JavaScript string escaping, and URL parameter escaping.
Use a templating engine (Jinja2, Django templates) that auto-escapes by default — only opt out when you're rendering trusted HTML.

Expected Output

Raw: <script>alert('XSS')</script>
Escaped: &lt;script&gt;alert(&#x27;XSS&#x27;)&lt;/script&gt;
Safe HTML: <p>Hello, &lt;b&gt;World&lt;/b&gt;!</p>

Hints

Hint 1: Use html.escape(text) to convert <, >, &, ", and ' to HTML entities.

Hint 2: html.escape(text, quote=True) also escapes single quotes — use this when inserting into HTML attributes.

Hint 3: Always escape on OUTPUT (when rendering to HTML), not on input storage.

#4Null Byte Injection DetectionEasy

null-byteinjectionsanitizationfilename

Implement null byte detection and sanitization for file paths.

def validate_filename(name: str) -> bool:
    """Return False if filename contains null bytes or other dangerous characters."""
    pass

def sanitize_filename(name: str) -> str:
    """Remove null bytes and control characters from filename."""
    pass

tests = [
    ("safe_file.txt", True),
    ("file.txt\x00.jpg", False),
]

for name, expected in tests:
    result = validate_filename(name)
    status = "valid" if result else "invalid (null byte)"
    print(f"{repr(name)}: {status}")

malicious = "safe_file.txt\x00.exe"
clean = sanitize_filename(malicious)
print(f"After null byte rejection: {clean}")
print(f"Truncation attack blocked: {'.exe' not in clean or '\x00' not in clean}")

Solution

import re

def validate_filename(name: str) -> bool:
    if "\x00" in name:
        return False
    if any(ord(c) < 32 for c in name):
        return False
    # Only allow safe filename characters
    return bool(re.fullmatch(r"[a-zA-Z0-9._\-]{1,255}", name))

def sanitize_filename(name: str) -> str:
    # Remove null bytes and control characters
    cleaned = "".join(c for c in name if ord(c) >= 32 and c != "\x00")
    # Keep only safe characters
    cleaned = re.sub(r"[^a-zA-Z0-9._\-]", "_", cleaned)
    return cleaned[:255]

tests = [
    ("safe_file.txt", True),
    ("file.txt\x00.jpg", False),
]

for name, expected in tests:
    result = validate_filename(name)
    status = "valid" if result else "invalid (null byte)"
    print(f"{repr(name)}: {status}")

malicious = "safe_file.txt\x00.exe"
clean = sanitize_filename(malicious)
print(f"After null byte rejection: {clean}")
print(f"Truncation attack blocked: {'.exe' not in clean or chr(0) not in clean}")

Null byte history: PHP's fopen(), include(), and many C library functions treat \x00 as string terminator. An attacker submitting ../../etc/passwd\x00.jpg exploited PHP apps that validated only the .jpg extension but passed the raw string to C file-open calls. Python's open() raises ValueError on null bytes in filenames in modern versions — but external calls (subprocess, ctypes, database drivers) may not.

Expected Output

Clean filename: valid
Null byte injection: invalid (null byte)
After null byte rejection: safe_file.txt
Truncation attack blocked: True

Hints

Hint 1: Null bytes (\x00) can truncate strings in C-based code. "file.txt\x00.jpg" may be treated as "file.txt" by the OS.

Hint 2: Check for \x00 in any string that will be passed to file system, database, or C library calls.

Hint 3: Strip or reject any input containing control characters (ord(c) < 32).

Medium

#5Path Traversal PreventionMedium

path-traversaldirectory-traversalos.pathsecurity

Implement a safe_file_path(base_dir, user_filename) function that prevents directory traversal.

import os
from pathlib import Path
from urllib.parse import unquote

BASE_DIR = "/var/www/files"

def safe_file_path(base_dir: str, user_filename: str) -> str:
    """
    Return the safe absolute path for user_filename within base_dir.
    Raise ValueError for any traversal attempt.
    """
    pass

# Valid
print(f"Safe path: {safe_file_path(BASE_DIR, 'report.pdf')}")

# Attacks
for filename, label in [
    ("../../../etc/passwd", "Traversal blocked"),
    ("%2e%2e%2fetc%2fpasswd", "Encoded traversal blocked"),
    ("/etc/passwd", "Absolute path blocked"),
]:
    try:
        safe_file_path(BASE_DIR, filename)
    except ValueError as e:
        print(f"{label}: ValueError: {e}")

Solution

import os
from pathlib import Path
from urllib.parse import unquote

BASE_DIR = "/var/www/files"

def safe_file_path(base_dir: str, user_filename: str) -> str:
    # Decode URL encoding first
    decoded = unquote(user_filename)

    # Reject absolute paths
    if os.path.isabs(decoded):
        raise ValueError("absolute path not allowed")

    # Resolve the full path
    base = Path(base_dir).resolve()
    full_path = (base / decoded).resolve()

    # Ensure the resolved path is within base_dir
    try:
        full_path.relative_to(base)
    except ValueError:
        raise ValueError("path traversal detected")

    return str(full_path)

print(f"Safe path: {safe_file_path(BASE_DIR, 'report.pdf')}")

for filename, label in [
    ("../../../etc/passwd", "Traversal blocked"),
    ("%2e%2e%2fetc%2fpasswd", "Encoded traversal blocked"),
    ("/etc/passwd", "Absolute path blocked"),
]:
    try:
        safe_file_path(BASE_DIR, filename)
    except ValueError as e:
        print(f"{label}: ValueError: {e}")

The resolve() trick: Path.resolve() canonicalizes the path — it follows symlinks and collapses .. components. After resolving, path.relative_to(base) raises ValueError if path is not under base. This is the correct and complete check. A string-based check like if '..' in filename is insufficient because of encoding variants and symlink attacks.

Expected Output

Safe path: /var/www/files/report.pdf
Traversal blocked: ValueError: path traversal detected
Encoded traversal blocked: ValueError: path traversal detected
Absolute path blocked: ValueError: absolute path not allowed

Hints

Hint 1: Resolve the full path with os.path.realpath() or pathlib.Path.resolve(), then check it starts with the allowed base directory.

Hint 2: Decode the filename before checking — %2F is /, %2E%2E is ..

Hint 3: Reject absolute paths in user-supplied filenames — only allow relative paths within the base directory.

#6Unicode Normalization Attack PreventionMedium

unicodenormalizationNFCNFKChomograph

Demonstrate unicode normalization attacks and implement a normalizing validator.

import unicodedata

def normalize_input(text: str, form: str = "NFC") -> str:
    """Normalize unicode text to prevent homograph attacks."""
    pass

def safe_compare(a: str, b: str) -> bool:
    """Compare strings after NFC normalization."""
    pass

# Composed vs decomposed form
composed   = "caf\u00e9"     # café (precomposed)
decomposed = "cafe\u0301"    # café (e + combining accent)

print(f"Raw equal: {composed == decomposed}")
print(f"NFC normalized equal: {normalize_input(composed) == normalize_input(decomposed)}")

# Homograph: Cyrillic 'а' (U+0430) looks like Latin 'a' (U+0061)
cyrillic_admin = "\u0430dmin"
latin_admin    = "admin"

nfkc_cyrillic = unicodedata.normalize("NFKC", cyrillic_admin)
nfkc_latin    = unicodedata.normalize("NFKC", latin_admin)
print(f"Visually identical strings are equal after normalization: {safe_compare(composed, decomposed)}")
print(f"Homograph 'аdmin' vs 'admin': different after NFKC? {nfkc_cyrillic != nfkc_latin}")

Solution

import unicodedata

def normalize_input(text: str, form: str = "NFC") -> str:
    return unicodedata.normalize(form, text)

def safe_compare(a: str, b: str) -> bool:
    return normalize_input(a) == normalize_input(b)

composed   = "caf\u00e9"
decomposed = "cafe\u0301"

print(f"Raw equal: {composed == decomposed}")
print(f"NFC normalized equal: {normalize_input(composed) == normalize_input(decomposed)}")

cyrillic_admin = "\u0430dmin"
latin_admin    = "admin"

nfkc_cyrillic = unicodedata.normalize("NFKC", cyrillic_admin)
nfkc_latin    = unicodedata.normalize("NFKC", latin_admin)

print(f"Visually identical strings are equal after normalization: {safe_compare(composed, decomposed)}")
print(f"Homograph 'аdmin' vs 'admin': different after NFKC? {nfkc_cyrillic != nfkc_latin}")

Real attack: A user registers with username аdmin (Cyrillic а) — a moderator sees it as admin in the UI. The attacker can trick users who see the username into following the account, or exploit access-control checks that compare strings without normalization. Always normalize to NFC/NFKC before: storing usernames, comparing access-control strings, and rendering in security-sensitive UIs.

Expected Output

Visually identical strings are equal after normalization: True
Raw equal: False
NFC normalized equal: True
Homograph 'аdmin' vs 'admin': different after NFKC? False (NFKC normalizes Cyrillic a)

Hints

Hint 1: Unicode has multiple representations for some characters. "café" (e + combining accent) and "café" (precomposed) are visually identical but not byte-equal.

Hint 2: Use unicodedata.normalize("NFC", s) or "NFKC" to canonicalize before comparison.

Hint 3: NFKC is stricter — it also maps compatibility characters (e.g., Roman numeral Ⅰ to I).

#7JSON Schema ValidationMedium

json-schemavalidationjsonschematype-safety

Validate user registration payloads against a JSON schema.

import jsonschema

USER_SCHEMA = {
    "type": "object",
    "required": ["email", "username", "age"],
    "additionalProperties": False,
    "properties": {
        "email":    {"type": "string", "format": "email", "maxLength": 254},
        "username": {"type": "string", "minLength": 3, "maxLength": 32, "pattern": "^[a-zA-Z0-9_\\-]+$"},
        "age":      {"type": "integer", "minimum": 13, "maximum": 120},
    },
}

def validate_user(payload: dict) -> str:
    try:
        jsonschema.validate(instance=payload, schema=USER_SCHEMA)
        return "OK"
    except jsonschema.ValidationError as e:
        return f"ValidationError: {e.message}"

valid   = {"email": "[email protected]", "username": "alice_99", "age": 25}
missing = {"username": "alice_99", "age": 25}
wrong   = {"email": "[email protected]", "username": 42, "age": 25}
extra   = {"email": "[email protected]", "username": "alice", "age": 25, "role": "admin"}

print(f"Valid payload: {validate_user(valid)}")
print(f"Missing required field: {validate_user(missing)}")
print(f"Wrong type: {validate_user(wrong)}")
print(f"Extra field (if additionalProperties=false): {validate_user(extra)}")

Solution

import jsonschema

USER_SCHEMA = {
    "type": "object",
    "required": ["email", "username", "age"],
    "additionalProperties": False,
    "properties": {
        "email":    {"type": "string", "format": "email", "maxLength": 254},
        "username": {"type": "string", "minLength": 3, "maxLength": 32, "pattern": "^[a-zA-Z0-9_\\-]+$"},
        "age":      {"type": "integer", "minimum": 13, "maximum": 120},
    },
}

def validate_user(payload: dict) -> str:
    try:
        jsonschema.validate(instance=payload, schema=USER_SCHEMA)
        return "OK"
    except jsonschema.ValidationError as e:
        return f"ValidationError: {e.message}"

valid   = {"email": "[email protected]", "username": "alice_99", "age": 25}
missing = {"username": "alice_99", "age": 25}
wrong   = {"email": "[email protected]", "username": 42, "age": 25}
extra   = {"email": "[email protected]", "username": "alice", "age": 25, "role": "admin"}

print(f"Valid payload: {validate_user(valid)}")
print(f"Missing required field: {validate_user(missing)}")
print(f"Wrong type: {validate_user(wrong)}")
print(f"Extra field (if additionalProperties=false): {validate_user(extra)}")

additionalProperties: false is a critical security setting. Without it, an attacker can include unexpected fields like {"role": "admin"} in a registration payload. If any downstream code reads role from the validated object without explicitly checking it was allowed, privilege escalation is possible. Always define explicit schemas with additionalProperties: false for security-sensitive inputs.

Expected Output

Valid payload: OK
Missing required field: ValidationError: 'email' is a required property
Wrong type: ValidationError: 42 is not of type 'string'
Extra field (if additionalProperties=false): ValidationError

Hints

Hint 1: Use the jsonschema library: jsonschema.validate(instance, schema) raises jsonschema.ValidationError on failure.

Hint 2: Mark required fields with the "required" array in the schema.

Hint 3: Set "additionalProperties": false to reject unknown fields — prevents parameter pollution attacks.

#8Type Coercion Attack PreventionMedium

type-coercionmass-assignmentparameter-pollutionvalidation

Implement strict type validators that prevent type coercion and mass assignment attacks.

def strict_string(value, field_name: str) -> str:
    """Accept only str, raise ValueError for other types."""
    pass

def strict_int(value, field_name: str, min_val: int = None, max_val: int = None) -> int:
    """Accept int (or str that parses to int), reject other types."""
    pass

def strict_bool(value, field_name: str) -> bool:
    """Accept only bool, reject int/str coercions."""
    pass

# Valid
print(f"String \"42\": validated as int {strict_int('42', 'age')}")
print(f"Boolean True as \"active\": safe_bool={strict_bool(True, 'active')}")

# Attacks
for value, field, fn, label in [
    (["admin", "user"], "username", strict_string, "List injection blocked"),
    ({"role": "admin"}, "username", strict_string, "Nested object injection blocked"),
]:
    try:
        fn(value, field)
    except ValueError as e:
        print(f"{label}: ValueError: {e}")

Solution

def strict_string(value, field_name: str) -> str:
    if not isinstance(value, str):
        raise ValueError(f"expected str, got {type(value).__name__}")
    return value

def strict_int(value, field_name: str, min_val: int = None, max_val: int = None) -> int:
    if isinstance(value, bool):   # bool is subclass of int — reject it
        raise ValueError(f"expected int, got bool")
    if isinstance(value, str):
        try:
            value = int(value)
        except ValueError:
            raise ValueError(f"{field_name}: cannot convert '{value}' to int")
    if not isinstance(value, int):
        raise ValueError(f"expected int, got {type(value).__name__}")
    if min_val is not None and value < min_val:
        raise ValueError(f"{field_name}: {value} < minimum {min_val}")
    if max_val is not None and value > max_val:
        raise ValueError(f"{field_name}: {value} > maximum {max_val}")
    return value

def strict_bool(value, field_name: str) -> bool:
    if not isinstance(value, bool):
        raise ValueError(f"expected bool, got {type(value).__name__}")
    return value

print(f"String \"42\": validated as int {strict_int('42', 'age')}")
print(f"Boolean True as \"active\": safe_bool={strict_bool(True, 'active')}")

for value, field, fn, label in [
    (["admin", "user"], "username", strict_string, "List injection blocked"),
    ({"role": "admin"}, "username", strict_string, "Nested object injection blocked"),
]:
    try:
        fn(value, field)
    except ValueError as e:
        print(f"{label}: ValueError: {e}")

Python gotchas: bool is a subclass of int in Python — isinstance(True, int) returns True. Always check for bool before int if you want to reject boolean inputs. None is falsy but isinstance(None, str) is False — so explicit isinstance checks are safer than truthiness checks.

Expected Output

String "42": validated as int 42
Boolean True as "active": safe_bool=True
List injection blocked: ValueError: expected str, got list
Nested object injection blocked: ValueError: expected str, got dict

Hints

Hint 1: JSON parsers return Python objects — {"age": "42"} gives str while {"age": 42} gives int. Always assert the type explicitly.

Hint 2: A malicious client might send a list where you expect a string: {"username": ["admin", "user"]}.

Hint 3: Use isinstance() checks, not just truthiness — None, 0, and "" are all falsy but have different semantic meanings.

Hard

#9ReDoS-Safe Regex ValidatorHard

regexredoscatastrophic-backtrackingtimeout

Demonstrate a ReDoS-vulnerable regex and implement a timeout-protected validator.

import re
import threading
import time

# Safe email regex (no nested quantifiers)
SAFE_EMAIL = re.compile(r"^[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}$")

# ReDoS-vulnerable pattern: (a+)+ — catastrophic backtracking
REDOS_PATTERN = re.compile(r"^(a+)+$")

def validate_email_safe(email: str) -> bool:
    return bool(SAFE_EMAIL.fullmatch(email))

def validate_with_timeout(pattern: re.Pattern, text: str, timeout: float = 0.1) -> bool:
    """Run regex match in thread with timeout. Returns False if timeout."""
    result = [None]
    def run():
        result[0] = bool(pattern.fullmatch(text))
    t = threading.Thread(target=run, daemon=True)
    t.start()
    t.join(timeout=timeout)
    if t.is_alive():
        return False   # timeout — treat as no match
    return bool(result[0])

def is_redos_dangerous(pattern_str: str) -> bool:
    """Heuristic: detect obvious nested quantifier patterns."""
    dangerous = [r"\+\)+\+", r"\*\)+\*", r"\+\)+\*", r"\*\)+\+"]
    import re as re2
    for d in dangerous:
        if re2.search(d, pattern_str):
            return True
    return False

print(f"Safe regex matches email: {validate_email_safe('[email protected]')}")
print(f"Safe regex rejects invalid: {validate_email_safe('not-an-email')}")

redos_str = r"^(a+)+$"
print(f"ReDoS pattern detected as dangerous: {is_redos_dangerous(redos_str)}")

# ReDoS attack string: 'aaa...a!' — no match but catastrophic backtracking
attack = "a" * 25 + "!"
start = time.perf_counter()
result = validate_with_timeout(REDOS_PATTERN, attack, timeout=0.5)
elapsed = time.perf_counter() - start
print(f"Timeout protection triggered: {not result and elapsed < 1.0}")

Solution

import re
import threading
import time

SAFE_EMAIL = re.compile(r"^[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}$")
REDOS_PATTERN = re.compile(r"^(a+)+$")

def validate_email_safe(email: str) -> bool:
    return bool(SAFE_EMAIL.fullmatch(email))

def validate_with_timeout(pattern: re.Pattern, text: str, timeout: float = 0.1) -> bool:
    result = [None]
    def run():
        result[0] = bool(pattern.fullmatch(text))
    t = threading.Thread(target=run, daemon=True)
    t.start()
    t.join(timeout=timeout)
    if t.is_alive():
        return False
    return bool(result[0])

def is_redos_dangerous(pattern_str: str) -> bool:
    import re as re2
    # Detect nested quantifiers: (x+)+, (x*)*, etc.
    nested = re2.search(r'[\(\|][^)]*[+*]\)+[+*?]', pattern_str)
    return bool(nested)

print(f"Safe regex matches email: {validate_email_safe('[email protected]')}")
print(f"Safe regex rejects invalid: {validate_email_safe('not-an-email')}")

redos_str = r"^(a+)+$"
print(f"ReDoS pattern detected as dangerous: {is_redos_dangerous(redos_str)}")

attack = "a" * 25 + "!"
start = time.perf_counter()
result = validate_with_timeout(REDOS_PATTERN, attack, timeout=0.5)
elapsed = time.perf_counter() - start
print(f"Timeout protection triggered: {not result and elapsed < 1.0}")

ReDoS in production: CloudFlare suffered a major outage in 2019 caused by a ReDoS vulnerability in a WAF rule. The pattern (?:(?:\"|'|\]|\}|\\|\d|(?:nan|infinity|true|false|null|undefined|symbol|math)|\|-|+)+[)];?((?:\s|-|~|!|{}||||+).(?:.=.*)))caused catastrophic backtracking on certain HTTP headers. Use there2` Python binding (guaranteed linear time) for user-supplied or complex patterns.

Expected Output

Safe regex matches email: True
Safe regex rejects invalid: False
ReDoS pattern detected as dangerous: True
Timeout protection triggered: True

Hints

Hint 1: Catastrophic backtracking (ReDoS) happens with patterns like (a+)+ — exponential backtracking on no-match strings.

Hint 2: Use re.compile() with a timeout by running in a thread, or use the regex module which supports timeouts.

Hint 3: Avoid nested quantifiers: (a+)+, (a|a)*, (a*)*. Use possessive quantifiers or atomic groups if available.

#10Validation Library Pattern — Pydantic-StyleHard

pydanticvalidationmodelfield-validator

Build a Pydantic-inspired field validation model from scratch using dataclasses and __post_init__.

import re
from dataclasses import dataclass

class ValidationError(Exception):
    pass

def validate_username(value: str) -> str:
    if not isinstance(value, str):
        raise ValidationError("username must be a string")
    if len(value) < 3:
        raise ValidationError("username too short")
    if len(value) > 32:
        raise ValidationError("username too long")
    if not re.fullmatch(r"[a-zA-Z0-9_\-]+", value):
        raise ValidationError("username contains invalid characters")
    return value

def validate_email(value: str) -> str:
    if not isinstance(value, str):
        raise ValidationError("email must be a string")
    if not re.fullmatch(r"[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}", value):
        raise ValidationError("invalid email format")
    return value.lower()

def validate_age(value) -> int:
    try:
        value = int(value)
    except (TypeError, ValueError):
        raise ValidationError("age must be an integer")
    if not (13 <= value <= 120):
        raise ValidationError("age must be 13-120")
    return value

@dataclass
class UserInput:
    username: str
    email: str
    age: int

    def __post_init__(self):
        self.username = validate_username(self.username)
        self.email    = validate_email(self.email)
        self.age      = validate_age(self.age)

# Tests
try:
    u = UserInput(username="alice", email="[email protected]", age=25)
    print(f"Valid model: {u}")
except ValidationError as e:
    print(f"Unexpected: {e}")

for kwargs, label in [
    ({"username": "al", "email": "[email protected]", "age": 25}, "Short username"),
    ({"username": "alice", "email": "not-an-email", "age": 25}, "Invalid email"),
    ({"username": "alice", "email": "[email protected]", "age": 200}, "Age out of range"),
]:
    try:
        UserInput(**kwargs)
    except ValidationError as e:
        print(f"{label}: ValidationError: {e}")

Solution

import re
from dataclasses import dataclass

class ValidationError(Exception):
    pass

def validate_username(value: str) -> str:
    if not isinstance(value, str):
        raise ValidationError("username must be a string")
    if len(value) < 3:
        raise ValidationError("username too short")
    if len(value) > 32:
        raise ValidationError("username too long")
    if not re.fullmatch(r"[a-zA-Z0-9_\-]+", value):
        raise ValidationError("username contains invalid characters")
    return value

def validate_email(value: str) -> str:
    if not isinstance(value, str):
        raise ValidationError("email must be a string")
    if not re.fullmatch(r"[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}", value):
        raise ValidationError("invalid email format")
    return value.lower()

def validate_age(value) -> int:
    try:
        value = int(value)
    except (TypeError, ValueError):
        raise ValidationError("age must be an integer")
    if not (13 <= value <= 120):
        raise ValidationError("age must be 13-120")
    return value

@dataclass
class UserInput:
    username: str
    email: str
    age: int

    def __post_init__(self):
        self.username = validate_username(self.username)
        self.email    = validate_email(self.email)
        self.age      = validate_age(self.age)

try:
    u = UserInput(username="alice", email="[email protected]", age=25)
    print(f"Valid model: {u}")
except ValidationError as e:
    print(f"Unexpected: {e}")

for kwargs, label in [
    ({"username": "al", "email": "[email protected]", "age": 25}, "Short username"),
    ({"username": "alice", "email": "not-an-email", "age": 25}, "Invalid email"),
    ({"username": "alice", "email": "[email protected]", "age": 200}, "Age out of range"),
]:
    try:
        UserInput(**kwargs)
    except ValidationError as e:
        print(f"{label}: ValidationError: {e}")

Pydantic v2 in production: This exercise shows the internals of what Pydantic automates. In production, use Pydantic v2 — it compiles validators to Rust via pydantic-core, giving 5-50x faster validation compared to pure Python. The @field_validator decorator and model_validator provide the same pattern with much less boilerplate.

Expected Output

Valid model: UserInput(username='alice', email='[email protected]', age=25)
Short username: ValidationError: username too short
Invalid email: ValidationError: invalid email format
Age out of range: ValidationError: age must be 13-120

Hints

Hint 1: Build a simple validation framework using __init_subclass__ or descriptors to register field validators.

Hint 2: A field descriptor can hold validation rules (min_length, max_length, pattern) and raise ValueError on bad input.

Hint 3: The __post_init__ pattern with dataclasses is a lightweight alternative to full Pydantic.

#11Multi-Layer Input Sanitization PipelineHard

sanitizationpipelinedefense-in-depthencoding

Build a composable input sanitization pipeline with pluggable sanitizer steps.

import re
import unicodedata
import html

def make_pipeline(*sanitizers):
    """Compose multiple sanitizer functions into a pipeline."""
    def pipeline(text: str) -> str:
        for san in sanitizers:
            text = san(text)
        return text
    return pipeline

def normalize_unicode(text: str) -> str:
    return unicodedata.normalize("NFC", text)

def strip_control_chars(text: str) -> str:
    return "".join(c for c in text if ord(c) >= 32 or c in "\n\r\t")

def remove_html_tags(text: str) -> str:
    return re.sub(r"<[^>]+>", "", text)

def remove_sql_meta(text: str) -> str:
    return re.sub(r"[;'\"\-\-]", "", text)

def remove_path_components(text: str) -> str:
    return re.sub(r"[/\\.]", "", text)

def truncate(max_len: int):
    return lambda text: text[:max_len]

# Demonstrate individual sanitizers
print(f"Clean input: {normalize_unicode('hello world 123')}")
print(f"XSS cleaned: {remove_html_tags('hello <script>alert(1)</script>')}")
print(f"SQL meta cleaned: {remove_sql_meta(\"hello' OR 1=1 -- world\")}")
print(f"Path traversal cleaned: {remove_path_components('../etc/passwd')}")

# Full pipeline
sanitize_comment = make_pipeline(
    normalize_unicode,
    strip_control_chars,
    remove_html_tags,
    truncate(1000),
)
print(f"Pipeline result: {sanitize_comment('hello world 123')}")

Solution

import re
import unicodedata

def make_pipeline(*sanitizers):
    def pipeline(text: str) -> str:
        for san in sanitizers:
            text = san(text)
        return text
    return pipeline

def normalize_unicode(text: str) -> str:
    return unicodedata.normalize("NFC", text)

def strip_control_chars(text: str) -> str:
    return "".join(c for c in text if ord(c) >= 32 or c in "\n\r\t")

def remove_html_tags(text: str) -> str:
    return re.sub(r"<[^>]+>", "", text)

def remove_sql_meta(text: str) -> str:
    return re.sub(r"[;'\"\-\-]", "", text)

def remove_path_components(text: str) -> str:
    return re.sub(r"[/\\.]", "", text)

def truncate(max_len: int):
    return lambda text: text[:max_len]

print(f"Clean input: {normalize_unicode('hello world 123')}")
print(f"XSS cleaned: {remove_html_tags('hello <script>alert(1)</script>')}")
print(f"SQL meta cleaned: {remove_sql_meta(\"hello' OR 1=1 -- world\")}")
print(f"Path traversal cleaned: {remove_path_components('../etc/passwd')}")

sanitize_comment = make_pipeline(
    normalize_unicode,
    strip_control_chars,
    remove_html_tags,
    truncate(1000),
)
print(f"Pipeline result: {sanitize_comment('hello world 123')}")

Sanitization vs validation:

Validation: Check if input is acceptable. Reject if not. (Preferred for IDs, emails, amounts.)
Sanitization: Transform input into a safe form. (Acceptable for free-text comments, display names.)
Never sanitize SQL input — use parameterized queries instead. Sanitization cannot reliably prevent SQL injection.
Never sanitize path inputs — use Path.resolve() + bounds check instead. Sanitization misses encoding variants.
Use sanitization only for display content (HTML comments, usernames) where you want to preserve as much of the input as possible.

Expected Output

Clean input: hello world 123
XSS cleaned: hello alert1
SQL meta cleaned: hello world
Path traversal cleaned: etcpasswd
Pipeline result: hello world 123

Hints

Hint 1: Build a pipeline of sanitizers: each takes a string and returns a cleaned string.

Hint 2: Apply normalization first, then strip dangerous characters, then validate the result.

Hint 3: A pipeline makes it easy to add, remove, or reorder sanitization steps.

Practice: Input Validation and Sanitization

Easy​

Medium​

Hard​

Easy

Medium

Hard