The os Module - System Calls and Process Interaction
Reading time: ~18 minutes | Level: Foundation → Engineering
Here is a behavior that surprises most Python developers:
import os
os.environ["MY_SECRET"] = "hunter2"
import subprocess
result = subprocess.run(
["python3", "-c", "import os; print(os.environ.get('MY_SECRET'))"],
capture_output=True, text=True
)
print(result.stdout.strip()) # hunter2
Setting os.environ["MY_SECRET"] in your process mutates the environment for that process and all child processes it spawns - including subprocesses. This is how configuration leaks and secret exposure happen in production.
The os module is the thin wrapper between your Python code and the operating system kernel. Understanding it means understanding how processes, files, permissions, and system resources actually work - not just how Python abstracts them.
What You Will Learn
- The architectural difference between
os,pathlib, andshutil- and when to use each - How
os.pathworks and whypathlibreplaces most of it in modern code - Directory listing with
os.listdir()andos.scandir()- and why scandir is dramatically faster - Recursive traversal with
os.walk()and howtopdowncontrols the traversal order - File metadata with
os.stat()- permissions, size, timestamps - Changing file permissions with
os.chmod()and reading them withstat.S_IMODE - Process identity with
os.getpid()andos.getppid() - Why
os.system()is dangerous and howsubprocess.run()replaces it safely - Cryptographically secure random bytes from
os.urandom()
Prerequisites
- Familiarity with Python file I/O (
open(),read(),write()) - Understanding of Python strings and f-strings
- Basic knowledge of what a file system is (files, directories, paths)
- Having completed the
pathlibmodule (topic 04) is helpful but not required
The Big Picture: os vs pathlib vs shutil
These three modules are often confused. Here is when to use each:
| Module | Use for | Key APIs |
|---|---|---|
pathlib | Path manipulation, file read/write, directory creation, glob patterns | Path("/a/b/c"), p.exists(), p.read_text(), p.glob("*.py"), p.stat() |
os | Process/system info, permissions, environment vars, walking trees | os.getpid(), os.environ, os.walk(), os.chmod(), os.urandom(), os.cpu_count() |
shutil | Copy/move/delete, high-level FS ops, archive handling, finding executables | shutil.copy(), shutil.move(), shutil.rmtree(), shutil.which(), shutil.disk_usage() |
:::tip Rule of Thumb
For path manipulation, prefer pathlib. For system-level operations (process info, permissions, environment, random bytes), use os. For copying, moving, and deleting directory trees, use shutil.
:::
Part 1 - os.path: The Classic Path Toolkit
os.path provides string-based path manipulation. It predates pathlib by decades. In Python 3.4+, pathlib is preferred for path manipulation - but os.path is still everywhere in existing codebases, so you must know it.
import os
path = "/home/alice/projects/myapp/config.yaml"
# Core os.path operations
print(os.path.basename(path)) # config.yaml
print(os.path.dirname(path)) # /home/alice/projects/myapp
print(os.path.splitext(path)) # ('/home/alice/projects/myapp/config', '.yaml')
print(os.path.split(path)) # ('/home/alice/projects/myapp', 'config.yaml')
# Building paths safely (handles OS-specific separators)
joined = os.path.join("/home/alice", "projects", "myapp", "config.yaml")
print(joined) # /home/alice/projects/myapp/config.yaml
# Checking path properties
print(os.path.exists(path)) # True or False depending on disk
print(os.path.isfile(path)) # True if it's a file
print(os.path.isdir(path)) # True if it's a directory
print(os.path.isabs(path)) # True - path starts with /
print(os.path.abspath("config.yaml")) # /current/working/dir/config.yaml
The pathlib Equivalents
Every os.path operation has a pathlib equivalent. The pathlib version is more readable because you compose operations with attribute access instead of function calls:
from pathlib import Path
import os
path_str = "/home/alice/projects/myapp/config.yaml"
path = Path(path_str)
# os.path → pathlib
os.path.basename(path_str) # config.yaml
path.name # config.yaml ← cleaner
os.path.dirname(path_str) # /home/alice/projects/myapp
path.parent # PosixPath('/home/alice/projects/myapp')
os.path.splitext(path_str) # ('.../config', '.yaml')
path.stem, path.suffix # 'config', '.yaml'
os.path.exists(path_str) # True/False
path.exists() # True/False
:::note When os.path Still Makes Sense
os.path is still useful when you are working with code that passes around plain strings, integrating with legacy APIs that only accept strings, or writing library code that must work without importing pathlib.
:::
Part 2 - Current Working Directory
import os
# Get the current working directory
cwd = os.getcwd()
print(cwd) # /Users/alice/projects/myapp
# Change the working directory
os.chdir("/tmp")
print(os.getcwd()) # /tmp
# Change back
os.chdir(cwd)
print(os.getcwd()) # /Users/alice/projects/myapp
:::danger os.chdir is a Code Smell
os.chdir() mutates the process-wide working directory. If any other thread or code calls os.getcwd() after your chdir, it sees the new directory. This causes hard-to-debug race conditions in multithreaded applications.
The correct pattern is to build absolute paths with os.path.join(base, filename) or Path(base) / filename rather than changing directories. Reserve os.chdir() for short scripts where you control the entire process.
:::
The Safe Pattern
import os
from pathlib import Path
# BAD: changing the global working directory
def process_files(directory):
os.chdir(directory) # global mutation - dangerous
for f in os.listdir("."):
process(f)
# GOOD: build absolute paths, never change directory
def process_files(directory):
base = Path(directory).resolve()
for f in base.iterdir():
process(f) # f is an absolute Path - safe
Part 3 - Listing Directory Contents
os.listdir(): Simple but Dumb
import os
entries = os.listdir("/tmp")
print(entries)
# ['file1.txt', 'file2.log', 'subdir', '.hidden']
# Returns: list of strings, names only, no metadata
os.listdir() returns a plain list of names. To get file type or size, you must make a separate os.stat() call for each entry - which means one system call per file.
os.scandir(): Faster and Smarter
import os
# scandir returns DirEntry objects - already have type and stat info
with os.scandir("/tmp") as entries:
for entry in entries:
print(f"{entry.name:30} is_file={entry.is_file()} is_dir={entry.is_dir()}")
# file1.txt is_file=True is_dir=False
# subdir is_file=False is_dir=True
os.scandir() returns DirEntry objects that cache the file type information from the OS directory listing. On most filesystems, this means zero extra system calls to determine is_file() and is_dir().
For a directory with 1000 files:
os.listdir()+os.path.isfile()per file: 1readdirsyscall + 1000statsyscalls = 1001 total syscallsos.scandir(): 1readdirsyscall (DirEntry cachesd_typefromdirentstruct) = 1 total syscall on most Linux filesystems
scandir is up to 20x faster on large directories.
Practical os.scandir() Usage
import os
def list_python_files(directory):
"""List all Python files in a directory (non-recursive)."""
py_files = []
with os.scandir(directory) as entries:
for entry in entries:
if entry.is_file() and entry.name.endswith(".py"):
stat = entry.stat()
py_files.append({
"name": entry.name,
"path": entry.path, # Full absolute path
"size": stat.st_size,
"modified": stat.st_mtime,
})
return sorted(py_files, key=lambda x: x["name"])
# Usage
files = list_python_files("/Users/alice/myproject")
for f in files:
print(f"{f['name']:30} {f['size']:8} bytes")
:::tip DirEntry Attributes
A DirEntry object has: name (filename), path (full path), is_file(), is_dir(), is_symlink(), and stat(). The stat() call may use a cached result on Windows or follow a symlink - check stat(follow_symlinks=False) for symlink metadata.
:::
Part 4 - Recursive Directory Traversal with os.walk()
os.walk() is one of the most useful functions in Python's standard library. It generates (dirpath, dirnames, filenames) tuples for every directory in a tree.
import os
# Basic traversal
for dirpath, dirnames, filenames in os.walk("/Users/alice/projects"):
print(f"DIR: {dirpath}")
for fname in filenames:
print(f" FILE: {os.path.join(dirpath, fname)}")
How os.walk() Works Internally
/project/
├── main.py
├── config.yaml
└── src/
├── models.py
└── utils/
└── helpers.py
topdown=True (default) yields root-first:
("/project", ["src"], ["main.py", "config.yaml"])("/project/src", ["utils"], ["models.py"])("/project/src/utils", [], ["helpers.py"])
topdown=False yields deepest-first:
("/project/src/utils", [], ["helpers.py"])("/project/src", ["utils"], ["models.py"])("/project", ["src"], ["main.py", "config.yaml"])
Controlling Traversal: Pruning Subdirectories
With topdown=True, you can modify dirnames in-place to skip directories:
import os
def find_python_files(root, skip_dirs=None):
"""
Recursively find all .py files, skipping specified directories.
Modifying dirnames in-place prunes the traversal - no wasted work.
"""
skip_dirs = skip_dirs or {".git", "__pycache__", ".venv", "node_modules"}
py_files = []
for dirpath, dirnames, filenames in os.walk(root, topdown=True):
# Prune: remove directories we don't want to descend into
# Must modify in-place (slice assignment), not reassign
dirnames[:] = [d for d in dirnames if d not in skip_dirs]
for fname in filenames:
if fname.endswith(".py"):
full_path = os.path.join(dirpath, fname)
py_files.append(full_path)
return py_files
# Find all Python files in a project
files = find_python_files("/Users/alice/projects/myapp")
for f in files:
print(f)
# /Users/alice/projects/myapp/main.py
# /Users/alice/projects/myapp/src/models.py
# /Users/alice/projects/myapp/src/utils/helpers.py
:::warning Modifying dirnames In-Place
Use dirnames[:] = [...] (slice assignment), not dirnames = [...] (rebinding). Slice assignment modifies the original list object that os.walk holds a reference to. Rebinding creates a new list and leaves the original untouched - so os.walk still descends into all directories.
:::
topdown=False: When You Need to Delete Directories
topdown=False yields deepest directories first. This is the correct mode for deleting directory trees - you must delete files before deleting their parent directory:
import os
def delete_empty_directories(root):
"""Remove all empty directories in a tree (bottom-up)."""
for dirpath, dirnames, filenames in os.walk(root, topdown=False):
if not dirnames and not filenames:
try:
os.rmdir(dirpath)
print(f"Removed empty dir: {dirpath}")
except OSError as e:
print(f"Could not remove {dirpath}: {e}")
Part 5 - File Metadata and Permissions
os.stat(): Everything About a File
import os
import stat
import time
info = os.stat("/etc/hosts")
print(f"Size: {info.st_size} bytes")
print(f"Mode: {oct(info.st_mode)}") # e.g., 0o100644
print(f"UID: {info.st_uid}") # owner user ID
print(f"GID: {info.st_gid}") # owner group ID
print(f"Modified: {time.ctime(info.st_mtime)}") # last modification time
print(f"Accessed: {time.ctime(info.st_atime)}") # last access time
print(f"Changed: {time.ctime(info.st_ctime)}") # metadata change time
Understanding Unix File Permissions
The st_mode octal 0o 1 0 0 6 4 4 breaks down as: file type (10=regular, 04=directory, 012=symlink) · special bits (setuid/setgid/sticky) · user perms · group perms · other perms.
Permission bit values: 4 = read (r) · 2 = write (w) · 1 = execute (x) · 6 = rw- · 7 = rwx
| Octal | Symbolic | Meaning |
|---|---|---|
0o644 | rw-r--r-- | Owner rw, group r, other r - typical file |
0o755 | rwxr-xr-x | Owner rwx, group/other rx - executable/directory |
0o700 | rwx------ | Owner only, private - SSH keys |
os.chmod(): Changing Permissions
import os
import stat
# Make a file executable
os.chmod("deploy.sh", 0o755)
# Make a private key file owner-read-only
os.chmod("id_rsa", 0o600)
# Using stat constants (more readable)
os.chmod("script.py", stat.S_IRUSR | stat.S_IWUSR | stat.S_IXUSR)
# stat.S_IRUSR = 0o400 (owner read)
# stat.S_IWUSR = 0o200 (owner write)
# stat.S_IXUSR = 0o100 (owner execute)
# Combined: 0o700
# Check current permissions
info = os.stat("deploy.sh")
permissions = stat.S_IMODE(info.st_mode) # Extract permission bits only
print(oct(permissions)) # 0o755
print(bool(permissions & stat.S_IXUSR)) # True - owner can execute
Practical: Audit Files With Insecure Permissions
import os
import stat
def find_world_writable(directory):
"""Find files that are writable by anyone - a security risk."""
risky_files = []
for dirpath, dirnames, filenames in os.walk(directory):
dirnames[:] = [d for d in dirnames if not d.startswith(".")]
for fname in filenames:
fpath = os.path.join(dirpath, fname)
try:
mode = os.stat(fpath).st_mode
if mode & stat.S_IWOTH: # World-writable bit
risky_files.append(fpath)
except PermissionError:
pass
return risky_files
# Usage
risky = find_world_writable("/var/www/html")
for path in risky:
print(f"RISKY: {path}")
Part 6 - File System Operations
Creating Directories
import os
# Create a single directory
os.mkdir("/tmp/mydir") # Fails if parent doesn't exist
# Create nested directories (like mkdir -p)
os.makedirs("/tmp/a/b/c") # Creates all intermediate dirs
os.makedirs("/tmp/a/b/c", exist_ok=True) # No error if already exists
:::tip Always Use exist_ok=True
In production code, always use os.makedirs(path, exist_ok=True). Without it, you get a FileExistsError if another process or thread creates the directory between your check and your creation - a classic TOCTOU (time-of-check-time-of-use) race condition.
:::
Renaming and Moving Files
import os
# Rename/move within same filesystem - atomic on POSIX
os.rename("/tmp/old_name.txt", "/tmp/new_name.txt")
# For cross-filesystem moves, use shutil.move() instead
import shutil
shutil.move("/tmp/file.txt", "/mnt/storage/file.txt")
Removing Files and Directories
import os
os.remove("file.txt") # Remove a file (raises if directory)
os.unlink("file.txt") # Alias for os.remove
os.rmdir("empty_dir") # Remove EMPTY directory only
# For non-empty directories:
import shutil
shutil.rmtree("non_empty_dir") # USE WITH CAUTION - no recycle bin
:::danger shutil.rmtree is Permanent
shutil.rmtree() deletes the directory and all its contents permanently - there is no recycle bin or undo. Always double-check the path. A common catastrophic bug: shutil.rmtree(base_dir + suffix) where suffix is empty and base_dir is /. Test with dry runs in production code.
:::
Part 7 - Process Information
import os
# Current process ID
pid = os.getpid()
print(f"This process ID: {pid}") # e.g., 12345
# Parent process ID
ppid = os.getppid()
print(f"Parent process ID: {ppid}") # e.g., 12300
# System info
cpus = os.cpu_count()
print(f"CPU cores: {cpus}") # e.g., 8
# System load average (Unix only - not available on Windows)
try:
load = os.getloadavg()
print(f"Load avg (1m, 5m, 15m): {load}") # e.g., (1.5, 1.2, 0.9)
except AttributeError:
print("getloadavg not available on this platform")
Why Process IDs Matter
import os
# Writing PID files (used by daemons to prevent duplicate instances)
pid_file = "/var/run/myapp.pid"
def write_pid_file():
with open(pid_file, "w") as f:
f.write(str(os.getpid()))
def check_running():
try:
with open(pid_file) as f:
old_pid = int(f.read().strip())
# Check if process is still running
os.kill(old_pid, 0) # Signal 0 = check existence, don't kill
return True # Process exists
except (FileNotFoundError, ProcessLookupError):
return False
except PermissionError:
return True # Process exists but we can't signal it
# Common in web servers, background workers, schedulers
Part 8 - Environment Variables
import os
# Read environment variables
path = os.environ["PATH"] # KeyError if missing
home = os.environ.get("HOME") # None if missing
port = os.environ.get("PORT", "8080") # Default value
# Set environment variable (affects current process and future child processes)
os.environ["MY_APP_MODE"] = "production"
# Delete an environment variable
del os.environ["TEMP_VAR"]
# or
os.environ.pop("TEMP_VAR", None) # Safe - no error if missing
# Get all environment variables as a dict
env_dict = dict(os.environ)
for key, value in sorted(env_dict.items()):
print(f"{key}={value}")
:::note Full Coverage in Next Topic
Environment variables have their own dedicated topic (06-Environment-Variables) covering the 12-factor app pattern, python-dotenv, Pydantic Settings, and security practices. This section covers just the os module mechanics.
:::
Part 9 - os.urandom(): Cryptographically Secure Random Bytes
import os
# Generate 16 bytes of cryptographically secure random data
random_bytes = os.urandom(16)
print(random_bytes) # b'\x8f\xc3\xb2...' (16 random bytes)
print(len(random_bytes)) # 16
# Generate a secure token (common for session IDs, CSRF tokens)
import secrets # Python 3.6+ preferred API wrapping os.urandom
token = secrets.token_hex(32) # 64-character hex string
print(token) # e.g., "a3f8c9d1e2..."
api_key = secrets.token_urlsafe(32) # URL-safe base64
print(api_key) # e.g., "wI4Qp8..."
:::note os.urandom vs random
os.urandom() reads from the OS cryptographically secure random number generator (/dev/urandom on Unix, CryptGenRandom on Windows). The random module is not cryptographically secure - never use random to generate passwords, tokens, or keys. Use os.urandom() directly or the secrets module (which wraps it with a friendlier API).
:::
Part 10 - os.system() vs subprocess.run()
os.system() is one of those functions that exists in Python and should essentially never be used in production code.
Why os.system() Is Dangerous
import os
# os.system - DO NOT USE
filename = "report 2024.pdf"
os.system(f"ls -la {filename}")
# This passes the string to the shell, which interprets it.
# If filename = "file.pdf; rm -rf /", you get shell injection!
# Worse: no way to capture output
# os.system returns only the exit code (0 = success)
ret = os.system("ls /tmp") # Prints to stdout directly
print(ret) # 0 (success) - output is gone
With os.system(): user_input = "report.pdf; rm -rf ~" → os.system(f"open {user_input}") → shell executes open report.pdf; rm -rf ~ → deletes home directory.
With subprocess.run(["open", user_input]): arguments are passed as a list, never interpreted by the shell. No shell metacharacters (;, &&, |, >) are processed.
The Correct Way: subprocess.run()
import subprocess
# Safe - arguments are a list, no shell injection possible
result = subprocess.run(
["ls", "-la", "/tmp"],
capture_output=True, # Capture stdout and stderr
text=True, # Decode bytes to str
check=True # Raise CalledProcessError if exit code != 0
)
print(result.stdout) # The ls output as a string
print(result.returncode) # 0
# Handling errors
try:
result = subprocess.run(
["python3", "nonexistent.py"],
capture_output=True,
text=True,
check=True
)
except subprocess.CalledProcessError as e:
print(f"Command failed with code {e.returncode}")
print(f"stderr: {e.stderr}")
# Passing user input safely - no string formatting needed
user_filename = "report 2024.pdf"
result = subprocess.run(
["wc", "-l", user_filename], # Each argument is separate
capture_output=True, text=True
)
Part 11 - Real-World: Build Tool Integration
Here is a complete, production-quality script combining os.walk, os.stat, os.makedirs, and subprocess.run to build a project report:
import os
import subprocess
import json
from datetime import datetime
def analyze_project(root_dir, output_dir):
"""
Analyze a Python project: count files, find large files,
run pylint, and write a JSON report.
"""
os.makedirs(output_dir, exist_ok=True)
stats = {
"root": root_dir,
"analyzed_at": datetime.utcnow().isoformat(),
"file_count": 0,
"total_size_bytes": 0,
"large_files": [], # Files over 100KB
"file_types": {},
"pylint_score": None,
}
skip_dirs = {".git", "__pycache__", ".venv", "node_modules", ".mypy_cache"}
for dirpath, dirnames, filenames in os.walk(root_dir, topdown=True):
dirnames[:] = [d for d in dirnames if d not in skip_dirs]
for fname in filenames:
fpath = os.path.join(dirpath, fname)
try:
info = os.stat(fpath)
size = info.st_size
ext = os.path.splitext(fname)[1].lower() or "(no ext)"
stats["file_count"] += 1
stats["total_size_bytes"] += size
stats["file_types"][ext] = stats["file_types"].get(ext, 0) + 1
if size > 100_000: # 100KB
stats["large_files"].append({
"path": fpath,
"size_kb": round(size / 1024, 1),
})
except PermissionError:
pass
# Run pylint on the project (safely, without shell=True)
try:
result = subprocess.run(
["python3", "-m", "pylint", root_dir, "--score=y"],
capture_output=True,
text=True,
timeout=60
)
# Parse pylint score from last line: "Your code has been rated at 9.50/10"
for line in result.stdout.splitlines():
if "rated at" in line:
score_part = line.split("rated at")[1].strip()
stats["pylint_score"] = score_part.split("/")[0].strip()
except (subprocess.TimeoutExpired, FileNotFoundError):
stats["pylint_score"] = "unavailable"
# Write report
report_path = os.path.join(output_dir, "project_report.json")
with open(report_path, "w", encoding="utf-8") as f:
json.dump(stats, f, indent=2)
print(f"Report written to {report_path}")
print(f"Files analyzed: {stats['file_count']}")
print(f"Total size: {stats['total_size_bytes'] / 1024:.1f} KB")
return stats
# Run it
# analyze_project("/Users/alice/projects/myapp", "/tmp/reports")
Interview Questions
Q1: What is the difference between os.listdir() and os.scandir(), and when would you use each?
Answer: os.listdir() returns a plain list of filenames as strings. To determine file type (file vs directory) or size, you must make a separate os.stat() system call for each entry - O(n) additional syscalls.
os.scandir() returns DirEntry objects. On Linux and Windows, the directory entry structure from the OS already includes the file type (the d_type field in struct dirent). scandir exposes this as entry.is_file() and entry.is_dir() without additional syscalls, making it up to 20x faster on large directories. Use scandir when you need to filter by type or access stat information. Use listdir only when you need a simple list of names and nothing else.
Q2: How does os.walk() allow you to prune directories, and why must you use slice assignment?
Answer: With topdown=True (the default), os.walk yields (dirpath, dirnames, filenames) and checks dirnames to decide which subdirectories to descend into. If you modify dirnames before the next iteration, os.walk respects the change.
The modification must be in-place using dirnames[:] = [...] (slice assignment). If you write dirnames = [...], you rebind the local variable to a new list object, but os.walk still holds a reference to the original list and will descend into all original directories. Slice assignment modifies the contents of the existing list object that both your code and os.walk share.
Q3: Why should you never use os.system() in production code?
Answer: Three reasons:
- Shell injection:
os.system()passes the command string to the shell for interpretation. User-controlled input in the string can contain shell metacharacters (;,&&,|,`) that execute arbitrary commands. - No output capture:
os.system()writes stdout/stderr directly to the terminal and returns only the exit code. You cannot capture or process the output programmatically. - No timeout or error handling:
subprocess.run()withcheck=True,timeout=N, andcapture_output=Truegives you structured error handling, output capture, and timeout protection. Usesubprocess.run(["cmd", "arg1", "arg2"])with a list of arguments - no shell interpretation, no injection risk.
Q4: What is the difference between os.remove() and shutil.rmtree()?
Answer: os.remove() (also os.unlink()) deletes a single file. It raises IsADirectoryError if called on a directory and FileNotFoundError if the file does not exist.
os.rmdir() deletes a single empty directory. It raises OSError if the directory contains any files or subdirectories.
shutil.rmtree() recursively deletes an entire directory tree - all files, subdirectories, and their contents. There is no confirmation, no recycle bin, and no undo. Always validate the path before calling it in production code.
Q5: What does os.stat() return, and what is the difference between st_mtime, st_atime, and st_ctime?
Answer: os.stat() returns a stat_result object with fields from the underlying POSIX stat(2) system call:
st_size: file size in bytesst_mode: file type and permission bits (usestat.S_IMODE()to extract permissions)st_uid,st_gid: owner user ID and group IDst_mtime: modification time - when the file content was last changedst_atime: access time - when the file was last read (often disabled on Linux for performance)st_ctime: metadata change time on Unix (NOT creation time) - when permissions, owner, or link count changed. On Windows, this is creation time.
The common confusion: on Unix/Linux, st_ctime is not creation time. Use st_mtime to detect file changes in build tools and cache invalidators.
Q6: How is os.urandom() different from the random module, and when must you use os.urandom()?
Answer: random is a pseudo-random number generator (Mersenne Twister) seeded from the system time. It is statistically high-quality but cryptographically predictable - given enough output, an attacker can determine the internal state and predict all future values.
os.urandom() reads from the OS cryptographically secure random number generator - /dev/urandom on Unix (which uses hardware entropy sources, interrupt timing, and the kernel's CSPRNG), or CryptGenRandom on Windows. The output is computationally infeasible to predict.
You must use os.urandom() (or the secrets module, which wraps it) for: session tokens, CSRF tokens, password reset links, API keys, encryption keys, nonces, and any value whose unpredictability has security implications. Use random only for simulations, games, and non-security random sampling.
Practice Challenges
Beginner - Directory File Counter
Write a function count_by_extension(directory) that returns a dictionary mapping each file extension (e.g., ".py", ".txt") to the number of files with that extension in the directory (non-recursive). Use os.scandir().
Solution
import os
def count_by_extension(directory):
"""
Count files in a directory grouped by extension.
Non-recursive. Uses os.scandir for efficiency.
Args:
directory: path to the directory to scan
Returns:
dict mapping extension -> count
e.g., {'.py': 12, '.txt': 3, '(no ext)': 1}
"""
counts = {}
with os.scandir(directory) as entries:
for entry in entries:
if entry.is_file():
# os.path.splitext returns ('name', '.ext') or ('name', '')
_, ext = os.path.splitext(entry.name)
ext = ext.lower() if ext else "(no ext)"
counts[ext] = counts.get(ext, 0) + 1
return counts
# Demo
if __name__ == "__main__":
import sys
target = sys.argv[1] if len(sys.argv) > 1 else "."
result = count_by_extension(target)
print(f"File types in {target}:")
for ext, count in sorted(result.items(), key=lambda x: -x[1]):
print(f" {ext:15} {count:4} files")
# Example output for a Python project directory:
# File types in /Users/alice/myproject:
# .py 47 files
# .md 8 files
# .yaml 3 files
# (no ext) 2 files
# .json 1 files
Intermediate - Recursive Duplicate Finder
Write a function find_duplicates(directory) that recursively scans a directory and returns a dict where keys are file sizes (in bytes) and values are lists of file paths that share that size. Include only sizes with more than one file - these are potential duplicates. Skip hidden directories and __pycache__.
Solution
import os
from collections import defaultdict
def find_duplicates(directory):
"""
Find potential duplicate files by matching file size.
Files with the same size are candidates for deduplication.
(True deduplication requires content hashing - this is step 1.)
Args:
directory: root directory to scan recursively
Returns:
dict: {size_bytes: [list_of_paths]} for sizes with 2+ files
"""
skip_dirs = {"__pycache__", ".git", ".venv", "node_modules", ".mypy_cache"}
size_map = defaultdict(list) # size -> [paths]
for dirpath, dirnames, filenames in os.walk(directory, topdown=True):
# Prune hidden dirs and known noisy dirs
dirnames[:] = [
d for d in dirnames
if d not in skip_dirs and not d.startswith(".")
]
for fname in filenames:
if fname.startswith("."):
continue # Skip hidden files
fpath = os.path.join(dirpath, fname)
try:
size = os.stat(fpath).st_size
if size > 0: # Skip empty files
size_map[size].append(fpath)
except (PermissionError, FileNotFoundError):
pass # Skip inaccessible files
# Keep only sizes with multiple files
duplicates = {
size: paths
for size, paths in size_map.items()
if len(paths) > 1
}
return duplicates
def report_duplicates(directory):
"""Print a human-readable duplicate report."""
dupes = find_duplicates(directory)
if not dupes:
print("No potential duplicates found.")
return
total_wasted = 0
print(f"Potential duplicates in {directory}:\n")
for size, paths in sorted(dupes.items(), key=lambda x: -x[0]):
size_kb = size / 1024
wasted = size * (len(paths) - 1) # Could save this many bytes
total_wasted += wasted
print(f" Size: {size_kb:.1f} KB - {len(paths)} files")
for path in paths:
print(f" {path}")
print()
print(f"Potential savings if deduplicated: {total_wasted / 1024:.1f} KB")
# Demo usage
# report_duplicates("/Users/alice/Downloads")
# Note: size matching is not conclusive - two files can have the same size
# but different contents. For reliable deduplication, hash the content:
import hashlib
def hash_file(path, chunk_size=65536):
"""Return MD5 hash of file contents."""
h = hashlib.md5()
with open(path, "rb") as f:
while chunk := f.read(chunk_size):
h.update(chunk)
return h.hexdigest()
def find_true_duplicates(directory):
"""Find files with identical content (two-pass: size then hash)."""
# First pass: group by size (cheap)
size_candidates = find_duplicates(directory)
# Second pass: hash only the candidate files (expensive for large files)
hash_map = defaultdict(list)
for paths in size_candidates.values():
for path in paths:
try:
digest = hash_file(path)
hash_map[digest].append(path)
except (PermissionError, FileNotFoundError):
pass
return {h: paths for h, paths in hash_map.items() if len(paths) > 1}
Advanced - Secure Deployment Script with Permission Auditing
Write a deploy_static_files(src_dir, dest_dir) function that:
- Copies all non-hidden files from
src_dirtodest_dirrecursively usingos.walk,os.makedirs, andshutil.copy2 - After copying, audits each file and sets permissions: directories get
0o755, regular files get0o644, files ending in.shget0o755 - Detects any files that end up world-writable (
stat.S_IWOTH) and raises aRuntimeErrorlisting them - Returns a summary dict:
{"copied": N, "permission_errors": [...]}
Solution
import os
import stat
import shutil
from pathlib import Path
class DeploymentError(Exception):
"""Raised when deployment encounters security violations."""
pass
def deploy_static_files(src_dir, dest_dir):
"""
Deploy static files from src_dir to dest_dir with secure permissions.
Steps:
1. Walk src_dir, skip hidden files/dirs
2. Recreate directory structure in dest_dir
3. Copy each file (preserving metadata with shutil.copy2)
4. Set permissions: dirs=0o755, .sh files=0o755, others=0o644
5. Audit for world-writable files and raise DeploymentError if found
Args:
src_dir: source directory path (str or Path)
dest_dir: destination directory path (str or Path)
Returns:
dict: {"copied": int, "permission_errors": list[str]}
Raises:
DeploymentError: if any deployed file ends up world-writable
"""
src_dir = str(src_dir)
dest_dir = str(dest_dir)
os.makedirs(dest_dir, exist_ok=True)
summary = {"copied": 0, "permission_errors": []}
skip_dirs = {".git", "__pycache__", ".venv", "node_modules"}
# ── Phase 1: Copy files ───────────────────────────────────────────
for dirpath, dirnames, filenames in os.walk(src_dir, topdown=True):
# Skip hidden and noisy directories
dirnames[:] = [
d for d in dirnames
if d not in skip_dirs and not d.startswith(".")
]
# Compute the relative path from src_dir
rel_path = os.path.relpath(dirpath, src_dir)
dest_subdir = os.path.join(dest_dir, rel_path)
os.makedirs(dest_subdir, exist_ok=True)
for fname in filenames:
if fname.startswith("."):
continue # Skip hidden files
src_file = os.path.join(dirpath, fname)
dest_file = os.path.join(dest_subdir, fname)
try:
shutil.copy2(src_file, dest_file) # copy2 preserves timestamps
summary["copied"] += 1
except PermissionError as e:
summary["permission_errors"].append(f"copy failed: {src_file}: {e}")
# ── Phase 2: Set permissions ──────────────────────────────────────
for dirpath, dirnames, filenames in os.walk(dest_dir, topdown=True):
# Set directory permissions
try:
os.chmod(dirpath, 0o755)
except PermissionError as e:
summary["permission_errors"].append(f"chmod dir failed: {dirpath}: {e}")
for fname in filenames:
fpath = os.path.join(dirpath, fname)
# Shell scripts need execute bit; everything else gets 0o644
target_mode = 0o755 if fname.endswith(".sh") else 0o644
try:
os.chmod(fpath, target_mode)
except PermissionError as e:
summary["permission_errors"].append(
f"chmod failed: {fpath}: {e}"
)
# ── Phase 3: Security audit ───────────────────────────────────────
world_writable = []
for dirpath, dirnames, filenames in os.walk(dest_dir):
for fname in filenames:
fpath = os.path.join(dirpath, fname)
try:
mode = os.stat(fpath).st_mode
if mode & stat.S_IWOTH: # world-writable bit set
world_writable.append(fpath)
except PermissionError:
pass
if world_writable:
file_list = "\n ".join(world_writable)
raise DeploymentError(
f"Security violation: {len(world_writable)} world-writable files found "
f"after deployment:\n {file_list}"
)
print(f"Deployment complete:")
print(f" Files copied: {summary['copied']}")
print(f" Perm errors: {len(summary['permission_errors'])}")
if summary["permission_errors"]:
for err in summary["permission_errors"]:
print(f" WARNING: {err}")
return summary
# Demo usage
if __name__ == "__main__":
import tempfile
import textwrap
# Set up a test source directory
with tempfile.TemporaryDirectory() as src:
# Create some files
Path(src, "index.html").write_text("<h1>Hello</h1>")
Path(src, "style.css").write_text("body { margin: 0; }")
Path(src, "deploy.sh").write_text("#!/bin/bash\necho deploying")
Path(src, "subdir").mkdir()
Path(src, "subdir", "app.js").write_text("console.log('app');")
with tempfile.TemporaryDirectory() as dest:
result = deploy_static_files(src, dest)
print(f"\nResult: {result}")
# Verify permissions
for dirpath, _, filenames in os.walk(dest):
for fname in filenames:
fpath = os.path.join(dirpath, fname)
mode = stat.S_IMODE(os.stat(fpath).st_mode)
expected = 0o755 if fname.endswith(".sh") else 0o644
status = "OK" if mode == expected else "MISMATCH"
print(f" [{status}] {fname}: {oct(mode)}")
# Example output:
# Deployment complete:
# Files copied: 4
# Perm errors: 0
#
# Result: {'copied': 4, 'permission_errors': []}
# [OK] index.html: 0o644
# [OK] style.css: 0o644
# [OK] app.js: 0o644
# [OK] deploy.sh: 0o755
Quick Reference
| Operation | Code | Notes |
|---|---|---|
| Current directory | os.getcwd() | Returns absolute path string |
| Change directory | os.chdir(path) | Avoid in production - mutates global state |
| List directory | os.listdir(path) | Returns list of name strings |
| Scan directory | os.scandir(path) | Returns DirEntry objects - faster |
| Walk recursively | os.walk(path, topdown=True) | Yields (dirpath, dirnames, filenames) |
| Prune walk | dirnames[:] = [...] | Slice assignment in-place |
| File exists | os.path.exists(path) | Prefer Path(p).exists() |
| Is file | os.path.isfile(path) | Prefer Path(p).is_file() |
| Is directory | os.path.isdir(path) | Prefer Path(p).is_dir() |
| Join paths | os.path.join(a, b, c) | Prefer Path(a) / b / c |
| Split extension | os.path.splitext(name) | Returns ('stem', '.ext') |
| File metadata | os.stat(path) | Returns stat_result |
| File permissions | stat.S_IMODE(os.stat(p).st_mode) | Requires import stat |
| Set permissions | os.chmod(path, 0o644) | Octal mode |
| Create directory | os.makedirs(path, exist_ok=True) | Creates all intermediate dirs |
| Remove file | os.remove(path) | Single file only |
| Remove dir tree | shutil.rmtree(path) | Permanent - no undo |
| Rename/move | os.rename(src, dst) | Atomic on POSIX if same filesystem |
| Current PID | os.getpid() | Integer process ID |
| Parent PID | os.getppid() | Integer parent process ID |
| CPU count | os.cpu_count() | Number of logical CPUs |
| System load | os.getloadavg() | Unix only: (1m, 5m, 15m) tuple |
| Secure random | os.urandom(n) | n bytes of cryptographic entropy |
| Run command | subprocess.run([...], capture_output=True, text=True, check=True) | Never use os.system() |
| Read env var | os.environ.get("KEY", "default") | Safe - returns default if missing |
| Set env var | os.environ["KEY"] = "value" | Affects current process and children |
Key Takeaways
osis the thin wrapper around POSIX/Win32 system calls - it handles process info, permissions, environment variables, and filesystem operations thatpathlibdoes not cover- Use
os.scandir()instead ofos.listdir()whenever you need file type or stat information - it avoids extra system calls by caching directory entry metadata os.walk()withtopdown=Truelets you prune directory traversal by modifyingdirnames[:] = [...]in-place; usetopdown=Falsefor bottom-up operations like directory deletionos.chdir()mutates global process state - avoid it in library code; build absolute paths insteados.chmod()andos.stat()give you full control over Unix file permissions;stat.S_IMODE()extracts the permission bits from the full mode value- Never use
os.system()- it is vulnerable to shell injection and cannot capture output; usesubprocess.run(["cmd", "arg"], capture_output=True, text=True, check=True)instead os.urandom()provides cryptographically secure random bytes; thesecretsmodule (Python 3.6+) wraps it with a friendlier API for tokens and keysos.makedirs(path, exist_ok=True)is the safe way to create nested directories - it avoids TOCTOU race conditions by not failing if the directory already exists
