Skip to main content

The os Module - System Calls and Process Interaction

Reading time: ~18 minutes | Level: Foundation → Engineering

Here is a behavior that surprises most Python developers:

import os

os.environ["MY_SECRET"] = "hunter2"

import subprocess
result = subprocess.run(
["python3", "-c", "import os; print(os.environ.get('MY_SECRET'))"],
capture_output=True, text=True
)
print(result.stdout.strip()) # hunter2

Setting os.environ["MY_SECRET"] in your process mutates the environment for that process and all child processes it spawns - including subprocesses. This is how configuration leaks and secret exposure happen in production.

The os module is the thin wrapper between your Python code and the operating system kernel. Understanding it means understanding how processes, files, permissions, and system resources actually work - not just how Python abstracts them.

What You Will Learn

  • The architectural difference between os, pathlib, and shutil - and when to use each
  • How os.path works and why pathlib replaces most of it in modern code
  • Directory listing with os.listdir() and os.scandir() - and why scandir is dramatically faster
  • Recursive traversal with os.walk() and how topdown controls the traversal order
  • File metadata with os.stat() - permissions, size, timestamps
  • Changing file permissions with os.chmod() and reading them with stat.S_IMODE
  • Process identity with os.getpid() and os.getppid()
  • Why os.system() is dangerous and how subprocess.run() replaces it safely
  • Cryptographically secure random bytes from os.urandom()

Prerequisites

  • Familiarity with Python file I/O (open(), read(), write())
  • Understanding of Python strings and f-strings
  • Basic knowledge of what a file system is (files, directories, paths)
  • Having completed the pathlib module (topic 04) is helpful but not required

The Big Picture: os vs pathlib vs shutil

These three modules are often confused. Here is when to use each:

ModuleUse forKey APIs
pathlibPath manipulation, file read/write, directory creation, glob patternsPath("/a/b/c"), p.exists(), p.read_text(), p.glob("*.py"), p.stat()
osProcess/system info, permissions, environment vars, walking treesos.getpid(), os.environ, os.walk(), os.chmod(), os.urandom(), os.cpu_count()
shutilCopy/move/delete, high-level FS ops, archive handling, finding executablesshutil.copy(), shutil.move(), shutil.rmtree(), shutil.which(), shutil.disk_usage()

:::tip Rule of Thumb For path manipulation, prefer pathlib. For system-level operations (process info, permissions, environment, random bytes), use os. For copying, moving, and deleting directory trees, use shutil. :::

Part 1 - os.path: The Classic Path Toolkit

os.path provides string-based path manipulation. It predates pathlib by decades. In Python 3.4+, pathlib is preferred for path manipulation - but os.path is still everywhere in existing codebases, so you must know it.

import os

path = "/home/alice/projects/myapp/config.yaml"

# Core os.path operations
print(os.path.basename(path)) # config.yaml
print(os.path.dirname(path)) # /home/alice/projects/myapp
print(os.path.splitext(path)) # ('/home/alice/projects/myapp/config', '.yaml')
print(os.path.split(path)) # ('/home/alice/projects/myapp', 'config.yaml')

# Building paths safely (handles OS-specific separators)
joined = os.path.join("/home/alice", "projects", "myapp", "config.yaml")
print(joined) # /home/alice/projects/myapp/config.yaml

# Checking path properties
print(os.path.exists(path)) # True or False depending on disk
print(os.path.isfile(path)) # True if it's a file
print(os.path.isdir(path)) # True if it's a directory
print(os.path.isabs(path)) # True - path starts with /
print(os.path.abspath("config.yaml")) # /current/working/dir/config.yaml

The pathlib Equivalents

Every os.path operation has a pathlib equivalent. The pathlib version is more readable because you compose operations with attribute access instead of function calls:

from pathlib import Path
import os

path_str = "/home/alice/projects/myapp/config.yaml"
path = Path(path_str)

# os.path → pathlib
os.path.basename(path_str) # config.yaml
path.name # config.yaml ← cleaner

os.path.dirname(path_str) # /home/alice/projects/myapp
path.parent # PosixPath('/home/alice/projects/myapp')

os.path.splitext(path_str) # ('.../config', '.yaml')
path.stem, path.suffix # 'config', '.yaml'

os.path.exists(path_str) # True/False
path.exists() # True/False

:::note When os.path Still Makes Sense os.path is still useful when you are working with code that passes around plain strings, integrating with legacy APIs that only accept strings, or writing library code that must work without importing pathlib. :::

Part 2 - Current Working Directory

import os

# Get the current working directory
cwd = os.getcwd()
print(cwd) # /Users/alice/projects/myapp

# Change the working directory
os.chdir("/tmp")
print(os.getcwd()) # /tmp

# Change back
os.chdir(cwd)
print(os.getcwd()) # /Users/alice/projects/myapp

:::danger os.chdir is a Code Smell os.chdir() mutates the process-wide working directory. If any other thread or code calls os.getcwd() after your chdir, it sees the new directory. This causes hard-to-debug race conditions in multithreaded applications.

The correct pattern is to build absolute paths with os.path.join(base, filename) or Path(base) / filename rather than changing directories. Reserve os.chdir() for short scripts where you control the entire process. :::

The Safe Pattern

import os
from pathlib import Path

# BAD: changing the global working directory
def process_files(directory):
os.chdir(directory) # global mutation - dangerous
for f in os.listdir("."):
process(f)

# GOOD: build absolute paths, never change directory
def process_files(directory):
base = Path(directory).resolve()
for f in base.iterdir():
process(f) # f is an absolute Path - safe

Part 3 - Listing Directory Contents

os.listdir(): Simple but Dumb

import os

entries = os.listdir("/tmp")
print(entries)
# ['file1.txt', 'file2.log', 'subdir', '.hidden']
# Returns: list of strings, names only, no metadata

os.listdir() returns a plain list of names. To get file type or size, you must make a separate os.stat() call for each entry - which means one system call per file.

os.scandir(): Faster and Smarter

import os

# scandir returns DirEntry objects - already have type and stat info
with os.scandir("/tmp") as entries:
for entry in entries:
print(f"{entry.name:30} is_file={entry.is_file()} is_dir={entry.is_dir()}")
# file1.txt is_file=True is_dir=False
# subdir is_file=False is_dir=True

os.scandir() returns DirEntry objects that cache the file type information from the OS directory listing. On most filesystems, this means zero extra system calls to determine is_file() and is_dir().

For a directory with 1000 files:

  • os.listdir() + os.path.isfile() per file: 1 readdir syscall + 1000 stat syscalls = 1001 total syscalls
  • os.scandir(): 1 readdir syscall (DirEntry caches d_type from dirent struct) = 1 total syscall on most Linux filesystems

scandir is up to 20x faster on large directories.

Practical os.scandir() Usage

import os

def list_python_files(directory):
"""List all Python files in a directory (non-recursive)."""
py_files = []
with os.scandir(directory) as entries:
for entry in entries:
if entry.is_file() and entry.name.endswith(".py"):
stat = entry.stat()
py_files.append({
"name": entry.name,
"path": entry.path, # Full absolute path
"size": stat.st_size,
"modified": stat.st_mtime,
})
return sorted(py_files, key=lambda x: x["name"])

# Usage
files = list_python_files("/Users/alice/myproject")
for f in files:
print(f"{f['name']:30} {f['size']:8} bytes")

:::tip DirEntry Attributes A DirEntry object has: name (filename), path (full path), is_file(), is_dir(), is_symlink(), and stat(). The stat() call may use a cached result on Windows or follow a symlink - check stat(follow_symlinks=False) for symlink metadata. :::

Part 4 - Recursive Directory Traversal with os.walk()

os.walk() is one of the most useful functions in Python's standard library. It generates (dirpath, dirnames, filenames) tuples for every directory in a tree.

import os

# Basic traversal
for dirpath, dirnames, filenames in os.walk("/Users/alice/projects"):
print(f"DIR: {dirpath}")
for fname in filenames:
print(f" FILE: {os.path.join(dirpath, fname)}")

How os.walk() Works Internally

/project/
├── main.py
├── config.yaml
└── src/
├── models.py
└── utils/
└── helpers.py

topdown=True (default) yields root-first:

  1. ("/project", ["src"], ["main.py", "config.yaml"])
  2. ("/project/src", ["utils"], ["models.py"])
  3. ("/project/src/utils", [], ["helpers.py"])

topdown=False yields deepest-first:

  1. ("/project/src/utils", [], ["helpers.py"])
  2. ("/project/src", ["utils"], ["models.py"])
  3. ("/project", ["src"], ["main.py", "config.yaml"])

Controlling Traversal: Pruning Subdirectories

With topdown=True, you can modify dirnames in-place to skip directories:

import os

def find_python_files(root, skip_dirs=None):
"""
Recursively find all .py files, skipping specified directories.
Modifying dirnames in-place prunes the traversal - no wasted work.
"""
skip_dirs = skip_dirs or {".git", "__pycache__", ".venv", "node_modules"}
py_files = []

for dirpath, dirnames, filenames in os.walk(root, topdown=True):
# Prune: remove directories we don't want to descend into
# Must modify in-place (slice assignment), not reassign
dirnames[:] = [d for d in dirnames if d not in skip_dirs]

for fname in filenames:
if fname.endswith(".py"):
full_path = os.path.join(dirpath, fname)
py_files.append(full_path)

return py_files

# Find all Python files in a project
files = find_python_files("/Users/alice/projects/myapp")
for f in files:
print(f)
# /Users/alice/projects/myapp/main.py
# /Users/alice/projects/myapp/src/models.py
# /Users/alice/projects/myapp/src/utils/helpers.py

:::warning Modifying dirnames In-Place Use dirnames[:] = [...] (slice assignment), not dirnames = [...] (rebinding). Slice assignment modifies the original list object that os.walk holds a reference to. Rebinding creates a new list and leaves the original untouched - so os.walk still descends into all directories. :::

topdown=False: When You Need to Delete Directories

topdown=False yields deepest directories first. This is the correct mode for deleting directory trees - you must delete files before deleting their parent directory:

import os

def delete_empty_directories(root):
"""Remove all empty directories in a tree (bottom-up)."""
for dirpath, dirnames, filenames in os.walk(root, topdown=False):
if not dirnames and not filenames:
try:
os.rmdir(dirpath)
print(f"Removed empty dir: {dirpath}")
except OSError as e:
print(f"Could not remove {dirpath}: {e}")

Part 5 - File Metadata and Permissions

os.stat(): Everything About a File

import os
import stat
import time

info = os.stat("/etc/hosts")

print(f"Size: {info.st_size} bytes")
print(f"Mode: {oct(info.st_mode)}") # e.g., 0o100644
print(f"UID: {info.st_uid}") # owner user ID
print(f"GID: {info.st_gid}") # owner group ID
print(f"Modified: {time.ctime(info.st_mtime)}") # last modification time
print(f"Accessed: {time.ctime(info.st_atime)}") # last access time
print(f"Changed: {time.ctime(info.st_ctime)}") # metadata change time

Understanding Unix File Permissions

The st_mode octal 0o 1 0 0 6 4 4 breaks down as: file type (10=regular, 04=directory, 012=symlink) · special bits (setuid/setgid/sticky) · user perms · group perms · other perms.

Permission bit values: 4 = read (r) · 2 = write (w) · 1 = execute (x) · 6 = rw- · 7 = rwx

OctalSymbolicMeaning
0o644rw-r--r--Owner rw, group r, other r - typical file
0o755rwxr-xr-xOwner rwx, group/other rx - executable/directory
0o700rwx------Owner only, private - SSH keys

os.chmod(): Changing Permissions

import os
import stat

# Make a file executable
os.chmod("deploy.sh", 0o755)

# Make a private key file owner-read-only
os.chmod("id_rsa", 0o600)

# Using stat constants (more readable)
os.chmod("script.py", stat.S_IRUSR | stat.S_IWUSR | stat.S_IXUSR)
# stat.S_IRUSR = 0o400 (owner read)
# stat.S_IWUSR = 0o200 (owner write)
# stat.S_IXUSR = 0o100 (owner execute)
# Combined: 0o700

# Check current permissions
info = os.stat("deploy.sh")
permissions = stat.S_IMODE(info.st_mode) # Extract permission bits only
print(oct(permissions)) # 0o755
print(bool(permissions & stat.S_IXUSR)) # True - owner can execute

Practical: Audit Files With Insecure Permissions

import os
import stat

def find_world_writable(directory):
"""Find files that are writable by anyone - a security risk."""
risky_files = []
for dirpath, dirnames, filenames in os.walk(directory):
dirnames[:] = [d for d in dirnames if not d.startswith(".")]
for fname in filenames:
fpath = os.path.join(dirpath, fname)
try:
mode = os.stat(fpath).st_mode
if mode & stat.S_IWOTH: # World-writable bit
risky_files.append(fpath)
except PermissionError:
pass
return risky_files

# Usage
risky = find_world_writable("/var/www/html")
for path in risky:
print(f"RISKY: {path}")

Part 6 - File System Operations

Creating Directories

import os

# Create a single directory
os.mkdir("/tmp/mydir") # Fails if parent doesn't exist

# Create nested directories (like mkdir -p)
os.makedirs("/tmp/a/b/c") # Creates all intermediate dirs
os.makedirs("/tmp/a/b/c", exist_ok=True) # No error if already exists

:::tip Always Use exist_ok=True In production code, always use os.makedirs(path, exist_ok=True). Without it, you get a FileExistsError if another process or thread creates the directory between your check and your creation - a classic TOCTOU (time-of-check-time-of-use) race condition. :::

Renaming and Moving Files

import os

# Rename/move within same filesystem - atomic on POSIX
os.rename("/tmp/old_name.txt", "/tmp/new_name.txt")

# For cross-filesystem moves, use shutil.move() instead
import shutil
shutil.move("/tmp/file.txt", "/mnt/storage/file.txt")

Removing Files and Directories

import os

os.remove("file.txt") # Remove a file (raises if directory)
os.unlink("file.txt") # Alias for os.remove

os.rmdir("empty_dir") # Remove EMPTY directory only

# For non-empty directories:
import shutil
shutil.rmtree("non_empty_dir") # USE WITH CAUTION - no recycle bin

:::danger shutil.rmtree is Permanent shutil.rmtree() deletes the directory and all its contents permanently - there is no recycle bin or undo. Always double-check the path. A common catastrophic bug: shutil.rmtree(base_dir + suffix) where suffix is empty and base_dir is /. Test with dry runs in production code. :::

Part 7 - Process Information

import os

# Current process ID
pid = os.getpid()
print(f"This process ID: {pid}") # e.g., 12345

# Parent process ID
ppid = os.getppid()
print(f"Parent process ID: {ppid}") # e.g., 12300

# System info
cpus = os.cpu_count()
print(f"CPU cores: {cpus}") # e.g., 8

# System load average (Unix only - not available on Windows)
try:
load = os.getloadavg()
print(f"Load avg (1m, 5m, 15m): {load}") # e.g., (1.5, 1.2, 0.9)
except AttributeError:
print("getloadavg not available on this platform")

Why Process IDs Matter

import os

# Writing PID files (used by daemons to prevent duplicate instances)
pid_file = "/var/run/myapp.pid"

def write_pid_file():
with open(pid_file, "w") as f:
f.write(str(os.getpid()))

def check_running():
try:
with open(pid_file) as f:
old_pid = int(f.read().strip())
# Check if process is still running
os.kill(old_pid, 0) # Signal 0 = check existence, don't kill
return True # Process exists
except (FileNotFoundError, ProcessLookupError):
return False
except PermissionError:
return True # Process exists but we can't signal it

# Common in web servers, background workers, schedulers

Part 8 - Environment Variables

import os

# Read environment variables
path = os.environ["PATH"] # KeyError if missing
home = os.environ.get("HOME") # None if missing
port = os.environ.get("PORT", "8080") # Default value

# Set environment variable (affects current process and future child processes)
os.environ["MY_APP_MODE"] = "production"

# Delete an environment variable
del os.environ["TEMP_VAR"]
# or
os.environ.pop("TEMP_VAR", None) # Safe - no error if missing

# Get all environment variables as a dict
env_dict = dict(os.environ)
for key, value in sorted(env_dict.items()):
print(f"{key}={value}")

:::note Full Coverage in Next Topic Environment variables have their own dedicated topic (06-Environment-Variables) covering the 12-factor app pattern, python-dotenv, Pydantic Settings, and security practices. This section covers just the os module mechanics. :::

Part 9 - os.urandom(): Cryptographically Secure Random Bytes

import os

# Generate 16 bytes of cryptographically secure random data
random_bytes = os.urandom(16)
print(random_bytes) # b'\x8f\xc3\xb2...' (16 random bytes)
print(len(random_bytes)) # 16

# Generate a secure token (common for session IDs, CSRF tokens)
import secrets # Python 3.6+ preferred API wrapping os.urandom
token = secrets.token_hex(32) # 64-character hex string
print(token) # e.g., "a3f8c9d1e2..."

api_key = secrets.token_urlsafe(32) # URL-safe base64
print(api_key) # e.g., "wI4Qp8..."

:::note os.urandom vs random os.urandom() reads from the OS cryptographically secure random number generator (/dev/urandom on Unix, CryptGenRandom on Windows). The random module is not cryptographically secure - never use random to generate passwords, tokens, or keys. Use os.urandom() directly or the secrets module (which wraps it with a friendlier API). :::

Part 10 - os.system() vs subprocess.run()

os.system() is one of those functions that exists in Python and should essentially never be used in production code.

Why os.system() Is Dangerous

import os

# os.system - DO NOT USE
filename = "report 2024.pdf"
os.system(f"ls -la {filename}")
# This passes the string to the shell, which interprets it.
# If filename = "file.pdf; rm -rf /", you get shell injection!

# Worse: no way to capture output
# os.system returns only the exit code (0 = success)
ret = os.system("ls /tmp") # Prints to stdout directly
print(ret) # 0 (success) - output is gone

With os.system(): user_input = "report.pdf; rm -rf ~"os.system(f"open {user_input}") → shell executes open report.pdf; rm -rf ~deletes home directory.

With subprocess.run(["open", user_input]): arguments are passed as a list, never interpreted by the shell. No shell metacharacters (;, &&, |, >) are processed.

The Correct Way: subprocess.run()

import subprocess

# Safe - arguments are a list, no shell injection possible
result = subprocess.run(
["ls", "-la", "/tmp"],
capture_output=True, # Capture stdout and stderr
text=True, # Decode bytes to str
check=True # Raise CalledProcessError if exit code != 0
)

print(result.stdout) # The ls output as a string
print(result.returncode) # 0

# Handling errors
try:
result = subprocess.run(
["python3", "nonexistent.py"],
capture_output=True,
text=True,
check=True
)
except subprocess.CalledProcessError as e:
print(f"Command failed with code {e.returncode}")
print(f"stderr: {e.stderr}")

# Passing user input safely - no string formatting needed
user_filename = "report 2024.pdf"
result = subprocess.run(
["wc", "-l", user_filename], # Each argument is separate
capture_output=True, text=True
)

Part 11 - Real-World: Build Tool Integration

Here is a complete, production-quality script combining os.walk, os.stat, os.makedirs, and subprocess.run to build a project report:

import os
import subprocess
import json
from datetime import datetime

def analyze_project(root_dir, output_dir):
"""
Analyze a Python project: count files, find large files,
run pylint, and write a JSON report.
"""
os.makedirs(output_dir, exist_ok=True)

stats = {
"root": root_dir,
"analyzed_at": datetime.utcnow().isoformat(),
"file_count": 0,
"total_size_bytes": 0,
"large_files": [], # Files over 100KB
"file_types": {},
"pylint_score": None,
}

skip_dirs = {".git", "__pycache__", ".venv", "node_modules", ".mypy_cache"}

for dirpath, dirnames, filenames in os.walk(root_dir, topdown=True):
dirnames[:] = [d for d in dirnames if d not in skip_dirs]

for fname in filenames:
fpath = os.path.join(dirpath, fname)
try:
info = os.stat(fpath)
size = info.st_size
ext = os.path.splitext(fname)[1].lower() or "(no ext)"

stats["file_count"] += 1
stats["total_size_bytes"] += size
stats["file_types"][ext] = stats["file_types"].get(ext, 0) + 1

if size > 100_000: # 100KB
stats["large_files"].append({
"path": fpath,
"size_kb": round(size / 1024, 1),
})
except PermissionError:
pass

# Run pylint on the project (safely, without shell=True)
try:
result = subprocess.run(
["python3", "-m", "pylint", root_dir, "--score=y"],
capture_output=True,
text=True,
timeout=60
)
# Parse pylint score from last line: "Your code has been rated at 9.50/10"
for line in result.stdout.splitlines():
if "rated at" in line:
score_part = line.split("rated at")[1].strip()
stats["pylint_score"] = score_part.split("/")[0].strip()
except (subprocess.TimeoutExpired, FileNotFoundError):
stats["pylint_score"] = "unavailable"

# Write report
report_path = os.path.join(output_dir, "project_report.json")
with open(report_path, "w", encoding="utf-8") as f:
json.dump(stats, f, indent=2)

print(f"Report written to {report_path}")
print(f"Files analyzed: {stats['file_count']}")
print(f"Total size: {stats['total_size_bytes'] / 1024:.1f} KB")
return stats

# Run it
# analyze_project("/Users/alice/projects/myapp", "/tmp/reports")

Interview Questions

Q1: What is the difference between os.listdir() and os.scandir(), and when would you use each?

Answer: os.listdir() returns a plain list of filenames as strings. To determine file type (file vs directory) or size, you must make a separate os.stat() system call for each entry - O(n) additional syscalls.

os.scandir() returns DirEntry objects. On Linux and Windows, the directory entry structure from the OS already includes the file type (the d_type field in struct dirent). scandir exposes this as entry.is_file() and entry.is_dir() without additional syscalls, making it up to 20x faster on large directories. Use scandir when you need to filter by type or access stat information. Use listdir only when you need a simple list of names and nothing else.

Q2: How does os.walk() allow you to prune directories, and why must you use slice assignment?

Answer: With topdown=True (the default), os.walk yields (dirpath, dirnames, filenames) and checks dirnames to decide which subdirectories to descend into. If you modify dirnames before the next iteration, os.walk respects the change.

The modification must be in-place using dirnames[:] = [...] (slice assignment). If you write dirnames = [...], you rebind the local variable to a new list object, but os.walk still holds a reference to the original list and will descend into all original directories. Slice assignment modifies the contents of the existing list object that both your code and os.walk share.

Q3: Why should you never use os.system() in production code?

Answer: Three reasons:

  1. Shell injection: os.system() passes the command string to the shell for interpretation. User-controlled input in the string can contain shell metacharacters (;, &&, |, `) that execute arbitrary commands.
  2. No output capture: os.system() writes stdout/stderr directly to the terminal and returns only the exit code. You cannot capture or process the output programmatically.
  3. No timeout or error handling: subprocess.run() with check=True, timeout=N, and capture_output=True gives you structured error handling, output capture, and timeout protection. Use subprocess.run(["cmd", "arg1", "arg2"]) with a list of arguments - no shell interpretation, no injection risk.

Q4: What is the difference between os.remove() and shutil.rmtree()?

Answer: os.remove() (also os.unlink()) deletes a single file. It raises IsADirectoryError if called on a directory and FileNotFoundError if the file does not exist.

os.rmdir() deletes a single empty directory. It raises OSError if the directory contains any files or subdirectories.

shutil.rmtree() recursively deletes an entire directory tree - all files, subdirectories, and their contents. There is no confirmation, no recycle bin, and no undo. Always validate the path before calling it in production code.

Q5: What does os.stat() return, and what is the difference between st_mtime, st_atime, and st_ctime?

Answer: os.stat() returns a stat_result object with fields from the underlying POSIX stat(2) system call:

  • st_size: file size in bytes
  • st_mode: file type and permission bits (use stat.S_IMODE() to extract permissions)
  • st_uid, st_gid: owner user ID and group ID
  • st_mtime: modification time - when the file content was last changed
  • st_atime: access time - when the file was last read (often disabled on Linux for performance)
  • st_ctime: metadata change time on Unix (NOT creation time) - when permissions, owner, or link count changed. On Windows, this is creation time.

The common confusion: on Unix/Linux, st_ctime is not creation time. Use st_mtime to detect file changes in build tools and cache invalidators.

Q6: How is os.urandom() different from the random module, and when must you use os.urandom()?

Answer: random is a pseudo-random number generator (Mersenne Twister) seeded from the system time. It is statistically high-quality but cryptographically predictable - given enough output, an attacker can determine the internal state and predict all future values.

os.urandom() reads from the OS cryptographically secure random number generator - /dev/urandom on Unix (which uses hardware entropy sources, interrupt timing, and the kernel's CSPRNG), or CryptGenRandom on Windows. The output is computationally infeasible to predict.

You must use os.urandom() (or the secrets module, which wraps it) for: session tokens, CSRF tokens, password reset links, API keys, encryption keys, nonces, and any value whose unpredictability has security implications. Use random only for simulations, games, and non-security random sampling.

Practice Challenges

Beginner - Directory File Counter

Write a function count_by_extension(directory) that returns a dictionary mapping each file extension (e.g., ".py", ".txt") to the number of files with that extension in the directory (non-recursive). Use os.scandir().

Solution
import os

def count_by_extension(directory):
"""
Count files in a directory grouped by extension.
Non-recursive. Uses os.scandir for efficiency.

Args:
directory: path to the directory to scan

Returns:
dict mapping extension -> count
e.g., {'.py': 12, '.txt': 3, '(no ext)': 1}
"""
counts = {}

with os.scandir(directory) as entries:
for entry in entries:
if entry.is_file():
# os.path.splitext returns ('name', '.ext') or ('name', '')
_, ext = os.path.splitext(entry.name)
ext = ext.lower() if ext else "(no ext)"
counts[ext] = counts.get(ext, 0) + 1

return counts


# Demo
if __name__ == "__main__":
import sys
target = sys.argv[1] if len(sys.argv) > 1 else "."
result = count_by_extension(target)

print(f"File types in {target}:")
for ext, count in sorted(result.items(), key=lambda x: -x[1]):
print(f" {ext:15} {count:4} files")

# Example output for a Python project directory:
# File types in /Users/alice/myproject:
# .py 47 files
# .md 8 files
# .yaml 3 files
# (no ext) 2 files
# .json 1 files

Intermediate - Recursive Duplicate Finder

Write a function find_duplicates(directory) that recursively scans a directory and returns a dict where keys are file sizes (in bytes) and values are lists of file paths that share that size. Include only sizes with more than one file - these are potential duplicates. Skip hidden directories and __pycache__.

Solution
import os
from collections import defaultdict

def find_duplicates(directory):
"""
Find potential duplicate files by matching file size.
Files with the same size are candidates for deduplication.
(True deduplication requires content hashing - this is step 1.)

Args:
directory: root directory to scan recursively

Returns:
dict: {size_bytes: [list_of_paths]} for sizes with 2+ files
"""
skip_dirs = {"__pycache__", ".git", ".venv", "node_modules", ".mypy_cache"}
size_map = defaultdict(list) # size -> [paths]

for dirpath, dirnames, filenames in os.walk(directory, topdown=True):
# Prune hidden dirs and known noisy dirs
dirnames[:] = [
d for d in dirnames
if d not in skip_dirs and not d.startswith(".")
]

for fname in filenames:
if fname.startswith("."):
continue # Skip hidden files
fpath = os.path.join(dirpath, fname)
try:
size = os.stat(fpath).st_size
if size > 0: # Skip empty files
size_map[size].append(fpath)
except (PermissionError, FileNotFoundError):
pass # Skip inaccessible files

# Keep only sizes with multiple files
duplicates = {
size: paths
for size, paths in size_map.items()
if len(paths) > 1
}

return duplicates


def report_duplicates(directory):
"""Print a human-readable duplicate report."""
dupes = find_duplicates(directory)

if not dupes:
print("No potential duplicates found.")
return

total_wasted = 0
print(f"Potential duplicates in {directory}:\n")

for size, paths in sorted(dupes.items(), key=lambda x: -x[0]):
size_kb = size / 1024
wasted = size * (len(paths) - 1) # Could save this many bytes
total_wasted += wasted

print(f" Size: {size_kb:.1f} KB - {len(paths)} files")
for path in paths:
print(f" {path}")
print()

print(f"Potential savings if deduplicated: {total_wasted / 1024:.1f} KB")


# Demo usage
# report_duplicates("/Users/alice/Downloads")

# Note: size matching is not conclusive - two files can have the same size
# but different contents. For reliable deduplication, hash the content:
import hashlib

def hash_file(path, chunk_size=65536):
"""Return MD5 hash of file contents."""
h = hashlib.md5()
with open(path, "rb") as f:
while chunk := f.read(chunk_size):
h.update(chunk)
return h.hexdigest()

def find_true_duplicates(directory):
"""Find files with identical content (two-pass: size then hash)."""
# First pass: group by size (cheap)
size_candidates = find_duplicates(directory)

# Second pass: hash only the candidate files (expensive for large files)
hash_map = defaultdict(list)
for paths in size_candidates.values():
for path in paths:
try:
digest = hash_file(path)
hash_map[digest].append(path)
except (PermissionError, FileNotFoundError):
pass

return {h: paths for h, paths in hash_map.items() if len(paths) > 1}

Advanced - Secure Deployment Script with Permission Auditing

Write a deploy_static_files(src_dir, dest_dir) function that:

  1. Copies all non-hidden files from src_dir to dest_dir recursively using os.walk, os.makedirs, and shutil.copy2
  2. After copying, audits each file and sets permissions: directories get 0o755, regular files get 0o644, files ending in .sh get 0o755
  3. Detects any files that end up world-writable (stat.S_IWOTH) and raises a RuntimeError listing them
  4. Returns a summary dict: {"copied": N, "permission_errors": [...]}
Solution
import os
import stat
import shutil
from pathlib import Path


class DeploymentError(Exception):
"""Raised when deployment encounters security violations."""
pass


def deploy_static_files(src_dir, dest_dir):
"""
Deploy static files from src_dir to dest_dir with secure permissions.

Steps:
1. Walk src_dir, skip hidden files/dirs
2. Recreate directory structure in dest_dir
3. Copy each file (preserving metadata with shutil.copy2)
4. Set permissions: dirs=0o755, .sh files=0o755, others=0o644
5. Audit for world-writable files and raise DeploymentError if found

Args:
src_dir: source directory path (str or Path)
dest_dir: destination directory path (str or Path)

Returns:
dict: {"copied": int, "permission_errors": list[str]}

Raises:
DeploymentError: if any deployed file ends up world-writable
"""
src_dir = str(src_dir)
dest_dir = str(dest_dir)
os.makedirs(dest_dir, exist_ok=True)

summary = {"copied": 0, "permission_errors": []}
skip_dirs = {".git", "__pycache__", ".venv", "node_modules"}

# ── Phase 1: Copy files ───────────────────────────────────────────
for dirpath, dirnames, filenames in os.walk(src_dir, topdown=True):
# Skip hidden and noisy directories
dirnames[:] = [
d for d in dirnames
if d not in skip_dirs and not d.startswith(".")
]

# Compute the relative path from src_dir
rel_path = os.path.relpath(dirpath, src_dir)
dest_subdir = os.path.join(dest_dir, rel_path)
os.makedirs(dest_subdir, exist_ok=True)

for fname in filenames:
if fname.startswith("."):
continue # Skip hidden files

src_file = os.path.join(dirpath, fname)
dest_file = os.path.join(dest_subdir, fname)

try:
shutil.copy2(src_file, dest_file) # copy2 preserves timestamps
summary["copied"] += 1
except PermissionError as e:
summary["permission_errors"].append(f"copy failed: {src_file}: {e}")

# ── Phase 2: Set permissions ──────────────────────────────────────
for dirpath, dirnames, filenames in os.walk(dest_dir, topdown=True):
# Set directory permissions
try:
os.chmod(dirpath, 0o755)
except PermissionError as e:
summary["permission_errors"].append(f"chmod dir failed: {dirpath}: {e}")

for fname in filenames:
fpath = os.path.join(dirpath, fname)
# Shell scripts need execute bit; everything else gets 0o644
target_mode = 0o755 if fname.endswith(".sh") else 0o644
try:
os.chmod(fpath, target_mode)
except PermissionError as e:
summary["permission_errors"].append(
f"chmod failed: {fpath}: {e}"
)

# ── Phase 3: Security audit ───────────────────────────────────────
world_writable = []

for dirpath, dirnames, filenames in os.walk(dest_dir):
for fname in filenames:
fpath = os.path.join(dirpath, fname)
try:
mode = os.stat(fpath).st_mode
if mode & stat.S_IWOTH: # world-writable bit set
world_writable.append(fpath)
except PermissionError:
pass

if world_writable:
file_list = "\n ".join(world_writable)
raise DeploymentError(
f"Security violation: {len(world_writable)} world-writable files found "
f"after deployment:\n {file_list}"
)

print(f"Deployment complete:")
print(f" Files copied: {summary['copied']}")
print(f" Perm errors: {len(summary['permission_errors'])}")
if summary["permission_errors"]:
for err in summary["permission_errors"]:
print(f" WARNING: {err}")

return summary


# Demo usage
if __name__ == "__main__":
import tempfile
import textwrap

# Set up a test source directory
with tempfile.TemporaryDirectory() as src:
# Create some files
Path(src, "index.html").write_text("<h1>Hello</h1>")
Path(src, "style.css").write_text("body { margin: 0; }")
Path(src, "deploy.sh").write_text("#!/bin/bash\necho deploying")
Path(src, "subdir").mkdir()
Path(src, "subdir", "app.js").write_text("console.log('app');")

with tempfile.TemporaryDirectory() as dest:
result = deploy_static_files(src, dest)
print(f"\nResult: {result}")

# Verify permissions
for dirpath, _, filenames in os.walk(dest):
for fname in filenames:
fpath = os.path.join(dirpath, fname)
mode = stat.S_IMODE(os.stat(fpath).st_mode)
expected = 0o755 if fname.endswith(".sh") else 0o644
status = "OK" if mode == expected else "MISMATCH"
print(f" [{status}] {fname}: {oct(mode)}")

# Example output:
# Deployment complete:
# Files copied: 4
# Perm errors: 0
#
# Result: {'copied': 4, 'permission_errors': []}
# [OK] index.html: 0o644
# [OK] style.css: 0o644
# [OK] app.js: 0o644
# [OK] deploy.sh: 0o755

Quick Reference

OperationCodeNotes
Current directoryos.getcwd()Returns absolute path string
Change directoryos.chdir(path)Avoid in production - mutates global state
List directoryos.listdir(path)Returns list of name strings
Scan directoryos.scandir(path)Returns DirEntry objects - faster
Walk recursivelyos.walk(path, topdown=True)Yields (dirpath, dirnames, filenames)
Prune walkdirnames[:] = [...]Slice assignment in-place
File existsos.path.exists(path)Prefer Path(p).exists()
Is fileos.path.isfile(path)Prefer Path(p).is_file()
Is directoryos.path.isdir(path)Prefer Path(p).is_dir()
Join pathsos.path.join(a, b, c)Prefer Path(a) / b / c
Split extensionos.path.splitext(name)Returns ('stem', '.ext')
File metadataos.stat(path)Returns stat_result
File permissionsstat.S_IMODE(os.stat(p).st_mode)Requires import stat
Set permissionsos.chmod(path, 0o644)Octal mode
Create directoryos.makedirs(path, exist_ok=True)Creates all intermediate dirs
Remove fileos.remove(path)Single file only
Remove dir treeshutil.rmtree(path)Permanent - no undo
Rename/moveos.rename(src, dst)Atomic on POSIX if same filesystem
Current PIDos.getpid()Integer process ID
Parent PIDos.getppid()Integer parent process ID
CPU countos.cpu_count()Number of logical CPUs
System loados.getloadavg()Unix only: (1m, 5m, 15m) tuple
Secure randomos.urandom(n)n bytes of cryptographic entropy
Run commandsubprocess.run([...], capture_output=True, text=True, check=True)Never use os.system()
Read env varos.environ.get("KEY", "default")Safe - returns default if missing
Set env varos.environ["KEY"] = "value"Affects current process and children

Key Takeaways

  • os is the thin wrapper around POSIX/Win32 system calls - it handles process info, permissions, environment variables, and filesystem operations that pathlib does not cover
  • Use os.scandir() instead of os.listdir() whenever you need file type or stat information - it avoids extra system calls by caching directory entry metadata
  • os.walk() with topdown=True lets you prune directory traversal by modifying dirnames[:] = [...] in-place; use topdown=False for bottom-up operations like directory deletion
  • os.chdir() mutates global process state - avoid it in library code; build absolute paths instead
  • os.chmod() and os.stat() give you full control over Unix file permissions; stat.S_IMODE() extracts the permission bits from the full mode value
  • Never use os.system() - it is vulnerable to shell injection and cannot capture output; use subprocess.run(["cmd", "arg"], capture_output=True, text=True, check=True) instead
  • os.urandom() provides cryptographically secure random bytes; the secrets module (Python 3.6+) wraps it with a friendlier API for tokens and keys
  • os.makedirs(path, exist_ok=True) is the safe way to create nested directories - it avoids TOCTOU race conditions by not failing if the directory already exists
© 2026 EngineersOfAI. All rights reserved.