Skip to main content

Shell Scripting for ML Workflows

Reading time: ~40 min · Interview relevance: High · Target roles: ML Engineer, MLOps Engineer, Research Engineer

The Training Launch That Corrupted 3 Days of Work

A research engineer at a mid-size ML company kicks off a distributed training run across 16 GPUs. The job runs for 72 hours. On hour 71, one of the workers crashes due to a disk-full error. The training loop catches the exception and exits gracefully - but the checkpoint manager had already deleted the previous checkpoint to save space. The run is unrecoverable. Three days of GPU time, gone.

The root cause was not the disk filling up. That is expected. The root cause was that the launch script had no error handling, the checkpoint cleanup ran unconditionally before saving the new checkpoint, and there was no monitoring script watching disk utilization. A 20-line bash script watching df -h and sending a Slack alert when disk usage exceeded 80% would have caught this 12 hours before the crash.

Shell scripting is not glamorous. Every ML engineer would rather write Python. But bash is the language of the environment - it is what runs when you SSH into a training node, what CI systems execute, what Kubernetes init containers run. When a distributed training job fails at 3 AM, you are debugging with bash, not Jupyter notebooks. The engineers who move fastest in ML infrastructure are the ones who are genuinely fluent in bash.

This lesson covers the bash features that matter most for ML workflows: robust error handling with set -euo pipefail, parallel execution with GNU parallel and background jobs, argument parsing patterns, remote execution for multi-node training, monitoring loops that watch GPU utilization and disk space, and checkpoint management scripts that do not destroy your data when they fail.

Every code example in this lesson reflects patterns used in production ML systems. The training launcher script alone has saved real teams from the exact scenario described above.

Why This Exists - The Gap Between Python and the Environment

Python is excellent for ML logic. It is not excellent for orchestrating system-level operations. Consider what happens when you launch a distributed training job: you need to check which GPUs are available on which nodes, set environment variables like MASTER_ADDR and MASTER_PORT, start processes on remote nodes via SSH, wait for all processes to finish, handle partial failures, and clean up if anything goes wrong.

You could write all of this in Python using subprocess, paramiko, and multiprocessing. Teams do. But the result is hundreds of lines of Python that replicate what bash does natively in 30 lines. Bash is designed for process orchestration - every operator (|, &, >, <(...)) is a concurrency or I/O primitive.

The other reason is ubiquity. Every Linux system has bash. Not every system has Python 3.11 with your specific ML library versions. When you are initializing a fresh GPU node, the first thing you run is a bash script that installs your dependencies - you cannot use Python for that.

Historical Context - The Shell as the Original Orchestrator

The Unix shell predates every ML framework by decades. Ken Thompson wrote the first Unix shell in 1971. The Bourne shell, which bash (Bourne Again SHell) is named after, was written by Stephen Bourne at Bell Labs in 1979. The design philosophy was composability: small tools (grep, sed, awk, sort) that each do one thing well, connected by pipes.

GNU parallel was released by Ole Tange in 2011, adding the ability to run shell commands in parallel across multiple CPU cores or remote machines in a syntax nearly identical to xargs. GNU parallel is effectively map-reduce in bash: parallel command ::: input1 input2 input3 runs command on all three inputs concurrently.

The tmux terminal multiplexer (2007) solved the persistent session problem for ML engineers - you can start a 3-day training run in a tmux session, detach, and reattach from anywhere. Combined with remote SSH, this is still how most interactive ML training is managed even in the era of cloud platforms.

Core Concepts

Robust Error Handling

The single most important pattern in production bash scripts is set -euo pipefail at the top of every script. Without it, bash cheerfully continues executing after errors.

#!/usr/bin/env bash
# ALWAYS start scripts with this line
set -euo pipefail

# What each flag does:
# -e exit immediately if any command returns non-zero exit code
# -u treat unset variables as errors (prevents typo bugs like $MDOEL_DIR)
# -o pipefail a pipe fails if ANY stage fails, not just the last one

# Without -o pipefail, this would silently succeed even if python fails:
# python train.py | tee train.log
# With -o pipefail, the pipe fails if python fails.

# trap: run cleanup on exit (even on error or SIGINT/SIGTERM)
cleanup() {
local exit_code=$?
echo "[$(date -Is)] Script exiting with code ${exit_code}"
# Kill all background jobs started by this script
jobs -p | xargs -r kill 2>/dev/null || true
exit "${exit_code}"
}
trap cleanup EXIT

# trap SIGINT/SIGTERM specifically for graceful shutdown during training
handle_interrupt() {
echo "[$(date -Is)] Received interrupt signal, cleaning up..."
# Save partial checkpoint if training is running
if [[ -n "${TRAINING_PID:-}" ]]; then
kill -SIGUSR1 "${TRAINING_PID}" 2>/dev/null || true
wait "${TRAINING_PID}" || true
fi
exit 130 # standard exit code for SIGINT
}
trap handle_interrupt SIGINT SIGTERM

Variables, Arrays, and Conditionals

#!/usr/bin/env bash
set -euo pipefail

# ---------------------------------------------------------------
# Variable declaration patterns
# ---------------------------------------------------------------

# Default values: use ${VAR:-default} to avoid -u errors
EXPERIMENT_NAME="${EXPERIMENT_NAME:-unnamed_experiment}"
OUTPUT_DIR="${OUTPUT_DIR:-/tmp/training_output}"
NUM_WORKERS="${NUM_WORKERS:-4}"

# Integer arithmetic
NUM_GPUS=$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)
BATCH_SIZE=$(( 32 * NUM_GPUS )) # scale batch size with GPU count
echo "Using ${NUM_GPUS} GPUs, batch size ${BATCH_SIZE}"

# String manipulation
MODEL_PATH="/models/experiment_2024/checkpoint_final.pt"
MODEL_DIR="${MODEL_PATH%/*}" # remove everything after last /
MODEL_NAME="${MODEL_PATH##*/}" # remove everything before last /
STEM="${MODEL_NAME%.pt}" # remove .pt suffix
echo "Dir: ${MODEL_DIR}, Name: ${STEM}"

# Arrays
GPU_IDS=( 0 1 2 3 )
NODES=( node-01 node-02 node-03 node-04 )

# Iterate over array
for gpu_id in "${GPU_IDS[@]}"; do
echo "GPU ${gpu_id}"
done

# Array length
echo "Number of GPUs: ${#GPU_IDS[@]}"

# Slicing: first 2 GPUs
echo "First 2: ${GPU_IDS[@]:0:2}"

# ---------------------------------------------------------------
# Conditionals
# ---------------------------------------------------------------

# File/directory checks
if [[ ! -d "${OUTPUT_DIR}" ]]; then
mkdir -p "${OUTPUT_DIR}"
echo "Created output directory: ${OUTPUT_DIR}"
fi

if [[ ! -f "configs/experiment.yaml" ]]; then
echo "ERROR: config file not found" >&2
exit 1
fi

# Integer comparison
if (( NUM_GPUS == 0 )); then
echo "WARNING: no GPUs found, falling back to CPU" >&2
elif (( NUM_GPUS < 4 )); then
echo "WARNING: fewer than 4 GPUs available (${NUM_GPUS})"
fi

# String comparison
if [[ "${EXPERIMENT_NAME}" == *"debug"* ]]; then
echo "Debug mode: reducing dataset size"
MAX_SAMPLES=1000
fi

# Command availability check
if ! command -v nvcc &>/dev/null; then
echo "ERROR: nvcc not found - CUDA not installed" >&2
exit 1
fi

Argument Parsing with getopts

#!/usr/bin/env bash
set -euo pipefail

# Script: train_launcher.sh
# Usage: ./train_launcher.sh -e <experiment_name> -g <num_gpus> [-c <config>] [-d]

usage() {
cat <<EOF
Usage: $(basename "$0") [OPTIONS]

Options:
-e EXPERIMENT Experiment name (required)
-g NUM_GPUS Number of GPUs to use (required)
-c CONFIG Path to config file (default: configs/default.yaml)
-o OUTPUT_DIR Output directory (default: outputs/EXPERIMENT)
-d Dry run - print commands without executing
-h Show this help message

Examples:
$(basename "$0") -e resnet50_baseline -g 4
$(basename "$0") -e bert_finetune -g 8 -c configs/bert.yaml
EOF
}

# Defaults
CONFIG="configs/default.yaml"
DRY_RUN=false
EXPERIMENT=""
NUM_GPUS=""
OUTPUT_DIR=""

# getopts: parse single-character flags
while getopts "e:g:c:o:dh" opt; do
case "${opt}" in
e) EXPERIMENT="${OPTARG}" ;;
g) NUM_GPUS="${OPTARG}" ;;
c) CONFIG="${OPTARG}" ;;
o) OUTPUT_DIR="${OPTARG}" ;;
d) DRY_RUN=true ;;
h) usage; exit 0 ;;
*) usage; exit 1 ;;
esac
done

shift $(( OPTIND - 1 )) # remove parsed flags from $@

# Validate required arguments
if [[ -z "${EXPERIMENT}" ]]; then
echo "ERROR: -e EXPERIMENT is required" >&2
usage
exit 1
fi

if [[ -z "${NUM_GPUS}" ]]; then
echo "ERROR: -g NUM_GPUS is required" >&2
usage
exit 1
fi

# Set derived defaults
OUTPUT_DIR="${OUTPUT_DIR:-outputs/${EXPERIMENT}}"

# run_or_dry: print command, execute only if not dry run
run_or_dry() {
echo "+ $*"
if [[ "${DRY_RUN}" == false ]]; then
"$@"
fi
}

echo "Experiment: ${EXPERIMENT}"
echo "GPUs: ${NUM_GPUS}"
echo "Config: ${CONFIG}"
echo "Output: ${OUTPUT_DIR}"
echo "Dry run: ${DRY_RUN}"

Pipelines, Redirections, and Process Substitution

#!/usr/bin/env bash
set -euo pipefail

# ---------------------------------------------------------------
# Basic redirections
# ---------------------------------------------------------------

# Redirect stdout AND stderr to a log file
python train.py > training.log 2>&1

# Redirect stderr to stdout (combine them), pipe to tee
python train.py 2>&1 | tee training.log

# Here-string: pass a string as stdin to a command
python -c "import sys; print(sys.version)" <<< ""

# Here-doc: multi-line stdin
python - <<'PYTHON'
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU count: {torch.cuda.device_count()}")
PYTHON

# ---------------------------------------------------------------
# Process substitution: treat command output as a file
# ---------------------------------------------------------------

# Compare two sorted lists without creating temp files
diff <(sort file1.txt) <(sort file2.txt)

# Pass the output of a command as a file argument
# Useful for commands that require a file path, not stdin
python evaluate.py \
--predictions <(python predict.py --model model.pt) \
--labels data/test_labels.csv

# ---------------------------------------------------------------
# Pipelines for log processing
# ---------------------------------------------------------------

# Parse training loss from log file (real-time)
tail -f training.log \
| grep --line-buffered "train_loss" \
| awk '{print $NF}' # print last field (the loss value)

# Extract best validation accuracy across all experiments
for log in outputs/*/training.log; do
experiment=$(dirname "${log}" | xargs basename)
best_acc=$(grep "val_accuracy" "${log}" \
| awk '{print $NF}' \
| sort -n \
| tail -1)
printf "%-30s %s\n" "${experiment}" "${best_acc}"
done

# Count failed training runs
grep -l "Error\|Exception\|Killed" outputs/*/training.log \
| wc -l

Parallel Execution

Parallel execution is critical for ML workflows - you often need to preprocess thousands of files, run multiple experiments, or download dozens of data shards simultaneously.

#!/usr/bin/env bash
set -euo pipefail

# ---------------------------------------------------------------
# Background jobs (&) and wait
# ---------------------------------------------------------------

# Launch 4 preprocessing jobs in parallel
for shard in data/raw/shard_{0,1,2,3}.parquet; do
python preprocess.py --input "${shard}" &
done

# Wait for all background jobs to complete
# IMPORTANT: 'wait' returns the exit code of the last job.
# Use the pattern below to catch ANY failure:
wait_all() {
local exit_code=0
for pid in "$@"; do
wait "${pid}" || exit_code=$?
done
return "${exit_code}"
}

pids=()
for shard in data/raw/shard_*.parquet; do
python preprocess.py --input "${shard}" &
pids+=( $! ) # $! is PID of the most recent background job
done
wait_all "${pids[@]}"

# ---------------------------------------------------------------
# xargs -P: parallel execution with concurrency limit
# ---------------------------------------------------------------

# Download 20 files, 8 at a time
# -P 8 = max 8 parallel jobs
# -I {} = replace {} with the input line
cat urls.txt | xargs -P 8 -I {} wget -q {}

# Preprocess all shards with 8 workers
ls data/raw/*.parquet \
| xargs -P 8 -I {} python preprocess.py --input {}

# ---------------------------------------------------------------
# GNU parallel: more powerful, handles complex cases
# ---------------------------------------------------------------

# Install: apt-get install parallel / brew install parallel

# Process all images with 8 workers, show progress
parallel --progress -j 8 \
python scripts/resize_image.py --input {} --output data/resized/ \
::: data/raw/images/*.jpg

# Parameter sweep: all combinations of lr and batch_size
parallel -j 4 \
python train.py --lr {1} --batch-size {2} --name "lr{1}_bs{2}" \
::: 0.001 0.0001 0.00001 \
::: 32 64 128

# Run jobs on remote machines (requires SSH key auth)
parallel -j 2 -S node-01,node-02,node-03 \
python preprocess.py --shard {} \
::: $(seq 0 23)

# ---------------------------------------------------------------
# Job control: fg, bg, jobs
# ---------------------------------------------------------------

# Send a running process to the background
# (Ctrl+Z to suspend, then bg to resume in background)
python train.py & # start in background
jobs # list background jobs: [1]+ Running python train.py
fg %1 # bring job 1 back to foreground
wait %1 # wait for job 1 without bringing to foreground

Complete Training Launcher Script

#!/usr/bin/env bash
# train_launcher.sh
# Launches distributed PyTorch training with GPU detection,
# environment setup, and robust error handling.

set -euo pipefail

# ---------------------------------------------------------------
# Argument parsing
# ---------------------------------------------------------------
usage() {
cat <<EOF
Usage: $(basename "$0") [OPTIONS]

-e EXPERIMENT Experiment name (required)
-c CONFIG Config file (default: configs/default.yaml)
-n NUM_NODES Number of nodes (default: 1)
-g GPUS_PER_NODE GPUs per node (default: auto-detect)
-o OUTPUT_DIR Output directory (default: outputs/EXPERIMENT)
-r RESUME Path to checkpoint to resume from
-d Dry run
-h Show help

Examples:
$(basename "$0") -e bert_base -c configs/bert.yaml -n 2 -g 8
$(basename "$0") -e debug_run -c configs/tiny.yaml -d
EOF
}

EXPERIMENT=""
CONFIG="configs/default.yaml"
NUM_NODES=1
GPUS_PER_NODE=""
OUTPUT_DIR=""
RESUME=""
DRY_RUN=false

while getopts "e:c:n:g:o:r:dh" opt; do
case "${opt}" in
e) EXPERIMENT="${OPTARG}" ;;
c) CONFIG="${OPTARG}" ;;
n) NUM_NODES="${OPTARG}" ;;
g) GPUS_PER_NODE="${OPTARG}" ;;
o) OUTPUT_DIR="${OPTARG}" ;;
r) RESUME="${OPTARG}" ;;
d) DRY_RUN=true ;;
h) usage; exit 0 ;;
*) usage >&2; exit 1 ;;
esac
done

if [[ -z "${EXPERIMENT}" ]]; then
echo "ERROR: -e EXPERIMENT is required" >&2
usage >&2
exit 1
fi

# ---------------------------------------------------------------
# GPU detection
# ---------------------------------------------------------------
detect_gpus() {
if ! command -v nvidia-smi &>/dev/null; then
echo "0"
return
fi
nvidia-smi --query-gpu=name --format=csv,noheader 2>/dev/null \
| wc -l \
| tr -d ' '
}

AVAILABLE_GPUS=$(detect_gpus)
GPUS_PER_NODE="${GPUS_PER_NODE:-${AVAILABLE_GPUS}}"

if (( GPUS_PER_NODE == 0 )); then
echo "WARNING: no GPUs detected, using CPU" >&2
DEVICE="cpu"
else
DEVICE="cuda"
echo "Detected ${AVAILABLE_GPUS} GPUs, using ${GPUS_PER_NODE} per node"
fi

WORLD_SIZE=$(( NUM_NODES * GPUS_PER_NODE ))
OUTPUT_DIR="${OUTPUT_DIR:-outputs/${EXPERIMENT}}"
LOG_FILE="${OUTPUT_DIR}/train.log"

# ---------------------------------------------------------------
# Environment validation
# ---------------------------------------------------------------
validate_environment() {
echo "Validating environment..."

# Check config exists
if [[ ! -f "${CONFIG}" ]]; then
echo "ERROR: config file not found: ${CONFIG}" >&2
exit 1
fi

# Check Python environment
if ! python -c "import torch" 2>/dev/null; then
echo "ERROR: PyTorch not importable" >&2
exit 1
fi

local torch_version
torch_version=$(python -c "import torch; print(torch.__version__)")
echo "PyTorch version: ${torch_version}"

# Check disk space (require at least 50 GB free)
local free_gb
free_gb=$(df -BG "${OUTPUT_DIR:-/tmp}" | awk 'NR==2 {print $4}' \
| tr -d 'G')
if (( free_gb < 50 )); then
echo "WARNING: only ${free_gb}GB free disk space" >&2
fi
}

# ---------------------------------------------------------------
# Setup
# ---------------------------------------------------------------
setup_output_dir() {
mkdir -p "${OUTPUT_DIR}"
cp "${CONFIG}" "${OUTPUT_DIR}/config.yaml"

# Save launch metadata
cat > "${OUTPUT_DIR}/launch_info.json" <<JSON
{
"experiment": "${EXPERIMENT}",
"config": "${CONFIG}",
"num_nodes": ${NUM_NODES},
"gpus_per_node": ${GPUS_PER_NODE},
"world_size": ${WORLD_SIZE},
"launched_at": "$(date -Is)",
"git_commit": "$(git rev-parse HEAD 2>/dev/null || echo unknown)",
"hostname": "$(hostname)"
}
JSON
}

# ---------------------------------------------------------------
# Launch training
# ---------------------------------------------------------------
run_or_dry() {
echo "+ $*"
if [[ "${DRY_RUN}" == false ]]; then
"$@"
fi
}

launch_training() {
local resume_args=()
if [[ -n "${RESUME}" ]]; then
resume_args=( --resume "${RESUME}" )
fi

if (( NUM_NODES == 1 )); then
# Single-node: use torchrun
run_or_dry torchrun \
--standalone \
--nproc-per-node "${GPUS_PER_NODE}" \
src/train.py \
--config "${CONFIG}" \
--output "${OUTPUT_DIR}" \
--experiment "${EXPERIMENT}" \
"${resume_args[@]}" \
2>&1 | tee "${LOG_FILE}" &
else
# Multi-node: requires MASTER_ADDR to be set
if [[ -z "${MASTER_ADDR:-}" ]]; then
echo "ERROR: MASTER_ADDR must be set for multi-node training" >&2
exit 1
fi
MASTER_PORT="${MASTER_PORT:-29500}"

run_or_dry torchrun \
--nnodes "${NUM_NODES}" \
--nproc-per-node "${GPUS_PER_NODE}" \
--rdzv-backend c10d \
--rdzv-endpoint "${MASTER_ADDR}:${MASTER_PORT}" \
src/train.py \
--config "${CONFIG}" \
--output "${OUTPUT_DIR}" \
--experiment "${EXPERIMENT}" \
"${resume_args[@]}" \
2>&1 | tee "${LOG_FILE}" &
fi

TRAINING_PID=$!
export TRAINING_PID

echo "Training PID: ${TRAINING_PID}"
echo "${TRAINING_PID}" > "${OUTPUT_DIR}/training.pid"
}

# ---------------------------------------------------------------
# Monitor disk space during training
# ---------------------------------------------------------------
monitor_disk() {
local threshold_pct=85
local check_interval=60

while kill -0 "${TRAINING_PID}" 2>/dev/null; do
local used_pct
used_pct=$(df "${OUTPUT_DIR}" | awk 'NR==2 {print $5}' \
| tr -d '%')

if (( used_pct >= threshold_pct )); then
echo "[$(date -Is)] WARNING: disk at ${used_pct}% - "
echo "cleaning old checkpoints..."
# Remove all but the last 2 checkpoints
ls -t "${OUTPUT_DIR}"/checkpoint_step_*.pt 2>/dev/null \
| tail -n +3 \
| xargs -r rm -v
fi

sleep "${check_interval}"
done
}

# ---------------------------------------------------------------
# Cleanup on exit
# ---------------------------------------------------------------
cleanup() {
local code=$?
if [[ -n "${TRAINING_PID:-}" ]] && kill -0 "${TRAINING_PID}" 2>/dev/null; then
echo "Sending SIGTERM to training process ${TRAINING_PID}..."
kill "${TRAINING_PID}"
wait "${TRAINING_PID}" 2>/dev/null || true
fi
echo "[$(date -Is)] Exit code: ${code}"
exit "${code}"
}
trap cleanup EXIT SIGINT SIGTERM

# ---------------------------------------------------------------
# Main
# ---------------------------------------------------------------
main() {
echo "=============================================="
echo " Training Launcher"
echo " Experiment: ${EXPERIMENT}"
echo " Nodes: ${NUM_NODES}, GPUs/node: ${GPUS_PER_NODE}"
echo " World size: ${WORLD_SIZE}"
echo "=============================================="

validate_environment
setup_output_dir
launch_training

# Start disk monitor in background
monitor_disk &
MONITOR_PID=$!

# Wait for training to finish
wait "${TRAINING_PID}"
TRAIN_EXIT=$?

# Stop disk monitor
kill "${MONITOR_PID}" 2>/dev/null || true

if (( TRAIN_EXIT == 0 )); then
echo "Training completed successfully"
echo "Output: ${OUTPUT_DIR}"
else
echo "Training FAILED with exit code ${TRAIN_EXIT}" >&2
exit "${TRAIN_EXIT}"
fi
}

main

Multi-Node Training Coordinator

#!/usr/bin/env bash
# multi_node_train.sh
# Coordinates a multi-node PyTorch training run by SSH-ing into
# each worker node and launching the training process.

set -euo pipefail

NODES_FILE="${1:-nodes.txt}" # file containing one hostname per line
CONFIG="${2:-configs/default.yaml}"
EXPERIMENT="${3:-distributed_run}"

if [[ ! -f "${NODES_FILE}" ]]; then
echo "ERROR: nodes file not found: ${NODES_FILE}" >&2
exit 1
fi

# Read nodes into array
mapfile -t NODES < "${NODES_FILE}"
NUM_NODES=${#NODES[@]}
MASTER_NODE="${NODES[0]}"

echo "Launching ${NUM_NODES}-node training"
echo "Master node: ${MASTER_NODE}"
echo "All nodes: ${NODES[*]}"

# Detect GPUs on master node (assume homogeneous cluster)
GPUS_PER_NODE=$(ssh "${MASTER_NODE}" \
"nvidia-smi --query-gpu=name --format=csv,noheader | wc -l")
echo "GPUs per node: ${GPUS_PER_NODE}"

# Sync code to all nodes
for node in "${NODES[@]}"; do
echo "Syncing code to ${node}..."
rsync -azq --delete \
--exclude='.git' \
--exclude='outputs/' \
--exclude='data/' \
--exclude='*.pyc' \
--exclude='__pycache__' \
. "${node}:~/training/"
done
echo "Code sync complete"

# Launch training on each node
pids=()
for i in "${!NODES[@]}"; do
node="${NODES[$i]}"
rank=$i

echo "Launching on ${node} (rank ${rank})..."

ssh "${node}" "
cd ~/training
export MASTER_ADDR=${MASTER_NODE}
export MASTER_PORT=29500
export NODE_RANK=${rank}

torchrun \
--nnodes ${NUM_NODES} \
--nproc-per-node ${GPUS_PER_NODE} \
--node-rank ${rank} \
--rdzv-backend c10d \
--rdzv-endpoint ${MASTER_NODE}:29500 \
src/train.py \
--config ${CONFIG} \
--output outputs/${EXPERIMENT} \
> outputs/${EXPERIMENT}/node_${rank}.log 2>&1
" &

pids+=( $! )
done

echo "All ${NUM_NODES} nodes launched. Waiting for completion..."

# Wait for all nodes, capture failures
failed=0
for i in "${!pids[@]}"; do
pid="${pids[$i]}"
node="${NODES[$i]}"

if wait "${pid}"; then
echo "Node ${node}: SUCCESS"
else
echo "Node ${node}: FAILED" >&2
failed=$(( failed + 1 ))
fi
done

if (( failed > 0 )); then
echo "ERROR: ${failed} nodes failed" >&2
exit 1
fi

echo "All nodes completed successfully"

GPU Monitoring Script

#!/usr/bin/env bash
# gpu_monitor.sh
# Continuously monitors GPU utilization, memory, and temperature.
# Alerts if utilization drops (indicating a training stall).

set -euo pipefail

EXPERIMENT="${1:-unknown}"
LOG_FILE="${2:-/tmp/gpu_monitor.log}"
ALERT_THRESHOLD_UTIL="${GPU_ALERT_UTIL:-20}" # alert if util drops below 20%
CHECK_INTERVAL=30
SLACK_WEBHOOK="${SLACK_WEBHOOK:-}"

log() {
echo "[$(date -Is)] $*" | tee -a "${LOG_FILE}"
}

send_alert() {
local message="$1"
log "ALERT: ${message}"

if [[ -n "${SLACK_WEBHOOK}" ]]; then
curl -s -X POST "${SLACK_WEBHOOK}" \
-H 'Content-type: application/json' \
--data "{\"text\": \"[${EXPERIMENT}] ${message}\"}" \
|| true
fi
}

monitor_gpus() {
local consecutive_low=0
local low_util_limit=3 # alert after 3 consecutive low readings

while true; do
# Query all GPUs: index, utilization%, memory_used_MiB, memory_total_MiB, temp_C
local gpu_stats
gpu_stats=$(nvidia-smi \
--query-gpu=index,utilization.gpu,memory.used,memory.total,temperature.gpu \
--format=csv,noheader,nounits 2>/dev/null)

if [[ -z "${gpu_stats}" ]]; then
send_alert "nvidia-smi returned no output - GPU driver issue?"
sleep "${CHECK_INTERVAL}"
continue
fi

local any_low=false
while IFS=', ' read -r idx util mem_used mem_total temp; do
log "GPU ${idx}: util=${util}% mem=${mem_used}/${mem_total}MiB temp=${temp}C"

if (( util < ALERT_THRESHOLD_UTIL )); then
any_low=true
fi

if (( temp > 85 )); then
send_alert "GPU ${idx} temperature critical: ${temp}C"
fi

# Memory pressure warning
local mem_pct=$(( mem_used * 100 / mem_total ))
if (( mem_pct > 95 )); then
send_alert "GPU ${idx} memory at ${mem_pct}% (${mem_used}/${mem_total} MiB)"
fi
done <<< "${gpu_stats}"

# Track consecutive low utilization readings
if [[ "${any_low}" == true ]]; then
consecutive_low=$(( consecutive_low + 1 ))
if (( consecutive_low >= low_util_limit )); then
send_alert "GPU utilization below ${ALERT_THRESHOLD_UTIL}% for " \
"$((consecutive_low * CHECK_INTERVAL)) seconds - training may have stalled"
consecutive_low=0 # reset to avoid spam
fi
else
consecutive_low=0
fi

sleep "${CHECK_INTERVAL}"
done
}

log "Starting GPU monitor for experiment: ${EXPERIMENT}"
log "Alert threshold: util < ${ALERT_THRESHOLD_UTIL}%"
monitor_gpus

Parallel Data Download Script

#!/usr/bin/env bash
# download_dataset.sh
# Downloads a sharded dataset in parallel, verifies checksums,
# and retries failed downloads.

set -euo pipefail

BASE_URL="${1:?Usage: $0 BASE_URL OUTPUT_DIR}"
OUTPUT_DIR="${2:?Usage: $0 BASE_URL OUTPUT_DIR}"
NUM_WORKERS="${3:-8}"
MANIFEST="${OUTPUT_DIR}/manifest.json"

mkdir -p "${OUTPUT_DIR}"

# Fetch the manifest (list of shards + expected checksums)
echo "Fetching manifest..."
curl -fsSL "${BASE_URL}/manifest.json" -o "${MANIFEST}"

# Parse manifest with jq to extract shard names and checksums
mapfile -t SHARDS < <(jq -r '.shards[].filename' "${MANIFEST}")
echo "Found ${#SHARDS[@]} shards to download"

# Download a single shard with retry logic
download_shard() {
local filename="$1"
local url="${BASE_URL}/${filename}"
local output="${OUTPUT_DIR}/${filename}"
local expected_sha256
expected_sha256=$(jq -r --arg f "${filename}" \
'.shards[] | select(.filename == $f) | .sha256' "${MANIFEST}")

# Skip if already downloaded with correct checksum
if [[ -f "${output}" ]]; then
local actual_sha256
actual_sha256=$(sha256sum "${output}" | awk '{print $1}')
if [[ "${actual_sha256}" == "${expected_sha256}" ]]; then
echo "SKIP (already complete): ${filename}"
return 0
else
echo "CHECKSUM MISMATCH, re-downloading: ${filename}"
rm "${output}"
fi
fi

# Download with retry
local max_retries=3
local attempt=0
while (( attempt < max_retries )); do
attempt=$(( attempt + 1 ))
echo "Downloading ${filename} (attempt ${attempt}/${max_retries})..."

if curl -fsSL --retry 3 "${url}" -o "${output}"; then
# Verify checksum
local actual_sha256
actual_sha256=$(sha256sum "${output}" | awk '{print $1}')
if [[ "${actual_sha256}" == "${expected_sha256}" ]]; then
echo "OK: ${filename}"
return 0
else
echo "CHECKSUM FAILED for ${filename}" >&2
rm "${output}"
fi
fi

(( attempt < max_retries )) && sleep $(( attempt * 5 ))
done

echo "FAILED after ${max_retries} attempts: ${filename}" >&2
return 1
}

export -f download_shard
export BASE_URL OUTPUT_DIR MANIFEST

# Download all shards in parallel
echo "Downloading with ${NUM_WORKERS} workers..."
printf '%s\n' "${SHARDS[@]}" \
| parallel --progress -j "${NUM_WORKERS}" \
download_shard {}

echo "All ${#SHARDS[@]} shards downloaded and verified"

Checkpoint Cleanup Script

#!/usr/bin/env bash
# cleanup_checkpoints.sh
# Safely removes old checkpoints, keeping only the N most recent
# and any checkpoints marked as "best" by the training script.

set -euo pipefail

CHECKPOINT_DIR="${1:?Usage: $0 CHECKPOINT_DIR [KEEP_N]}"
KEEP_N="${2:-5}"

if [[ ! -d "${CHECKPOINT_DIR}" ]]; then
echo "ERROR: directory not found: ${CHECKPOINT_DIR}" >&2
exit 1
fi

# Find all step checkpoints (not "best" or "final" checkpoints)
mapfile -t STEP_CHECKPOINTS < <(
find "${CHECKPOINT_DIR}" \
-name "checkpoint_step_*.pt" \
-not -name "*best*" \
-not -name "*final*" \
| sort -t_ -k3 -n # sort by step number
)

TOTAL=${#STEP_CHECKPOINTS[@]}
TO_DELETE=$(( TOTAL - KEEP_N ))

if (( TO_DELETE <= 0 )); then
echo "Only ${TOTAL} checkpoints found, keeping all (threshold: ${KEEP_N})"
exit 0
fi

echo "Found ${TOTAL} step checkpoints, deleting ${TO_DELETE} oldest..."

# Delete oldest checkpoints (first TO_DELETE in sorted list)
for (( i=0; i < TO_DELETE; i++ )); do
ckpt="${STEP_CHECKPOINTS[$i]}"
echo "Deleting: $(basename "${ckpt}")"
rm "${ckpt}"
done

echo "Done. Kept ${KEEP_N} most recent step checkpoints."
echo "Best/final checkpoints were preserved."

Log Parsing with awk, sed, jq

#!/usr/bin/env bash
set -euo pipefail

TRAINING_LOG="${1:-training.log}"

# ---------------------------------------------------------------
# Extract training loss trend using awk
# ---------------------------------------------------------------
echo "=== Training Loss ==="
grep "train_loss" "${TRAINING_LOG}" \
| awk '{
# Extract step and loss from lines like: step=1000 train_loss=0.342
match($0, /step=([0-9]+)/, step_arr)
match($0, /train_loss=([0-9.]+)/, loss_arr)
if (step_arr[1] && loss_arr[1])
printf "Step %6s: %.4f\n", step_arr[1], loss_arr[1]
}'

# ---------------------------------------------------------------
# Find the step with best validation accuracy
# ---------------------------------------------------------------
echo ""
echo "=== Best Validation Accuracy ==="
grep "val_accuracy" "${TRAINING_LOG}" \
| awk '{
match($0, /step=([0-9]+)/, s)
match($0, /val_accuracy=([0-9.]+)/, a)
if (s[1] && a[1] && a[1]+0 > best+0) {
best = a[1]+0
best_step = s[1]
}
}
END { printf "Best: %.4f at step %s\n", best, best_step }'

# ---------------------------------------------------------------
# Parse JSON metrics file with jq
# ---------------------------------------------------------------
METRICS_FILE="metrics/eval.json"

if [[ -f "${METRICS_FILE}" ]]; then
echo ""
echo "=== Evaluation Metrics ==="
jq -r '
"Accuracy: " + (.accuracy | tostring),
"F1 Score: " + (.f1 | tostring),
"Throughput: " + (.throughput_samples_per_sec | tostring) + " samples/s"
' "${METRICS_FILE}"

# Extract per-class F1 scores
echo ""
echo "=== Per-Class F1 ==="
jq -r '.per_class_f1 | to_entries[] | "\(.key): \(.value)"' \
"${METRICS_FILE}" \
| sort -t: -k2 -n
fi

# ---------------------------------------------------------------
# Count errors and warnings in the log
# ---------------------------------------------------------------
echo ""
echo "=== Log Summary ==="
echo "Errors: $(grep -c "ERROR\|Exception\|Traceback" "${TRAINING_LOG}" || true)"
echo "Warnings: $(grep -c "WARNING\|UserWarning" "${TRAINING_LOG}" || true)"
echo "OOM events: $(grep -c "out of memory\|CUDA out of memory" "${TRAINING_LOG}" || true)"

The ML Workflow Pipeline Diagram

Production Engineering Notes

Always use set -euo pipefail: This is non-negotiable for production scripts. Without -e, a failed command is silently ignored. Without -u, typos in variable names ($MDOEL_DIR) expand to empty string and cause subtle bugs. Without -o pipefail, a failed command in a pipeline only fails the pipeline if it is the last command.

Logging with timestamps: Every echo in a production script should include a timestamp: echo "[$(date -Is)] message". When you are debugging a failure at 3 AM and the log file has a thousand lines, timestamps tell you exactly when the failure happened relative to other events.

SSH and remote execution: Always use SSH key authentication (never password authentication in scripts). Use ssh -o ConnectTimeout=10 -o BatchMode=yes to fail fast if a node is unreachable. The BatchMode=yes flag disables interactive prompts - essential for scripts that cannot wait for user input.

tmux for long-running jobs: The standard workflow for interactive training on a remote machine is: tmux new -s training, start the training script, Ctrl+B D to detach, and tmux attach -t training to reconnect. Always name your tmux sessions after the experiment. Use tmux ls to list running sessions.

Signal handling and checkpoint safety: Before deleting an old checkpoint, always verify that the new checkpoint has been fully written by checking file size and optionally loading the state dict. A pattern like save_checkpoint_tmp() then mv (atomic on the same filesystem) is safer than writing directly to the final path.

:::danger Dangerous Patterns to Avoid Do not use rm -rf with variables without quoting. A bug that sets OUTPUT_DIR to an empty string turns rm -rf ${OUTPUT_DIR}/checkpoints into rm -rf /checkpoints. Always double-quote variables: rm -rf "${OUTPUT_DIR}/checkpoints". Better yet, validate that the variable is non-empty before running destructive commands.

Do not run parallel or xargs -P without a concurrency limit on shared infrastructure. Running parallel -j 0 (unlimited parallelism) on a shared NFS server will saturate I/O and impact everyone using that server. Always set a reasonable concurrency limit.

Do not capture sensitive values in log files. Training scripts often log their full argument list. If your script accepts --api-key or --password flags, those values will appear in the training log. Use environment variables for secrets and never log $* or the full argument list. :::

:::warning Common Pitfalls Glob patterns and spaces in filenames: The pattern for f in data/*.pt; do python process.py $f; done breaks if any filename contains a space. Always quote variables: "$f". Or use find ... -print0 | xargs -0 for the most robust handling.

wait without a PID returns the wrong exit code: Plain wait (without a PID argument) waits for all background jobs but returns 0 even if some failed in older bash versions. Always capture PIDs with $! immediately after launching a background job and pass them to wait.

SSH StrictHostKeyChecking in CI: When your CI script SSH-es into a worker node it has never connected to before, SSH will prompt "Are you sure you want to continue connecting?" and hang. Add -o StrictHostKeyChecking=no for automated scripts (and understand the security implications). :::

Interview Questions and Answers

Q1: Explain what set -euo pipefail does and why each flag matters for ML scripts.

A: -e causes the shell to exit immediately when any command returns a non-zero exit code, preventing the script from continuing after a failure. -u treats unset variable references as errors - without this, a typo like ${MDOEL_DIR} silently expands to empty string, which can cause rm -rf to run in the wrong directory. -o pipefail makes a pipeline fail if any stage in the pipe fails - without it, python train.py | tee training.log would report success even if the Python script failed, because tee succeeds. Together, these three flags make bash scripts fail fast and loudly instead of silently continuing with corrupted state.

Q2: You have 10,000 image files to preprocess. How would you do this in parallel in bash?

A: There are three main approaches. Using xargs -P: find data/images -name "*.jpg" | xargs -P 8 -I {} python preprocess.py {} - this limits concurrency to 8 workers and is available everywhere. Using GNU parallel: parallel -j 8 --progress python preprocess.py {} ::: data/images/*.jpg - this has better progress reporting and error handling. Using background jobs with manual PID tracking: start workers with &, collect PIDs with $!, and use wait PID to detect individual failures. For production use, GNU parallel is preferred because it handles argument quoting correctly, provides progress bars, and collects exit codes from all jobs.

Q3: How do you safely handle checkpoint cleanup in a training script so you never lose the most recent checkpoint?

A: The pattern is: (1) save the new checkpoint to a temporary filename first (checkpoint_step_1000.pt.tmp), (2) verify it was written correctly (check file size is non-zero, optionally load and verify state dict), (3) atomically rename it to the final name with mv (which is atomic when source and destination are on the same filesystem), and (4) only then delete old checkpoints. The key insight is that mv on the same filesystem is atomic - there is no moment where the file does not exist. Never delete the old checkpoint before the new one is fully written and verified.

Q4: You are running a distributed training job across 4 nodes. Node 2 fails halfway through training. How do you detect this and what should your bash script do?

A: Each node's SSH command is run as a background job with its PID captured. The coordinator script calls wait PID for each node. If any wait call returns non-zero, that node failed. The script should: (1) send an alert (Slack, email, PagerDuty) immediately; (2) kill all other nodes' training processes with SSH to avoid them hanging waiting for the dead worker; (3) preserve all available checkpoints; and (4) exit with a non-zero exit code so the CI/CD system knows the training failed. With PyTorch's c10d rendezvous backend, you can also configure automatic node fault tolerance - but the bash wrapper should still detect and report failures independently.

Q5: What is the difference between 2>&1 | tee log.txt and > log.txt 2>&1, and when would you use each for ML training?

A: > log.txt 2>&1 redirects both stdout and stderr to the file, with nothing visible on the terminal. 2>&1 | tee log.txt sends stderr to stdout (merging them), then pipes to tee which writes to the file AND the terminal simultaneously. For ML training, 2>&1 | tee training.log is almost always preferred - you want to see progress bars and loss values in real time on the terminal while also preserving them in the log file. The only downside is that tee adds a tiny amount of overhead, which is negligible for training. One subtlety: with set -o pipefail, if the Python training script fails, the pipe will also fail, and your script will exit with an error as expected.

Q6: How would you write a bash script that monitors GPU utilization every 30 seconds and sends an alert if utilization drops below 20% for more than 5 minutes?

A: The key elements are: (1) a loop that calls nvidia-smi --query-gpu=utilization.gpu --format=csv,noheader,nounits and parses the output; (2) a consecutive-low counter that increments each time utilization is below threshold and resets when utilization recovers; (3) an alert condition when the counter reaches 5 minutes / 30 seconds = 10 consecutive readings; and (4) a mechanism to send the alert (curl to Slack webhook, mail, or writing to a file that another process monitors). The script runs as a background process alongside training, and both are controlled by the launcher script's trap cleanup EXIT handler.

Q7: Explain process substitution and give an ML-specific example of where it is useful.

A: Process substitution (<(command)) makes the output of a command appear as if it were a file - bash creates a named pipe (like /dev/fd/63) that the outer command reads from. This avoids creating temporary files. An ML-specific example: comparing model predictions from two checkpoints without saving intermediate files: diff <(python predict.py --model checkpoint_step_1000.pt --data test.csv) <(python predict.py --model checkpoint_step_2000.pt --data test.csv). Another use case: some evaluation scripts require a file path argument and cannot read from stdin. You can use process substitution to pass streaming data: python evaluate.py --predictions <(python generate_predictions.py --model model.pt).

© 2026 EngineersOfAI. All rights reserved.