Shell Scripting for ML Workflows

Reading time: ~40 min · Interview relevance: High · Target roles: ML Engineer, MLOps Engineer, Research Engineer

The Training Launch That Corrupted 3 Days of Work

A research engineer at a mid-size ML company kicks off a distributed training run across 16 GPUs. The job runs for 72 hours. On hour 71, one of the workers crashes due to a disk-full error. The training loop catches the exception and exits gracefully - but the checkpoint manager had already deleted the previous checkpoint to save space. The run is unrecoverable. Three days of GPU time, gone.

The root cause was not the disk filling up. That is expected. The root cause was that the launch script had no error handling, the checkpoint cleanup ran unconditionally before saving the new checkpoint, and there was no monitoring script watching disk utilization. A 20-line bash script watching df -h and sending a Slack alert when disk usage exceeded 80% would have caught this 12 hours before the crash.

Shell scripting is not glamorous. Every ML engineer would rather write Python. But bash is the language of the environment - it is what runs when you SSH into a training node, what CI systems execute, what Kubernetes init containers run. When a distributed training job fails at 3 AM, you are debugging with bash, not Jupyter notebooks. The engineers who move fastest in ML infrastructure are the ones who are genuinely fluent in bash.

This lesson covers the bash features that matter most for ML workflows: robust error handling with set -euo pipefail, parallel execution with GNU parallel and background jobs, argument parsing patterns, remote execution for multi-node training, monitoring loops that watch GPU utilization and disk space, and checkpoint management scripts that do not destroy your data when they fail.

Every code example in this lesson reflects patterns used in production ML systems. The training launcher script alone has saved real teams from the exact scenario described above.

Why This Exists - The Gap Between Python and the Environment

Python is excellent for ML logic. It is not excellent for orchestrating system-level operations. Consider what happens when you launch a distributed training job: you need to check which GPUs are available on which nodes, set environment variables like MASTER_ADDR and MASTER_PORT, start processes on remote nodes via SSH, wait for all processes to finish, handle partial failures, and clean up if anything goes wrong.

You could write all of this in Python using subprocess, paramiko, and multiprocessing. Teams do. But the result is hundreds of lines of Python that replicate what bash does natively in 30 lines. Bash is designed for process orchestration - every operator (|, &, >, <(...)) is a concurrency or I/O primitive.

The other reason is ubiquity. Every Linux system has bash. Not every system has Python 3.11 with your specific ML library versions. When you are initializing a fresh GPU node, the first thing you run is a bash script that installs your dependencies - you cannot use Python for that.

Historical Context - The Shell as the Original Orchestrator

The Unix shell predates every ML framework by decades. Ken Thompson wrote the first Unix shell in 1971. The Bourne shell, which bash (Bourne Again SHell) is named after, was written by Stephen Bourne at Bell Labs in 1979. The design philosophy was composability: small tools (grep, sed, awk, sort) that each do one thing well, connected by pipes.

GNU parallel was released by Ole Tange in 2011, adding the ability to run shell commands in parallel across multiple CPU cores or remote machines in a syntax nearly identical to xargs. GNU parallel is effectively map-reduce in bash: parallel command ::: input1 input2 input3 runs command on all three inputs concurrently.

The tmux terminal multiplexer (2007) solved the persistent session problem for ML engineers - you can start a 3-day training run in a tmux session, detach, and reattach from anywhere. Combined with remote SSH, this is still how most interactive ML training is managed even in the era of cloud platforms.

Core Concepts

Robust Error Handling

The single most important pattern in production bash scripts is set -euo pipefail at the top of every script. Without it, bash cheerfully continues executing after errors.

#!/usr/bin/env bash
# ALWAYS start scripts with this line
set -euo pipefail

# What each flag does:
# -e  exit immediately if any command returns non-zero exit code
# -u  treat unset variables as errors (prevents typo bugs like $MDOEL_DIR)
# -o pipefail  a pipe fails if ANY stage fails, not just the last one

# Without -o pipefail, this would silently succeed even if python fails:
# python train.py | tee train.log
# With -o pipefail, the pipe fails if python fails.

# trap: run cleanup on exit (even on error or SIGINT/SIGTERM)
cleanup() {
    local exit_code=$?
    echo "[$(date -Is)] Script exiting with code ${exit_code}"
    # Kill all background jobs started by this script
    jobs -p | xargs -r kill 2>/dev/null || true
    exit "${exit_code}"
}
trap cleanup EXIT

# trap SIGINT/SIGTERM specifically for graceful shutdown during training
handle_interrupt() {
    echo "[$(date -Is)] Received interrupt signal, cleaning up..."
    # Save partial checkpoint if training is running
    if [[ -n "${TRAINING_PID:-}" ]]; then
        kill -SIGUSR1 "${TRAINING_PID}" 2>/dev/null || true
        wait "${TRAINING_PID}" || true
    fi
    exit 130  # standard exit code for SIGINT
}
trap handle_interrupt SIGINT SIGTERM

Variables, Arrays, and Conditionals

#!/usr/bin/env bash
set -euo pipefail

# ---------------------------------------------------------------
# Variable declaration patterns
# ---------------------------------------------------------------

# Default values: use ${VAR:-default} to avoid -u errors
EXPERIMENT_NAME="${EXPERIMENT_NAME:-unnamed_experiment}"
OUTPUT_DIR="${OUTPUT_DIR:-/tmp/training_output}"
NUM_WORKERS="${NUM_WORKERS:-4}"

# Integer arithmetic
NUM_GPUS=$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)
BATCH_SIZE=$(( 32 * NUM_GPUS ))    # scale batch size with GPU count
echo "Using ${NUM_GPUS} GPUs, batch size ${BATCH_SIZE}"

# String manipulation
MODEL_PATH="/models/experiment_2024/checkpoint_final.pt"
MODEL_DIR="${MODEL_PATH%/*}"          # remove everything after last /
MODEL_NAME="${MODEL_PATH##*/}"        # remove everything before last /
STEM="${MODEL_NAME%.pt}"              # remove .pt suffix
echo "Dir: ${MODEL_DIR}, Name: ${STEM}"

# Arrays
GPU_IDS=( 0 1 2 3 )
NODES=( node-01 node-02 node-03 node-04 )

# Iterate over array
for gpu_id in "${GPU_IDS[@]}"; do
    echo "GPU ${gpu_id}"
done

# Array length
echo "Number of GPUs: ${#GPU_IDS[@]}"

# Slicing: first 2 GPUs
echo "First 2: ${GPU_IDS[@]:0:2}"

# ---------------------------------------------------------------
# Conditionals
# ---------------------------------------------------------------

# File/directory checks
if [[ ! -d "${OUTPUT_DIR}" ]]; then
    mkdir -p "${OUTPUT_DIR}"
    echo "Created output directory: ${OUTPUT_DIR}"
fi

if [[ ! -f "configs/experiment.yaml" ]]; then
    echo "ERROR: config file not found" >&2
    exit 1
fi

# Integer comparison
if (( NUM_GPUS == 0 )); then
    echo "WARNING: no GPUs found, falling back to CPU" >&2
elif (( NUM_GPUS < 4 )); then
    echo "WARNING: fewer than 4 GPUs available (${NUM_GPUS})"
fi

# String comparison
if [[ "${EXPERIMENT_NAME}" == *"debug"* ]]; then
    echo "Debug mode: reducing dataset size"
    MAX_SAMPLES=1000
fi

# Command availability check
if ! command -v nvcc &>/dev/null; then
    echo "ERROR: nvcc not found - CUDA not installed" >&2
    exit 1
fi

Argument Parsing with getopts

#!/usr/bin/env bash
set -euo pipefail

# Script: train_launcher.sh
# Usage: ./train_launcher.sh -e <experiment_name> -g <num_gpus> [-c <config>] [-d]

usage() {
    cat <<EOF
Usage: $(basename "$0") [OPTIONS]

Options:
  -e EXPERIMENT   Experiment name (required)
  -g NUM_GPUS     Number of GPUs to use (required)
  -c CONFIG       Path to config file (default: configs/default.yaml)
  -o OUTPUT_DIR   Output directory (default: outputs/EXPERIMENT)
  -d              Dry run - print commands without executing
  -h              Show this help message

Examples:
  $(basename "$0") -e resnet50_baseline -g 4
  $(basename "$0") -e bert_finetune -g 8 -c configs/bert.yaml
EOF
}

# Defaults
CONFIG="configs/default.yaml"
DRY_RUN=false
EXPERIMENT=""
NUM_GPUS=""
OUTPUT_DIR=""

# getopts: parse single-character flags
while getopts "e:g:c:o:dh" opt; do
    case "${opt}" in
        e) EXPERIMENT="${OPTARG}" ;;
        g) NUM_GPUS="${OPTARG}"   ;;
        c) CONFIG="${OPTARG}"     ;;
        o) OUTPUT_DIR="${OPTARG}" ;;
        d) DRY_RUN=true           ;;
        h) usage; exit 0          ;;
        *) usage; exit 1          ;;
    esac
done

shift $(( OPTIND - 1 ))   # remove parsed flags from $@

# Validate required arguments
if [[ -z "${EXPERIMENT}" ]]; then
    echo "ERROR: -e EXPERIMENT is required" >&2
    usage
    exit 1
fi

if [[ -z "${NUM_GPUS}" ]]; then
    echo "ERROR: -g NUM_GPUS is required" >&2
    usage
    exit 1
fi

# Set derived defaults
OUTPUT_DIR="${OUTPUT_DIR:-outputs/${EXPERIMENT}}"

# run_or_dry: print command, execute only if not dry run
run_or_dry() {
    echo "+ $*"
    if [[ "${DRY_RUN}" == false ]]; then
        "$@"
    fi
}

echo "Experiment:  ${EXPERIMENT}"
echo "GPUs:        ${NUM_GPUS}"
echo "Config:      ${CONFIG}"
echo "Output:      ${OUTPUT_DIR}"
echo "Dry run:     ${DRY_RUN}"

Pipelines, Redirections, and Process Substitution

#!/usr/bin/env bash
set -euo pipefail

# ---------------------------------------------------------------
# Basic redirections
# ---------------------------------------------------------------

# Redirect stdout AND stderr to a log file
python train.py > training.log 2>&1

# Redirect stderr to stdout (combine them), pipe to tee
python train.py 2>&1 | tee training.log

# Here-string: pass a string as stdin to a command
python -c "import sys; print(sys.version)" <<< ""

# Here-doc: multi-line stdin
python - <<'PYTHON'
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU count: {torch.cuda.device_count()}")
PYTHON

# ---------------------------------------------------------------
# Process substitution: treat command output as a file
# ---------------------------------------------------------------

# Compare two sorted lists without creating temp files
diff <(sort file1.txt) <(sort file2.txt)

# Pass the output of a command as a file argument
# Useful for commands that require a file path, not stdin
python evaluate.py \
    --predictions <(python predict.py --model model.pt) \
    --labels data/test_labels.csv

# ---------------------------------------------------------------
# Pipelines for log processing
# ---------------------------------------------------------------

# Parse training loss from log file (real-time)
tail -f training.log \
    | grep --line-buffered "train_loss" \
    | awk '{print $NF}'   # print last field (the loss value)

# Extract best validation accuracy across all experiments
for log in outputs/*/training.log; do
    experiment=$(dirname "${log}" | xargs basename)
    best_acc=$(grep "val_accuracy" "${log}" \
               | awk '{print $NF}' \
               | sort -n \
               | tail -1)
    printf "%-30s %s\n" "${experiment}" "${best_acc}"
done

# Count failed training runs
grep -l "Error\|Exception\|Killed" outputs/*/training.log \
    | wc -l

Parallel Execution

Parallel execution is critical for ML workflows - you often need to preprocess thousands of files, run multiple experiments, or download dozens of data shards simultaneously.

#!/usr/bin/env bash
set -euo pipefail

# ---------------------------------------------------------------
# Background jobs (&) and wait
# ---------------------------------------------------------------

# Launch 4 preprocessing jobs in parallel
for shard in data/raw/shard_{0,1,2,3}.parquet; do
    python preprocess.py --input "${shard}" &
done

# Wait for all background jobs to complete
# IMPORTANT: 'wait' returns the exit code of the last job.
# Use the pattern below to catch ANY failure:
wait_all() {
    local exit_code=0
    for pid in "$@"; do
        wait "${pid}" || exit_code=$?
    done
    return "${exit_code}"
}

pids=()
for shard in data/raw/shard_*.parquet; do
    python preprocess.py --input "${shard}" &
    pids+=( $! )    # $! is PID of the most recent background job
done
wait_all "${pids[@]}"

# ---------------------------------------------------------------
# xargs -P: parallel execution with concurrency limit
# ---------------------------------------------------------------

# Download 20 files, 8 at a time
# -P 8 = max 8 parallel jobs
# -I {} = replace {} with the input line
cat urls.txt | xargs -P 8 -I {} wget -q {}

# Preprocess all shards with 8 workers
ls data/raw/*.parquet \
    | xargs -P 8 -I {} python preprocess.py --input {}

# ---------------------------------------------------------------
# GNU parallel: more powerful, handles complex cases
# ---------------------------------------------------------------

# Install: apt-get install parallel  /  brew install parallel

# Process all images with 8 workers, show progress
parallel --progress -j 8 \
    python scripts/resize_image.py --input {} --output data/resized/ \
    ::: data/raw/images/*.jpg

# Parameter sweep: all combinations of lr and batch_size
parallel -j 4 \
    python train.py --lr {1} --batch-size {2} --name "lr{1}_bs{2}" \
    ::: 0.001 0.0001 0.00001 \
    ::: 32 64 128

# Run jobs on remote machines (requires SSH key auth)
parallel -j 2 -S node-01,node-02,node-03 \
    python preprocess.py --shard {} \
    ::: $(seq 0 23)

# ---------------------------------------------------------------
# Job control: fg, bg, jobs
# ---------------------------------------------------------------

# Send a running process to the background
# (Ctrl+Z to suspend, then bg to resume in background)
python train.py &   # start in background
jobs                # list background jobs: [1]+ Running python train.py
fg %1              # bring job 1 back to foreground
wait %1            # wait for job 1 without bringing to foreground

Complete Training Launcher Script

#!/usr/bin/env bash
# train_launcher.sh
# Launches distributed PyTorch training with GPU detection,
# environment setup, and robust error handling.

set -euo pipefail

# ---------------------------------------------------------------
# Argument parsing
# ---------------------------------------------------------------
usage() {
    cat <<EOF
Usage: $(basename "$0") [OPTIONS]

  -e EXPERIMENT    Experiment name (required)
  -c CONFIG        Config file (default: configs/default.yaml)
  -n NUM_NODES     Number of nodes (default: 1)
  -g GPUS_PER_NODE GPUs per node (default: auto-detect)
  -o OUTPUT_DIR    Output directory (default: outputs/EXPERIMENT)
  -r RESUME        Path to checkpoint to resume from
  -d               Dry run
  -h               Show help

Examples:
  $(basename "$0") -e bert_base -c configs/bert.yaml -n 2 -g 8
  $(basename "$0") -e debug_run -c configs/tiny.yaml -d
EOF
}

EXPERIMENT=""
CONFIG="configs/default.yaml"
NUM_NODES=1
GPUS_PER_NODE=""
OUTPUT_DIR=""
RESUME=""
DRY_RUN=false

while getopts "e:c:n:g:o:r:dh" opt; do
    case "${opt}" in
        e) EXPERIMENT="${OPTARG}"    ;;
        c) CONFIG="${OPTARG}"        ;;
        n) NUM_NODES="${OPTARG}"     ;;
        g) GPUS_PER_NODE="${OPTARG}" ;;
        o) OUTPUT_DIR="${OPTARG}"    ;;
        r) RESUME="${OPTARG}"        ;;
        d) DRY_RUN=true              ;;
        h) usage; exit 0             ;;
        *) usage >&2; exit 1         ;;
    esac
done

if [[ -z "${EXPERIMENT}" ]]; then
    echo "ERROR: -e EXPERIMENT is required" >&2
    usage >&2
    exit 1
fi

# ---------------------------------------------------------------
# GPU detection
# ---------------------------------------------------------------
detect_gpus() {
    if ! command -v nvidia-smi &>/dev/null; then
        echo "0"
        return
    fi
    nvidia-smi --query-gpu=name --format=csv,noheader 2>/dev/null \
        | wc -l \
        | tr -d ' '
}

AVAILABLE_GPUS=$(detect_gpus)
GPUS_PER_NODE="${GPUS_PER_NODE:-${AVAILABLE_GPUS}}"

if (( GPUS_PER_NODE == 0 )); then
    echo "WARNING: no GPUs detected, using CPU" >&2
    DEVICE="cpu"
else
    DEVICE="cuda"
    echo "Detected ${AVAILABLE_GPUS} GPUs, using ${GPUS_PER_NODE} per node"
fi

WORLD_SIZE=$(( NUM_NODES * GPUS_PER_NODE ))
OUTPUT_DIR="${OUTPUT_DIR:-outputs/${EXPERIMENT}}"
LOG_FILE="${OUTPUT_DIR}/train.log"

# ---------------------------------------------------------------
# Environment validation
# ---------------------------------------------------------------
validate_environment() {
    echo "Validating environment..."

    # Check config exists
    if [[ ! -f "${CONFIG}" ]]; then
        echo "ERROR: config file not found: ${CONFIG}" >&2
        exit 1
    fi

    # Check Python environment
    if ! python -c "import torch" 2>/dev/null; then
        echo "ERROR: PyTorch not importable" >&2
        exit 1
    fi

    local torch_version
    torch_version=$(python -c "import torch; print(torch.__version__)")
    echo "PyTorch version: ${torch_version}"

    # Check disk space (require at least 50 GB free)
    local free_gb
    free_gb=$(df -BG "${OUTPUT_DIR:-/tmp}" | awk 'NR==2 {print $4}' \
              | tr -d 'G')
    if (( free_gb < 50 )); then
        echo "WARNING: only ${free_gb}GB free disk space" >&2
    fi
}

# ---------------------------------------------------------------
# Setup
# ---------------------------------------------------------------
setup_output_dir() {
    mkdir -p "${OUTPUT_DIR}"
    cp "${CONFIG}" "${OUTPUT_DIR}/config.yaml"

    # Save launch metadata
    cat > "${OUTPUT_DIR}/launch_info.json" <<JSON
{
  "experiment": "${EXPERIMENT}",
  "config": "${CONFIG}",
  "num_nodes": ${NUM_NODES},
  "gpus_per_node": ${GPUS_PER_NODE},
  "world_size": ${WORLD_SIZE},
  "launched_at": "$(date -Is)",
  "git_commit": "$(git rev-parse HEAD 2>/dev/null || echo unknown)",
  "hostname": "$(hostname)"
}
JSON
}

# ---------------------------------------------------------------
# Launch training
# ---------------------------------------------------------------
run_or_dry() {
    echo "+ $*"
    if [[ "${DRY_RUN}" == false ]]; then
        "$@"
    fi
}

launch_training() {
    local resume_args=()
    if [[ -n "${RESUME}" ]]; then
        resume_args=( --resume "${RESUME}" )
    fi

    if (( NUM_NODES == 1 )); then
        # Single-node: use torchrun
        run_or_dry torchrun \
            --standalone \
            --nproc-per-node "${GPUS_PER_NODE}" \
            src/train.py \
            --config "${CONFIG}" \
            --output "${OUTPUT_DIR}" \
            --experiment "${EXPERIMENT}" \
            "${resume_args[@]}" \
            2>&1 | tee "${LOG_FILE}" &
    else
        # Multi-node: requires MASTER_ADDR to be set
        if [[ -z "${MASTER_ADDR:-}" ]]; then
            echo "ERROR: MASTER_ADDR must be set for multi-node training" >&2
            exit 1
        fi
        MASTER_PORT="${MASTER_PORT:-29500}"

        run_or_dry torchrun \
            --nnodes "${NUM_NODES}" \
            --nproc-per-node "${GPUS_PER_NODE}" \
            --rdzv-backend c10d \
            --rdzv-endpoint "${MASTER_ADDR}:${MASTER_PORT}" \
            src/train.py \
            --config "${CONFIG}" \
            --output "${OUTPUT_DIR}" \
            --experiment "${EXPERIMENT}" \
            "${resume_args[@]}" \
            2>&1 | tee "${LOG_FILE}" &
    fi

    TRAINING_PID=$!
    export TRAINING_PID

    echo "Training PID: ${TRAINING_PID}"
    echo "${TRAINING_PID}" > "${OUTPUT_DIR}/training.pid"
}

# ---------------------------------------------------------------
# Monitor disk space during training
# ---------------------------------------------------------------
monitor_disk() {
    local threshold_pct=85
    local check_interval=60

    while kill -0 "${TRAINING_PID}" 2>/dev/null; do
        local used_pct
        used_pct=$(df "${OUTPUT_DIR}" | awk 'NR==2 {print $5}' \
                   | tr -d '%')

        if (( used_pct >= threshold_pct )); then
            echo "[$(date -Is)] WARNING: disk at ${used_pct}% - "
            echo "cleaning old checkpoints..."
            # Remove all but the last 2 checkpoints
            ls -t "${OUTPUT_DIR}"/checkpoint_step_*.pt 2>/dev/null \
                | tail -n +3 \
                | xargs -r rm -v
        fi

        sleep "${check_interval}"
    done
}

# ---------------------------------------------------------------
# Cleanup on exit
# ---------------------------------------------------------------
cleanup() {
    local code=$?
    if [[ -n "${TRAINING_PID:-}" ]] && kill -0 "${TRAINING_PID}" 2>/dev/null; then
        echo "Sending SIGTERM to training process ${TRAINING_PID}..."
        kill "${TRAINING_PID}"
        wait "${TRAINING_PID}" 2>/dev/null || true
    fi
    echo "[$(date -Is)] Exit code: ${code}"
    exit "${code}"
}
trap cleanup EXIT SIGINT SIGTERM

# ---------------------------------------------------------------
# Main
# ---------------------------------------------------------------
main() {
    echo "=============================================="
    echo " Training Launcher"
    echo " Experiment: ${EXPERIMENT}"
    echo " Nodes: ${NUM_NODES}, GPUs/node: ${GPUS_PER_NODE}"
    echo " World size: ${WORLD_SIZE}"
    echo "=============================================="

    validate_environment
    setup_output_dir
    launch_training

    # Start disk monitor in background
    monitor_disk &
    MONITOR_PID=$!

    # Wait for training to finish
    wait "${TRAINING_PID}"
    TRAIN_EXIT=$?

    # Stop disk monitor
    kill "${MONITOR_PID}" 2>/dev/null || true

    if (( TRAIN_EXIT == 0 )); then
        echo "Training completed successfully"
        echo "Output: ${OUTPUT_DIR}"
    else
        echo "Training FAILED with exit code ${TRAIN_EXIT}" >&2
        exit "${TRAIN_EXIT}"
    fi
}

main

Multi-Node Training Coordinator

#!/usr/bin/env bash
# multi_node_train.sh
# Coordinates a multi-node PyTorch training run by SSH-ing into
# each worker node and launching the training process.

set -euo pipefail

NODES_FILE="${1:-nodes.txt}"   # file containing one hostname per line
CONFIG="${2:-configs/default.yaml}"
EXPERIMENT="${3:-distributed_run}"

if [[ ! -f "${NODES_FILE}" ]]; then
    echo "ERROR: nodes file not found: ${NODES_FILE}" >&2
    exit 1
fi

# Read nodes into array
mapfile -t NODES < "${NODES_FILE}"
NUM_NODES=${#NODES[@]}
MASTER_NODE="${NODES[0]}"

echo "Launching ${NUM_NODES}-node training"
echo "Master node: ${MASTER_NODE}"
echo "All nodes: ${NODES[*]}"

# Detect GPUs on master node (assume homogeneous cluster)
GPUS_PER_NODE=$(ssh "${MASTER_NODE}" \
    "nvidia-smi --query-gpu=name --format=csv,noheader | wc -l")
echo "GPUs per node: ${GPUS_PER_NODE}"

# Sync code to all nodes
for node in "${NODES[@]}"; do
    echo "Syncing code to ${node}..."
    rsync -azq --delete \
        --exclude='.git' \
        --exclude='outputs/' \
        --exclude='data/' \
        --exclude='*.pyc' \
        --exclude='__pycache__' \
        . "${node}:~/training/"
done
echo "Code sync complete"

# Launch training on each node
pids=()
for i in "${!NODES[@]}"; do
    node="${NODES[$i]}"
    rank=$i

    echo "Launching on ${node} (rank ${rank})..."

    ssh "${node}" "
        cd ~/training
        export MASTER_ADDR=${MASTER_NODE}
        export MASTER_PORT=29500
        export NODE_RANK=${rank}

        torchrun \
            --nnodes ${NUM_NODES} \
            --nproc-per-node ${GPUS_PER_NODE} \
            --node-rank ${rank} \
            --rdzv-backend c10d \
            --rdzv-endpoint ${MASTER_NODE}:29500 \
            src/train.py \
            --config ${CONFIG} \
            --output outputs/${EXPERIMENT} \
        > outputs/${EXPERIMENT}/node_${rank}.log 2>&1
    " &

    pids+=( $! )
done

echo "All ${NUM_NODES} nodes launched. Waiting for completion..."

# Wait for all nodes, capture failures
failed=0
for i in "${!pids[@]}"; do
    pid="${pids[$i]}"
    node="${NODES[$i]}"

    if wait "${pid}"; then
        echo "Node ${node}: SUCCESS"
    else
        echo "Node ${node}: FAILED" >&2
        failed=$(( failed + 1 ))
    fi
done

if (( failed > 0 )); then
    echo "ERROR: ${failed} nodes failed" >&2
    exit 1
fi

echo "All nodes completed successfully"

GPU Monitoring Script

#!/usr/bin/env bash
# gpu_monitor.sh
# Continuously monitors GPU utilization, memory, and temperature.
# Alerts if utilization drops (indicating a training stall).

set -euo pipefail

EXPERIMENT="${1:-unknown}"
LOG_FILE="${2:-/tmp/gpu_monitor.log}"
ALERT_THRESHOLD_UTIL="${GPU_ALERT_UTIL:-20}"   # alert if util drops below 20%
CHECK_INTERVAL=30
SLACK_WEBHOOK="${SLACK_WEBHOOK:-}"

log() {
    echo "[$(date -Is)] $*" | tee -a "${LOG_FILE}"
}

send_alert() {
    local message="$1"
    log "ALERT: ${message}"

    if [[ -n "${SLACK_WEBHOOK}" ]]; then
        curl -s -X POST "${SLACK_WEBHOOK}" \
            -H 'Content-type: application/json' \
            --data "{\"text\": \"[${EXPERIMENT}] ${message}\"}" \
            || true
    fi
}

monitor_gpus() {
    local consecutive_low=0
    local low_util_limit=3   # alert after 3 consecutive low readings

    while true; do
        # Query all GPUs: index, utilization%, memory_used_MiB, memory_total_MiB, temp_C
        local gpu_stats
        gpu_stats=$(nvidia-smi \
            --query-gpu=index,utilization.gpu,memory.used,memory.total,temperature.gpu \
            --format=csv,noheader,nounits 2>/dev/null)

        if [[ -z "${gpu_stats}" ]]; then
            send_alert "nvidia-smi returned no output - GPU driver issue?"
            sleep "${CHECK_INTERVAL}"
            continue
        fi

        local any_low=false
        while IFS=', ' read -r idx util mem_used mem_total temp; do
            log "GPU ${idx}: util=${util}% mem=${mem_used}/${mem_total}MiB temp=${temp}C"

            if (( util < ALERT_THRESHOLD_UTIL )); then
                any_low=true
            fi

            if (( temp > 85 )); then
                send_alert "GPU ${idx} temperature critical: ${temp}C"
            fi

            # Memory pressure warning
            local mem_pct=$(( mem_used * 100 / mem_total ))
            if (( mem_pct > 95 )); then
                send_alert "GPU ${idx} memory at ${mem_pct}% (${mem_used}/${mem_total} MiB)"
            fi
        done <<< "${gpu_stats}"

        # Track consecutive low utilization readings
        if [[ "${any_low}" == true ]]; then
            consecutive_low=$(( consecutive_low + 1 ))
            if (( consecutive_low >= low_util_limit )); then
                send_alert "GPU utilization below ${ALERT_THRESHOLD_UTIL}% for " \
                    "$((consecutive_low * CHECK_INTERVAL)) seconds - training may have stalled"
                consecutive_low=0   # reset to avoid spam
            fi
        else
            consecutive_low=0
        fi

        sleep "${CHECK_INTERVAL}"
    done
}

log "Starting GPU monitor for experiment: ${EXPERIMENT}"
log "Alert threshold: util < ${ALERT_THRESHOLD_UTIL}%"
monitor_gpus

Parallel Data Download Script

#!/usr/bin/env bash
# download_dataset.sh
# Downloads a sharded dataset in parallel, verifies checksums,
# and retries failed downloads.

set -euo pipefail

BASE_URL="${1:?Usage: $0 BASE_URL OUTPUT_DIR}"
OUTPUT_DIR="${2:?Usage: $0 BASE_URL OUTPUT_DIR}"
NUM_WORKERS="${3:-8}"
MANIFEST="${OUTPUT_DIR}/manifest.json"

mkdir -p "${OUTPUT_DIR}"

# Fetch the manifest (list of shards + expected checksums)
echo "Fetching manifest..."
curl -fsSL "${BASE_URL}/manifest.json" -o "${MANIFEST}"

# Parse manifest with jq to extract shard names and checksums
mapfile -t SHARDS < <(jq -r '.shards[].filename' "${MANIFEST}")
echo "Found ${#SHARDS[@]} shards to download"

# Download a single shard with retry logic
download_shard() {
    local filename="$1"
    local url="${BASE_URL}/${filename}"
    local output="${OUTPUT_DIR}/${filename}"
    local expected_sha256
    expected_sha256=$(jq -r --arg f "${filename}" \
        '.shards[] | select(.filename == $f) | .sha256' "${MANIFEST}")

    # Skip if already downloaded with correct checksum
    if [[ -f "${output}" ]]; then
        local actual_sha256
        actual_sha256=$(sha256sum "${output}" | awk '{print $1}')
        if [[ "${actual_sha256}" == "${expected_sha256}" ]]; then
            echo "SKIP (already complete): ${filename}"
            return 0
        else
            echo "CHECKSUM MISMATCH, re-downloading: ${filename}"
            rm "${output}"
        fi
    fi

    # Download with retry
    local max_retries=3
    local attempt=0
    while (( attempt < max_retries )); do
        attempt=$(( attempt + 1 ))
        echo "Downloading ${filename} (attempt ${attempt}/${max_retries})..."

        if curl -fsSL --retry 3 "${url}" -o "${output}"; then
            # Verify checksum
            local actual_sha256
            actual_sha256=$(sha256sum "${output}" | awk '{print $1}')
            if [[ "${actual_sha256}" == "${expected_sha256}" ]]; then
                echo "OK: ${filename}"
                return 0
            else
                echo "CHECKSUM FAILED for ${filename}" >&2
                rm "${output}"
            fi
        fi

        (( attempt < max_retries )) && sleep $(( attempt * 5 ))
    done

    echo "FAILED after ${max_retries} attempts: ${filename}" >&2
    return 1
}

export -f download_shard
export BASE_URL OUTPUT_DIR MANIFEST

# Download all shards in parallel
echo "Downloading with ${NUM_WORKERS} workers..."
printf '%s\n' "${SHARDS[@]}" \
    | parallel --progress -j "${NUM_WORKERS}" \
        download_shard {}

echo "All ${#SHARDS[@]} shards downloaded and verified"

Checkpoint Cleanup Script

#!/usr/bin/env bash
# cleanup_checkpoints.sh
# Safely removes old checkpoints, keeping only the N most recent
# and any checkpoints marked as "best" by the training script.

set -euo pipefail

CHECKPOINT_DIR="${1:?Usage: $0 CHECKPOINT_DIR [KEEP_N]}"
KEEP_N="${2:-5}"

if [[ ! -d "${CHECKPOINT_DIR}" ]]; then
    echo "ERROR: directory not found: ${CHECKPOINT_DIR}" >&2
    exit 1
fi

# Find all step checkpoints (not "best" or "final" checkpoints)
mapfile -t STEP_CHECKPOINTS < <(
    find "${CHECKPOINT_DIR}" \
        -name "checkpoint_step_*.pt" \
        -not -name "*best*" \
        -not -name "*final*" \
        | sort -t_ -k3 -n   # sort by step number
)

TOTAL=${#STEP_CHECKPOINTS[@]}
TO_DELETE=$(( TOTAL - KEEP_N ))

if (( TO_DELETE <= 0 )); then
    echo "Only ${TOTAL} checkpoints found, keeping all (threshold: ${KEEP_N})"
    exit 0
fi

echo "Found ${TOTAL} step checkpoints, deleting ${TO_DELETE} oldest..."

# Delete oldest checkpoints (first TO_DELETE in sorted list)
for (( i=0; i < TO_DELETE; i++ )); do
    ckpt="${STEP_CHECKPOINTS[$i]}"
    echo "Deleting: $(basename "${ckpt}")"
    rm "${ckpt}"
done

echo "Done. Kept ${KEEP_N} most recent step checkpoints."
echo "Best/final checkpoints were preserved."

Log Parsing with awk, sed, jq

#!/usr/bin/env bash
set -euo pipefail

TRAINING_LOG="${1:-training.log}"

# ---------------------------------------------------------------
# Extract training loss trend using awk
# ---------------------------------------------------------------
echo "=== Training Loss ==="
grep "train_loss" "${TRAINING_LOG}" \
    | awk '{
        # Extract step and loss from lines like: step=1000 train_loss=0.342
        match($0, /step=([0-9]+)/, step_arr)
        match($0, /train_loss=([0-9.]+)/, loss_arr)
        if (step_arr[1] && loss_arr[1])
            printf "Step %6s: %.4f\n", step_arr[1], loss_arr[1]
    }'

# ---------------------------------------------------------------
# Find the step with best validation accuracy
# ---------------------------------------------------------------
echo ""
echo "=== Best Validation Accuracy ==="
grep "val_accuracy" "${TRAINING_LOG}" \
    | awk '{
        match($0, /step=([0-9]+)/, s)
        match($0, /val_accuracy=([0-9.]+)/, a)
        if (s[1] && a[1] && a[1]+0 > best+0) {
            best = a[1]+0
            best_step = s[1]
        }
    }
    END { printf "Best: %.4f at step %s\n", best, best_step }'

# ---------------------------------------------------------------
# Parse JSON metrics file with jq
# ---------------------------------------------------------------
METRICS_FILE="metrics/eval.json"

if [[ -f "${METRICS_FILE}" ]]; then
    echo ""
    echo "=== Evaluation Metrics ==="
    jq -r '
        "Accuracy:    " + (.accuracy | tostring),
        "F1 Score:    " + (.f1 | tostring),
        "Throughput:  " + (.throughput_samples_per_sec | tostring) + " samples/s"
    ' "${METRICS_FILE}"

    # Extract per-class F1 scores
    echo ""
    echo "=== Per-Class F1 ==="
    jq -r '.per_class_f1 | to_entries[] | "\(.key): \(.value)"' \
        "${METRICS_FILE}" \
        | sort -t: -k2 -n
fi

# ---------------------------------------------------------------
# Count errors and warnings in the log
# ---------------------------------------------------------------
echo ""
echo "=== Log Summary ==="
echo "Errors:   $(grep -c "ERROR\|Exception\|Traceback" "${TRAINING_LOG}" || true)"
echo "Warnings: $(grep -c "WARNING\|UserWarning"         "${TRAINING_LOG}" || true)"
echo "OOM events: $(grep -c "out of memory\|CUDA out of memory" "${TRAINING_LOG}" || true)"

The ML Workflow Pipeline Diagram

Production Engineering Notes

Always use set -euo pipefail: This is non-negotiable for production scripts. Without -e, a failed command is silently ignored. Without -u, typos in variable names ($MDOEL_DIR) expand to empty string and cause subtle bugs. Without -o pipefail, a failed command in a pipeline only fails the pipeline if it is the last command.

Logging with timestamps: Every echo in a production script should include a timestamp: echo "[$(date -Is)] message". When you are debugging a failure at 3 AM and the log file has a thousand lines, timestamps tell you exactly when the failure happened relative to other events.

SSH and remote execution: Always use SSH key authentication (never password authentication in scripts). Use ssh -o ConnectTimeout=10 -o BatchMode=yes to fail fast if a node is unreachable. The BatchMode=yes flag disables interactive prompts - essential for scripts that cannot wait for user input.

tmux for long-running jobs: The standard workflow for interactive training on a remote machine is: tmux new -s training, start the training script, Ctrl+B D to detach, and tmux attach -t training to reconnect. Always name your tmux sessions after the experiment. Use tmux ls to list running sessions.

Signal handling and checkpoint safety: Before deleting an old checkpoint, always verify that the new checkpoint has been fully written by checking file size and optionally loading the state dict. A pattern like save_checkpoint_tmp() then mv (atomic on the same filesystem) is safer than writing directly to the final path.

:::danger Dangerous Patterns to Avoid Do not use rm -rf with variables without quoting. A bug that sets OUTPUT_DIR to an empty string turns rm -rf ${OUTPUT_DIR}/checkpoints into rm -rf /checkpoints. Always double-quote variables: rm -rf "${OUTPUT_DIR}/checkpoints". Better yet, validate that the variable is non-empty before running destructive commands.

Do not run parallel or xargs -P without a concurrency limit on shared infrastructure. Running parallel -j 0 (unlimited parallelism) on a shared NFS server will saturate I/O and impact everyone using that server. Always set a reasonable concurrency limit.

Do not capture sensitive values in log files. Training scripts often log their full argument list. If your script accepts --api-key or --password flags, those values will appear in the training log. Use environment variables for secrets and never log $* or the full argument list. :::

:::warning Common Pitfalls Glob patterns and spaces in filenames: The pattern for f in data/*.pt; do python process.py $f; done breaks if any filename contains a space. Always quote variables: "$f". Or use find ... -print0 | xargs -0 for the most robust handling.

wait without a PID returns the wrong exit code: Plain wait (without a PID argument) waits for all background jobs but returns 0 even if some failed in older bash versions. Always capture PIDs with $! immediately after launching a background job and pass them to wait.

SSH StrictHostKeyChecking in CI: When your CI script SSH-es into a worker node it has never connected to before, SSH will prompt "Are you sure you want to continue connecting?" and hang. Add -o StrictHostKeyChecking=no for automated scripts (and understand the security implications). :::

Interview Questions and Answers

Q1: Explain what set -euo pipefail does and why each flag matters for ML scripts.

A: -e causes the shell to exit immediately when any command returns a non-zero exit code, preventing the script from continuing after a failure. -u treats unset variable references as errors - without this, a typo like ${MDOEL_DIR} silently expands to empty string, which can cause rm -rf to run in the wrong directory. -o pipefail makes a pipeline fail if any stage in the pipe fails - without it, python train.py | tee training.log would report success even if the Python script failed, because tee succeeds. Together, these three flags make bash scripts fail fast and loudly instead of silently continuing with corrupted state.

Q2: You have 10,000 image files to preprocess. How would you do this in parallel in bash?

A: There are three main approaches. Using xargs -P: find data/images -name "*.jpg" | xargs -P 8 -I {} python preprocess.py {} - this limits concurrency to 8 workers and is available everywhere. Using GNU parallel: parallel -j 8 --progress python preprocess.py {} ::: data/images/*.jpg - this has better progress reporting and error handling. Using background jobs with manual PID tracking: start workers with &, collect PIDs with $!, and use wait PID to detect individual failures. For production use, GNU parallel is preferred because it handles argument quoting correctly, provides progress bars, and collects exit codes from all jobs.

Q3: How do you safely handle checkpoint cleanup in a training script so you never lose the most recent checkpoint?

A: The pattern is: (1) save the new checkpoint to a temporary filename first (checkpoint_step_1000.pt.tmp), (2) verify it was written correctly (check file size is non-zero, optionally load and verify state dict), (3) atomically rename it to the final name with mv (which is atomic when source and destination are on the same filesystem), and (4) only then delete old checkpoints. The key insight is that mv on the same filesystem is atomic - there is no moment where the file does not exist. Never delete the old checkpoint before the new one is fully written and verified.

Q4: You are running a distributed training job across 4 nodes. Node 2 fails halfway through training. How do you detect this and what should your bash script do?

A: Each node's SSH command is run as a background job with its PID captured. The coordinator script calls wait PID for each node. If any wait call returns non-zero, that node failed. The script should: (1) send an alert (Slack, email, PagerDuty) immediately; (2) kill all other nodes' training processes with SSH to avoid them hanging waiting for the dead worker; (3) preserve all available checkpoints; and (4) exit with a non-zero exit code so the CI/CD system knows the training failed. With PyTorch's c10d rendezvous backend, you can also configure automatic node fault tolerance - but the bash wrapper should still detect and report failures independently.

Q5: What is the difference between 2>&1 | tee log.txt and > log.txt 2>&1, and when would you use each for ML training?

A: > log.txt 2>&1 redirects both stdout and stderr to the file, with nothing visible on the terminal. 2>&1 | tee log.txt sends stderr to stdout (merging them), then pipes to tee which writes to the file AND the terminal simultaneously. For ML training, 2>&1 | tee training.log is almost always preferred - you want to see progress bars and loss values in real time on the terminal while also preserving them in the log file. The only downside is that tee adds a tiny amount of overhead, which is negligible for training. One subtlety: with set -o pipefail, if the Python training script fails, the pipe will also fail, and your script will exit with an error as expected.

Q6: How would you write a bash script that monitors GPU utilization every 30 seconds and sends an alert if utilization drops below 20% for more than 5 minutes?

A: The key elements are: (1) a loop that calls nvidia-smi --query-gpu=utilization.gpu --format=csv,noheader,nounits and parses the output; (2) a consecutive-low counter that increments each time utilization is below threshold and resets when utilization recovers; (3) an alert condition when the counter reaches 5 minutes / 30 seconds = 10 consecutive readings; and (4) a mechanism to send the alert (curl to Slack webhook, mail, or writing to a file that another process monitors). The script runs as a background process alongside training, and both are controlled by the launcher script's trap cleanup EXIT handler.

Q7: Explain process substitution and give an ML-specific example of where it is useful.

A: Process substitution (<(command)) makes the output of a command appear as if it were a file - bash creates a named pipe (like /dev/fd/63) that the outer command reads from. This avoids creating temporary files. An ML-specific example: comparing model predictions from two checkpoints without saving intermediate files: diff <(python predict.py --model checkpoint_step_1000.pt --data test.csv) <(python predict.py --model checkpoint_step_2000.pt --data test.csv). Another use case: some evaluation scripts require a file path argument and cannot read from stdin. You can use process substitution to pass streaming data: python evaluate.py --predictions <(python generate_predictions.py --model model.pt).

The Training Launch That Corrupted 3 Days of Work​

Why This Exists - The Gap Between Python and the Environment​

Historical Context - The Shell as the Original Orchestrator​

Core Concepts​

Robust Error Handling​

Variables, Arrays, and Conditionals​

Argument Parsing with getopts​

Pipelines, Redirections, and Process Substitution​

Parallel Execution​

Complete Training Launcher Script​

Multi-Node Training Coordinator​

GPU Monitoring Script​

Parallel Data Download Script​

Checkpoint Cleanup Script​

Log Parsing with awk, sed, jq​

The ML Workflow Pipeline Diagram​

Production Engineering Notes​

Interview Questions and Answers​