Skip to main content

Privacy and Air-Gapped Deployment

The 3 AM Security Audit That Changed Everything

It is 3 AM on a Tuesday at a mid-sized law firm in Chicago. A senior partner is drafting a motion involving a sensitive merger that will close in 72 hours. She has been using an AI assistant to check her arguments, find holes in opposing counsel's briefs, and suggest case precedents. The tool is fast, accurate, and indispensable.

Then the firm's outside security consultant - brought in to audit AI tool usage before a major client's annual review - pulls a network log and walks into her office. "Every prompt you typed in the last three months went to a server in Oregon. The client's merger strategy, the confidential financial terms, the negotiating positions. All of it. Sitting in a vendor's training pipeline."

The partner goes pale. The client is a Fortune 50 company. The other party to the merger had recently hired from the same cloud AI vendor. Nothing is provably compromised - but the attorney-client privilege question alone could unravel the firm's relationship with that client permanently. The firm's malpractice carrier raises rates by 40% at the next renewal.

This scenario is not hypothetical. It plays out across regulated industries every year, often quietly, without public disclosure. The pattern is the same: engineers and business users reach for convenient cloud AI tools, forget that "the model" lives somewhere else and "somewhere else" means your data leaves your control the moment you hit enter. For healthcare providers, that moment can constitute a HIPAA breach. For defense contractors, it can violate ITAR regulations. For banks processing non-public information, it can be a securities violation.

The solution is not to avoid AI. The solution is to run the model where the data lives - behind your firewall, on your hardware, with no network path to the outside world. This is air-gapped deployment: the oldest security pattern in computing, now applied to language models.

This lesson walks through every layer of that deployment - from the regulatory landscape that makes it necessary, to the practical steps of getting model weights onto an isolated machine, to the automation scripts that make the process repeatable and auditable.


Why This Exists

The Problem: Cloud AI Leaks Data by Design

Cloud-hosted AI APIs are not the problem. The business model is. When you call openai.chat.completions.create(), your prompt travels over the network to a server you do not control, runs through a model you cannot inspect, and the provider retains the right to log, analyze, and potentially use that interaction for model improvement. The terms of service change. The vendor's security posture changes. The regulatory interpretation of what constitutes "sharing" changes.

For most use cases this is an acceptable trade. For a subset - healthcare, legal, financial trading, defense, intelligence - it is not acceptable under any circumstances. These aren't edge cases. Healthcare alone is a $4.3 trillion industry in the United States with strict data handling requirements. The combined market for privacy-requiring AI is enormous, and until open-source models reached competitive quality, that market was simply locked out of modern AI tooling.

Why "Private Cloud" Is Not Enough

A common response is "we use a private cloud instance - our VPC, our keys, our data." This addresses some risks but not all. The hyperscaler still sees your network traffic metadata. The model provider may still have contractual access to logs for debugging. Most importantly, "private cloud" still requires internet connectivity - which is precisely what certain regulated environments prohibit. A classified government facility cannot route traffic to AWS GovCloud even if the data never leaves a sovereign boundary. A hospital in an area with unreliable connectivity needs inference to work when the internet is down.

Air-gapped deployment eliminates the network dependency entirely. The model runs on hardware that has no physical or logical connection to the internet. Data never crosses a boundary it should not cross because there is no path for it to do so.

What This Solves

Running models locally in true air-gap mode provides:

  • Data residency guarantees - the prompt and response never leave the physical hardware
  • Regulatory compliance - HIPAA, ITAR, GDPR Article 44, FIPS 140-2 can all be satisfied
  • Availability independence - model works when internet is down, vendor is down, or vendor is acquired
  • Audit completeness - every inference call is logged locally, no third-party logs exist
  • Supply chain control - model weights are versioned and verified by checksum before use

Historical Context: From SIPRNET to LLMs

The concept of an air-gapped network predates the internet. The U.S. Department of Defense operated early computing systems in the 1960s and 1970s on physically isolated networks. The Secure Internet Protocol Router Network (SIPRNET) - which handles SECRET-classified U.S. government communications - has always been air-gapped from the civilian internet. The technique was formalized not because someone invented it, but because the threat model demanded it: if sensitive data must not leak, the simplest guarantee is no physical path for leakage.

The "aha moment" for AI practitioners came in 2022-2023 as two trends collided. First, LLaMA-1 was released by Meta in February 2023, demonstrating that a 65B parameter model could achieve near-GPT-3.5 quality. Second, Samsung's internal ChatGPT leak in April 2023 became public: engineers had pasted proprietary chip design source code into ChatGPT prompts, accidentally disclosing trade secrets to OpenAI. Samsung banned ChatGPT internally within days. The story ran in Bloomberg and the Wall Street Journal. Every CISO in every Fortune 500 company forwarded it to their AI working group.

Within three months, the combination of LLaMA-class quality and the Samsung incident created an entirely new market segment: enterprise-grade, privacy-preserving, locally-hosted LLMs. Tools like Ollama, LM Studio, and the broader llama.cpp ecosystem were built precisely to serve this market. The regulatory frameworks had existed for decades. The models were finally good enough to meet them.


Core Concepts

Threat Modeling for LLM Privacy

Before deploying anything, you need to understand which threats you are defending against. Air-gap deployment addresses different threat classes than encryption or access control.

The four primary threat classes for LLM deployments are:

Network exfiltration - data leaving your perimeter via network connections. Air-gap eliminates this entirely.

Vendor access - the model provider having contractual or technical access to your data. Air-gap eliminates this because you are running the weights, not calling an API.

Side-channel inference - an adversary inferring the contents of your prompts from metadata (timing, packet sizes). Air-gap eliminates this - no packets leave.

Model poisoning - the model itself has been tampered with to exfiltrate data or produce biased outputs. Air-gap does not eliminate this, which is why SHA256 verification of model weights before deployment is essential.

The threat model informs the deployment architecture. If you only need to defend against network exfiltration, a local install with HF_HUB_OFFLINE=1 may suffice. If you need to defend against all four threats including model tampering, you need a full chain-of-custody deployment with verified checksums and possibly a separate trusted platform module (TPM) attestation.

Regulatory Landscape

Understanding which regulation applies to your situation determines the technical requirements.

HIPAA (Health Insurance Portability and Accountability Act) - applies to Protected Health Information (PHI) in the United States. PHI includes patient names, dates, geographic data, phone numbers, device identifiers, and anything that could identify a patient. The key question: does your LLM prompt contain PHI? If a doctor asks "What is the standard treatment for the condition described in this patient note: [note]", and the note contains the patient's name or a unique identifier, that is PHI. HIPAA's Security Rule requires that PHI be protected in transit and at rest. Sending PHI to a cloud LLM requires a Business Associate Agreement (BAA) with the vendor, which most AI providers do not offer or offer only in enterprise tiers. Running locally eliminates the need for a BAA entirely.

ITAR (International Traffic in Arms Regulations) - applies to defense articles and technical data. If you are an aerospace or defense contractor using AI to analyze technical documents that fall under USML (United States Munitions List) categories, those documents cannot be processed on systems accessible to non-U.S. persons or routed through foreign-operated infrastructure. Air-gap deployment combined with proper access controls satisfies ITAR's technical data handling requirements.

GDPR Article 44 - restricts transfers of EU personal data to countries outside the EU/EEA unless adequate safeguards exist. Most U.S. cloud AI providers process data in U.S. datacenters, which technically constitutes a Chapter V transfer. On-premise deployment in EU jurisdiction eliminates the transfer question entirely.

FIPS 140-2 - Federal Information Processing Standard for cryptographic modules. U.S. federal agencies and contractors handling Controlled Unclassified Information (CUI) must use FIPS-validated cryptographic modules. This affects how you handle model weight encryption, log encryption, and API TLS. Running RHEL or Ubuntu with FIPS mode enabled addresses the operating system layer; your inference stack must also be validated or use validated libraries.

SOC 2 Type II - not a regulation but an audit standard. If your company has SOC 2 Type II certification, your AI deployment must fall within the audit scope with documented controls. Local deployment makes the control evidence much cleaner: inference logs are local, access is controlled by your IAM, no third-party sub-processors exist.

The Air-Gap Deployment Architecture

┌─────────────────────────────────────────────────────────────┐
│ SECURE FACILITY │
│ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Air-Gapped Network Segment │ │
│ │ │ │
│ │ ┌─────────────┐ ┌─────────────────────────┐ │ │
│ │ │ Inference │ │ Application │ │ │
│ │ │ Server │◄───│ Server │ │ │
│ │ │ (GPU host) │ │ (FastAPI / LangChain) │ │ │
│ │ └─────────────┘ └─────────────────────────┘ │ │
│ │ ▲ ▲ │ │
│ │ │ │ │ │
│ │ ┌──────┴──────┐ ┌───────┴──────┐ │ │
│ │ │ Model │ │ User │ │ │
│ │ │ Storage │ │ Workstations│ │ │
│ │ │ (NAS/SAN) │ │ │ │ │
│ │ └─────────────┘ └──────────────┘ │ │
│ └──────────────────────────────────────────────────────┘ │
│ ▲ │
│ │ (One-way data diode or │
│ │ physical media transfer only) │
│ ┌───────────┴──────────────────────────────────────────┐ │
│ │ Data Transfer Workstation │ │
│ │ (not on air-gapped segment) │ │
│ └──────────────────────────────────────────────────────┘ │
│ ▲ │
│ │ USB / encrypted drive │
└──────────────┼──────────────────────────────────────────────┘

(external world)

The critical element is the one-way transfer boundary. Data flows into the air-gapped segment (new model weights, software updates) only via deliberate, controlled action - physical media, data diodes, or manual transfer workstations. No data flows out automatically.


Pre-Downloading Everything: The Complete Checklist

The hardest part of air-gap deployment is not the runtime configuration - it is ensuring every dependency is available offline before you disconnect. Missing a single file means the model fails to load and you cannot just pip install your way out of it.

Model Weights and Tokenizer Files

HuggingFace stores models as a collection of files: model weights (sharded .safetensors or .bin files), a config.json, a tokenizer.json, tokenizer_config.json, and sometimes additional vocabulary files. You need all of them.

# download_model_for_airgap.py
# Run this on an internet-connected machine before transport

from huggingface_hub import snapshot_download
import hashlib
import json
import os
from pathlib import Path

def download_and_verify(
model_id: str,
local_dir: str,
revision: str = "main"
) -> dict:
"""
Download a complete model snapshot and generate a manifest
with SHA256 checksums for every file.
"""
print(f"Downloading {model_id} to {local_dir} ...")

# snapshot_download fetches every file in the repo
downloaded_path = snapshot_download(
repo_id=model_id,
local_dir=local_dir,
revision=revision,
ignore_patterns=["*.msgpack", "flax_model*", "tf_model*"], # skip non-PyTorch weights
)

print(f"Download complete. Building checksum manifest ...")

manifest = {
"model_id": model_id,
"revision": revision,
"files": {}
}

base = Path(downloaded_path)
for filepath in sorted(base.rglob("*")):
if filepath.is_file():
relative = str(filepath.relative_to(base))
sha256 = hashlib.sha256()
with open(filepath, "rb") as f:
for chunk in iter(lambda: f.read(65536), b""):
sha256.update(chunk)
manifest["files"][relative] = {
"sha256": sha256.hexdigest(),
"size_bytes": filepath.stat().st_size
}
print(f" {relative}: {sha256.hexdigest()[:16]}...")

manifest_path = Path(local_dir) / "airgap_manifest.json"
with open(manifest_path, "w") as f:
json.dump(manifest, f, indent=2)

print(f"\nManifest written to {manifest_path}")
print(f"Total files: {len(manifest['files'])}")
total_bytes = sum(v['size_bytes'] for v in manifest['files'].values())
print(f"Total size: {total_bytes / 1e9:.2f} GB")

return manifest


if __name__ == "__main__":
# Example: download Llama-3.2-3B-Instruct
download_and_verify(
model_id="meta-llama/Llama-3.2-3B-Instruct",
local_dir="./models/llama-3.2-3b-instruct",
revision="main"
)

For GGUF models (used with llama.cpp and Ollama), the download is simpler - a single file per quantization level:

# Download GGUF quantizations for offline use
# Run on internet-connected machine

MODEL_BASE_URL="https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main"
OUTPUT_DIR="./models/gguf/llama-3.2-3b"

mkdir -p "$OUTPUT_DIR"

# Download multiple quantization levels
for QUANT in Q4_K_M Q5_K_M Q8_0; do
FILENAME="Llama-3.2-3B-Instruct-${QUANT}.gguf"
echo "Downloading $FILENAME ..."
curl -L \
--progress-bar \
-o "$OUTPUT_DIR/$FILENAME" \
"$MODEL_BASE_URL/$FILENAME"

# Generate checksum immediately after download
sha256sum "$OUTPUT_DIR/$FILENAME" >> "$OUTPUT_DIR/checksums.sha256"
echo " Checksum recorded."
done

echo "All downloads complete. Checksum file: $OUTPUT_DIR/checksums.sha256"

Python Package Offline Cache

Every Python package your inference stack needs must be downloaded to a local cache. pip download is the correct tool - it downloads wheel files and their dependencies without installing them.

# pip_offline_cache.sh
# Creates a self-contained pip cache directory for offline installation

set -e

CACHE_DIR="./pip_cache"
mkdir -p "$CACHE_DIR"

# Core inference dependencies
pip download \
--dest "$CACHE_DIR" \
torch==2.3.0 \
torchvision==0.18.0 \
torchaudio==2.3.0 \
--index-url https://download.pytorch.org/whl/cu121

# HuggingFace ecosystem
pip download \
--dest "$CACHE_DIR" \
transformers==4.42.0 \
accelerate==0.31.0 \
huggingface_hub==0.23.4 \
tokenizers==0.19.1 \
safetensors==0.4.3 \
sentencepiece==0.2.0

# Serving layer
pip download \
--dest "$CACHE_DIR" \
fastapi==0.111.0 \
uvicorn[standard]==0.30.0 \
pydantic==2.7.4

# Utilities
pip download \
--dest "$CACHE_DIR" \
numpy==1.26.4 \
scipy==1.13.1 \
tqdm==4.66.4

echo "Package cache ready at $CACHE_DIR"
echo "Transfer this directory to the air-gapped machine."
echo ""
echo "To install on air-gapped machine:"
echo " pip install --no-index --find-links=$CACHE_DIR transformers accelerate ..."

On the air-gapped machine, installation uses --no-index to prevent any network calls:

# On the air-gapped machine - no internet access
pip install \
--no-index \
--find-links=/path/to/pip_cache \
transformers \
accelerate \
fastapi \
uvicorn

Verifying Model Integrity on Arrival

After transferring model files via physical media (USB, encrypted hard drive, or verified network share), verify every file before use:

# verify_airgap_manifest.py
# Run on the air-gapped machine after receiving model files

import hashlib
import json
import sys
from pathlib import Path

def verify_manifest(model_dir: str, manifest_path: str = None) -> bool:
"""
Verify all files in a model directory against the manifest
generated during download. Returns True if all checksums match.
"""
base = Path(model_dir)

if manifest_path is None:
manifest_path = base / "airgap_manifest.json"

with open(manifest_path) as f:
manifest = json.load(f)

print(f"Verifying {len(manifest['files'])} files for {manifest['model_id']} ...")
print(f"Revision: {manifest['revision']}")
print()

failures = []

for relative_path, expected in manifest["files"].items():
filepath = base / relative_path

if not filepath.exists():
failures.append(f"MISSING: {relative_path}")
continue

# Compute actual checksum
sha256 = hashlib.sha256()
with open(filepath, "rb") as f:
for chunk in iter(lambda: f.read(65536), b""):
sha256.update(chunk)

actual = sha256.hexdigest()

if actual != expected["sha256"]:
failures.append(
f"CHECKSUM MISMATCH: {relative_path}\n"
f" expected: {expected['sha256']}\n"
f" actual: {actual}"
)
else:
print(f" OK {relative_path}")

print()

if failures:
print("VERIFICATION FAILED:")
for failure in failures:
print(f" {failure}")
return False
else:
print(f"All {len(manifest['files'])} files verified successfully.")
return True


if __name__ == "__main__":
model_dir = sys.argv[1] if len(sys.argv) > 1 else "./model"
ok = verify_manifest(model_dir)
sys.exit(0 if ok else 1)

Configuring HuggingFace for Fully Offline Operation

HuggingFace's Transformers library makes network calls in several places that are easy to miss. Setting the right environment variables before any import ensures the library never attempts an outbound connection.

# Set in your shell profile, systemd unit, or Docker ENV
export HF_HUB_OFFLINE=1 # Prevents huggingface_hub from making any network requests
export TRANSFORMERS_OFFLINE=1 # Same for transformers library specifically
export HF_DATASETS_OFFLINE=1 # For datasets library if used
export TOKENIZERS_PARALLELISM=false # Prevents fork-related warnings in offline mode

# Optionally: point HF_HOME to your local model cache
export HF_HOME=/opt/ai/model_cache
export HF_HUB_CACHE=/opt/ai/model_cache/hub
warning

Setting HF_HUB_OFFLINE=1 is not enough on its own. If you install a newer version of transformers that adds a new network call (e.g., for model card fetching), your deployment will silently fail in new ways. Pin your dependency versions and test every upgrade in a staging environment before deploying to air-gapped production.

# airgap_inference.py
# Complete inference script with all offline guards set

import os

# These must be set BEFORE importing transformers or huggingface_hub
os.environ["HF_HUB_OFFLINE"] = "1"
os.environ["TRANSFORMERS_OFFLINE"] = "1"
os.environ["HF_DATASETS_OFFLINE"] = "1"

# Now safe to import
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

def load_model_offline(model_path: str):
"""
Load a model from a local path, guaranteed no network access.
The local_files_only=True is belt-and-suspenders alongside the env vars.
"""
print(f"Loading tokenizer from {model_path} ...")
tokenizer = AutoTokenizer.from_pretrained(
model_path,
local_files_only=True, # explicit belt-and-suspenders
trust_remote_code=False, # never execute remote code in air-gap context
)

print(f"Loading model from {model_path} ...")
model = AutoModelForCausalLM.from_pretrained(
model_path,
local_files_only=True,
trust_remote_code=False,
torch_dtype=torch.float16,
device_map="auto",
)

return model, tokenizer


def generate(model, tokenizer, prompt: str, max_new_tokens: int = 512) -> str:
messages = [{"role": "user", "content": prompt}]

# Apply chat template using local tokenizer only
input_ids = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt"
).to(model.device)

with torch.no_grad():
output_ids = model.generate(
input_ids,
max_new_tokens=max_new_tokens,
do_sample=True,
temperature=0.7,
top_p=0.9,
pad_token_id=tokenizer.eos_token_id,
)

# Decode only the newly generated tokens
new_tokens = output_ids[0][input_ids.shape[1]:]
return tokenizer.decode(new_tokens, skip_special_tokens=True)


if __name__ == "__main__":
model, tokenizer = load_model_offline("/opt/ai/models/llama-3.2-3b-instruct")
response = generate(model, tokenizer, "Summarize the key provisions of HIPAA.")
print(response)

Ollama in Offline Mode

Ollama also makes network calls by default. To run it in fully offline mode:

# Start Ollama with all network features disabled
OLLAMA_HOST=127.0.0.1:11434 \
OLLAMA_ORIGINS="*" \
OLLAMA_MODELS=/opt/ai/ollama_models \
ollama serve &

# Load a model that was previously pulled and transferred
# (ollama pull must have been done on an internet-connected machine first)
# The model blobs are in OLLAMA_MODELS directory - transfer the entire directory

# Verify it works without network
ollama run llama3.2:3b "Hello, are you working offline?"

Sneakernet Model Delivery

"Sneakernet" - the practice of physically carrying data on removable media - is the standard model delivery mechanism for air-gapped environments. The term is slightly tongue-in-cheek but the practice is serious and requires its own procedures.

Transfer Media Options

MediaCapacitySpeedSecurity Considerations
USB 3.2 encrypted driveup to 4 TB400 MB/sVeraCrypt or hardware encryption; must be wiped after transfer
LTO tape12-45 TB native400 MB/sIndustry standard for classified; requires tape drive on both ends
Encrypted NVMe in secure enclosureup to 8 TB3+ GB/sFastest option; physical chain of custody required
Optical (M-DISC)25-100 GB20-30 MB/sArchival quality; read-only reduces tampering risk

For most enterprise deployments, an encrypted NVMe drive with hardware-enforced encryption (Samsung T7 Shield or IronKey) is the practical choice - fast, portable, and the encryption is auditable.

Transfer Procedure

# prepare_transfer_drive.sh
# Run on internet-connected preparation machine
# Assumes encrypted drive is mounted at /mnt/transfer_drive

TRANSFER_DRIVE="/mnt/transfer_drive"
SOURCE_MODELS_DIR="./models"
SOURCE_PIP_CACHE="./pip_cache"
SOURCE_SCRIPTS="./scripts"

# Create organized directory structure on drive
mkdir -p "$TRANSFER_DRIVE/models"
mkdir -p "$TRANSFER_DRIVE/pip_cache"
mkdir -p "$TRANSFER_DRIVE/scripts"
mkdir -p "$TRANSFER_DRIVE/docs"

# Copy model files
echo "Copying model files ..."
rsync -av --progress \
"$SOURCE_MODELS_DIR/" \
"$TRANSFER_DRIVE/models/"

# Copy pip cache
echo "Copying pip cache ..."
rsync -av \
"$SOURCE_PIP_CACHE/" \
"$TRANSFER_DRIVE/pip_cache/"

# Copy deployment scripts
cp "$SOURCE_SCRIPTS"/*.sh "$TRANSFER_DRIVE/scripts/"
cp "$SOURCE_SCRIPTS"/*.py "$TRANSFER_DRIVE/scripts/"

# Generate top-level manifest
echo "Generating transfer manifest ..."
find "$TRANSFER_DRIVE" -type f -exec sha256sum {} \; \
| sort > "$TRANSFER_DRIVE/TRANSFER_MANIFEST.sha256"

# Record transfer metadata
cat > "$TRANSFER_DRIVE/TRANSFER_METADATA.txt" << EOF
Transfer date: $(date -u '+%Y-%m-%d %H:%M:%S UTC')
Prepared by: $(whoami)@$(hostname)
Purpose: Air-gapped LLM deployment
Models included: $(ls "$TRANSFER_DRIVE/models/")
File count: $(find "$TRANSFER_DRIVE" -type f | wc -l)
Total size: $(du -sh "$TRANSFER_DRIVE" | cut -f1)
SHA256 of this file: (computed below)
EOF

# Sync to ensure everything is written
sync

echo "Transfer drive ready."
echo "Physical chain of custody required from this point."

Mermaid Diagrams

Air-Gap Deployment Architecture

Regulatory Compliance Decision Tree

Offline Model Loading Flow


On-Premise Deployment Patterns

Pattern 1: Bare Metal Single Server

Best for: small teams, high-GPU-utilization use cases, simplicity of operations.

┌──────────────────────────────────────┐
│ Single bare metal server │
│ (e.g., Dell R750xa, 2x A100 80GB) │
│ │
│ ┌──────────┐ ┌──────────────────┐ │
│ │ GPU 0 │ │ GPU 1 │ │
│ │ A100 80G │ │ A100 80G │ │
│ └──────────┘ └──────────────────┘ │
│ │
│ Model: Llama-3-70B in FP16 │
│ (140 GB across both GPUs) │
│ │
│ Serving: vLLM on port 8000 │
│ (internal network only) │
└──────────────────────────────────────┘

Setup:

# vllm_airgap_server.sh
# Start vLLM in offline mode, no telemetry

export HF_HUB_OFFLINE=1
export TRANSFORMERS_OFFLINE=1
export VLLM_WORKER_MULTIPROC_METHOD=spawn

python -m vllm.entrypoints.openai.api_server \
--model /opt/ai/models/llama-3-70b-instruct \
--tensor-parallel-size 2 \
--max-model-len 8192 \
--host 0.0.0.0 \
--port 8000 \
--served-model-name "llama-3-70b" \
--disable-log-requests \ # optional: reduce log verbosity
--gpu-memory-utilization 0.90

Pattern 2: Air-Gapped Kubernetes

For larger deployments where multiple teams need access and high availability matters, Kubernetes in an air-gapped configuration is the right pattern. This requires pre-loading all container images as well as model weights.

# Pre-pull all required container images on internet-connected machine
docker pull vllm/vllm-openai:v0.5.0
docker pull nvidia/cuda:12.1.0-base-ubuntu22.04
docker pull python:3.11-slim

# Save to tar archives for transfer
docker save vllm/vllm-openai:v0.5.0 | gzip > vllm-v0.5.0.tar.gz
docker save python:3.11-slim | gzip > python-3.11-slim.tar.gz

# On air-gapped machine, load images
docker load < vllm-v0.5.0.tar.gz
docker load < python-3.11-slim.tar.gz

# Push to internal registry (Harbor, Nexus, or registry:2)
docker tag vllm/vllm-openai:v0.5.0 internal-registry.local/vllm:v0.5.0
docker push internal-registry.local/vllm:v0.5.0

The Kubernetes deployment manifest references the internal registry and mounts the model weights via a PersistentVolume backed by the internal NAS:

# vllm-deployment.yaml (simplified)
apiVersion: apps/v1
kind: Deployment
metadata:
name: llm-inference
namespace: ai-platform
spec:
replicas: 1
selector:
matchLabels:
app: llm-inference
template:
metadata:
labels:
app: llm-inference
spec:
containers:
- name: vllm
image: internal-registry.local/vllm:v0.5.0 # no external registry
env:
- name: HF_HUB_OFFLINE
value: "1"
- name: TRANSFORMERS_OFFLINE
value: "1"
volumeMounts:
- name: model-weights
mountPath: /opt/models
readOnly: true
resources:
limits:
nvidia.com/gpu: "1"
volumes:
- name: model-weights
persistentVolumeClaim:
claimName: model-weights-pvc # backed by internal NAS

FIPS Compliance for Government Deployments

FIPS 140-2 (transitioning to FIPS 140-3) specifies requirements for cryptographic modules used in U.S. federal information systems. For LLM deployments, this primarily affects:

  1. The TLS layer protecting the inference API (must use FIPS-approved cipher suites)
  2. Any encrypted storage for model weights or logs
  3. Authentication mechanisms for the inference API

Enabling FIPS mode on Ubuntu 22.04:

# Enable FIPS mode (requires Ubuntu Pro subscription or DISA STIG hardening)
sudo ua enable fips
sudo reboot

# Verify FIPS mode is active
cat /proc/sys/crypto/fips_enabled
# Should output: 1

# Verify Python crypto uses FIPS-compliant backend
python3 -c "import ssl; print(ssl.OPENSSL_VERSION)"
# Should show OpenSSL with FIPS enabled

# Generate FIPS-compliant TLS cert for inference API (RSA 3072 or ECDSA P-384)
openssl req -x509 \
-newkey ec \
-pkeyopt ec_paramgen_curve:P-384 \
-keyout server.key \
-out server.crt \
-days 365 \
-nodes \
-subj "/CN=inference.internal.example.com"
tip

For DISA STIG hardening requirements, Red Hat Enterprise Linux (RHEL) has the most mature FIPS implementation for AI workloads. RHEL 9 with the crypto-policies package set to FIPS mode passes most government security scans without additional configuration.


Complete Air-Gap Deployment Automation Script

#!/bin/bash
# airgap_deploy.sh
# Complete deployment script for air-gapped LLM inference server
# Run on the air-gapped target machine after receiving transfer media

set -euo pipefail

# ---- Configuration ----
TRANSFER_MEDIA="/mnt/transfer_drive"
INSTALL_BASE="/opt/ai"
MODEL_NAME="llama-3.2-3b-instruct"
MODEL_DIR="$INSTALL_BASE/models/$MODEL_NAME"
PIP_CACHE="$TRANSFER_MEDIA/pip_cache"
VENV_DIR="$INSTALL_BASE/venv"
SERVICE_USER="aiinference"
PORT=8080

echo "================================================================"
echo " Air-Gapped LLM Deployment Script"
echo " Target: $INSTALL_BASE"
echo " Model: $MODEL_NAME"
echo "================================================================"
echo ""

# Step 1: Verify transfer media checksums
echo "[1/7] Verifying transfer media integrity ..."
cd "$TRANSFER_MEDIA"
sha256sum --check --quiet TRANSFER_MANIFEST.sha256
echo " Transfer media integrity: OK"

# Step 2: Create directory structure
echo "[2/7] Creating directory structure ..."
sudo mkdir -p "$INSTALL_BASE"/{models,venv,logs,scripts}
sudo chown -R "$SERVICE_USER:$SERVICE_USER" "$INSTALL_BASE"

# Step 3: Copy model files
echo "[3/7] Copying model files ..."
rsync -av "$TRANSFER_MEDIA/models/$MODEL_NAME/" "$MODEL_DIR/"

# Step 4: Verify model checksums
echo "[4/7] Verifying model file integrity ..."
python3 - << 'PYEOF'
import sys
sys.path.insert(0, '/opt/ai/scripts')
from verify_airgap_manifest import verify_manifest
ok = verify_manifest('/opt/ai/models/llama-3.2-3b-instruct')
if not ok:
print("ERROR: Model verification failed. Do not proceed.")
sys.exit(1)
PYEOF
echo " Model integrity: OK"

# Step 5: Create virtualenv and install packages offline
echo "[5/7] Installing Python packages from offline cache ..."
python3 -m venv "$VENV_DIR"
"$VENV_DIR/bin/pip" install \
--no-index \
--find-links="$PIP_CACHE" \
--quiet \
transformers \
accelerate \
fastapi \
uvicorn \
torch \
safetensors \
sentencepiece
echo " Python packages installed: OK"

# Step 6: Create systemd service
echo "[6/7] Installing systemd service ..."
sudo tee /etc/systemd/system/ai-inference.service > /dev/null << SVCEOF
[Unit]
Description=Air-Gapped LLM Inference Service
After=network.target

[Service]
Type=simple
User=$SERVICE_USER
WorkingDirectory=$INSTALL_BASE
Environment="HF_HUB_OFFLINE=1"
Environment="TRANSFORMERS_OFFLINE=1"
Environment="HF_HOME=$INSTALL_BASE/hf_cache"
Environment="PATH=$VENV_DIR/bin:/usr/local/bin:/usr/bin:/bin"
ExecStart=$VENV_DIR/bin/python $INSTALL_BASE/scripts/airgap_inference.py
Restart=on-failure
RestartSec=10
StandardOutput=append:$INSTALL_BASE/logs/inference.log
StandardError=append:$INSTALL_BASE/logs/inference_error.log

[Install]
WantedBy=multi-user.target
SVCEOF

sudo systemctl daemon-reload
sudo systemctl enable ai-inference
echo " Systemd service installed: OK"

# Step 7: Start service and verify
echo "[7/7] Starting inference service ..."
sudo systemctl start ai-inference
sleep 5

if systemctl is-active --quiet ai-inference; then
echo " Service status: RUNNING"
echo ""
echo "Deployment complete."
echo "Inference endpoint: http://localhost:$PORT"
echo "Logs: $INSTALL_BASE/logs/inference.log"
else
echo " Service status: FAILED"
echo " Check logs: journalctl -u ai-inference -n 50"
exit 1
fi

Production Engineering Notes

Logging and Audit Trails

Air-gapped deployments need complete local audit trails since you cannot rely on cloud logging services. Every inference request should be logged with enough detail for compliance review:

# audit_logger.py
import json
import hashlib
import time
from datetime import datetime, timezone
from pathlib import Path

class AuditLogger:
"""
Immutable append-only audit log for LLM inference calls.
Each entry is hash-chained to the previous for tamper detection.
"""
def __init__(self, log_path: str):
self.log_path = Path(log_path)
self.log_path.parent.mkdir(parents=True, exist_ok=True)
self._prev_hash = "genesis"

def log_inference(
self,
user_id: str,
prompt_hash: str, # SHA256 of prompt - do NOT log prompt text in PHI contexts
response_tokens: int,
latency_ms: float,
model_id: str,
):
entry = {
"timestamp": datetime.now(timezone.utc).isoformat(),
"user_id": user_id,
"prompt_hash": prompt_hash,
"response_tokens": response_tokens,
"latency_ms": round(latency_ms, 2),
"model_id": model_id,
"prev_hash": self._prev_hash,
}

entry_str = json.dumps(entry, sort_keys=True)
entry_hash = hashlib.sha256(entry_str.encode()).hexdigest()
entry["entry_hash"] = entry_hash
self._prev_hash = entry_hash

with open(self.log_path, "a") as f:
f.write(json.dumps(entry) + "\n")

return entry_hash
note

In HIPAA contexts, you should log prompt hashes (not the prompt text itself) unless the log store is itself HIPAA-compliant. A SHA256 hash of the prompt allows you to correlate audit records without exposing PHI in the log file.

Model Update Procedures

When a new model version needs to be deployed to an air-gapped environment, the process must be documented and auditable:

  1. Download new model weights on internet-connected preparation machine
  2. Generate new manifest and checksums
  3. Record the model version, source commit hash, and download date in a change log
  4. Obtain security officer approval (for government / defense contexts)
  5. Transfer via sneakernet using the procedures above
  6. Verify checksums on arrival
  7. Deploy to staging first, run smoke tests
  8. Promote to production with a documented rollback procedure

Common Mistakes

danger

Forgetting trust_remote_code=False

Many models on HuggingFace ship with custom model code in modeling_*.py files. Loading these with trust_remote_code=True executes arbitrary Python code from the model repository. In an air-gap context where you have done a careful chain-of-custody transfer, you then execute unvetted code. Always use trust_remote_code=False and use only models with standard architectures (LLaMA, Mistral, Gemma) that do not require remote code execution.

danger

Incomplete dependency audit before air-gap

The most common failure mode in air-gap deployment: you transfer the model and pip cache, install everything, start the inference server, and it crashes because a transitive dependency was missed. pip download fetches transitive dependencies, but only for the platform you run it on. If your preparation machine is x86-64 Linux and your target is also x86-64 Linux, you are fine. If they differ, you will get wrong-platform wheels. Always run pip download on a machine with the same OS and architecture as the target, or use --platform flags explicitly.

warning

Setting HF_HUB_OFFLINE=1 but not testing it

It is easy to set the environment variable and assume everything is offline. But if your code calls hf_hub_download() directly anywhere - in a custom script, a LangChain integration, or a model's config.json processing - you may still get network calls depending on the library version. Always test your full inference stack in a network-blocked environment (e.g., a VM with all outbound network disabled via iptables -P OUTPUT DROP) before deploying to actual air-gap hardware.

warning

Not verifying checksums end-to-end

Bit rot on physical media is real. USB drives can develop bad sectors. Encrypted drives have firmware bugs. A model file with a single corrupted byte will either fail to load silently or produce garbled outputs with no obvious error. Always verify SHA256 checksums after transfer and before loading the model, not just after download. The verification script takes 5-10 minutes for a 7B model and has saved countless debugging sessions.


Interview Q&A

Q1: What is the difference between "offline mode" in HuggingFace and a true air-gapped deployment? When would each be appropriate?

HuggingFace offline mode (HF_HUB_OFFLINE=1, TRANSFORMERS_OFFLINE=1) prevents the HuggingFace libraries from making network calls. The machine can still reach the internet via other channels - a developer could curl a URL, a logging library could phone home, a different process could establish a connection. It is a software-enforced constraint.

A true air-gapped deployment is a hardware-enforced constraint: there is no physical network path from the machine to the internet. The NIC may be disabled, the network switch port may be on an isolated VLAN with no uplink to the internet, or the machine may literally have no network connection at all. No software running on the machine can establish an outbound internet connection because the packets have nowhere to go.

Offline mode is appropriate for: compliance postures where the primary concern is accidental data leakage via AI libraries, developer machines in a corporate environment, or cost control (avoiding accidental downloads). Air-gap is appropriate for: classified environments, ITAR-controlled facilities, environments where the threat includes a compromised application that might exfiltrate data, or environments with strict regulatory requirements that mandate network isolation.

Q2: A government client requires FIPS 140-2 compliance for their LLM deployment. What does that actually mean in practice, and what stack components need to be validated?

FIPS 140-2 specifies requirements for cryptographic modules. In practice, for an LLM deployment, the relevant components are:

  • Operating system crypto layer: The OS must run in FIPS mode (RHEL 9, Ubuntu 22.04 with FIPS module, or equivalent). This ensures that all kernel-level crypto operations use FIPS-validated algorithms.
  • TLS for the inference API: The HTTP server (uvicorn, nginx, etc.) must be configured to use FIPS-approved cipher suites (AES-256-GCM, AES-128-GCM) and disallow non-FIPS ciphers (RC4, DES, 3DES, export ciphers).
  • At-rest encryption for model weights and logs: If stored on encrypted volumes, the encryption must use FIPS-validated implementations (AES-256-XTS for disk encryption, LUKS2 with OpenSSL FIPS backend on Linux).
  • Authentication: Any API keys, tokens, or certificates used to authenticate to the inference endpoint must use FIPS-approved algorithms.

Note what FIPS does NOT cover: the correctness or safety of the model's outputs, the security of the model weights themselves, or the application logic. FIPS is purely about cryptographic operations. A FIPS-compliant deployment can still have insecure application code.

Q3: How do you handle model updates in an air-gapped environment without disrupting production?

The key is treating model weights like immutable artifacts with a formal change control process. The pattern is:

  1. Maintain a versioned model registry (even just a directory structure with version numbers) on the internal NAS.
  2. New model versions are transferred via the standard sneakernet procedure into a staging area, not directly into production.
  3. Run automated smoke tests against the new version: can it load without errors, does it produce coherent output, do latency benchmarks fall within acceptable bounds?
  4. Promote to production via a symlink swap or a blue-green deployment in Kubernetes (two deployments, switch the service selector). This allows instant rollback.
  5. Keep the previous version on disk for at least 30 days before deletion, with a documented retention policy.

For zero-downtime updates in Kubernetes, you run two deployments simultaneously, shift 10% of traffic to the new version, verify behavior, then shift 100%. The model storage is read-only via PVC, so both versions coexist on the NAS.

Q4: What are the main security risks specific to air-gapped LLM deployments that engineers often overlook?

Three risks that are consistently underestimated:

First, model tampering before transfer. The air-gap protects against runtime data exfiltration but not against a malicious model file. An attacker with access to the preparation machine (or the download source) could modify model weights to produce backdoored outputs - answering certain trigger prompts in predetermined ways or subtly biasing outputs. SHA256 verification mitigates this only if you verify against a trusted source. Ideally, verify the checksum against the model card on HuggingFace AND against a checksum provided by the model author via a separate channel.

Second, output channel exfiltration. The model cannot send data out, but the application using the model can. If a user submits a prompt and the response is displayed in a browser, and that browser has internet access, a malicious application could encode data into the response content that the user then unintentionally transmits. This is a data diode problem - inputs come in, outputs go out, but outputs should be audited. Full output logging with manual or automated review helps.

Third, physical access. Air-gap security is only as strong as physical security. A USB drive plugged into the inference server by a malicious insider bypasses all network controls. Physical access controls, port blocking (disabling USB ports in BIOS or using port blockers), and tamper-evident seals are part of a complete air-gap security posture.

Q5: How do you handle tokenizer updates and security patches when the deployment is air-gapped?

This is one of the most underappreciated operational challenges. Tokenizer vulnerabilities exist - there have been CVEs related to certain tokenizer implementations that could cause crashes or unexpected behavior. In a cloud deployment, you update the library and restart. In an air-gap deployment, every update is a full sneakernet procedure.

The practical answer is: establish a regular update cadence (quarterly is common for government deployments) where you prepare a full updated pip cache on a preparation machine, transfer it via the standard procedure, verify it in staging, and deploy to production. This is the same cadence as OS patch management in air-gapped environments.

For critical security vulnerabilities, you need an emergency out-of-band procedure - typically a dedicated security officer who can approve and execute an expedited transfer outside the normal quarterly cycle. Document this procedure before you need it.

The deeper lesson is: air-gapped deployments do not eliminate the need for updates. They make updates slower and more expensive. This cost should be factored into the build-vs-buy decision when evaluating whether to build an air-gapped AI deployment.

Q6: Walk through what TRANSFORMERS_OFFLINE=1 actually does in the HuggingFace codebase. What network calls does it block?

The TRANSFORMERS_OFFLINE flag, when set to "1", is checked in several places in the transformers library source:

  • In AutoConfig.from_pretrained: prevents fetching the config from the HuggingFace Hub if only a model ID (not a local path) is provided
  • In AutoTokenizer.from_pretrained: prevents downloading tokenizer files from the Hub
  • In AutoModel.from_pretrained: prevents downloading model weights from the Hub
  • In model-specific config classes: prevents fetching configuration updates or special token definitions

HF_HUB_OFFLINE (at the huggingface_hub library level) is broader - it also prevents hf_hub_download(), list_repo_files(), and other Hub API calls that might be made by LangChain integrations or custom code.

The key thing both flags do NOT block: explicit requests or urllib calls in your own code or in third-party libraries that do not use HuggingFace APIs. A complete offline guarantee requires either network-layer blocking (firewall rules or the actual air-gap) or a comprehensive audit of every network call made by your entire dependency tree. The environment variables are a best-effort control, not a hard guarantee.


Summary

Air-gapped LLM deployment is not exotic - it is the required architecture for any organization that handles regulated data and wants to use AI. The key points:

  • Identify which regulations apply (HIPAA, ITAR, GDPR, FIPS) before choosing a deployment pattern
  • Download everything before air-gapping: model weights, tokenizer files, all Python packages
  • Generate and verify SHA256 checksums at every transfer step - before and after physical media transfer
  • Set HF_HUB_OFFLINE=1, TRANSFORMERS_OFFLINE=1, and local_files_only=True in code as belt-and-suspenders
  • Test your full stack with outbound network blocked before trusting it on actual air-gapped hardware
  • Build update procedures and audit logging before going to production - they are not optional in regulated environments
  • Physical security is part of air-gap security - USB port lockdown, tamper-evident seals, access controls

The operational overhead is real. The alternative - a data breach in a regulated context - is worse by orders of magnitude. The Samsung incident cost reputational capital that took years to recover. A HIPAA breach can cost $1.9 million per incident under the 2024 penalty structure. Air-gap deployment is expensive. Breaches are more expensive.

© 2026 EngineersOfAI. All rights reserved.