Skip to main content

108 docs tagged with "ai-engineering"

View all tags

Active Learning

Selecting the most informative samples for labeling - uncertainty sampling, diversity strategies, query-by-committee, and LLM-based active learning for text classification.

Advanced Prompting Techniques

Master self-refinement, Tree of Thought, ReAct, meta-prompting, and other advanced techniques for reliable, sophisticated LLM behavior in production.

Adversarial Examples

Crafting inputs that reliably cause model failures - attack techniques, transferability, and robust defense strategies for production AI systems.

Agentic RAG

Build RAG systems that reason, iterate, and self-correct - covering Self-RAG, FLARE, ReAct tool-augmented RAG, RAPTOR, and Corrective RAG with full production implementations using the Anthropic SDK.

AI Engineering - Production Track

The complete engineering track for building, shipping, and operating production AI systems - LLMOps, observability, gateways, synthetic data, compression, and security.

AI Error Handling and Fallbacks

Graceful degradation, retry logic, circuit breakers, fallback model chains, and user-facing error messages for production AI systems.

AI Product Design Principles

Principles for designing AI products that build trust, degrade gracefully, and solve the last-mile problem between model capability and user value.

AI Security Governance

Organizational security policies, risk classification frameworks, compliance programs, lifecycle governance, model cards, incident response, and vendor risk management for responsible AI system deployment.

Alerting on LLM Quality Degradation

Build production alerting systems for LLM quality - threshold alerts, statistical process control, anomaly detection, deployment correlation, runbooks, and Prometheus/Grafana integration.

Annotation Pipelines

Data labeling workflows, annotation guidelines, inter-annotator agreement, conflict resolution, and quality control for training data that powers AI systems.

Async LLM Calls

Asynchronous LLM call patterns for high-throughput applications - concurrency control with semaphores, producer-consumer queues, token bucket rate limiting, circuit breakers, and async orchestration patterns.

AWQ: Activation-Aware Weight Quantization

AWQ protects the 1% of weights that matter most - how activation statistics reveal salient weights, how scaling preserves them without extra memory, why AWQ outperforms GPTQ at INT4 for production inference, and how to configure Marlin kernels for maximum throughput.

Batch Processing with LLMs

Efficiently processing large document sets with LLM batch APIs - Anthropic Batch API, cost optimization, monitoring, checkpointing, and production patterns for overnight and large-scale LLM workloads.

Benchmarking Compressed Models

How to systematically evaluate accuracy-efficiency tradeoffs in quantized, pruned, and distilled models - perplexity, task-specific capabilities, latency, throughput, and automated regression detection.

Building Golden Datasets

Learn how to construct, annotate, validate, and maintain golden datasets that serve as the ground truth foundation for all AI system evaluation - covering annotation guidelines, inter-annotator agreement, adversarial generation, dataset versioning, and drift detection.

Context Management at Scale

Managing context windows, conversation history, and state across sessions - sliding window, summarization compression, hierarchical context, KV cache management, and context budget allocation for production LLM systems.

Continuous Eval in CI/CD

Design and implement a full CI/CD pipeline for AI systems - covering PR-level linting, merge-level regression, pre-deployment evaluation gates, production monitoring with statistical process control, anomaly detection, automated rollback, and observability tracing from query to feedback.

Cost Management and Budget Alerts

Track LLM spend per user, team, and feature in real time. Enforce hard budget limits and trigger alerts before costs spiral - because the invoice arrives 30 days too late.

Cost Optimization Patterns

Practical LLM cost reduction - semantic caching, model routing, prompt compression, Anthropic prompt caching, output length control, cost attribution, and monitoring for production AI systems.

Data Poisoning

Attacks that corrupt training or fine-tuning data to embed backdoors, trigger unexpected behaviors, or degrade model performance in production.

Data Quality and Filtering

Systematic approaches to filtering synthetic data for quality, diversity, safety, and alignment - the layered pipeline that separates fine-tuned models that work from models that regress.

Dataset Curation for Fine-Tuning

How to build high-quality fine-tuning datasets - sourcing, deduplication, quality filtering, LLM-as-judge scoring, and a complete curation pipeline. Why 5K curated examples beat 500K raw ones.

Distillation Datasets

Building distillation datasets: capturing frontier model knowledge, reasoning traces, and calibration into training data for smaller, efficient models - from Orca to Phi.

Document Ingestion and Chunking

Master every chunking strategy from fixed-size to semantic and structure-aware splitting. Learn how to parse PDFs, DOCX, and HTML, enrich metadata, evaluate chunk quality, and build a production-grade ingestion pipeline.

Embedding Models in Production

How to choose, deploy, and manage embedding models at scale - including versioning, caching, batching, and migration strategies for production RAG systems.

Escalation and Handoff Patterns

Designing AI systems that know when to stop and hand off to humans - confidence thresholds, sentiment detection, topic-based routing, context transfer, and escalation orchestration.

Evaluation-Driven Development

Building AI systems test-first - write evals before writing prompts. The EDD loop, eval strategies, golden dataset construction, LLM-as-judge calibration, and a full EvalSuite implementation ready for CI integration.

Evol-Instruct

Evol-Instruct: systematically evolving instruction datasets to create complex, diverse training data that produces stronger instruction-following models - the technique behind WizardLM and WizardCoder.

Feedback Collection for LLM Systems

Build production-grade feedback collection systems for AI products - explicit signals, implicit behavioral signals, data schemas, bias mitigation, and closed-loop improvement pipelines.

Fine-Tuning Pipelines

End-to-end fine-tuning pipeline engineering - from data collection and curation to training, evaluation, and deployment. When to fine-tune vs RAG vs prompt engineering, and how to build the pipeline that makes it repeatable and production-safe.

GPTQ: Post-Training Quantization

GPTQ explained from first principles - how Hessian-based error compensation quantizes 175B models to 4-bit in hours, the role of calibration data, group size, activation reordering, and how to deploy GPTQ models in production with vLLM and autoGPTQ.

Handling LLM Latency

Perceived latency, progressive rendering, streaming, prompt caching, and UX patterns for making slow AI responses feel fast.

Human Feedback Collection

Collecting preference data, thumbs ratings, and corrections for RLHF pipelines - preference interface design, feedback quality controls, DPO data formats, and ELO-based model ranking.

Hybrid Search and Reranking

How to combine BM25 sparse retrieval with dense vector search using Reciprocal Rank Fusion, and how to apply cross-encoder reranking for precision that neither method achieves alone.

Idempotency and Retries

Making LLM-powered workflows robust with idempotency keys, smart retries, distributed deduplication, workflow state persistence, and failure-tolerant pipeline design for production AI systems.

Jailbreaks and Bypasses

Taxonomy of jailbreak techniques, why they work, evaluation frameworks, and layered defense strategies for production LLM systems.

Knowledge Distillation for LLMs

Training smaller student models to match larger teacher models - soft labels, temperature scaling, intermediate representation matching, API-based distillation, and a complete production pipeline for task-specific deployment.

Langfuse - Open-Source LLM Observability

Master Langfuse for production LLM observability - self-hosted tracing, evaluation datasets, prompt management, cost attribution by feature, and full data sovereignty for regulated industries.

LangSmith Deep Dive

Master LangSmith for LLM observability - production tracing, dataset curation, evaluation pipelines, prompt versioning, annotation queues, and deployment gating for AI systems.

LiteLLM

Deploy LiteLLM as a universal LLM proxy supporting 100+ providers. Configure routing, load balancing, fallbacks, semantic caching, and cost tracking through a single OpenAI-compatible endpoint.

LLM as Data Generator

Use frontier LLMs to generate high-quality instruction-following, reasoning, and preference datasets - sampling strategies, diversity maximization, and quality vs. quantity tradeoffs.

LLM CI/CD

CI/CD pipelines for LLM applications - handling non-deterministic outputs with LLM-judge gates, canary deployments with quality monitoring, automated rollback triggers, and full GitHub Actions implementation.

LLM-as-Judge

Build calibrated, bias-corrected LLM judges that approximate human judgment at scale - pointwise scoring, pairwise comparison, bias mitigation, and ensemble techniques.

LLMOps Platforms

Comprehensive guide to LLMOps platforms - LangSmith, Langfuse, W&B Weave, Arize Phoenix, Helicone, and PromptLayer. When to build vs buy, integration patterns, abstraction layers, and production-grade Python examples using the Anthropic SDK.

Load Balancing Across Providers

Distribute LLM traffic across multiple API keys and providers using round-robin, weighted, least-connections, and latency-based routing to scale throughput beyond single-key limits.

LoRA for Efficient Fine-Tuning

LoRA and QLoRA: fine-tune 70B models on a single GPU by freezing the base model and training only small low-rank adapter matrices - the technique that democratized LLM customization.

Measuring AI Product Quality

Build a production-grade quality measurement system for AI products using explicit feedback, implicit behavioral signals, LLM-as-judge, and composite scoring.

Measuring HITL Effectiveness

End-to-end metrics for human-in-the-loop systems - false positive/negative rates, confidence calibration, inter-rater reliability, reviewer performance tracking, ROI computation, and system-level effectiveness dashboards.

Membership Inference

Determining whether specific data was used in model training - privacy risks, attack techniques, and defenses for production ML systems.

Model Extraction

Querying a model API to reconstruct its weights, replicate its behavior, or steal proprietary training data through systematic probing.

Model Fallback and Retry

Design resilient LLM clients with configurable fallback chains, exponential backoff with jitter, and circuit breakers that handle provider failures gracefully without any user-facing impact.

Module 01: LLMOps - Overview

An overview of LLMOps - the engineering discipline for building, shipping, and operating production LLM applications reliably and at scale.

Module 03: LLM Gateways

Learn how to build and operate a production LLM gateway - the unified infrastructure layer for routing, caching, cost control, and observability across every AI service your team runs.

Module 04: Synthetic Data

Learn to generate, filter, and use synthetic training data at scale - from Self-Instruct bootstrapping to Evol-Instruct complexity evolution, distillation datasets, and RAG evaluation corpora.

Module 05: Model Compression

Master the full spectrum of model compression techniques - quantization, pruning, distillation, and LoRA - to deploy large language models efficiently in production.

Module 07: Production AI Patterns

Battle-tested engineering patterns for deploying LLM applications at scale - context management, streaming, async calls, batching, retries, cost optimization, multi-tenancy, and AI product architecture.

Module 08: AI Product Engineering

Design, build, and ship AI-powered products that users trust - streaming UX, latency management, error handling, rollout strategies, personalization, and quality measurement.

Module 09: Human-in-the-Loop

Master human-in-the-loop AI systems - annotation pipelines, active learning, feedback collection, escalation patterns, and measuring HITL effectiveness.

Module 6 - AI Security

Comprehensive coverage of AI security threats, attack vectors, and defenses for production AI systems.

Offline vs. Online Evaluation

Design an evaluation strategy that bridges static datasets and production signals - A/B testing, shadow evaluation, implicit signals, and the evaluation flywheel.

OpenTelemetry for AI Systems

Apply OpenTelemetry to AI and LLM applications - GenAI semantic conventions, auto-instrumentation, OTel Collector routing, sampling strategies, context propagation through async queues, and multi-backend production setups.

Personalisation and Memory

User preference learning, conversation memory architecture, and personalised AI experiences that persist across sessions.

Portkey

Use Portkey as a managed LLM gateway with built-in observability, virtual keys, guardrails, request tracing, feedback collection, and automated fallbacks across Claude, GPT-4o, and 250+ providers.

Privacy and Ethics in Synthetic Data

Copyright exposure, memorization risks, differential privacy, bias auditing, terms-of-service compliance, and the governance processes required for defensible synthetic data pipelines.

Prompt Debugging Methodology

Systematic methodology for diagnosing and fixing prompt failures - isolation, ablation, root cause analysis, and building a regression test suite.

Prompt Design Fundamentals

Master the first principles of prompt engineering - clarity, specificity, task framing, structural markers, and the systematic principles behind effective LLM instructions.

Prompt Injection

How prompt injection attacks work, why they are the most critical AI vulnerability in production, and how to defend against them with layered mitigations.

Prompt Injection Defense

Understand prompt injection attack taxonomy, detection strategies, defense layers, and sanitization techniques for production LLM systems.

Prompt Templates and Composition

Build maintainable, production-grade prompt systems with Jinja2 templates, variable injection, modular composition, and reusable prompt libraries.

Prompt UX Patterns

Prompt scaffolding, slash commands, context transparency, and mode switching in production AI interfaces.

Prompt Versioning

Treating prompts as first-class code artifacts - versioning, branching, review gates, A/B testing, and rollback for production LLM prompts. Build a complete prompt registry from scratch.

Prompt Versioning and Management

Treat prompts as code - semantic versioning, A/B testing, rollback strategies, and prompt registries for production LLM systems.

Quality Metrics in Production LLM Systems

Define, measure, and operationalize quality metrics for production LLM applications - faithfulness, answer relevance, hallucination rate, coherence, toxicity, BLEU vs LLM-as-judge, SLO definitions, and async evaluation pipelines.

Quantization Deep Dive

INT8, INT4, NF4, FP8, and block-wise quantization explained from first principles - how floating point becomes integer, what accuracy you lose, and how to tune quantization for production LLM inference.

Query Transformation and HyDE

Master query transformation techniques - HyDE, multi-query retrieval, step-back prompting, query decomposition, and routing - to solve the vocabulary mismatch problem that breaks naive RAG systems in production.

RAG Engineering - Module Overview

Build production-grade Retrieval-Augmented Generation systems - from document ingestion and chunking through hybrid search, query transformation, agentic RAG, and RAGAS evaluation.

RAG Evaluation and RAGAS

Build a continuous RAG evaluation pipeline using the RAGAS framework - faithfulness, answer relevance, context precision, and context recall - with full production implementations using the Anthropic SDK and automated regression detection.

RAG-Specific Evaluation

Master the full evaluation stack for Retrieval-Augmented Generation systems - covering RAGAS metrics, hallucination type classification, citation accuracy, retrieval precision/recall/nDCG, and production-grade benchmarking with complete Python implementations.

Rate Limiting and Quotas

Protect your LLM infrastructure from abuse and cost overruns with token bucket rate limiting and sliding window quotas per user, team, and feature - enforced at the gateway before any tokens are consumed.

Red Teaming AI Systems

Systematic adversarial testing of AI systems - methodology, automated red teaming, documentation, and building a continuous red team program.

Regression Testing for Prompts

Build a production-grade regression testing system for LLM prompts - covering test case design, LLM-as-judge pass/fail evaluation, flaky test detection, caching, differential testing, and CI gates that block regressions before they reach users.

Review Queues and Tooling

Building production review interfaces, priority queues, audit trails, reviewer dashboards, and HITL tooling - from Redis-backed queue management to Label Studio integration.

Securing RAG Systems

Attack surfaces unique to RAG architectures - document poisoning, retrieval hijacking, indirect prompt injection, embedding collision, cross-tenant leakage, and defense-in-depth strategies for production RAG deployments.

Self-Instruct

How the Self-Instruct paper bootstrapped instruction-following datasets from a tiny seed set using GPT-3, enabling the Alpaca era of aligned models - and how to implement it today.

Semantic Caching

Return cached LLM responses for semantically similar queries using embedding-based vector similarity. Cut costs 40–60% by never paying for the same question twice regardless of how it is phrased.

Streaming Responses

Implementing and optimizing streaming for real-time LLM response delivery - SSE, chunking strategies, backpressure, tool use streaming, and production patterns for perceived performance.

Streaming UX for LLMs

Server-sent events, streaming tokens, TTFT optimization, and building responsive AI chat interfaces that feel instant even under production load.

Structured Pruning

Remove entire attention heads, MLP neurons, and transformer layers to achieve real hardware latency improvements - with production-grade code for Taylor importance, angular distance layer scoring, iterative recovery, and combined compression pipelines.

Synthetic Data for RAG

Generating question-answer pairs, evaluation datasets, and retrieval test cases from documents to build, evaluate, and systematically improve RAG systems.

System Prompts and Personas

Design production-grade system prompts and AI personas - the 6-component anatomy, dynamic context injection, behavioral constraints, tone configuration, and persona stability testing.

Tracing LLM Applications

What tracing means for LLM apps - capturing every prompt, completion, latency, cost, and error in queryable traces. Why traditional APM fails for AI, OpenTelemetry GenAI semantic conventions, and a complete production-grade tracer implementation.

Unstructured Pruning

Weight-level sparsity, the Lottery Ticket Hypothesis, SparseGPT, Wanda, and 2:4 structured sparsity - why unstructured pruning is theoretically elegant but practically limited for LLMs.

Vector Search in Practice

How approximate nearest neighbor search works, how to choose the right vector database, and how to build production-grade retrieval pipelines that stay fast at millions of documents.

What is LLMOps

LLMOps defined - the operational discipline for managing LLM-powered applications in production, why it differs from MLOps, and the full lifecycle every AI engineering team must master.

Why AI Evaluation Is Hard

Understanding the fundamental gap between software testing and AI evaluation - non-determinism, no oracle, emergent failures, and how to build a multi-layered evaluation strategy.

Why an LLM Gateway

The case for centralizing all LLM traffic through a single gateway layer - routing, cost control, fallbacks, and observability without rewriting application code.

Why Human-in-the-Loop Matters

Understand why full automation fails, where the alignment gap lives, what regulations demand, and how to design the right level of human oversight for any AI system.

Why Model Compression Matters

The memory wall, inference costs, edge deployment, and latency requirements that make model compression essential for production AI systems - with real cost math, a full compression taxonomy, and decision frameworks for choosing the right technique.

Why RAG Exists

Understand why Retrieval-Augmented Generation was invented, what problems it solves that fine-tuning and prompt stuffing cannot, and how to architect a minimal RAG pipeline from scratch.

Why Synthetic Data

Understand why synthetic data has become central to AI engineering - the labeled data bottleneck, privacy constraints, rare events, LLMs as generators, landmark case studies, and when synthetic beats real.