Reading time: ~40 min | Interview relevance: Critical | Roles: AI Engineer, LLM Engineer, GenAI Engineer, Applied AI Engineer
The AI Engineer role barely existed two years ago. Now it is one of the hottest positions in tech, with companies scrambling to build LLM-powered products. But because the role is new, interview formats vary wildly. One company might ask you to build a RAG pipeline live, another might quiz you on prompt engineering strategies, and a third might focus on traditional coding with an LLM twist.
This list of 45 problems covers the full spectrum of what AI Engineer candidates face. It emphasizes the unique skills that define this role: working with large language models, building retrieval-augmented systems, designing AI agents, and shipping production AI applications.
AI Engineer Interview Structure
| Round | Duration | What They Test | Weight |
|---|
| Coding | 45-60 min | DSA + API design + LLM integration | 20-25% |
| LLM / AI Depth | 45-60 min | Prompt engineering, RAG, fine-tuning, evaluation | 25-30% |
| System Design | 45-60 min | Production AI systems, architecture | 25-30% |
| Take-Home / Live Build | 2-4 hours | Build an AI feature end-to-end | 15-20% |
| Behavioral | 30-45 min | Product sense, collaboration, shipping velocity | 10% |
:::tip The AI Engineer Differentiator
AI Engineers are builders, not researchers. Interviewers care about: Can you ship an AI-powered feature? Can you make it reliable? Can you iterate quickly? Deep ML theory is less important than practical system-building skills.
:::
Round 1: Coding & API Problems (12 Problems)
AI Engineer coding rounds emphasize API design, data transformation, and LLM integration over pure algorithms.
Core Coding
| # | Problem | Difficulty | Time | Key Pattern | Why AI Engineers Need It | Company Tags |
|---|
| 1 | Design a REST API for a Chat Application | Medium | 25 min | RESTful design, WebSocket for streaming | AI apps need clean APIs for model interaction | OpenAI, Anthropic, Startups |
| 2 | Implement a Token Counter and Text Chunker | Medium | 20 min | Tokenization, sliding window chunking | Core building block of any RAG system | AI Labs, Big Tech |
| 3 | Build a Retry Handler with Exponential Backoff | Easy | 15 min | Error handling, backoff strategy | LLM APIs fail; robust error handling is essential | All |
| 4 | Implement a Concurrent API Call Manager | Medium | 25 min | Async/await, rate limiting, batching | Parallel LLM calls for throughput | Startups, Big Tech |
| 5 | Parse and Validate Structured Output from an LLM | Medium | 20 min | JSON parsing, schema validation, error recovery | LLMs return messy outputs; parsing is critical | All |
LLM-Flavored Coding
| # | Problem | Difficulty | Time | Key Pattern | Why AI Engineers Need It | Company Tags |
|---|
| 6 | Implement a Prompt Template Engine | Medium | 25 min | String interpolation, variable injection, escaping | Prompts are code; they need proper templating | Startups, AI Labs |
| 7 | Build a Simple Vector Search with Cosine Similarity | Easy | 20 min | Embedding comparison, top-K retrieval | Foundation of semantic search and RAG | All |
| 8 | Implement a Conversation Memory Manager | Medium | 25 min | Sliding window, summarization triggers, token budgeting | Long conversations exceed context windows | Anthropic, OpenAI, Startups |
| 9 | Build a Function-Calling Router | Medium | 30 min | Intent classification, argument extraction, dispatch | Agentic systems need function routing | OpenAI, Anthropic, Cohere |
| 10 | Implement a Response Streaming Handler | Medium | 20 min | Server-sent events, token-by-token processing | Streaming is standard for LLM UX | All |
| 11 | Build a Simple Evaluation Harness | Medium | 25 min | Test case management, metric computation, result aggregation | Evaluation drives AI product quality | AI Labs, Startups |
| 12 | Implement Document Deduplication with MinHash | Hard | 30 min | Locality-sensitive hashing, Jaccard similarity | Data quality for RAG knowledge bases | Google, Databricks |
Round 2: LLM & AI Depth Problems (18 Problems)
These problems test your understanding of how LLMs work, how to use them effectively, and how to build reliable AI systems.
Prompt Engineering & Design
| # | Problem | Difficulty | Time | Key Pattern | Why It Matters | Company Tags |
|---|
| 13 | Design a Prompt Chain for Complex Document Summarization | Medium | 25 min | Multi-step prompting, context management | Complex tasks require decomposition | All |
| 14 | Implement Few-Shot Classification with Dynamic Example Selection | Medium | 25 min | Embedding similarity for example retrieval | Static examples underperform dynamic selection | AI Labs, Startups |
| 15 | Design a Prompt for Structured Data Extraction from Unstructured Text | Medium | 20 min | Output formatting, schema enforcement | Information extraction is a top AI use case | All |
| 16 | Build a Self-Correcting Prompt Pipeline | Hard | 30 min | Output validation, error feedback, retry with corrections | LLMs make mistakes; self-correction improves reliability | Anthropic, OpenAI |
| 17 | Compare and Evaluate Prompt Strategies for a Classification Task | Medium | 25 min | A/B testing prompts, metric comparison | Systematic prompt optimization beats guessing | All |
RAG Systems
| # | Problem | Difficulty | Time | Key Pattern | Why It Matters | Company Tags |
|---|
| 18 | Design a RAG Pipeline for a Technical Documentation System | Medium | 35 min | Chunking, embedding, retrieval, generation | The canonical AI Engineer system design problem | All |
| 19 | Implement Hybrid Search (Keyword + Semantic) | Medium | 25 min | BM25 + embedding search, reciprocal rank fusion | Semantic search alone misses keyword matches | Google, Startups |
| 20 | Design a Multi-Document QA System with Source Attribution | Hard | 35 min | Cross-document retrieval, citation generation | Users need to verify AI answers | Anthropic, Google, Startups |
| 21 | Handle Stale and Conflicting Information in a RAG System | Hard | 30 min | Temporal filtering, conflict resolution, freshness scoring | Real knowledge bases have contradictions | All |
| 22 | Evaluate RAG Quality: Design Metrics and Test Suite | Medium | 25 min | Faithfulness, relevance, completeness metrics | You cannot improve what you cannot measure | AI Labs, Big Tech |
| # | Problem | Difficulty | Time | Key Pattern | Why It Matters | Company Tags |
|---|
| 23 | Design an AI Agent with Tool Use Capabilities | Hard | 35 min | Planning, tool selection, execution, error recovery | Agents are the next frontier of AI applications | Anthropic, OpenAI, Startups |
| 24 | Implement a ReAct-Style Reasoning Agent | Medium | 30 min | Thought-action-observation loop | Most popular agent architecture | AI Labs, Startups |
| 25 | Design a Multi-Agent Collaboration System | Hard | 35 min | Agent roles, communication protocol, consensus | Complex tasks benefit from specialized agents | AI Labs, Startups |
| 26 | Build an Agent with Memory and Learning | Hard | 30 min | Short-term context, long-term storage, retrieval | Persistent agents need memory management | Anthropic, OpenAI |
Fine-Tuning & Model Customization
| # | Problem | Difficulty | Time | Key Pattern | Why It Matters | Company Tags |
|---|
| 27 | Design a Fine-Tuning Pipeline for Domain Adaptation | Medium | 30 min | Data preparation, training config, evaluation | When prompting is not enough, fine-tuning is next | AI Labs, Big Tech |
| 28 | Compare Fine-Tuning vs. RAG vs. Prompt Engineering for a Use Case | Medium | 25 min | Tradeoff analysis: cost, quality, latency, maintenance | The most common AI architecture decision | All |
| 29 | Implement LoRA Fine-Tuning Configuration and Explain the Approach | Medium | 25 min | Parameter-efficient fine-tuning | LoRA is the standard for efficient fine-tuning | AI Labs, Startups |
| 30 | Design a Training Data Curation Pipeline for Fine-Tuning | Medium | 25 min | Data quality, diversity, deduplication, formatting | Garbage in, garbage out -- especially for fine-tuning | All |
:::warning The RAG vs. Fine-Tuning Question
This comes up in nearly every AI Engineer interview. Have a clear, nuanced framework:
- RAG when: knowledge changes frequently, need source attribution, want to avoid training
- Fine-tuning when: need consistent style/format, domain-specific reasoning, latency matters
- Both when: domain knowledge + retrieval needed (fine-tuned model with RAG)
:::
Round 3: System Design Problems (15 Problems)
AI Engineer system design focuses on production AI applications -- how to make them reliable, fast, and cost-effective.
Production AI Systems
| # | Problem | Difficulty | Time | Key Pattern | Why It Matters | Company Tags |
|---|
| 31 | Design a Customer Support Chatbot System | Medium | 40 min | Intent routing, knowledge retrieval, escalation | The most common AI Engineer project | All |
| 32 | Design a Code Review Assistant | Medium | 35 min | Code parsing, context retrieval, LLM analysis | Developer tools are a hot AI application area | GitHub, Anthropic, Google |
| 33 | Design a Content Generation Pipeline | Medium | 35 min | Template system, LLM generation, quality checks, human review | Content generation at scale needs guardrails | Jasper, Copy.ai, Big Tech |
| 34 | Design an AI-Powered Search Engine | Hard | 45 min | Query understanding, hybrid retrieval, LLM-augmented results | Next-gen search combines retrieval and generation | Google, Perplexity, You.com |
| 35 | Design a Document Processing and Analysis System | Medium | 35 min | OCR, parsing, extraction, classification, summarization | Enterprise AI = document processing | Amazon, Google, Startups |
Reliability & Safety
| # | Problem | Difficulty | Time | Key Pattern | Why It Matters | Company Tags |
|---|
| 36 | Design a Guardrails System for LLM Output | Hard | 35 min | Input validation, output filtering, safety classification | Safety is non-negotiable for production AI | Anthropic, OpenAI, All |
| 37 | Design an LLM Evaluation and Monitoring Platform | Hard | 40 min | Automated eval, human eval, drift detection, alerting | Production AI needs continuous evaluation | AI Labs, Big Tech |
| 38 | Design a Cost Optimization Strategy for LLM Applications | Medium | 30 min | Caching, model routing, prompt optimization, batching | LLM API costs can spiral quickly | Startups, All |
| 39 | Design a Fallback and Degradation Strategy for AI Features | Medium | 25 min | Graceful degradation, fallback models, cached responses | AI systems fail; plan for it | All |
| 40 | Design a Prompt Versioning and Deployment System | Medium | 25 min | Version control, A/B testing, rollback | Prompts are code and need software engineering practices | AI Labs, Startups |
Scaling & Infrastructure
| # | Problem | Difficulty | Time | Key Pattern | Why It Matters | Company Tags |
|---|
| 41 | Design a Multi-Tenant AI Platform | Hard | 40 min | Isolation, resource allocation, custom models per tenant | B2B AI products serve many customers | Startups, Big Tech |
| 42 | Design a Semantic Caching Layer for LLM Responses | Medium | 30 min | Embedding-based cache keys, similarity threshold, invalidation | Reduce cost and latency for similar queries | Startups, AI Labs |
| 43 | Design an AI-Powered Data Pipeline | Medium | 30 min | LLM for transformation, validation, error handling | AI augments data engineering workflows | Databricks, Startups |
| 44 | Design a Real-Time AI Translation System | Hard | 40 min | Streaming translation, context preservation, quality monitoring | Real-time AI needs careful latency management | Google, Meta, Startups |
| 45 | Design an AI Feature Flag System | Medium | 25 min | Feature flags for AI capabilities, gradual rollout, metrics | Ship AI features safely with controlled exposure | All |
4-Week AI Engineer Study Plan
| Week | Focus | Problems | Daily Load |
|---|
| Week 1 | Coding + API design | #1-12 | 2 problems/day |
| Week 2 | LLM depth + RAG | #13-22 | 2 problems/day |
| Week 3 | Agents + System Design | #23-35 | 1-2 per day |
| Week 4 | Reliability + Polish | #36-45 + review | 1 problem + 1 mock/day |

:::tip Build, Don't Just Study
AI Engineer interviews often include live builds or take-home projects. Set up a development environment with an LLM API and actually build small projects as you study. Reading about RAG is not the same as building one.
:::
Key Frameworks for AI Engineer Interviews
RAG Architecture Decision Framework
| Dimension | Simple RAG | Advanced RAG | Fine-Tuned + RAG |
|---|
| Setup cost | Low | Medium | High |
| Maintenance | Low | Medium | High |
| Quality | Good for simple queries | Better for complex queries | Best for domain-specific |
| Latency | Medium (retrieval + generation) | Higher (multi-step) | Lower (no retrieval for common patterns) |
| When to use | MVP, proof of concept | Production systems | High-stakes, domain-specific |
LLM Selection Framework
| Factor | Small Models (7-13B) | Medium Models (30-70B) | Large Models (100B+) | API Models (GPT-4, Claude) |
|---|
| Latency | Very low | Low | Medium | Variable |
| Cost | Low (self-hosted) | Medium | High | Per-token |
| Quality | Good for narrow tasks | Good for most tasks | Great for complex tasks | State-of-the-art |
| Privacy | Full control | Full control | Full control | Data sent to provider |
| Best for | High-volume, simple | Balanced | Research, complex | Quick iteration, quality |
Evaluation Framework for AI Applications
| Metric | What It Measures | How to Compute |
|---|
| Faithfulness | Does the output match retrieved context? | LLM-as-judge or NLI model |
| Relevance | Is the output useful for the user's query? | Human eval or LLM-as-judge |
| Completeness | Does the output cover all relevant points? | Checklist comparison |
| Harmlessness | Is the output safe and appropriate? | Safety classifier + human review |
| Latency | How fast is the response? | P50, P95, P99 timing |
| Cost | How much does each query cost? | Token counting + API pricing |
AI Engineer vs. MLE: Problem Differences
| Dimension | AI Engineer Focus | MLE Focus |
|---|
| Coding | API design, integrations, parsing | Algorithm implementation, optimization |
| Models | Using LLMs effectively, prompt engineering | Training models, architecture design |
| Data | Retrieval, chunking, embedding | Feature engineering, data pipelines |
| Systems | AI application architecture | ML infrastructure, training systems |
| Evaluation | Output quality, safety, user satisfaction | Model metrics (AUC, F1, NDCG) |
| Deployment | API serving, caching, cost management | Model optimization, A/B testing |
Progress Tracker
| # | Problem | Status | Date | Time | Notes |
|---|
| 1 | Chat API Design | [ ] | | | |
| 2 | Token Counter & Chunker | [ ] | | | |
| 3 | Retry with Backoff | [ ] | | | |
| 4 | Concurrent API Manager | [ ] | | | |
| 5 | Structured Output Parser | [ ] | | | |
| 6 | Prompt Template Engine | [ ] | | | |
| 7 | Vector Search | [ ] | | | |
| 8 | Conversation Memory | [ ] | | | |
| 9 | Function-Calling Router | [ ] | | | |
| 10 | Response Streaming | [ ] | | | |
| 11 | Evaluation Harness | [ ] | | | |
| 12 | Document Deduplication | [ ] | | | |
| 13 | Prompt Chain Summarization | [ ] | | | |
| 14 | Dynamic Few-Shot | [ ] | | | |
| 15 | Data Extraction Prompt | [ ] | | | |
| 16 | Self-Correcting Pipeline | [ ] | | | |
| 17 | Prompt Strategy Evaluation | [ ] | | | |
| 18 | RAG Pipeline Design | [ ] | | | |
| 19 | Hybrid Search | [ ] | | | |
| 20 | Multi-Doc QA with Citations | [ ] | | | |
| 21 | Stale Info in RAG | [ ] | | | |
| 22 | RAG Quality Metrics | [ ] | | | |
| 23 | Agent with Tools | [ ] | | | |
| 24 | ReAct Agent | [ ] | | | |
| 25 | Multi-Agent System | [ ] | | | |
| 26 | Agent Memory | [ ] | | | |
| 27 | Fine-Tuning Pipeline | [ ] | | | |
| 28 | FT vs RAG vs Prompting | [ ] | | | |
| 29 | LoRA Configuration | [ ] | | | |
| 30 | Training Data Curation | [ ] | | | |
| 31 | Support Chatbot | [ ] | | | |
| 32 | Code Review Assistant | [ ] | | | |
| 33 | Content Generation Pipeline | [ ] | | | |
| 34 | AI Search Engine | [ ] | | | |
| 35 | Document Processing | [ ] | | | |
| 36 | LLM Guardrails | [ ] | | | |
| 37 | Eval & Monitoring Platform | [ ] | | | |
| 38 | Cost Optimization | [ ] | | | |
| 39 | Fallback Strategy | [ ] | | | |
| 40 | Prompt Versioning | [ ] | | | |
| 41 | Multi-Tenant AI Platform | [ ] | | | |
| 42 | Semantic Caching | [ ] | | | |
| 43 | AI Data Pipeline | [ ] | | | |
| 44 | Real-Time Translation | [ ] | | | |
| 45 | AI Feature Flags | [ ] | | | |
Next Steps
After completing the AI Engineer problem list:
© 2026 EngineersOfAI. All rights reserved.