What is review queue?

Building production review interfaces, priority queues, audit trails, reviewer dashboards, and HITL tooling - from Redis-backed queue management to Label Studio integration.

How does HITL tooling work in practice?

Review Queues and Tooling covers review queue, HITL tooling, Label Studio from first principles with code examples. Free lesson at https://engineersofai.com/docs/ai-engineering/human-in-the-loop/review-queues-and-tooling

What is the difference between review queue and Label Studio?

See the full breakdown at https://engineersofai.com/docs/ai-engineering/human-in-the-loop/review-queues-and-tooling

:::tip 🎮 Interactive Playground Visualize this concept: Try the Human Evaluation Process demo on the EngineersOfAI Playground - no code required. :::

Review Queues and Tooling

When the Queue Becomes the Problem

A social media platform deployed an AI content moderation system that automatically approved 92% of posts and flagged 8% for human review. On paper, this was a success: 92% automation meant dramatically reduced moderation cost and faster content clearance.

The math told a different story. The platform processed 80,000 posts per day. Eight percent flagged meant 6,400 human reviews per day. With a moderation team of 14 people working 8-hour shifts, each moderator needed to complete 457 reviews per day - one every 63 seconds. The average review included a post with user context, prior violation history, and the AI's rationale for flagging. Reading and understanding all of that in 63 seconds was not possible.

The team's response was predictable: they started skimming. Context was skipped. Prior history was ignored. Decisions were made based on the surface content alone. Within three months, two things happened simultaneously: the false rejection rate (legitimate content blocked) had tripled, generating user complaints. And a coordinated harassment campaign that relied on individually-borderline posts was approved wholesale because each individual post seemed marginal in isolation.

The review queue had become a bottleneck that defeated the purpose of the entire HITL system. The AI was making routing decisions that were theoretically correct, but the downstream human process could not absorb the volume. The failure was not in the AI - it was in the queue design, the reviewer capacity planning, and the interface that forced reviewers into an impossible throughput target.

This lesson covers how to build review queues that work at scale: priority management, reviewer assignment, interface design for throughput and quality, audit trails for compliance, and the tooling ecosystem that supports it all.

Review Queue Architecture

Every stage matters. Errors in flagging logic fill the queue with items that should not be there. Poor priority assignment means critical items wait behind low-stakes ones. Bad interface design cuts reviewer throughput in half. Missing audit trails create compliance gaps. A weak feedback loop means human review labor produces no model improvement.

Priority Queue Implementation

The queue must guarantee that critical items are always processed before low-priority items, regardless of arrival time. Redis sorted sets provide exactly this property with O(log N) enqueue and dequeue operations.

import asyncio
import json
import time
import uuid
from dataclasses import dataclass, field, asdict
from enum import Enum
from typing import Optional


class ReviewPriority(int, Enum):
    """
    Priority levels for review queue ordering.
    Lower numeric value = higher priority = dequeued first.

    Score in Redis = priority.value * 1_000_000 + arrival_timestamp
    This ensures:
    - CRITICAL always before HIGH (primary sort by priority tier)
    - Within the same tier: FIFO (secondary sort by arrival time)
    """
    CRITICAL = 1   # Crisis content, legal threats - SLA: 5 minutes
    HIGH = 2       # High-confidence AI flag, time-sensitive - SLA: 30 minutes
    MEDIUM = 3     # Standard review - SLA: 4 hours
    LOW = 4        # Borderline cases, sampling - SLA: 24 hours


@dataclass
class ReviewItem:
    """A single item in the review queue."""
    item_id: str = field(default_factory=lambda: str(uuid.uuid4()))
    created_at: float = field(default_factory=time.time)
    priority: ReviewPriority = ReviewPriority.MEDIUM
    item_type: str = ""          # "content_post", "ai_response", "annotation", etc.

    # Content to review
    content: str = ""
    content_metadata: dict = field(default_factory=dict)

    # AI context - shown to reviewer to support their decision
    ai_prediction: Optional[str] = None
    ai_confidence: Optional[float] = None
    ai_reasoning: Optional[str] = None

    # Routing metadata
    escalation_trigger: Optional[str] = None
    required_skill: Optional[str] = None  # "medical", "legal", "general", "de" (German)
    assigned_reviewer: Optional[str] = None

    # SLA tracking
    sla_deadline: Optional[float] = None   # Unix timestamp - alert if exceeded
    age_at_assignment: Optional[float] = None

    def compute_sla_deadline(self) -> float:
        """Compute SLA deadline based on priority."""
        sla_hours = {
            ReviewPriority.CRITICAL: 5/60,    # 5 minutes
            ReviewPriority.HIGH: 0.5,          # 30 minutes
            ReviewPriority.MEDIUM: 4.0,        # 4 hours
            ReviewPriority.LOW: 24.0,          # 24 hours
        }
        return self.created_at + sla_hours[self.priority] * 3600


class ReviewQueue:
    """
    Production-ready review queue using Redis sorted sets.

    Key design decisions:
    - Priority score = tier * 1_000_000 + timestamp (FIFO within tier)
    - Atomic dequeue via Lua script (prevents double-assignment)
    - Skill matching in dequeue (route to qualified reviewer)
    - Automatic SLA deadline computation on enqueue
    """

    SLA_HOURS = {
        ReviewPriority.CRITICAL: 5 / 60,
        ReviewPriority.HIGH: 0.5,
        ReviewPriority.MEDIUM: 4.0,
        ReviewPriority.LOW: 24.0,
    }

    def __init__(self, redis_client, queue_name: str = "hitl:review_queue"):
        self.redis = redis_client
        self.queue_name = queue_name
        self.items_key = f"{queue_name}:items"
        self.stats_key = f"{queue_name}:stats"

    def _priority_score(self, item: ReviewItem) -> float:
        """
        Compute Redis sorted set score.
        Lower score = dequeued first.
        Priority tier dominates, timestamp breaks ties within tier.
        """
        return item.priority.value * 1_000_000 + item.created_at

    async def enqueue(self, item: ReviewItem) -> None:
        """Add an item to the review queue with SLA deadline."""
        if item.sla_deadline is None:
            item.sla_deadline = item.compute_sla_deadline()

        score = self._priority_score(item)
        item_json = json.dumps(asdict(item))

        pipe = self.redis.pipeline()
        pipe.zadd(self.queue_name, {item.item_id: score})
        pipe.hset(self.items_key, item.item_id, item_json)

        # Update stats
        pipe.hincrby(self.stats_key, "total_enqueued", 1)
        pipe.hincrby(self.stats_key, f"enqueued_{item.priority.name}", 1)

        await pipe.execute()

    async def dequeue(
        self,
        reviewer_id: str,
        required_skill: Optional[str] = None,
    ) -> Optional[ReviewItem]:
        """
        Atomically dequeue the highest-priority item for this reviewer.
        Skill matching ensures medical items go to medical reviewers.
        """
        # Lua script: atomic scan-and-assign
        lua_script = """
        local queue = KEYS[1]
        local items = KEYS[2]
        local reviewer_id = ARGV[1]
        local required_skill = ARGV[2]
        local current_time = tonumber(ARGV[3])

        -- Scan from highest priority (lowest score)
        local candidates = redis.call('ZRANGE', queue, 0, 50, 'WITHSCORES')

        for i = 1, #candidates, 2 do
            local item_id = candidates[i]
            local item_json = redis.call('HGET', items, item_id)

            if item_json then
                local item = cjson.decode(item_json)

                -- Check skill match
                local skill_match = (required_skill == '' or
                                     item.required_skill == nil or
                                     item.required_skill == cjson.null or
                                     item.required_skill == required_skill)

                -- Skip already assigned items
                local not_assigned = (item.assigned_reviewer == nil or
                                      item.assigned_reviewer == cjson.null)

                if skill_match and not_assigned then
                    -- Assign to reviewer
                    item.assigned_reviewer = reviewer_id
                    item.age_at_assignment = current_time - item.created_at

                    -- Remove from sorted queue (in-progress)
                    redis.call('ZREM', queue, item_id)

                    -- Update item record
                    redis.call('HSET', items, item_id, cjson.encode(item))

                    return cjson.encode(item)
                end
            end
        end

        return nil
        """

        result = await self.redis.eval(
            lua_script,
            2,
            self.queue_name,
            self.items_key,
            reviewer_id,
            required_skill or "",
            str(time.time()),
        )

        if not result:
            return None

        data = json.loads(result)
        data["priority"] = ReviewPriority(data["priority"])
        return ReviewItem(**{k: v for k, v in data.items() if k in ReviewItem.__dataclass_fields__})

    async def return_to_queue(
        self,
        item: ReviewItem,
        reason: str = "reviewer_abandoned",
    ) -> None:
        """
        Return an item to the queue when a reviewer abandons it.
        Maintains original priority - does not penalize for return.
        """
        item.assigned_reviewer = None
        score = self._priority_score(item)
        item_json = json.dumps(asdict(item))

        pipe = self.redis.pipeline()
        pipe.zadd(self.queue_name, {item.item_id: score})
        pipe.hset(self.items_key, item.item_id, item_json)
        await pipe.execute()

    async def get_queue_stats(self) -> dict:
        """Real-time queue depth and SLA compliance statistics."""
        total = await self.redis.zcard(self.queue_name)

        priority_counts = {}
        sla_violations = 0
        now = time.time()

        for p in ReviewPriority:
            min_score = p.value * 1_000_000
            max_score = (p.value + 1) * 1_000_000 - 1
            count = await self.redis.zcount(self.queue_name, min_score, max_score)
            priority_counts[p.name] = count

        # Check SLA violations (items past deadline still in queue)
        all_items = await self.redis.zrange(self.queue_name, 0, -1)
        for item_id in all_items[:100]:  # Sample first 100 for efficiency
            item_json = await self.redis.hget(self.items_key, item_id)
            if item_json:
                item_data = json.loads(item_json)
                deadline = item_data.get("sla_deadline")
                if deadline and now > deadline:
                    sla_violations += 1

        # Age of oldest item in queue
        oldest_item = await self.redis.zrange(self.queue_name, 0, 0, withscores=True)
        oldest_age = None
        if oldest_item:
            score = oldest_item[0][1]
            created_at = score % 1_000_000
            oldest_age = round((now - created_at) / 3600, 2)  # hours

        return {
            "total_pending": total,
            "by_priority": priority_counts,
            "sla_violations_sampled": sla_violations,
            "oldest_item_age_hours": oldest_age,
            "requires_attention": (
                priority_counts.get("CRITICAL", 0) > 0
                or sla_violations > 0
            ),
        }

    async def get_sla_compliance_report(self) -> dict:
        """
        Compute SLA compliance rate from historical data.
        Requires separate storage of completion timestamps.
        """
        # In production: query completion records from database
        # This method documents the expected shape of the report
        return {
            "period_hours": 24,
            "sla_by_priority": {
                "CRITICAL": {"target_minutes": 5, "compliance_rate": 0.0},
                "HIGH": {"target_minutes": 30, "compliance_rate": 0.0},
                "MEDIUM": {"target_hours": 4, "compliance_rate": 0.0},
                "LOW": {"target_hours": 24, "compliance_rate": 0.0},
            },
            "note": "Connect to completion records store for real data",
        }

The Review Interface: Throughput and Accuracy by Design

The review interface design determines reviewer throughput and decision quality. A well-designed interface where a reviewer can make a high-quality decision in 45 seconds supports 80 reviews per hour. A poorly designed interface where they must reconstruct context from raw data supports 20 reviews per hour - a 4x throughput difference that determines whether your queue clears or backs up.

from fastapi import FastAPI, HTTPException, Depends
from fastapi.responses import JSONResponse
from typing import Optional
import time
import json

app = FastAPI()


@app.get("/review/next")
async def get_next_item(
    reviewer_id: str,
    skill: Optional[str] = None,
) -> dict:
    """
    Fetch the next review item for this reviewer.

    The response includes all context the reviewer needs - they should
    not have to make additional API calls or search for context.
    """
    queue = get_queue()
    item = await queue.dequeue(reviewer_id, required_skill=skill)

    if not item:
        return {"status": "empty", "message": "No items available in your skill queue"}

    # Compute age for context
    age_minutes = (time.time() - item.created_at) / 60

    # Format AI confidence as percentage for display
    confidence_display = (
        f"{item.ai_confidence:.0%}"
        if item.ai_confidence is not None else None
    )

    # SLA urgency signal for the UI
    sla_urgency = "ok"
    if item.sla_deadline:
        remaining = item.sla_deadline - time.time()
        if remaining < 0:
            sla_urgency = "violated"
        elif remaining < 300:  # < 5 minutes
            sla_urgency = "critical"
        elif remaining < 1800:  # < 30 minutes
            sla_urgency = "warning"

    return {
        "status": "ok",
        "item": {
            "item_id": item.item_id,
            "item_type": item.item_type,
            "content": item.content,
            "content_metadata": item.content_metadata,
            "ai_context": {
                "prediction": item.ai_prediction,
                "confidence": confidence_display,
                "reasoning": item.ai_reasoning,
            },
            "queue_context": {
                "priority": item.priority.name,
                "age_minutes": round(age_minutes, 1),
                "escalation_trigger": item.escalation_trigger,
                "sla_urgency": sla_urgency,
            },
        },
        "decisions_available": ["approve", "reject", "escalate", "skip"],
        "keyboard_shortcuts": {
            "a": "approve",
            "r": "reject",
            "e": "escalate",
            "s": "skip",
        },
    }


@app.post("/review/submit")
async def submit_decision(
    item_id: str,
    reviewer_id: str,
    decision: str,
    rationale: Optional[str] = None,
    time_spent_seconds: float = 0,
    corrected_label: Optional[str] = None,
) -> dict:
    """
    Submit a review decision with full audit capture.

    Required fields are minimal (item_id, reviewer_id, decision).
    Optional rationale is required for high-priority items via
    a separate validation layer.
    """
    valid_decisions = {"approve", "reject", "escalate", "skip"}
    if decision not in valid_decisions:
        raise HTTPException(
            status_code=400,
            detail=f"Invalid decision '{decision}'. Must be one of {valid_decisions}"
        )

    # Validate minimum time spent
    if time_spent_seconds < 5:
        raise HTTPException(
            status_code=400,
            detail="Review completed too quickly. Minimum 5 seconds required."
        )

    # Build complete review record
    review_record = {
        "item_id": item_id,
        "reviewer_id": reviewer_id,
        "decision": decision,
        "rationale": rationale,
        "corrected_label": corrected_label,
        "time_spent_seconds": time_spent_seconds,
        "timestamp": time.time(),
    }

    # Log to audit trail (synchronous - must succeed before returning)
    audit_logger = get_audit_logger()
    audit_record = await audit_logger.log(review_record)

    # Trigger downstream actions (asynchronous - can fail without blocking)
    try:
        await _execute_downstream_actions(item_id, decision, review_record)
    except Exception as e:
        # Log error but don't fail the review submission
        print(f"Downstream action failed for {item_id}: {e}")

    # Queue for training feedback if it's a correction
    if corrected_label and decision in ("reject", "approve"):
        await _queue_for_training(item_id, corrected_label, reviewer_id)

    return {
        "status": "recorded",
        "item_id": item_id,
        "audit_record_id": audit_record.record_id,
        "decision": decision,
    }


@app.get("/review/queue/stats")
async def queue_statistics() -> dict:
    """Real-time queue statistics for monitoring dashboard."""
    queue = get_queue()
    stats = await queue.get_queue_stats()
    return stats


@app.get("/reviewer/{reviewer_id}/stats")
async def reviewer_statistics(reviewer_id: str, period_hours: float = 24) -> dict:
    """Per-reviewer performance metrics."""
    tracker = get_reviewer_tracker()
    metrics = tracker.compute_metrics(reviewer_id, period_hours)
    issues = tracker.detect_quality_issues(reviewer_id)

    return {
        "reviewer_id": reviewer_id,
        "period_hours": period_hours,
        "metrics": {
            "total_reviews": metrics.total_reviews,
            "avg_reviews_per_hour": round(metrics.avg_reviews_per_hour, 1),
            "avg_review_time_seconds": round(metrics.avg_review_time_seconds, 1),
            "gold_task_accuracy": f"{metrics.gold_accuracy:.1%}",
            "ai_override_rate": f"{metrics.override_rate:.1%}",
        },
        "quality_issues": issues,
        "status": "issues_detected" if issues else "ok",
    }


async def _execute_downstream_actions(
    item_id: str,
    decision: str,
    record: dict,
) -> None:
    """Execute downstream actions based on review decision."""
    if decision == "approve":
        # Publish content, release AI response, etc.
        await _publish_item(item_id)
    elif decision == "reject":
        # Remove content, block user if threshold exceeded, notify uploader
        await _remove_item(item_id, record.get("rationale"))
    elif decision == "escalate":
        # Move to higher-priority specialist queue
        await _escalate_item(item_id, record)

async def _publish_item(item_id: str): pass
async def _remove_item(item_id: str, reason: Optional[str]): pass
async def _escalate_item(item_id: str, record: dict): pass
async def _queue_for_training(item_id: str, label: str, reviewer_id: str): pass
def get_queue(): return ReviewQueue(redis_client=None)
def get_audit_logger(): return AuditLogger()
def get_reviewer_tracker(): return ReviewerPerformanceTracker({})

Reviewer Interface: UX Principles for Speed and Quality

// ReviewInterface.tsx - Key design decisions documented as code

const KEYBOARD_SHORTCUTS: Record<string, string> = {
  "a": "approve",
  "r": "reject",
  "e": "escalate",
  "s": "skip",
  "ArrowLeft": "previous",
  "ArrowRight": "next_without_deciding",
};

// Design principle: keyboard shortcuts should cover 90%+ of review actions
// Mouse-only interfaces cut throughput by 30-40%

interface ReviewItemDisplay {
  item_id: string;
  content: string;
  content_metadata: Record<string, unknown>;
  ai_context: {
    prediction: string | null;
    confidence: string | null;
    reasoning: string | null;
  };
  queue_context: {
    priority: "CRITICAL" | "HIGH" | "MEDIUM" | "LOW";
    age_minutes: number;
    sla_urgency: "ok" | "warning" | "critical" | "violated";
  };
}

const SLA_COLORS = {
  ok: "#dcfce7",
  warning: "#fef9c3",
  critical: "#ffedd5",
  violated: "#fee2e2",
};

function ReviewPanel({
  item,
  onDecision,
  showAIPrediction = true,  // Design choice: show or hide AI prediction
}: {
  item: ReviewItemDisplay;
  onDecision: (decision: string, rationale?: string) => void;
  showAIPrediction?: boolean;
}) {
  const startTime = React.useRef(Date.now());
  const [rationale, setRationale] = React.useState("");
  const [requiresRationale] = React.useState(
    item.queue_context.priority === "CRITICAL" ||
    item.queue_context.priority === "HIGH"
  );

  React.useEffect(() => {
    const handler = (e: KeyboardEvent) => {
      // Never capture when user is typing in text fields
      if (e.target instanceof HTMLInputElement ||
          e.target instanceof HTMLTextAreaElement) return;

      const action = KEYBOARD_SHORTCUTS[e.key];
      if (!action) return;

      if (action === "approve" || action === "reject" || action === "escalate") {
        if (requiresRationale && !rationale.trim()) {
          // Flash the rationale field - don't silently fail
          document.getElementById("rationale-field")?.focus();
          return;
        }
        handleDecision(action);
      } else if (action === "skip") {
        handleDecision("skip");
      }
    };

    window.addEventListener("keydown", handler);
    return () => window.removeEventListener("keydown", handler);
  }, [rationale, requiresRationale]);

  const handleDecision = (decision: string) => {
    const timeSpent = (Date.now() - startTime.current) / 1000;
    onDecision(decision, rationale.trim() || undefined);
  };

  return (
    <div className="review-panel" style={{ maxWidth: 800 }}>

      {/* Priority and SLA indicator */}
      <div style={{
        backgroundColor: SLA_COLORS[item.queue_context.sla_urgency],
        padding: "8px 16px",
        borderRadius: 6,
        marginBottom: 12,
        display: "flex",
        justifyContent: "space-between",
      }}>
        <span>Priority: <strong>{item.queue_context.priority}</strong></span>
        <span>Age: {item.queue_context.age_minutes.toFixed(0)} minutes</span>
        {item.queue_context.sla_urgency !== "ok" && (
          <span style={{ color: "#dc2626" }}>
            SLA: {item.queue_context.sla_urgency.toUpperCase()}
          </span>
        )}
      </div>

      {/* AI context - shown after content to reduce anchoring bias */}
      {/* Design choice: content first, then AI prediction */}
      <div className="content-section" style={{ marginBottom: 16 }}>
        <h3>Content to Review</h3>
        <div style={{
          background: "#f9fafb",
          border: "1px solid #e5e7eb",
          borderRadius: 6,
          padding: 16,
          fontFamily: "monospace",
          whiteSpace: "pre-wrap",
          wordBreak: "break-word",
        }}>
          {item.content}
        </div>

        {/* Metadata tags */}
        {Object.entries(item.content_metadata).map(([k, v]) => (
          <span key={k} style={{
            display: "inline-block",
            background: "#e0e7ff",
            color: "#3730a3",
            padding: "2px 8px",
            borderRadius: 4,
            fontSize: 12,
            marginRight: 6,
            marginTop: 8,
          }}>
            {k}: {String(v)}
          </span>
        ))}
      </div>

      {/* AI recommendation - shown below content */}
      {showAIPrediction && item.ai_context.prediction && (
        <div style={{
          background: "#dbeafe",
          border: "1px solid #2563eb",
          borderRadius: 6,
          padding: 12,
          marginBottom: 16,
        }}>
          <div style={{ fontWeight: "bold", marginBottom: 4 }}>
            AI Recommendation: {item.ai_context.prediction}
            {item.ai_context.confidence && (
              <span style={{ fontWeight: "normal", marginLeft: 8, color: "#374151" }}>
                ({item.ai_context.confidence} confidence)
              </span>
            )}
          </div>
          {item.ai_context.reasoning && (
            <div style={{ fontSize: 13, color: "#374151" }}>
              Reasoning: {item.ai_context.reasoning}
            </div>
          )}
        </div>
      )}

      {/* Rationale field - required for high-priority items */}
      <div style={{ marginBottom: 16 }}>
        <label style={{ display: "block", marginBottom: 6, fontWeight: "bold" }}>
          Rationale {requiresRationale ? "(required)" : "(optional)"}
        </label>
        <textarea
          id="rationale-field"
          value={rationale}
          onChange={(e) => setRationale(e.target.value)}
          placeholder="Briefly explain your decision..."
          style={{
            width: "100%",
            padding: 10,
            borderRadius: 6,
            border: requiresRationale && !rationale.trim()
              ? "2px solid #dc2626"
              : "1px solid #d1d5db",
            minHeight: 60,
          }}
        />
      </div>

      {/* Decision buttons with keyboard hints */}
      <div style={{ display: "flex", gap: 10 }}>
        {[
          { action: "approve", label: "Approve", key: "A", color: "#16a34a" },
          { action: "reject", label: "Reject", key: "R", color: "#dc2626" },
          { action: "escalate", label: "Escalate", key: "E", color: "#d97706" },
          { action: "skip", label: "Skip", key: "S", color: "#6b7280" },
        ].map(({ action, label, key, color }) => (
          <button
            key={action}
            onClick={() => handleDecision(action)}
            style={{
              flex: 1,
              padding: "12px 0",
              borderRadius: 6,
              border: `2px solid ${color}`,
              background: "white",
              color: color,
              fontWeight: "bold",
              cursor: "pointer",
            }}
          >
            {label}
            <span style={{
              display: "block",
              fontSize: 11,
              fontWeight: "normal",
              color: "#9ca3af",
            }}>
              Press [{key}]
            </span>
          </button>
        ))}
      </div>
    </div>
  );
}

Immutable Audit Trail

Compliance requirements in regulated industries demand audit records that cannot be modified or deleted. Hash-chaining provides tamper detection: any modification to a record changes its hash, which breaks the chain for all subsequent records.

import hashlib
import json
import time
import uuid
from dataclasses import dataclass, asdict
from typing import Optional


@dataclass
class AuditRecord:
    """
    A single immutable review decision record.
    Hash-chained to detect tampering.
    """
    record_id: str
    item_id: str
    reviewer_id: str
    decision: str
    rationale: Optional[str]
    corrected_label: Optional[str]
    time_spent_seconds: float
    timestamp: float

    # Compliance fields
    item_content_hash: str    # Hash of item content at review time
    ai_prediction: Optional[str]
    ai_confidence: Optional[float]

    # Chain integrity
    previous_record_hash: Optional[str]
    record_hash: str = ""     # Computed on initialization

    def __post_init__(self):
        if not self.record_hash:
            self.record_hash = self._compute_hash()

    def _compute_hash(self) -> str:
        """
        Deterministic hash of record content.
        Changes in any field change the hash.
        Chain integrity: each record's previous_record_hash must match
        the prior record's record_hash.
        """
        canonical = {
            "record_id": self.record_id,
            "item_id": self.item_id,
            "reviewer_id": self.reviewer_id,
            "decision": self.decision,
            "timestamp": self.timestamp,
            "previous_record_hash": self.previous_record_hash,
            "item_content_hash": self.item_content_hash,
        }
        return hashlib.sha256(
            json.dumps(canonical, sort_keys=True).encode()
        ).hexdigest()

    def verify_integrity(self, previous_record: Optional["AuditRecord"] = None) -> bool:
        """Verify this record's hash and chain link."""
        # Verify own hash
        recomputed = self._compute_hash()
        if recomputed != self.record_hash:
            return False

        # Verify chain link
        if previous_record is not None:
            if self.previous_record_hash != previous_record.record_hash:
                return False

        return True


class AuditLogger:
    """
    Append-only audit logger with hash-chain tamper detection.

    Storage backend should be:
    - Append-only (no update/delete operations)
    - Replicated (for durability)
    - Separately permissioned from operational data

    Options: S3 with Object Lock, Kafka (consumer group reads only),
    PostgreSQL with row-level security and no DELETE privilege.
    """

    def __init__(self, storage_backend=None):
        self.storage = storage_backend
        self._last_hash: Optional[str] = None
        self._buffer: list[AuditRecord] = []
        self._buffer_max = 100

    async def log(self, review_record: dict) -> AuditRecord:
        """
        Log a review decision.

        The item_content_hash is computed from the content at review time,
        providing evidence that the reviewer saw the same content that is
        being recorded - important for cases where content may be modified.
        """
        content = review_record.get("item_content", "")
        content_hash = hashlib.sha256(content.encode()).hexdigest()[:16]

        record = AuditRecord(
            record_id=str(uuid.uuid4()),
            item_id=review_record["item_id"],
            reviewer_id=review_record["reviewer_id"],
            decision=review_record["decision"],
            rationale=review_record.get("rationale"),
            corrected_label=review_record.get("corrected_label"),
            time_spent_seconds=review_record.get("time_spent_seconds", 0),
            timestamp=review_record.get("timestamp", time.time()),
            item_content_hash=content_hash,
            ai_prediction=review_record.get("ai_prediction"),
            ai_confidence=review_record.get("ai_confidence"),
            previous_record_hash=self._last_hash,
        )

        self._last_hash = record.record_hash
        self._buffer.append(record)

        if len(self._buffer) >= self._buffer_max:
            await self.flush()

        return record

    async def flush(self) -> int:
        """Persist buffer to storage backend."""
        if not self._buffer:
            return 0

        count = len(self._buffer)
        records = [asdict(r) for r in self._buffer]

        if self.storage:
            await self.storage.append_batch(records)
        else:
            # Development fallback: print to stdout as JSON lines
            for record in records:
                print(json.dumps(record))

        self._buffer.clear()
        return count

    def verify_chain(self, records: list[AuditRecord]) -> dict:
        """
        Verify the integrity of an audit chain.
        Used during compliance audits to demonstrate records were not modified.
        """
        errors = []
        for i, record in enumerate(records):
            prev = records[i - 1] if i > 0 else None
            if not record.verify_integrity(prev):
                errors.append({
                    "record_id": record.record_id,
                    "position": i,
                    "error": "Hash mismatch or chain broken",
                })

        return {
            "total_records": len(records),
            "verified": len(records) - len(errors),
            "failed": len(errors),
            "chain_valid": len(errors) == 0,
            "errors": errors,
        }


class ReviewerPerformanceTracker:
    """Tracks reviewer performance for quality management."""

    def __init__(self, gold_labels: dict):
        self.gold = gold_labels
        from collections import defaultdict
        self._reviews = defaultdict(list)

    def record_review(
        self,
        reviewer_id: str,
        item_id: str,
        decision: str,
        time_spent: float,
        ai_prediction: Optional[str] = None,
    ):
        is_gold = item_id in self.gold
        is_correct = self.gold.get(item_id) == decision if is_gold else None
        is_override = (ai_prediction is not None) and (ai_prediction != decision)

        self._reviews[reviewer_id].append({
            "item_id": item_id,
            "decision": decision,
            "time_spent": time_spent,
            "timestamp": time.time(),
            "is_gold": is_gold,
            "is_correct": is_correct,
            "is_override": is_override,
        })

    def compute_metrics(self, reviewer_id: str, period_hours: float = 24) -> dict:
        import statistics
        reviews = self._reviews.get(reviewer_id, [])
        cutoff = time.time() - period_hours * 3600
        recent = [r for r in reviews if r["timestamp"] > cutoff]

        if not recent:
            return {"reviewer_id": reviewer_id, "total_reviews": 0}

        times = [r["time_spent"] for r in recent]
        gold = [r for r in recent if r["is_gold"]]
        overrides = [r for r in recent if r.get("is_override") is not None]

        gold_accuracy = (
            sum(1 for r in gold if r["is_correct"]) / len(gold)
            if gold else None
        )

        override_rate = (
            sum(1 for r in overrides if r["is_override"]) / len(overrides)
            if overrides else None
        )

        return {
            "reviewer_id": reviewer_id,
            "total_reviews": len(recent),
            "avg_reviews_per_hour": round(len(recent) / period_hours, 1),
            "avg_time_seconds": round(statistics.mean(times), 1),
            "median_time_seconds": round(statistics.median(times), 1),
            "gold_accuracy": round(gold_accuracy, 3) if gold_accuracy else None,
            "override_rate": round(override_rate, 3) if override_rate else None,
        }

    def detect_quality_issues(self, reviewer_id: str) -> list[str]:
        metrics = self.compute_metrics(reviewer_id)
        issues = []

        if metrics.get("avg_time_seconds", 999) < 10:
            issues.append(
                f"Avg review time {metrics['avg_time_seconds']:.1f}s is too fast "
                f"(minimum: 10s). Quality risk."
            )

        gold_acc = metrics.get("gold_accuracy")
        if gold_acc is not None and gold_acc < 0.80 and metrics["total_reviews"] >= 20:
            issues.append(
                f"Gold task accuracy {gold_acc:.1%} below threshold (80%). "
                f"Retraining recommended."
            )

        override_rate = metrics.get("override_rate")
        if override_rate is not None and override_rate < 0.02:
            issues.append(
                f"AI override rate {override_rate:.1%} is very low. "
                f"Reviewer may be rubber-stamping AI decisions."
            )

        return issues

Label Studio Integration

Label Studio is the standard open-source annotation platform for teams that do not need a fully custom interface. It provides collaborative annotation, project management, IAA computation, and export functionality.

import json
import requests
from typing import Optional
from dataclasses import dataclass


@dataclass
class LabelStudioProject:
    """Represents a Label Studio annotation project."""
    project_id: int
    title: str
    label_config: str
    total_tasks: int
    completed_tasks: int


class LabelStudioClient:
    """
    Python client for Label Studio REST API.
    Manages projects, task import, and annotation export.
    """

    def __init__(self, host: str, api_key: str):
        self.host = host.rstrip("/")
        self.headers = {
            "Authorization": f"Token {api_key}",
            "Content-Type": "application/json",
        }

    def create_project(
        self,
        title: str,
        label_config: str,
        description: str = "",
        min_annotations_per_task: int = 1,
    ) -> LabelStudioProject:
        """Create a Label Studio project with specified label configuration."""
        response = requests.post(
            f"{self.host}/api/projects/",
            headers=self.headers,
            json={
                "title": title,
                "description": description,
                "label_config": label_config,
                "minimum_annotations": min_annotations_per_task,
                "show_instruction": True,
            },
        )
        response.raise_for_status()
        data = response.json()

        return LabelStudioProject(
            project_id=data["id"],
            title=data["title"],
            label_config=label_config,
            total_tasks=0,
            completed_tasks=0,
        )

    def import_tasks(
        self,
        project_id: int,
        tasks: list[dict],
        batch_size: int = 500,
    ) -> dict:
        """
        Import annotation tasks into a project.
        Batches large imports to avoid API timeouts.
        """
        total_imported = 0

        for i in range(0, len(tasks), batch_size):
            batch = tasks[i:i + batch_size]
            response = requests.post(
                f"{self.host}/api/projects/{project_id}/import",
                headers=self.headers,
                json=batch,
            )
            response.raise_for_status()
            total_imported += len(batch)

        return {"imported": total_imported}

    def get_annotations(
        self,
        project_id: int,
        export_format: str = "JSON",
        include_only_completed: bool = True,
    ) -> list[dict]:
        """Export completed annotations from a project."""
        params = {
            "exportType": export_format,
            "download_all_tasks": "false" if include_only_completed else "true",
        }
        response = requests.get(
            f"{self.host}/api/projects/{project_id}/export",
            headers=self.headers,
            params=params,
        )
        response.raise_for_status()
        return response.json()

    def get_project_statistics(self, project_id: int) -> dict:
        """Get completion and agreement statistics for a project."""
        response = requests.get(
            f"{self.host}/api/projects/{project_id}/",
            headers=self.headers,
        )
        response.raise_for_status()
        data = response.json()

        return {
            "total_tasks": data.get("task_count", 0),
            "total_annotations": data.get("total_annotations_number", 0),
            "completed_tasks": data.get("num_tasks_with_annotations", 0),
            "completion_rate": (
                data.get("num_tasks_with_annotations", 0)
                / max(data.get("task_count", 1), 1)
            ),
        }


# Standard label configurations for common review tasks

TEXT_CLASSIFICATION_CONFIG = """
<View>
  <Header value="AI Recommendation: $ai_prediction (Confidence: $ai_confidence)"/>
  <Text name="text" value="$text"/>
  <Choices name="label" toName="text" choice="single" showInLine="true">
    <Choice value="approve" style="background: #dcfce7; padding: 8px 16px; border-radius: 4px"/>
    <Choice value="reject" style="background: #fee2e2; padding: 8px 16px; border-radius: 4px"/>
    <Choice value="needs_escalation" style="background: #fef9c3; padding: 8px 16px; border-radius: 4px"/>
  </Choices>
  <TextArea name="rationale" toName="text"
            placeholder="Explain your decision (required for reject/escalate)"
            rows="3"/>
</View>
"""

PREFERENCE_COMPARISON_CONFIG = """
<View>
  <Header value="Which response better addresses this prompt?"/>
  <Text name="prompt" value="$prompt"/>
  <PairwiseComparison name="comparison" toName="response_a,response_b">
    <Text name="response_a" value="$response_a" label="Response A"/>
    <Text name="response_b" value="$response_b" label="Response B"/>
  </PairwiseComparison>
  <Choices name="preferred" toName="prompt" choice="single">
    <Choice value="A">Response A is better</Choice>
    <Choice value="B">Response B is better</Choice>
    <Choice value="tie">About equal</Choice>
  </Choices>
  <Rating name="confidence" toName="prompt" maxRating="5" icon="star"
          hotkey="1,2,3,4,5"/>
  <TextArea name="reason" toName="prompt" placeholder="Why is this better?" rows="2"/>
</View>
"""

MEDICAL_SEVERITY_CONFIG = """
<View>
  <Header value="MEDICAL ANNOTATION - Apply absolute severity, not relative to patient age"/>
  <Text name="clinical_note" value="$text"/>
  <Choices name="severity" toName="clinical_note" choice="single">
    <Choice value="5_severe">5 - Severe: Life-threatening, immediate intervention</Choice>
    <Choice value="4_high">4 - High: Urgent care required today</Choice>
    <Choice value="3_moderate">3 - Moderate: Non-urgent medical attention needed</Choice>
    <Choice value="2_mild">2 - Mild: Self-care appropriate</Choice>
    <Choice value="1_minimal">1 - Minimal: No medical attention needed</Choice>
  </Choices>
  <TextArea name="clinical_reasoning" toName="clinical_note"
            placeholder="Required: explain your severity assessment" rows="3"/>
</View>
"""


def export_to_training_data(
    annotations: list[dict],
    min_agreement: float = 0.70,
) -> tuple[list[dict], dict]:
    """
    Convert Label Studio annotations to training data format.
    Filters by annotator agreement threshold.
    """
    from collections import Counter

    training = []
    stats = {"total": 0, "included": 0, "excluded_low_agreement": 0, "excluded_no_annotations": 0}

    for task in annotations:
        stats["total"] += 1
        text = task.get("data", {}).get("text", "")
        annotation_results = task.get("annotations", [])

        if not annotation_results:
            stats["excluded_no_annotations"] += 1
            continue

        # Extract labels from all annotations
        labels = []
        for ann in annotation_results:
            if ann.get("result"):
                for result in ann["result"]:
                    if result.get("type") == "choices":
                        choices = result["value"].get("choices", [])
                        if choices:
                            labels.append(choices[0])

        if not labels:
            stats["excluded_no_annotations"] += 1
            continue

        # Majority vote with agreement check
        counts = Counter(labels)
        majority_label, majority_count = counts.most_common(1)[0]
        agreement = majority_count / len(labels)

        if agreement < min_agreement:
            stats["excluded_low_agreement"] += 1
            continue

        training.append({
            "text": text,
            "label": majority_label,
            "agreement_rate": round(agreement, 3),
            "n_annotators": len(labels),
            "label_distribution": dict(counts),
        })
        stats["included"] += 1

    stats["inclusion_rate"] = round(stats["included"] / max(stats["total"], 1), 3)
    return training, stats

Review Queue Design Principles Table

Design Decision	Best Practice	Anti-Pattern
Priority queue	Redis sorted set with tier scoring	Single FIFO queue (critical items wait)
SLA monitoring	Automated alerts at 50% and 80% of SLA	Manual daily checks
Skill routing	Metadata-based matching at dequeue	Random assignment
Audit logging	Immutable, hash-chained, separate storage	Application DB with DELETE privilege
AI prediction display	Show AFTER content (reduce anchoring)	Show AI prediction first
Rationale requirement	Mandatory for high-priority, optional for low	No rationale ever required
Reviewer feedback	Real-time accuracy on gold tasks	Monthly spreadsheet reports
Queue overflow	Auto-scale reviewer capacity alerts	Let queue grow indefinitely

:::tip Show Reviewer Performance Metrics in Real Time Display a running accuracy indicator to reviewers during their session: "Your decisions match the majority 87% of the time today. You disagree most with the majority on [category X]." Reviewers self-correct when they have real-time feedback. A dashboard they check monthly has no impact on in-session quality. Consider a daily accuracy recap email rather than a live dashboard if real-time display feels too pressuring. :::

:::warning Audit Logs Must Be Immutable - Not Just Write-Protected A write-protected database table can be modified by a database administrator. True immutability requires either append-only storage (S3 with Object Lock, Kafka streams) or hash-chain verification that makes tampering detectable even if the storage allows modification. For regulatory compliance: document your immutability mechanism explicitly and test it during compliance audits by attempting to modify a record and verifying the chain verification catches it. :::

:::danger Do Not Let Reviewers See Queue Depth During Sessions Displaying "4,582 items in queue" to a reviewer puts them under implicit pressure to review faster. Throughput increases, quality drops, and you end up with worse outcomes than a slower, more careful review process. Show reviewers their own progress (items completed today, accuracy) but hide queue depth and backlog numbers from their session view. :::

Interview Q&A

Q1: How do you design a review queue that scales from 1,000 to 100,000 reviews per day?

The architecture must be horizontally scalable. Use Redis sorted sets as the queue backend: O(log N) enqueue and dequeue with atomicity via Lua scripts prevents race conditions at high throughput. Multiple review application servers can dequeue concurrently without double-assignment.

Capacity planning: first, compute the sustainable throughput per reviewer by task type. Medical document review: 50-80/day. Binary content moderation: 300-500/day. Preference comparison: 200-400/day. Then compute required reviewers = daily_volume × escalation_rate / per_reviewer_throughput. Plan for 20% headroom and define the alert threshold at which you begin recruiting additional reviewers.

Skill routing prevents specialist bottlenecks. Medical items must only go to medical-certified reviewers. Language-specific items must go to fluent reviewers. A general pool cannot compensate for specialty shortfalls. Track queue depth per skill separately and alert independently.

For extreme scale (100,000+ reviews/day): partition the queue by content type into separate Redis instances. Use a queue management service (SQS, Pub/Sub) for cross-region distribution. Pre-compute AI context summaries before items enter the queue so reviewer time is not spent waiting for LLM calls.

Q2: What must be included in an audit trail for human review decisions?

Minimum for compliance: item_id, reviewer_id, decision, timestamp, time_spent_seconds. This is insufficient for most regulatory contexts.

Production compliance audit trail: (1) The item content at the exact time of review - content may be edited later, but the review record must capture what the reviewer actually saw. Compute a hash of the content. (2) The AI prediction and confidence that led to flagging - this demonstrates that the human was reviewing AI-generated content. (3) The reviewer's decision with rationale - "reject" alone is not auditable without a stated reason. (4) Hash-chain linking to previous record - provides tamper detection. (5) Record retention metadata - when this record expires, who is authorized to read it.

Storage requirements: append-only, replicated, separately permissioned from operational data. Hash-chain must be independently verifiable. Test chain verification during audits by modifying a test record and confirming the verification detects it.

Q3: How do you prevent reviewer fatigue from degrading decision quality over long shifts?

Fatigue detection uses three signals: accuracy on gold tasks by session quartile, review speed within session, and override rate patterns.

Gold task accuracy by quartile: if a reviewer's accuracy on gold tasks is 88% in the first quarter of their session but 74% in the last quarter, fatigue is degrading quality. This is detected by tagging gold task completions with session timestamp and computing per-quartile accuracy.

Speed monitoring: a reviewer whose average time per review drops from 45s to 18s in the second half of their shift is rushing. Set a minimum review time gate (5-15 seconds, depending on task complexity) that the interface enforces - the "next item" button does not appear until the minimum time has elapsed.

Design countermeasures: mandatory breaks every 90 minutes (interface shows break reminder and locks for 10 minutes), maximum session length of 4 hours for complex tasks, rotation between easy and complex item types within the session (monotony accelerates fatigue), progress displays that acknowledge effort ("You've reviewed 87 items today - great work") rather than queue depth.

Q4: How do you integrate Label Studio into a production HITL pipeline?

Label Studio fits between your flagging system and your downstream actions. The integration has five steps:

Import: Export flagged items from your system as JSON tasks with precomputed metadata fields. The Label Studio label_config XML references these field names. Run import via the Label Studio API - batch imports of 500 items at a time avoid timeouts.

Assignment: Label Studio supports manual project assignment and automatic round-robin. For skill-based routing, maintain separate projects per specialty. Use the API to manage project membership - a medical reviewer joins only medical projects.

Monitor: Query the project statistics endpoint to track completion rate and time-in-queue. Set up alerts when tasks age beyond SLA thresholds.

Export: Query completed annotations periodically. Run agreement filtering in your pipeline (Label Studio computes IAA but you control the threshold for training inclusion). Items with agreement below your threshold go to adjudication, not training data.

Feedback: Write filtered annotations back to your labeling database with provenance metadata (Label Studio project ID, annotation IDs). Use this provenance to track which training examples came from which annotation rounds.

Q5: How do you handle reviewer disagreement on the same item in a review queue?

Multi-annotator disagreement is handled through a resolution hierarchy: majority vote when majority is clear, weighted voting when accuracy data is available, expert adjudication for genuine ambiguity.

For content moderation queues: assign 2-3 reviewers to borderline items (items where the AI confidence was in the 40-70% range). If 2 of 3 agree, use their decision with the majority label. If 3 reviewers split evenly (impossible with 3) or the margin is very small, escalate to a senior reviewer.

Weight reviewers by their gold task accuracy: a reviewer with 90% gold accuracy gets 1.5× weight versus a reviewer at 75%. This down-weights noisy reviewers without excluding them entirely.

Track which item categories generate the most disagreement. Category X with 40% inter-reviewer agreement is a signal that your review guidelines for category X are ambiguous or insufficient. Update the guidelines, re-train reviewers on the updated criteria, and flag existing items in that category for priority re-review.

Disagreement data is valuable: items where reviewers split are often the hardest, most informative examples for model training. Active learning approaches can prioritize these for inclusion in fine-tuning datasets - they sit near the model's decision boundary and produce the highest information gain per label.

When the Queue Becomes the Problem​

Review Queue Architecture​

Priority Queue Implementation​

The Review Interface: Throughput and Accuracy by Design​

Reviewer Interface: UX Principles for Speed and Quality​

Immutable Audit Trail​

Label Studio Integration​

Review Queue Design Principles Table​

Interview Q&A​

Q1: How do you design a review queue that scales from 1,000 to 100,000 reviews per day?​

Q2: What must be included in an audit trail for human review decisions?​

Q3: How do you prevent reviewer fatigue from degrading decision quality over long shifts?​

Q4: How do you integrate Label Studio into a production HITL pipeline?​

Q5: How do you handle reviewer disagreement on the same item in a review queue?​