How does MT-Bench work in practice?

LLM-as-Judge covers LLM-as-judge, MT-Bench, position bias from first principles with code examples. Free lesson at https://engineersofai.com/docs/llms/llm-evaluation/llm-as-judge

What is the difference between LLM-as-judge and position bias?

See the full breakdown at https://engineersofai.com/docs/llms/llm-evaluation/llm-as-judge

LLM-as-Judge

Q: What is LLM-as-judge?

Use powerful LLMs to automatically evaluate other models - with position bias mitigation, CoT judging, and cost analysis.

The Scaling Problem

Your RLHF pipeline needs 50,000 labeled preference pairs per training run. Human annotators cost $0.50 per comparison and can deliver 500 comparisons per day per annotator. To label 50,000 pairs, you need 100 days of a single annotator, or 50 annotators working for 2 days. Either way, you are looking at$ 25,000 per training run just for the preference data. And you run training four times per week.

The math does not work. Human evaluation at the scale required for modern LLM training loops - hundreds of thousands of comparisons per week - is financially impossible for all but the largest labs.

This is the problem that LLM-as-judge solves. If a powerful LLM can reliably evaluate the outputs of other models, you can scale your evaluation pipeline to millions of comparisons per day at a fraction of the cost. GPT-4 as a judge costs roughly $0.03 per evaluation. At that rate, you can run 1 million evaluations for$ 30,000 - about the cost of one day of a human annotation team.

But this only works if the LLM judge is reliable. If it has systematic biases - preferring longer responses, preferring the first response it sees, preferring its own style - you are scaling up a broken signal. The entire practice of LLM-as-judge rests on a chain of assumptions that need to be validated, and most teams do not validate them rigorously enough.

This lesson covers what makes a good LLM judge, what makes a bad one, and the engineering practices that make LLM-as-judge reliable enough for production evaluation pipelines.

Why This Exists - The Bridge Between Human and Automatic

Before LLM-as-judge, there were two options: slow and expensive human evaluation, or fast and unreliable automatic metrics (BLEU, ROUGE). LLM-as-judge occupies a middle ground that was not possible until GPT-4-class models arrived in 2023.

The key insight: a capable LLM can understand evaluation criteria, consider context, and make nuanced judgments that BLEU cannot. It does not have to spend time reading slow; it processes at machine speed. It can be given precise, detailed rubrics that constrain its judgment. And it can explain its reasoning - giving you not just a score but a rationale that can be reviewed and improved.

The limitation: LLMs are not neutral observers. They have preferences baked in from training - aesthetic preferences, biases toward familiar styles, tendencies to reward confidence. These biases must be understood and mitigated, not ignored.

MT-Bench: The First Systematic LLM Judge Study

MT-Bench (Zheng et al., 2023, "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena") was the paper that established LLM-as-judge as a credible evaluation methodology.

The Setup

MT-Bench consists of 80 high-quality, multi-turn questions across 8 categories:

Writing (10 questions)
Roleplay (10 questions)
Reasoning (10 questions)
Math (10 questions)
Coding (10 questions)
Extraction (10 questions)
STEM (10 questions)
Humanities (10 questions)

Each question has a follow-up turn, testing whether the model can maintain coherent multi-turn conversation. GPT-4 evaluates each response on a 1-10 scale.

Key Findings

GPT-4 as judge agrees with human experts 80%+ of the time - comparable to human-human agreement
Position bias is real: when shown two responses, GPT-4 prefers the first position 56% of the time vs 44% for the second
Verbosity bias is real: GPT-4 tends to rate longer, more detailed responses higher
Weaker models make worse judges: GPT-3.5 as judge shows more biases and lower correlation with human judgments

MT-Bench scores for common models (at time of publication):

GPT-4: 8.99/10
Claude-v1: 7.90/10
GPT-3.5-turbo: 7.94/10
LLaMA-13B-fine-tuned: 6.27/10

Single-Score vs Pairwise Comparison

Two main judgment formats:

Single-Score (Absolute)

Present one response and ask the judge to rate it on a scale:

Rate the following response on a scale of 1-10 for helpfulness:
[Response A]
Score:

Advantages: Can be used without a baseline; enables longitudinal tracking. Disadvantages: Inconsistent calibration (the judge's internal "5/10" shifts with context); high variance.

Pairwise Comparison (Relative)

Present two responses and ask which is better:

Which response is better? Respond with "Response A", "Response B", or "Tie".
[Response A]
[Response B]

Advantages: More reliable than absolute scores; position-swapping eliminates position bias; directly usable for ELO computation. Disadvantages: Requires a baseline; cannot track absolute quality over time; O(n²) comparisons if evaluating n models.

The Systematic Biases

Understanding LLM judge biases is not academic - they can completely invalidate your evaluation results if uncorrected.

Position Bias

When presented with "Response A" and "Response B," LLM judges systematically prefer the response in the first position (A). This bias persists even for GPT-4, with studies showing 56%+ preference for position A even when the responses are equivalent.

Why it happens: LLMs are trained on text where the "first" answer tends to be the relevant one (QA datasets, chat logs where the helpful response comes first in human-written examples).

Mitigation: Always evaluate both orderings and average:

# Evaluate (A, B) and (B, A) separately, then aggregate
score_AB = judge(prompt, response_A, response_B)  # "A wins" / "B wins" / "Tie"
score_BA = judge(prompt, response_B, response_A)  # "B wins" / "A wins" / "Tie"

# Only trust result if both orderings agree
if score_AB == "A wins" and score_BA == "B wins":
    final = "A wins"
elif score_AB == "B wins" and score_BA == "A wins":
    final = "B wins"
else:
    final = "Tie"  # Inconsistent → call it a tie

Verbosity Bias

LLM judges prefer longer, more detailed responses even when the extra length adds no value or introduces errors.

Evidence: In controlled experiments, the same response padded with irrelevant sentences scored 0.3–0.8 points higher on 10-point scales.

Mitigation:

Explicitly instruct the judge not to prefer length: "Length alone is not a measure of quality. A concise correct answer is better than a verbose incorrect one."
Add evaluation criteria for conciseness: "Penalize unnecessary padding or repetition."
Normalize responses to similar lengths for critical comparisons.

Self-Preference

LLMs rate outputs that match their own style and training higher. A GPT-4-based judge shows mild preference for GPT-4 outputs. A Claude-based judge shows mild preference for Claude outputs.

Effect size: studies report 5–15% higher scores for "self" compared to equivalent outputs from other models.

Mitigation:

Use a judge from a different model family than the models being evaluated
Use multiple judges from different families and aggregate
For critical comparisons, include human judges to calibrate

Sycophancy

If you include the model's previous output or any hint of what the "expected" answer is, judges tend to validate it. They are trained to be agreeable.

Mitigation: Present responses without any context about which is "expected" to be better.

Chain-of-Thought Judging

Having the judge reason step-by-step before giving a score significantly improves reliability. This was established empirically in the MT-Bench paper and subsequent work.

COT_JUDGE_PROMPT = """
You are an expert evaluator assessing the quality of AI assistant responses.

[Prompt given to the AI]
{prompt}

[AI Response to evaluate]
{response}

Please evaluate this response following these steps:
1. What was the user asking for? (be specific)
2. Does the response directly address the request? What is missing, if anything?
3. Is the information accurate? Note any factual errors.
4. Is the response appropriately concise? Note any padding or unnecessary length.
5. Are there any safety issues?

Based on your analysis above, provide a final score from 1-10 where:
1-2: Does not address the request, contains major errors, or is harmful
3-4: Partially addresses the request with significant gaps
5-6: Addresses the request with minor gaps or errors
7-8: Fully addresses the request with high quality
9-10: Exceptional - better than most human experts would do

Reasoning: [Your step-by-step analysis]
Score: [Single integer 1-10]
""".strip()

Chain-of-thought judging improves:

Consistency (same response gets same score on repeated runs)
Calibration (scores are distributed more meaningfully across the scale)
Interpretability (you can review reasoning to catch errors)
Human-judge agreement (typically 5–10% higher correlation)

Reference-Guided Evaluation

For tasks where there is a known-good answer (closed-ended QA, factual queries), provide the reference answer to the judge:

REFERENCE_JUDGE_PROMPT = """
You are evaluating whether an AI response correctly answers a question.

Question: {question}
Correct Answer: {reference_answer}
AI Response: {ai_response}

Evaluate whether the AI response:
1. Contains the correct answer (even if phrased differently)
2. Contains any factually incorrect claims
3. Is appropriately concise or unnecessarily verbose

Verdict: "correct", "partially_correct", or "incorrect"
Explanation: [Brief explanation]
"""

Reference-guided evaluation reduces hallucination in the judge itself - the judge does not need to know the answer, just whether the response matches the reference.

Full Implementation: LLM Judge with Bias Mitigation

import anthropic
import openai
import json
from typing import Literal, Optional
from dataclasses import dataclass
import time

@dataclass
class JudgmentResult:
    winner: Literal["A", "B", "tie", "inconsistent"]
    score_A_forward: Optional[int]  # Score when A is presented first
    score_A_reverse: Optional[int]  # Score when B is presented first
    reasoning_forward: str
    reasoning_reverse: str
    position_bias_detected: bool
    confidence: Literal["high", "medium", "low"]


class LLMJudge:
    """
    Production-ready LLM judge with position bias mitigation,
    chain-of-thought reasoning, and reference support.
    """

    PAIRWISE_PROMPT = """You are an expert AI evaluator. Your task is to compare two AI responses to the same prompt and determine which is better.

User Prompt:
{prompt}

Response A:
{response_a}

Response B:
{response_b}

{reference_section}

Evaluation Criteria:
- Accuracy: Are the claims factually correct?
- Completeness: Does the response fully address the request?
- Clarity: Is the response clear and well-organized?
- Conciseness: Is the response appropriately brief (not unnecessarily verbose)?
- Safety: Does the response avoid harmful content?

IMPORTANT: Do not prefer a response simply because it is longer. Concise correct answers are better than verbose incorrect ones.

Think step by step:
1. What does the user need?
2. Does Response A address it? How well?
3. Does Response B address it? How well?
4. Compare the key differences.
5. Which is better overall?

Format your response as JSON:
{{
  "reasoning": "Your step-by-step analysis",
  "winner": "A" or "B" or "tie",
  "score_A": integer 1-10,
  "score_B": integer 1-10
}}
"""

    def __init__(
        self,
        judge_model: str = "claude-3-5-sonnet-20241022",
        judge_provider: Literal["anthropic", "openai"] = "anthropic",
    ):
        self.judge_model = judge_model
        self.judge_provider = judge_provider

        if judge_provider == "anthropic":
            self.client = anthropic.Anthropic()
        else:
            self.client = openai.OpenAI()

    def _call_judge(self, prompt: str) -> str:
        """Make a single API call to the judge model."""
        if self.judge_provider == "anthropic":
            message = self.client.messages.create(
                model=self.judge_model,
                max_tokens=1024,
                messages=[{"role": "user", "content": prompt}],
            )
            return message.content[0].text
        else:
            completion = self.client.chat.completions.create(
                model=self.judge_model,
                messages=[{"role": "user", "content": prompt}],
                max_tokens=1024,
                response_format={"type": "json_object"},
            )
            return completion.choices[0].message.content

    def _parse_judgment(self, response: str) -> dict:
        """Parse JSON judgment from the model response."""
        # Extract JSON from response (model might include preamble)
        try:
            # Try direct parse first
            return json.loads(response)
        except json.JSONDecodeError:
            # Find JSON block in response
            import re
            json_match = re.search(r'\{[^{}]*\}', response, re.DOTALL)
            if json_match:
                try:
                    return json.loads(json_match.group())
                except json.JSONDecodeError:
                    pass
            return {
                "reasoning": response,
                "winner": "tie",
                "score_A": 5,
                "score_B": 5,
            }

    def judge_pairwise(
        self,
        prompt: str,
        response_a: str,
        response_b: str,
        reference_answer: Optional[str] = None,
        mitigate_position_bias: bool = True,
        max_retries: int = 2,
    ) -> JudgmentResult:
        """
        Judge two responses with optional position bias mitigation.

        If mitigate_position_bias=True, evaluates both (A,B) and (B,A)
        orderings and checks for consistency.
        """
        reference_section = ""
        if reference_answer:
            reference_section = f"\nReference Answer (ground truth):\n{reference_answer}\n"

        # Forward evaluation: A presented first
        forward_prompt = self.PAIRWISE_PROMPT.format(
            prompt=prompt,
            response_a=response_a,
            response_b=response_b,
            reference_section=reference_section,
        )

        forward_raw = self._call_judge(forward_prompt)
        forward = self._parse_judgment(forward_raw)
        forward_winner = forward.get("winner", "tie")

        if not mitigate_position_bias:
            return JudgmentResult(
                winner=forward_winner,
                score_A_forward=forward.get("score_A"),
                score_A_reverse=None,
                reasoning_forward=forward.get("reasoning", ""),
                reasoning_reverse="",
                position_bias_detected=False,
                confidence="medium",
            )

        # Reverse evaluation: B presented first (as Response A)
        # This detects position bias: if A was preferred in forward but B is
        # preferred in reverse, the result was position-biased
        time.sleep(0.5)  # Rate limiting

        reverse_prompt = self.PAIRWISE_PROMPT.format(
            prompt=prompt,
            response_a=response_b,  # Swap!
            response_b=response_a,  # Swap!
            reference_section=reference_section,
        )

        reverse_raw = self._call_judge(reverse_prompt)
        reverse = self._parse_judgment(reverse_raw)
        reverse_winner_from_judge = reverse.get("winner", "tie")

        # Translate reverse result back to original A/B labeling
        if reverse_winner_from_judge == "A":
            reverse_winner = "B"  # Judge picked "A" in reverse = original B
        elif reverse_winner_from_judge == "B":
            reverse_winner = "A"  # Judge picked "B" in reverse = original A
        else:
            reverse_winner = "tie"

        # Score for original A in the reverse evaluation
        score_A_reverse = reverse.get("score_B")  # B in reverse = original A

        # Determine consistency
        position_bias_detected = False
        if forward_winner != reverse_winner and "tie" not in [forward_winner, reverse_winner]:
            position_bias_detected = True

        # Aggregate winner
        if forward_winner == reverse_winner:
            final_winner = forward_winner
            confidence = "high"
        elif forward_winner == "tie" or reverse_winner == "tie":
            # One says tie, one has a preference - call it a tie
            final_winner = "tie"
            confidence = "medium"
        else:
            # Inconsistent - genuine disagreement due to position
            final_winner = "inconsistent"
            confidence = "low"

        return JudgmentResult(
            winner=final_winner,
            score_A_forward=forward.get("score_A"),
            score_A_reverse=score_A_reverse,
            reasoning_forward=forward.get("reasoning", ""),
            reasoning_reverse=reverse.get("reasoning", ""),
            position_bias_detected=position_bias_detected,
            confidence=confidence,
        )

    def batch_evaluate(
        self,
        evaluation_items: list[dict],
        mitigate_position_bias: bool = True,
    ) -> list[JudgmentResult]:
        """
        Evaluate a batch of (prompt, response_a, response_b) items.
        Returns list of JudgmentResults with summary statistics.
        """
        results = []
        position_bias_count = 0

        for i, item in enumerate(evaluation_items):
            print(f"Evaluating item {i+1}/{len(evaluation_items)}...")

            result = self.judge_pairwise(
                prompt=item["prompt"],
                response_a=item["response_a"],
                response_b=item["response_b"],
                reference_answer=item.get("reference"),
                mitigate_position_bias=mitigate_position_bias,
            )
            results.append(result)

            if result.position_bias_detected:
                position_bias_count += 1

        # Summary
        winners = [r.winner for r in results]
        print(f"\n=== Batch Evaluation Summary ===")
        print(f"Items evaluated:     {len(results)}")
        print(f"Model A wins:        {winners.count('A')}")
        print(f"Model B wins:        {winners.count('B')}")
        print(f"Ties:                {winners.count('tie')}")
        print(f"Inconsistent:        {winners.count('inconsistent')}")
        print(f"Position bias detected: {position_bias_count}/{len(results)}")

        return results


# Usage example
def run_model_comparison():
    judge = LLMJudge(judge_model="claude-3-5-sonnet-20241022", judge_provider="anthropic")

    evaluation_items = [
        {
            "prompt": "Explain the difference between supervised and unsupervised learning.",
            "response_a": "Supervised learning uses labeled data to train models that predict outputs for new inputs. Unsupervised learning finds patterns in unlabeled data.",
            "response_b": "Supervised learning: you provide examples of input-output pairs (labeled data), and the model learns the mapping. Examples: classification, regression. Unsupervised learning: you only provide inputs, and the model discovers structure. Examples: clustering, dimensionality reduction. The key difference is whether human-labeled targets are available during training.",
        },
        {
            "prompt": "What is the capital of Australia?",
            "response_a": "The capital of Australia is Canberra.",
            "response_b": "Australia's capital city is Canberra, which was purpose-built as a compromise between Sydney and Melbourne, which both wanted to be the capital. It became the capital in 1913.",
            "reference": "Canberra",
        },
    ]

    results = judge.batch_evaluate(evaluation_items)
    return results

Prometheus: Open-Source Evaluation LLM

Prometheus (Kim et al., 2023) is an open-source LLM specifically trained for evaluation tasks. Unlike GPT-4-as-judge, which uses a general-purpose model, Prometheus was fine-tuned on a large dataset of evaluation instances to be a more calibrated judge.

Key features:

Based on Llama-2-13B fine-tuned on 100K evaluation instances
Produces evaluation scores AND detailed feedback
Can be run locally (no API costs)
Shows lower position bias than general-purpose models

# Prometheus evaluation using Hugging Face
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

PROMETHEUS_PROMPT = """###Task Description:
An instruction (might include an Input inside it), a response to evaluate, a reference answer that gets a score of 5, and a score rubric representing a evaluation criteria are given.
1. Write a detailed feedback that assess the quality of the response strictly based on the given score rubric, not evaluating in general.
2. After writing a feedback, write a score that is an integer between 1 and 5. You should refer to the score rubric.
3. The output format should look as follows: "Feedback: (write a feedback for criteria) [RESULT] (an integer number between 1 and 5)"
4. Please do not generate any other opening, closing, and explanations.

###The instruction to evaluate:
{instruction}

###Response to evaluate:
{response}

###Reference Answer (Score 5):
{reference_answer}

###Score Rubrics:
[Is the response accurate and helpful?]
Score 1: The response is inaccurate, irrelevant, or harmful.
Score 2: The response is partially relevant but contains significant errors.
Score 3: The response is relevant and mostly accurate but incomplete.
Score 4: The response is accurate and helpful with minor gaps.
Score 5: The response is completely accurate, comprehensive, and exceptionally helpful.

###Feedback:"""


def evaluate_with_prometheus(
    instruction: str,
    response: str,
    reference_answer: str,
    model_name: str = "kaist-ai/prometheus-7b-v2.0",
) -> dict:
    """
    Evaluate a response using Prometheus open-source judge.
    Returns score and feedback.
    """
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype=torch.float16,
        device_map="auto",
    )

    prompt = PROMETHEUS_PROMPT.format(
        instruction=instruction,
        response=response,
        reference_answer=reference_answer,
    )

    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=256,
            temperature=0.0,  # Deterministic for evaluation
        )

    generated = tokenizer.decode(
        outputs[0][inputs.input_ids.shape[1]:],
        skip_special_tokens=True,
    )

    # Parse score from "[RESULT] N" format
    import re
    score_match = re.search(r'\[RESULT\]\s*(\d)', generated)
    score = int(score_match.group(1)) if score_match else None

    feedback = generated.split("[RESULT]")[0].strip() if "[RESULT]" in generated else generated

    return {
        "score": score,
        "feedback": feedback,
        "raw_output": generated,
    }

Cost Analysis: GPT-4 as Judge vs Human Annotation

def compute_evaluation_costs(
    n_comparisons: int,
    avg_prompt_tokens: int = 200,
    avg_response_tokens: int = 300,
    judge_prompt_overhead: int = 500,  # System prompt + formatting
) -> dict:
    """
    Compare costs of different evaluation approaches.
    Prices as of 2024 - update for current pricing.
    """
    # Token counts per evaluation
    tokens_per_eval = avg_prompt_tokens + 2 * avg_response_tokens + judge_prompt_overhead
    judge_output_tokens = 200  # Reasoning + score

    # GPT-4o pricing (input/output per 1M tokens)
    gpt4o_input_price = 2.50 / 1_000_000
    gpt4o_output_price = 10.00 / 1_000_000

    # GPT-4o-mini pricing
    mini_input_price = 0.15 / 1_000_000
    mini_output_price = 0.60 / 1_000_000

    # Claude 3.5 Sonnet pricing
    claude_input_price = 3.00 / 1_000_000
    claude_output_price = 15.00 / 1_000_000

    # Human annotation cost
    human_cost_per_comparison = 0.50  # $0.50/comparison (crowdsourced)
    expert_cost_per_comparison = 5.00  # $5/comparison (expert annotator)

    costs = {
        "gpt4o_judge": n_comparisons * (
            tokens_per_eval * gpt4o_input_price +
            judge_output_tokens * gpt4o_output_price
        ) * 2,  # x2 for position bias mitigation (forward + reverse)
        "gpt4o_mini_judge": n_comparisons * (
            tokens_per_eval * mini_input_price +
            judge_output_tokens * mini_output_price
        ) * 2,
        "claude_judge": n_comparisons * (
            tokens_per_eval * claude_input_price +
            judge_output_tokens * claude_output_price
        ) * 2,
        "human_crowdsource": n_comparisons * human_cost_per_comparison,
        "human_expert": n_comparisons * expert_cost_per_comparison,
    }

    print(f"\n=== Evaluation Cost Analysis ({n_comparisons:,} comparisons) ===")
    for method, cost in sorted(costs.items(), key=lambda x: x[1]):
        print(f"{method:<25} ${cost:>10,.2f}")

    print(f"\nCost multiplier vs GPT-4o:")
    for method, cost in costs.items():
        ratio = cost / costs["gpt4o_judge"]
        print(f"  {method:<25} {ratio:.2f}x")

    return costs


# Example
compute_evaluation_costs(10_000)
# Output (approximate):
# gpt4o_mini_judge           $18.00
# gpt4o_judge               $130.00
# claude_judge              $180.00
# human_crowdsource        $5,000.00
# human_expert            $50,000.00

Production Engineering Notes

When to Trust LLM-as-Judge

LLM-as-judge is reliable for:

Open-ended helpfulness comparisons between similar-quality models
Identifying obviously bad responses (hallucinations, refusals, non-sequiturs)
Style and format compliance
Initial screening before human review

LLM-as-judge is unreliable for:

Safety-critical decisions (always require human review)
Specialized technical domains where the judge lacks expertise
Very close comparisons (delta less than 1 point on 10-point scale is noise)
Detecting subtle hallucinations in specialized domains

Calibration Against Human Judgments

Before using LLM-as-judge in production, calibrate:

def calibrate_judge_against_humans(
    human_preferences: list[dict],  # Ground truth human preferences
    judge: LLMJudge,
) -> dict:
    """
    Measure how well LLM judge correlates with human preferences.
    human_preferences: list of {prompt, response_a, response_b, human_winner}
    """
    judge_results = []
    human_results = []

    for item in human_preferences:
        result = judge.judge_pairwise(
            item["prompt"],
            item["response_a"],
            item["response_b"],
        )
        judge_results.append(result.winner)
        human_results.append(item["human_winner"])

    # Compute agreement
    agreements = sum(
        j == h for j, h in zip(judge_results, human_results)
        if j not in ["inconsistent"] and h not in ["inconsistent"]
    )
    valid = sum(
        1 for j, h in zip(judge_results, human_results)
        if j not in ["inconsistent"] and h not in ["inconsistent"]
    )

    agreement_rate = agreements / valid if valid > 0 else 0

    # Position bias rate
    bias_rate = sum(
        1 for r in [judge.judge_pairwise(
            item["prompt"], item["response_a"], item["response_b"]
        ) for item in human_preferences[:20]]
        if hasattr(r, 'position_bias_detected') and r.position_bias_detected
    ) / min(20, len(human_preferences))

    return {
        "human_judge_agreement": round(agreement_rate, 4),
        "n_valid_comparisons": valid,
        "acceptable": agreement_rate >= 0.75,  # MT-Bench threshold
    }

:::warning Do Not Use Weaker Models as Judges GPT-3.5-turbo and similar models show significantly more position bias and lower human-agreement rates than GPT-4-class models. The cost savings of using a cheaper judge often are not worth the unreliability. If budget is a constraint, use GPT-4o-mini with explicit anti-bias instructions rather than GPT-3.5. :::

:::danger LLM Judges Cannot Evaluate Safety LLM judges should not be the final arbiter on safety decisions. They can miss subtle harms, can be prompted to approve harmful content, and have no accountability. Always route borderline safety evaluations to human reviewers. :::

Common Mistakes

:::danger Forgetting Position Swap If you evaluate (A, B) but not (B, A), your results are contaminated by position bias. This is especially critical when you are trying to detect small differences between similar-quality models. Always run both orderings. :::

:::warning Using the Same Model Family to Judge Itself Using GPT-4 to judge GPT-4 outputs introduces self-preference bias. If you are comparing GPT-4 and Claude, use Claude as the judge for some evaluations and aggregate. Better yet: use a different model family as the judge, or use a specialized evaluation model like Prometheus. :::

:::danger Over-Indexing on LLM Judge Scores LLM judge scores are proxies. A model that improves from 7.2 to 7.4 on MT-Bench may or may not be actually better for users. Always validate judge improvements with human evaluation on a sample before drawing product conclusions. :::

Interview Q&A

Q1: What is LLM-as-judge and why did it become popular?

LLM-as-judge is using a powerful language model (typically GPT-4 or equivalent) to automatically evaluate the outputs of other models. It became popular because: (1) human annotation is too expensive and slow for the volume required by modern RLHF training loops - you need hundreds of thousands of labeled comparisons per training run; (2) GPT-4-class models have demonstrated 80%+ agreement with expert human judgments in controlled studies (MT-Bench), making them reliable enough for many evaluation scenarios; (3) the cost is orders of magnitude lower - $0.03/evaluation vs$ 0.50+ for human annotation; (4) it enables iterative development - you can evaluate every model checkpoint automatically without waiting for human annotation batches.

Q2: What is position bias in LLM-as-judge and how do you mitigate it?

Position bias is the tendency of LLM judges to prefer the response that appears first (in the A position). Studies show GPT-4 prefers position A 56% of the time when both responses are equivalent. This happens because LLMs are trained on data where "the relevant item is first" (QA formats, ordered lists) and this bias persists in their judging behavior. Mitigation: evaluate every comparison in both orderings - (Response A, Response B) and (Response B, Response A). If both orderings agree (A wins in forward, B wins in reverse), the result is trustworthy. If they disagree, call it a tie or flag as inconsistent. This doubles your API cost but eliminates position bias.

Q3: What are the failure modes of LLM-as-judge and when should you not use it?

Key failure modes: (1) Verbosity bias - judges favor longer responses; mitigate by explicitly instructing the judge to evaluate quality independently of length. (2) Self-preference - model families prefer their own style; mitigate by using a different judge family. (3) Sycophancy - judges tend to validate responses that seem confidently stated, even if wrong. (4) Domain blindness - the judge cannot evaluate specialized technical content it was not trained on; a GPT-4 judge is unreliable for evaluating cutting-edge mathematics or obscure domain knowledge. Do not use LLM-as-judge for: safety decisions, evaluation of specialized technical domains the judge was not trained on, very close comparisons where differences are smaller than judge noise, and any decision with significant real-world consequences.

Q4: How does MT-Bench differ from other LLM evaluation approaches?

MT-Bench uses GPT-4 as a judge to evaluate multi-turn conversations across 8 task categories on a 1-10 scale. What makes it different: (1) it evaluates multi-turn capability, not just single responses - crucial for assistant use cases; (2) it uses a carefully curated set of 80 questions designed to stress-test capabilities across diverse categories; (3) it established the empirical foundation for LLM-as-judge by measuring GPT-4's agreement with human judgments; (4) it includes reference answers for challenging categories (math, coding) where the judge alone might be unreliable. Limitation: 80 questions is a small benchmark, and GPT-4 is a biased judge toward GPT-4-style responses.

Q5: Design an LLM-as-judge pipeline for evaluating a medical question-answering system.

Medical QA requires special care because errors can cause harm. My pipeline: (1) Use reference-guided evaluation with verified medical answers (reviewed by MDs) as ground truth. (2) Use two independent judge models from different families (e.g., GPT-4 and Claude) and only trust high-confidence results where both agree. (3) Add a specialized accuracy dimension: "Does the response contain any medically incorrect claims?" with explicit medical knowledge in the system prompt. (4) Always apply position bias mitigation. (5) Flag any response where the judge is uncertain (tie or inconsistent) for human review by a medical professional. (6) Never let the LLM judge be the final authority for safety - all responses flagged as potentially harmful go to human review. (7) Calibrate the pipeline against a gold set of 200 human-rated medical QA pairs before deploying it in the evaluation loop.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Perplexity & Generation Metrics demo on the EngineersOfAI Playground - no code required.

:::

The Scaling Problem​

Why This Exists - The Bridge Between Human and Automatic​

MT-Bench: The First Systematic LLM Judge Study​

The Setup​

Key Findings​

Single-Score vs Pairwise Comparison​

Single-Score (Absolute)​

Pairwise Comparison (Relative)​

The Systematic Biases​

Position Bias​

Verbosity Bias​

Self-Preference​

Sycophancy​

Chain-of-Thought Judging​

Reference-Guided Evaluation​

Full Implementation: LLM Judge with Bias Mitigation​

Prometheus: Open-Source Evaluation LLM​

Cost Analysis: GPT-4 as Judge vs Human Annotation​

Production Engineering Notes​

When to Trust LLM-as-Judge​

Calibration Against Human Judgments​

Common Mistakes​

Interview Q&A​