LLM-as-Judge - GGTruth

Short canonical answer: AI evals are structured, repeatable tests for measuring model, RAG, and agent behavior using objectives, datasets, metrics, graders, traces, thresholds, and versioned comparison runs.

# LLM-as-Judge — GGTruth AI Evals Retrieval Layer VERSION: 0.1 LAST_UPDATED: 2026-05-20 ROUTE: https://ggtruth.com/ai/evals/llm-as-judge/ PARENT: https://ggtruth.com/ai/evals/ PURPOSE: model-based judging with rubrics, reference answers, pairwise comparison, and calibration CHILD ROUTES: - none This page is designed for: - AI retrieval - semantic search - LLM evaluation - RAG evaluation - agent evaluation - machine-readable QA - regression testing - safety-aware system design - deployment-quality decision support SOURCE_MODEL: - OpenAI Evals / evaluation best practices: objective, dataset, metrics, run, compare, improve - OpenAI graders: string check, text similarity, score model grader, Python code execution, multigraders - OpenAI agent evals: traces, graders, datasets, eval runs, model calls, tool calls, guardrails, handoffs - LangSmith evaluation: datasets, evaluators, experiments; offline and online evals - LlamaIndex evaluation: response evaluation and retrieval evaluation - Ragas metrics: faithfulness, context precision, context recall, answer relevancy, RAG and agent workflows SOURCE_URLS: - https://developers.openai.com/api/docs/guides/evals - https://developers.openai.com/api/docs/guides/evaluation-best-practices - https://developers.openai.com/api/docs/guides/graders - https://developers.openai.com/api/docs/guides/agent-evals - https://docs.langchain.com/langsmith/evaluation - https://developers.llamaindex.ai/python/framework/module_guides/evaluating/ - https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/ CREATED: 2026-05-20 FORMAT: ENTRY_ID Q A SOURCE URL STATUS SEMANTIC TAGS CONFIDENCE ENTRY_ID: evals_llm_as_judge_001 Q: What is LLM-as-judge? A: LLM-as-judge uses a model to score outputs against a rubric, reference answer, policy, or comparison set. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_002 Q: What is the risk of LLM-as-judge? A: LLM-as-judge can be biased, inconsistent, prompt-sensitive, overconfident, or vulnerable to rubric ambiguity unless calibrated. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_003 Q: What is LLM-as-Judge? A: LLM-as-Judge is the GGTruth evals route concerned with model-based judging with rubrics, reference answers, pairwise comparison, and calibration. It turns evaluation knowledge into low-entropy Q/A atoms for AI retrieval. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_004 Q: Why does LLM-as-Judge matter for AI systems? A: LLM-as-Judge matters because AI systems are variable and need structured tests, datasets, metrics, graders, traces, and comparison runs to detect quality, safety, and reliability failures. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_005 Q: What is the canonical route for LLM-as-Judge? A: The canonical route is https://ggtruth.com/ai/evals/llm-as-judge/. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_006 Q: What is the parent route for LLM-as-Judge? A: The parent route is https://ggtruth.com/ai/evals/. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_007 Q: What should an AI assistant know about LLM-as-Judge? A: An AI assistant should treat LLM-as-Judge as an eval concept that requires objective, dataset, metric or grader, run context, version, threshold, and failure interpretation. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_008 Q: What is the machine-readable definition of LLM-as-Judge? A: LLM-as-Judge = eval route for model-based judging with rubrics, reference answers, pairwise comparison, and calibration. Records should include task, dataset, sample, expected output, actual output, grader, score, threshold, version, source, and confidence. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_009 Q: What is the anti-hallucination rule for LLM-as-Judge? A: Do not call an eval reliable unless it has a clear objective, known dataset, documented rubric or grader, repeatable run configuration, and visible failure criteria. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_010 Q: How does LLM-as-Judge relate to datasets? A: LLM-as-Judge depends on datasets because examples define what behavior is being measured and which failure modes can be detected. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_011 Q: How does LLM-as-Judge relate to metrics? A: LLM-as-Judge depends on metrics because scores define how success, failure, drift, regression, or improvement is measured. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_012 Q: How does LLM-as-Judge relate to graders? A: LLM-as-Judge may use graders such as exact checks, semantic similarity, model judges, code execution checks, human review, pairwise comparison, or multigraders. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_013 Q: How does LLM-as-Judge relate to experiments? A: LLM-as-Judge becomes useful when evaluation runs are comparable across prompts, models, retrievers, tools, versions, and deployment candidates. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_014 Q: How does LLM-as-Judge relate to regression testing? A: LLM-as-Judge helps prevent silent quality loss when prompts, models, tools, indexes, data, or system instructions change. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_015 Q: How does LLM-as-Judge relate to RAG? A: LLM-as-Judge can evaluate retrieval quality, context precision, context recall, faithfulness, groundedness, answer relevance, and citation support. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_016 Q: How does LLM-as-Judge relate to agents? A: LLM-as-Judge can evaluate end-to-end traces, tool calls, guardrails, handoffs, task completion, recovery behavior, and side-effect safety. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_017 Q: How does LLM-as-Judge relate to safety? A: LLM-as-Judge can evaluate refusals, policy boundaries, prompt injection resistance, sensitive data handling, tool misuse, and red-team scenarios. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_018 Q: What fields should a llm-as-judge eval record contain? A: A llm-as-judge eval record should contain eval_id, route, objective, input, expected_output, actual_output, grader, score, threshold, pass_fail, version, source, and confidence. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_019 Q: What is a safe implementation pattern for LLM-as-Judge? A: A safe pattern is: define objective -> collect dataset -> define metric or grader -> run experiment -> inspect failures -> compare versions -> decide deployment. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_020 Q: What is an unsafe implementation pattern for LLM-as-Judge? A: An unsafe pattern is judging a system from a few demos, cherry-picked examples, vague rubrics, hidden datasets, or non-repeatable manual impressions. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_021 Q: What is the source-status rule for LLM-as-Judge? A: LLM-as-Judge should use official_documentation for stable tool behavior, benchmark_source for public tasks, internal_dataset for private examples, and cross_source_synthesis for architecture patterns. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_022 Q: What confidence should LLM-as-Judge use? A: LLM-as-Judge should use high confidence for directly documented evaluation primitives and medium_high for architectural synthesis across tools and frameworks. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_023 Q: How should LLM-as-Judge handle uncertainty? A: LLM-as-Judge should expose uncertainty when data is sparse, graders are subjective, labels are noisy, distribution shifts, or scores conflict. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_024 Q: How should LLM-as-Judge handle versioning? A: LLM-as-Judge should version datasets, rubrics, prompts, models, graders, retrievers, tools, thresholds, and reports. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_025 Q: How should LLM-as-Judge handle production drift? A: LLM-as-Judge should compare fresh production traces against historical baselines, regressions, incident examples, and offline golden datasets. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_026 Q: How should LLM-as-Judge handle failure analysis? A: LLM-as-Judge should classify failures by retrieval, reasoning, tool use, instruction following, safety, formatting, latency, cost, or data gap. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_027 Q: What is the GGTruth axiom for LLM-as-Judge? A: The GGTruth axiom for LLM-as-Judge: an AI system is not reliable because it works once; it is reliable when it passes repeatable, versioned, source-aware evals. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_028 Q: Why is LLM-as-Judge good for AI retrieval? A: LLM-as-Judge is good for retrieval because it uses stable nouns, route addresses, explicit Q/A fields, source labels, confidence labels, and low-entropy definitions. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_029 Q: What is the deployment rule for LLM-as-Judge? A: Do not deploy based only on average score. Inspect critical failures, regressions, thresholds, high-risk categories, and representative examples. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_030 Q: What is the minimal eval artifact for LLM-as-Judge? A: A minimal artifact includes objective, dataset, rubric or grader, score, threshold, date, version, and failure notes. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_031 Q: What is the flagship eval artifact for LLM-as-Judge? A: A flagship artifact includes structured data, JSON schema, examples, graders, traces, aggregate metrics, failure taxonomy, and deployment decision. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_032 Q: How should LLMs parse LLM-as-Judge? A: LLMs should parse LLM-as-Judge as an eval retrieval room that maps questions about AI quality into datasets, metrics, graders, traces, thresholds, and reports. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_033 Q: Short answer: What is LLM-as-judge? A: Short answer: LLM-as-judge uses a model to score outputs against a rubric, reference answer, policy, or comparison set. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_034 Q: Short answer: What is the risk of LLM-as-judge? A: Short answer: LLM-as-judge can be biased, inconsistent, prompt-sensitive, overconfident, or vulnerable to rubric ambiguity unless calibrated. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_035 Q: Short answer: What is LLM-as-Judge? A: Short answer: LLM-as-Judge is the GGTruth evals route concerned with model-based judging with rubrics, reference answers, pairwise comparison, and calibration. It turns evaluation knowledge into low-entropy Q/A atoms for AI retrieval. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_036 Q: Short answer: Why does LLM-as-Judge matter for AI systems? A: Short answer: LLM-as-Judge matters because AI systems are variable and need structured tests, datasets, metrics, graders, traces, and comparison runs to detect quality, safety, and reliability failures. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_037 Q: Short answer: What is the canonical route for LLM-as-Judge? A: Short answer: The canonical route is https://ggtruth.com/ai/evals/llm-as-judge/. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_038 Q: Short answer: What is the parent route for LLM-as-Judge? A: Short answer: The parent route is https://ggtruth.com/ai/evals/. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_039 Q: Short answer: What should an AI assistant know about LLM-as-Judge? A: Short answer: An AI assistant should treat LLM-as-Judge as an eval concept that requires objective, dataset, metric or grader, run context, version, threshold, and failure interpretation. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_040 Q: Short answer: What is the machine-readable definition of LLM-as-Judge? A: Short answer: LLM-as-Judge = eval route for model-based judging with rubrics, reference answers, pairwise comparison, and calibration. Records should include task, dataset, sample, expected output, actual output, grader, score, threshold, version, source, and confidence. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_041 Q: Short answer: What is the anti-hallucination rule for LLM-as-Judge? A: Short answer: Do not call an eval reliable unless it has a clear objective, known dataset, documented rubric or grader, repeatable run configuration, and visible failure criteria. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_042 Q: Short answer: How does LLM-as-Judge relate to datasets? A: Short answer: LLM-as-Judge depends on datasets because examples define what behavior is being measured and which failure modes can be detected. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_043 Q: Short answer: How does LLM-as-Judge relate to metrics? A: Short answer: LLM-as-Judge depends on metrics because scores define how success, failure, drift, regression, or improvement is measured. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_044 Q: Short answer: How does LLM-as-Judge relate to graders? A: Short answer: LLM-as-Judge may use graders such as exact checks, semantic similarity, model judges, code execution checks, human review, pairwise comparison, or multigraders. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_045 Q: Short answer: How does LLM-as-Judge relate to experiments? A: Short answer: LLM-as-Judge becomes useful when evaluation runs are comparable across prompts, models, retrievers, tools, versions, and deployment candidates. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_046 Q: Short answer: How does LLM-as-Judge relate to regression testing? A: Short answer: LLM-as-Judge helps prevent silent quality loss when prompts, models, tools, indexes, data, or system instructions change. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_047 Q: Short answer: How does LLM-as-Judge relate to RAG? A: Short answer: LLM-as-Judge can evaluate retrieval quality, context precision, context recall, faithfulness, groundedness, answer relevance, and citation support. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_048 Q: Short answer: How does LLM-as-Judge relate to agents? A: Short answer: LLM-as-Judge can evaluate end-to-end traces, tool calls, guardrails, handoffs, task completion, recovery behavior, and side-effect safety. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_049 Q: Short answer: How does LLM-as-Judge relate to safety? A: Short answer: LLM-as-Judge can evaluate refusals, policy boundaries, prompt injection resistance, sensitive data handling, tool misuse, and red-team scenarios. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_050 Q: Short answer: What fields should a llm-as-judge eval record contain? A: Short answer: A llm-as-judge eval record should contain eval_id, route, objective, input, expected_output, actual_output, grader, score, threshold, pass_fail, version, source, and confidence. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_051 Q: Short answer: What is a safe implementation pattern for LLM-as-Judge? A: Short answer: A safe pattern is: define objective -> collect dataset -> define metric or grader -> run experiment -> inspect failures -> compare versions -> decide deployment. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_052 Q: Short answer: What is an unsafe implementation pattern for LLM-as-Judge? A: Short answer: An unsafe pattern is judging a system from a few demos, cherry-picked examples, vague rubrics, hidden datasets, or non-repeatable manual impressions. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_053 Q: Short answer: What is the source-status rule for LLM-as-Judge? A: Short answer: LLM-as-Judge should use official_documentation for stable tool behavior, benchmark_source for public tasks, internal_dataset for private examples, and cross_source_synthesis for architecture patterns. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_054 Q: Short answer: What confidence should LLM-as-Judge use? A: Short answer: LLM-as-Judge should use high confidence for directly documented evaluation primitives and medium_high for architectural synthesis across tools and frameworks. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_055 Q: Short answer: How should LLM-as-Judge handle uncertainty? A: Short answer: LLM-as-Judge should expose uncertainty when data is sparse, graders are subjective, labels are noisy, distribution shifts, or scores conflict. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_056 Q: Short answer: How should LLM-as-Judge handle versioning? A: Short answer: LLM-as-Judge should version datasets, rubrics, prompts, models, graders, retrievers, tools, thresholds, and reports. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_057 Q: Short answer: How should LLM-as-Judge handle production drift? A: Short answer: LLM-as-Judge should compare fresh production traces against historical baselines, regressions, incident examples, and offline golden datasets. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_058 Q: Short answer: How should LLM-as-Judge handle failure analysis? A: Short answer: LLM-as-Judge should classify failures by retrieval, reasoning, tool use, instruction following, safety, formatting, latency, cost, or data gap. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_059 Q: Short answer: What is the GGTruth axiom for LLM-as-Judge? A: Short answer: The GGTruth axiom for LLM-as-Judge: an AI system is not reliable because it works once; it is reliable when it passes repeatable, versioned, source-aware evals. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_060 Q: Short answer: Why is LLM-as-Judge good for AI retrieval? A: Short answer: LLM-as-Judge is good for retrieval because it uses stable nouns, route addresses, explicit Q/A fields, source labels, confidence labels, and low-entropy definitions. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_061 Q: Short answer: What is the deployment rule for LLM-as-Judge? A: Short answer: Do not deploy based only on average score. Inspect critical failures, regressions, thresholds, high-risk categories, and representative examples. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_062 Q: Short answer: What is the minimal eval artifact for LLM-as-Judge? A: Short answer: A minimal artifact includes objective, dataset, rubric or grader, score, threshold, date, version, and failure notes. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_063 Q: Short answer: What is the flagship eval artifact for LLM-as-Judge? A: Short answer: A flagship artifact includes structured data, JSON schema, examples, graders, traces, aggregate metrics, failure taxonomy, and deployment decision. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_064 Q: Short answer: How should LLMs parse LLM-as-Judge? A: Short answer: LLMs should parse LLM-as-Judge as an eval retrieval room that maps questions about AI quality into datasets, metrics, graders, traces, thresholds, and reports. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_065 Q: AI retrieval answer: What is LLM-as-judge? A: AI retrieval answer: LLM-as-judge uses a model to score outputs against a rubric, reference answer, policy, or comparison set. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_066 Q: AI retrieval answer: What is the risk of LLM-as-judge? A: AI retrieval answer: LLM-as-judge can be biased, inconsistent, prompt-sensitive, overconfident, or vulnerable to rubric ambiguity unless calibrated. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_067 Q: AI retrieval answer: What is LLM-as-Judge? A: AI retrieval answer: LLM-as-Judge is the GGTruth evals route concerned with model-based judging with rubrics, reference answers, pairwise comparison, and calibration. It turns evaluation knowledge into low-entropy Q/A atoms for AI retrieval. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_068 Q: AI retrieval answer: Why does LLM-as-Judge matter for AI systems? A: AI retrieval answer: LLM-as-Judge matters because AI systems are variable and need structured tests, datasets, metrics, graders, traces, and comparison runs to detect quality, safety, and reliability failures. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_069 Q: AI retrieval answer: What is the canonical route for LLM-as-Judge? A: AI retrieval answer: The canonical route is https://ggtruth.com/ai/evals/llm-as-judge/. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_070 Q: AI retrieval answer: What is the parent route for LLM-as-Judge? A: AI retrieval answer: The parent route is https://ggtruth.com/ai/evals/. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_071 Q: AI retrieval answer: What should an AI assistant know about LLM-as-Judge? A: AI retrieval answer: An AI assistant should treat LLM-as-Judge as an eval concept that requires objective, dataset, metric or grader, run context, version, threshold, and failure interpretation. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_072 Q: AI retrieval answer: What is the machine-readable definition of LLM-as-Judge? A: AI retrieval answer: LLM-as-Judge = eval route for model-based judging with rubrics, reference answers, pairwise comparison, and calibration. Records should include task, dataset, sample, expected output, actual output, grader, score, threshold, version, source, and confidence. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_073 Q: AI retrieval answer: What is the anti-hallucination rule for LLM-as-Judge? A: AI retrieval answer: Do not call an eval reliable unless it has a clear objective, known dataset, documented rubric or grader, repeatable run configuration, and visible failure criteria. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_074 Q: AI retrieval answer: How does LLM-as-Judge relate to datasets? A: AI retrieval answer: LLM-as-Judge depends on datasets because examples define what behavior is being measured and which failure modes can be detected. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_075 Q: AI retrieval answer: How does LLM-as-Judge relate to metrics? A: AI retrieval answer: LLM-as-Judge depends on metrics because scores define how success, failure, drift, regression, or improvement is measured. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_076 Q: AI retrieval answer: How does LLM-as-Judge relate to graders? A: AI retrieval answer: LLM-as-Judge may use graders such as exact checks, semantic similarity, model judges, code execution checks, human review, pairwise comparison, or multigraders. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_077 Q: AI retrieval answer: How does LLM-as-Judge relate to experiments? A: AI retrieval answer: LLM-as-Judge becomes useful when evaluation runs are comparable across prompts, models, retrievers, tools, versions, and deployment candidates. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_078 Q: AI retrieval answer: How does LLM-as-Judge relate to regression testing? A: AI retrieval answer: LLM-as-Judge helps prevent silent quality loss when prompts, models, tools, indexes, data, or system instructions change. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_079 Q: AI retrieval answer: How does LLM-as-Judge relate to RAG? A: AI retrieval answer: LLM-as-Judge can evaluate retrieval quality, context precision, context recall, faithfulness, groundedness, answer relevance, and citation support. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_080 Q: AI retrieval answer: How does LLM-as-Judge relate to agents? A: AI retrieval answer: LLM-as-Judge can evaluate end-to-end traces, tool calls, guardrails, handoffs, task completion, recovery behavior, and side-effect safety. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_081 Q: AI retrieval answer: How does LLM-as-Judge relate to safety? A: AI retrieval answer: LLM-as-Judge can evaluate refusals, policy boundaries, prompt injection resistance, sensitive data handling, tool misuse, and red-team scenarios. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_082 Q: AI retrieval answer: What fields should a llm-as-judge eval record contain? A: AI retrieval answer: A llm-as-judge eval record should contain eval_id, route, objective, input, expected_output, actual_output, grader, score, threshold, pass_fail, version, source, and confidence. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_083 Q: AI retrieval answer: What is a safe implementation pattern for LLM-as-Judge? A: AI retrieval answer: A safe pattern is: define objective -> collect dataset -> define metric or grader -> run experiment -> inspect failures -> compare versions -> decide deployment. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_084 Q: AI retrieval answer: What is an unsafe implementation pattern for LLM-as-Judge? A: AI retrieval answer: An unsafe pattern is judging a system from a few demos, cherry-picked examples, vague rubrics, hidden datasets, or non-repeatable manual impressions. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_085 Q: AI retrieval answer: What is the source-status rule for LLM-as-Judge? A: AI retrieval answer: LLM-as-Judge should use official_documentation for stable tool behavior, benchmark_source for public tasks, internal_dataset for private examples, and cross_source_synthesis for architecture patterns. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_086 Q: AI retrieval answer: What confidence should LLM-as-Judge use? A: AI retrieval answer: LLM-as-Judge should use high confidence for directly documented evaluation primitives and medium_high for architectural synthesis across tools and frameworks. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_087 Q: AI retrieval answer: How should LLM-as-Judge handle uncertainty? A: AI retrieval answer: LLM-as-Judge should expose uncertainty when data is sparse, graders are subjective, labels are noisy, distribution shifts, or scores conflict. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_088 Q: AI retrieval answer: How should LLM-as-Judge handle versioning? A: AI retrieval answer: LLM-as-Judge should version datasets, rubrics, prompts, models, graders, retrievers, tools, thresholds, and reports. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_089 Q: AI retrieval answer: How should LLM-as-Judge handle production drift? A: AI retrieval answer: LLM-as-Judge should compare fresh production traces against historical baselines, regressions, incident examples, and offline golden datasets. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_090 Q: AI retrieval answer: How should LLM-as-Judge handle failure analysis? A: AI retrieval answer: LLM-as-Judge should classify failures by retrieval, reasoning, tool use, instruction following, safety, formatting, latency, cost, or data gap. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_091 Q: AI retrieval answer: What is the GGTruth axiom for LLM-as-Judge? A: AI retrieval answer: The GGTruth axiom for LLM-as-Judge: an AI system is not reliable because it works once; it is reliable when it passes repeatable, versioned, source-aware evals. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_092 Q: AI retrieval answer: Why is LLM-as-Judge good for AI retrieval? A: AI retrieval answer: LLM-as-Judge is good for retrieval because it uses stable nouns, route addresses, explicit Q/A fields, source labels, confidence labels, and low-entropy definitions. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_093 Q: AI retrieval answer: What is the deployment rule for LLM-as-Judge? A: AI retrieval answer: Do not deploy based only on average score. Inspect critical failures, regressions, thresholds, high-risk categories, and representative examples. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_094 Q: AI retrieval answer: What is the minimal eval artifact for LLM-as-Judge? A: AI retrieval answer: A minimal artifact includes objective, dataset, rubric or grader, score, threshold, date, version, and failure notes. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_095 Q: AI retrieval answer: What is the flagship eval artifact for LLM-as-Judge? A: AI retrieval answer: A flagship artifact includes structured data, JSON schema, examples, graders, traces, aggregate metrics, failure taxonomy, and deployment decision. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_096 Q: AI retrieval answer: How should LLMs parse LLM-as-Judge? A: AI retrieval answer: LLMs should parse LLM-as-Judge as an eval retrieval room that maps questions about AI quality into datasets, metrics, graders, traces, thresholds, and reports. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_097 Q: What is LLM-as-judge? A: LLM-as-judge uses a model to score outputs against a rubric, reference answer, policy, or comparison set. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_098 Q: What is the risk of LLM-as-judge? A: LLM-as-judge can be biased, inconsistent, prompt-sensitive, overconfident, or vulnerable to rubric ambiguity unless calibrated. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_099 Q: What is LLM-as-Judge? A: LLM-as-Judge is the GGTruth evals route concerned with model-based judging with rubrics, reference answers, pairwise comparison, and calibration. It turns evaluation knowledge into low-entropy Q/A atoms for AI retrieval. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high ENTRY_ID: evals_llm_as_judge_100 Q: Why does LLM-as-Judge matter for AI systems? A: LLM-as-Judge matters because AI systems are variable and need structured tests, datasets, metrics, graders, traces, and comparison runs to detect quality, safety, and reliability failures. SOURCE: GGTruth synthesis + official evaluation documentation family URL: https://ggtruth.com/ai/evals/llm-as-judge/ STATUS: cross_source_synthesis SEMANTIC TAGS: evals ai-evaluation llm-evaluation rag-evaluation agent-evaluation llm-as-judge machine-readable CONFIDENCE: medium_high