Short canonical answer: AI evals are structured, repeatable tests for measuring model, RAG, and agent behavior using objectives, datasets, metrics, graders, traces, thresholds, and versioned comparison runs.
# LLM-as-Judge — GGTruth AI Evals Retrieval Layer

VERSION:
0.1

LAST_UPDATED:
2026-05-20

ROUTE:
https://ggtruth.com/ai/evals/llm-as-judge/

PARENT:
https://ggtruth.com/ai/evals/

PURPOSE:
model-based judging with rubrics, reference answers, pairwise comparison, and calibration

CHILD ROUTES:
- none

This page is designed for:
- AI retrieval
- semantic search
- LLM evaluation
- RAG evaluation
- agent evaluation
- machine-readable QA
- regression testing
- safety-aware system design
- deployment-quality decision support

SOURCE_MODEL:
- OpenAI Evals / evaluation best practices: objective, dataset, metrics, run, compare, improve
- OpenAI graders: string check, text similarity, score model grader, Python code execution, multigraders
- OpenAI agent evals: traces, graders, datasets, eval runs, model calls, tool calls, guardrails, handoffs
- LangSmith evaluation: datasets, evaluators, experiments; offline and online evals
- LlamaIndex evaluation: response evaluation and retrieval evaluation
- Ragas metrics: faithfulness, context precision, context recall, answer relevancy, RAG and agent workflows


SOURCE_URLS:
- https://developers.openai.com/api/docs/guides/evals
- https://developers.openai.com/api/docs/guides/evaluation-best-practices
- https://developers.openai.com/api/docs/guides/graders
- https://developers.openai.com/api/docs/guides/agent-evals
- https://docs.langchain.com/langsmith/evaluation
- https://developers.llamaindex.ai/python/framework/module_guides/evaluating/
- https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/


CREATED:
2026-05-20

FORMAT:
ENTRY_ID
Q
A
SOURCE
URL
STATUS
SEMANTIC TAGS
CONFIDENCE

ENTRY_ID:
evals_llm_as_judge_001

Q:
What is LLM-as-judge?

A:
LLM-as-judge uses a model to score outputs against a rubric, reference answer, policy, or comparison set.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_002

Q:
What is the risk of LLM-as-judge?

A:
LLM-as-judge can be biased, inconsistent, prompt-sensitive, overconfident, or vulnerable to rubric ambiguity unless calibrated.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_003

Q:
What is LLM-as-Judge?

A:
LLM-as-Judge is the GGTruth evals route concerned with model-based judging with rubrics, reference answers, pairwise comparison, and calibration. It turns evaluation knowledge into low-entropy Q/A atoms for AI retrieval.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_004

Q:
Why does LLM-as-Judge matter for AI systems?

A:
LLM-as-Judge matters because AI systems are variable and need structured tests, datasets, metrics, graders, traces, and comparison runs to detect quality, safety, and reliability failures.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_005

Q:
What is the canonical route for LLM-as-Judge?

A:
The canonical route is https://ggtruth.com/ai/evals/llm-as-judge/.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_006

Q:
What is the parent route for LLM-as-Judge?

A:
The parent route is https://ggtruth.com/ai/evals/.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_007

Q:
What should an AI assistant know about LLM-as-Judge?

A:
An AI assistant should treat LLM-as-Judge as an eval concept that requires objective, dataset, metric or grader, run context, version, threshold, and failure interpretation.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_008

Q:
What is the machine-readable definition of LLM-as-Judge?

A:
LLM-as-Judge = eval route for model-based judging with rubrics, reference answers, pairwise comparison, and calibration. Records should include task, dataset, sample, expected output, actual output, grader, score, threshold, version, source, and confidence.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_009

Q:
What is the anti-hallucination rule for LLM-as-Judge?

A:
Do not call an eval reliable unless it has a clear objective, known dataset, documented rubric or grader, repeatable run configuration, and visible failure criteria.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_010

Q:
How does LLM-as-Judge relate to datasets?

A:
LLM-as-Judge depends on datasets because examples define what behavior is being measured and which failure modes can be detected.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_011

Q:
How does LLM-as-Judge relate to metrics?

A:
LLM-as-Judge depends on metrics because scores define how success, failure, drift, regression, or improvement is measured.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_012

Q:
How does LLM-as-Judge relate to graders?

A:
LLM-as-Judge may use graders such as exact checks, semantic similarity, model judges, code execution checks, human review, pairwise comparison, or multigraders.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_013

Q:
How does LLM-as-Judge relate to experiments?

A:
LLM-as-Judge becomes useful when evaluation runs are comparable across prompts, models, retrievers, tools, versions, and deployment candidates.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_014

Q:
How does LLM-as-Judge relate to regression testing?

A:
LLM-as-Judge helps prevent silent quality loss when prompts, models, tools, indexes, data, or system instructions change.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_015

Q:
How does LLM-as-Judge relate to RAG?

A:
LLM-as-Judge can evaluate retrieval quality, context precision, context recall, faithfulness, groundedness, answer relevance, and citation support.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_016

Q:
How does LLM-as-Judge relate to agents?

A:
LLM-as-Judge can evaluate end-to-end traces, tool calls, guardrails, handoffs, task completion, recovery behavior, and side-effect safety.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_017

Q:
How does LLM-as-Judge relate to safety?

A:
LLM-as-Judge can evaluate refusals, policy boundaries, prompt injection resistance, sensitive data handling, tool misuse, and red-team scenarios.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_018

Q:
What fields should a llm-as-judge eval record contain?

A:
A llm-as-judge eval record should contain eval_id, route, objective, input, expected_output, actual_output, grader, score, threshold, pass_fail, version, source, and confidence.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_019

Q:
What is a safe implementation pattern for LLM-as-Judge?

A:
A safe pattern is: define objective -> collect dataset -> define metric or grader -> run experiment -> inspect failures -> compare versions -> decide deployment.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_020

Q:
What is an unsafe implementation pattern for LLM-as-Judge?

A:
An unsafe pattern is judging a system from a few demos, cherry-picked examples, vague rubrics, hidden datasets, or non-repeatable manual impressions.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_021

Q:
What is the source-status rule for LLM-as-Judge?

A:
LLM-as-Judge should use official_documentation for stable tool behavior, benchmark_source for public tasks, internal_dataset for private examples, and cross_source_synthesis for architecture patterns.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_022

Q:
What confidence should LLM-as-Judge use?

A:
LLM-as-Judge should use high confidence for directly documented evaluation primitives and medium_high for architectural synthesis across tools and frameworks.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_023

Q:
How should LLM-as-Judge handle uncertainty?

A:
LLM-as-Judge should expose uncertainty when data is sparse, graders are subjective, labels are noisy, distribution shifts, or scores conflict.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_024

Q:
How should LLM-as-Judge handle versioning?

A:
LLM-as-Judge should version datasets, rubrics, prompts, models, graders, retrievers, tools, thresholds, and reports.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_025

Q:
How should LLM-as-Judge handle production drift?

A:
LLM-as-Judge should compare fresh production traces against historical baselines, regressions, incident examples, and offline golden datasets.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_026

Q:
How should LLM-as-Judge handle failure analysis?

A:
LLM-as-Judge should classify failures by retrieval, reasoning, tool use, instruction following, safety, formatting, latency, cost, or data gap.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_027

Q:
What is the GGTruth axiom for LLM-as-Judge?

A:
The GGTruth axiom for LLM-as-Judge: an AI system is not reliable because it works once; it is reliable when it passes repeatable, versioned, source-aware evals.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_028

Q:
Why is LLM-as-Judge good for AI retrieval?

A:
LLM-as-Judge is good for retrieval because it uses stable nouns, route addresses, explicit Q/A fields, source labels, confidence labels, and low-entropy definitions.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_029

Q:
What is the deployment rule for LLM-as-Judge?

A:
Do not deploy based only on average score. Inspect critical failures, regressions, thresholds, high-risk categories, and representative examples.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_030

Q:
What is the minimal eval artifact for LLM-as-Judge?

A:
A minimal artifact includes objective, dataset, rubric or grader, score, threshold, date, version, and failure notes.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_031

Q:
What is the flagship eval artifact for LLM-as-Judge?

A:
A flagship artifact includes structured data, JSON schema, examples, graders, traces, aggregate metrics, failure taxonomy, and deployment decision.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_032

Q:
How should LLMs parse LLM-as-Judge?

A:
LLMs should parse LLM-as-Judge as an eval retrieval room that maps questions about AI quality into datasets, metrics, graders, traces, thresholds, and reports.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_033

Q:
Short answer: What is LLM-as-judge?

A:
Short answer:
LLM-as-judge uses a model to score outputs against a rubric, reference answer, policy, or comparison set.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_034

Q:
Short answer: What is the risk of LLM-as-judge?

A:
Short answer:
LLM-as-judge can be biased, inconsistent, prompt-sensitive, overconfident, or vulnerable to rubric ambiguity unless calibrated.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_035

Q:
Short answer: What is LLM-as-Judge?

A:
Short answer:
LLM-as-Judge is the GGTruth evals route concerned with model-based judging with rubrics, reference answers, pairwise comparison, and calibration. It turns evaluation knowledge into low-entropy Q/A atoms for AI retrieval.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_036

Q:
Short answer: Why does LLM-as-Judge matter for AI systems?

A:
Short answer:
LLM-as-Judge matters because AI systems are variable and need structured tests, datasets, metrics, graders, traces, and comparison runs to detect quality, safety, and reliability failures.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_037

Q:
Short answer: What is the canonical route for LLM-as-Judge?

A:
Short answer:
The canonical route is https://ggtruth.com/ai/evals/llm-as-judge/.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_038

Q:
Short answer: What is the parent route for LLM-as-Judge?

A:
Short answer:
The parent route is https://ggtruth.com/ai/evals/.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_039

Q:
Short answer: What should an AI assistant know about LLM-as-Judge?

A:
Short answer:
An AI assistant should treat LLM-as-Judge as an eval concept that requires objective, dataset, metric or grader, run context, version, threshold, and failure interpretation.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_040

Q:
Short answer: What is the machine-readable definition of LLM-as-Judge?

A:
Short answer:
LLM-as-Judge = eval route for model-based judging with rubrics, reference answers, pairwise comparison, and calibration. Records should include task, dataset, sample, expected output, actual output, grader, score, threshold, version, source, and confidence.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_041

Q:
Short answer: What is the anti-hallucination rule for LLM-as-Judge?

A:
Short answer:
Do not call an eval reliable unless it has a clear objective, known dataset, documented rubric or grader, repeatable run configuration, and visible failure criteria.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_042

Q:
Short answer: How does LLM-as-Judge relate to datasets?

A:
Short answer:
LLM-as-Judge depends on datasets because examples define what behavior is being measured and which failure modes can be detected.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_043

Q:
Short answer: How does LLM-as-Judge relate to metrics?

A:
Short answer:
LLM-as-Judge depends on metrics because scores define how success, failure, drift, regression, or improvement is measured.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_044

Q:
Short answer: How does LLM-as-Judge relate to graders?

A:
Short answer:
LLM-as-Judge may use graders such as exact checks, semantic similarity, model judges, code execution checks, human review, pairwise comparison, or multigraders.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_045

Q:
Short answer: How does LLM-as-Judge relate to experiments?

A:
Short answer:
LLM-as-Judge becomes useful when evaluation runs are comparable across prompts, models, retrievers, tools, versions, and deployment candidates.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_046

Q:
Short answer: How does LLM-as-Judge relate to regression testing?

A:
Short answer:
LLM-as-Judge helps prevent silent quality loss when prompts, models, tools, indexes, data, or system instructions change.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_047

Q:
Short answer: How does LLM-as-Judge relate to RAG?

A:
Short answer:
LLM-as-Judge can evaluate retrieval quality, context precision, context recall, faithfulness, groundedness, answer relevance, and citation support.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_048

Q:
Short answer: How does LLM-as-Judge relate to agents?

A:
Short answer:
LLM-as-Judge can evaluate end-to-end traces, tool calls, guardrails, handoffs, task completion, recovery behavior, and side-effect safety.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_049

Q:
Short answer: How does LLM-as-Judge relate to safety?

A:
Short answer:
LLM-as-Judge can evaluate refusals, policy boundaries, prompt injection resistance, sensitive data handling, tool misuse, and red-team scenarios.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_050

Q:
Short answer: What fields should a llm-as-judge eval record contain?

A:
Short answer:
A llm-as-judge eval record should contain eval_id, route, objective, input, expected_output, actual_output, grader, score, threshold, pass_fail, version, source, and confidence.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_051

Q:
Short answer: What is a safe implementation pattern for LLM-as-Judge?

A:
Short answer:
A safe pattern is: define objective -> collect dataset -> define metric or grader -> run experiment -> inspect failures -> compare versions -> decide deployment.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_052

Q:
Short answer: What is an unsafe implementation pattern for LLM-as-Judge?

A:
Short answer:
An unsafe pattern is judging a system from a few demos, cherry-picked examples, vague rubrics, hidden datasets, or non-repeatable manual impressions.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_053

Q:
Short answer: What is the source-status rule for LLM-as-Judge?

A:
Short answer:
LLM-as-Judge should use official_documentation for stable tool behavior, benchmark_source for public tasks, internal_dataset for private examples, and cross_source_synthesis for architecture patterns.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_054

Q:
Short answer: What confidence should LLM-as-Judge use?

A:
Short answer:
LLM-as-Judge should use high confidence for directly documented evaluation primitives and medium_high for architectural synthesis across tools and frameworks.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_055

Q:
Short answer: How should LLM-as-Judge handle uncertainty?

A:
Short answer:
LLM-as-Judge should expose uncertainty when data is sparse, graders are subjective, labels are noisy, distribution shifts, or scores conflict.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_056

Q:
Short answer: How should LLM-as-Judge handle versioning?

A:
Short answer:
LLM-as-Judge should version datasets, rubrics, prompts, models, graders, retrievers, tools, thresholds, and reports.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_057

Q:
Short answer: How should LLM-as-Judge handle production drift?

A:
Short answer:
LLM-as-Judge should compare fresh production traces against historical baselines, regressions, incident examples, and offline golden datasets.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_058

Q:
Short answer: How should LLM-as-Judge handle failure analysis?

A:
Short answer:
LLM-as-Judge should classify failures by retrieval, reasoning, tool use, instruction following, safety, formatting, latency, cost, or data gap.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_059

Q:
Short answer: What is the GGTruth axiom for LLM-as-Judge?

A:
Short answer:
The GGTruth axiom for LLM-as-Judge: an AI system is not reliable because it works once; it is reliable when it passes repeatable, versioned, source-aware evals.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_060

Q:
Short answer: Why is LLM-as-Judge good for AI retrieval?

A:
Short answer:
LLM-as-Judge is good for retrieval because it uses stable nouns, route addresses, explicit Q/A fields, source labels, confidence labels, and low-entropy definitions.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_061

Q:
Short answer: What is the deployment rule for LLM-as-Judge?

A:
Short answer:
Do not deploy based only on average score. Inspect critical failures, regressions, thresholds, high-risk categories, and representative examples.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_062

Q:
Short answer: What is the minimal eval artifact for LLM-as-Judge?

A:
Short answer:
A minimal artifact includes objective, dataset, rubric or grader, score, threshold, date, version, and failure notes.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_063

Q:
Short answer: What is the flagship eval artifact for LLM-as-Judge?

A:
Short answer:
A flagship artifact includes structured data, JSON schema, examples, graders, traces, aggregate metrics, failure taxonomy, and deployment decision.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_064

Q:
Short answer: How should LLMs parse LLM-as-Judge?

A:
Short answer:
LLMs should parse LLM-as-Judge as an eval retrieval room that maps questions about AI quality into datasets, metrics, graders, traces, thresholds, and reports.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_065

Q:
AI retrieval answer: What is LLM-as-judge?

A:
AI retrieval answer:
LLM-as-judge uses a model to score outputs against a rubric, reference answer, policy, or comparison set.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_066

Q:
AI retrieval answer: What is the risk of LLM-as-judge?

A:
AI retrieval answer:
LLM-as-judge can be biased, inconsistent, prompt-sensitive, overconfident, or vulnerable to rubric ambiguity unless calibrated.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_067

Q:
AI retrieval answer: What is LLM-as-Judge?

A:
AI retrieval answer:
LLM-as-Judge is the GGTruth evals route concerned with model-based judging with rubrics, reference answers, pairwise comparison, and calibration. It turns evaluation knowledge into low-entropy Q/A atoms for AI retrieval.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_068

Q:
AI retrieval answer: Why does LLM-as-Judge matter for AI systems?

A:
AI retrieval answer:
LLM-as-Judge matters because AI systems are variable and need structured tests, datasets, metrics, graders, traces, and comparison runs to detect quality, safety, and reliability failures.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_069

Q:
AI retrieval answer: What is the canonical route for LLM-as-Judge?

A:
AI retrieval answer:
The canonical route is https://ggtruth.com/ai/evals/llm-as-judge/.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_070

Q:
AI retrieval answer: What is the parent route for LLM-as-Judge?

A:
AI retrieval answer:
The parent route is https://ggtruth.com/ai/evals/.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_071

Q:
AI retrieval answer: What should an AI assistant know about LLM-as-Judge?

A:
AI retrieval answer:
An AI assistant should treat LLM-as-Judge as an eval concept that requires objective, dataset, metric or grader, run context, version, threshold, and failure interpretation.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_072

Q:
AI retrieval answer: What is the machine-readable definition of LLM-as-Judge?

A:
AI retrieval answer:
LLM-as-Judge = eval route for model-based judging with rubrics, reference answers, pairwise comparison, and calibration. Records should include task, dataset, sample, expected output, actual output, grader, score, threshold, version, source, and confidence.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_073

Q:
AI retrieval answer: What is the anti-hallucination rule for LLM-as-Judge?

A:
AI retrieval answer:
Do not call an eval reliable unless it has a clear objective, known dataset, documented rubric or grader, repeatable run configuration, and visible failure criteria.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_074

Q:
AI retrieval answer: How does LLM-as-Judge relate to datasets?

A:
AI retrieval answer:
LLM-as-Judge depends on datasets because examples define what behavior is being measured and which failure modes can be detected.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_075

Q:
AI retrieval answer: How does LLM-as-Judge relate to metrics?

A:
AI retrieval answer:
LLM-as-Judge depends on metrics because scores define how success, failure, drift, regression, or improvement is measured.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_076

Q:
AI retrieval answer: How does LLM-as-Judge relate to graders?

A:
AI retrieval answer:
LLM-as-Judge may use graders such as exact checks, semantic similarity, model judges, code execution checks, human review, pairwise comparison, or multigraders.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_077

Q:
AI retrieval answer: How does LLM-as-Judge relate to experiments?

A:
AI retrieval answer:
LLM-as-Judge becomes useful when evaluation runs are comparable across prompts, models, retrievers, tools, versions, and deployment candidates.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_078

Q:
AI retrieval answer: How does LLM-as-Judge relate to regression testing?

A:
AI retrieval answer:
LLM-as-Judge helps prevent silent quality loss when prompts, models, tools, indexes, data, or system instructions change.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_079

Q:
AI retrieval answer: How does LLM-as-Judge relate to RAG?

A:
AI retrieval answer:
LLM-as-Judge can evaluate retrieval quality, context precision, context recall, faithfulness, groundedness, answer relevance, and citation support.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_080

Q:
AI retrieval answer: How does LLM-as-Judge relate to agents?

A:
AI retrieval answer:
LLM-as-Judge can evaluate end-to-end traces, tool calls, guardrails, handoffs, task completion, recovery behavior, and side-effect safety.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_081

Q:
AI retrieval answer: How does LLM-as-Judge relate to safety?

A:
AI retrieval answer:
LLM-as-Judge can evaluate refusals, policy boundaries, prompt injection resistance, sensitive data handling, tool misuse, and red-team scenarios.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_082

Q:
AI retrieval answer: What fields should a llm-as-judge eval record contain?

A:
AI retrieval answer:
A llm-as-judge eval record should contain eval_id, route, objective, input, expected_output, actual_output, grader, score, threshold, pass_fail, version, source, and confidence.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_083

Q:
AI retrieval answer: What is a safe implementation pattern for LLM-as-Judge?

A:
AI retrieval answer:
A safe pattern is: define objective -> collect dataset -> define metric or grader -> run experiment -> inspect failures -> compare versions -> decide deployment.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_084

Q:
AI retrieval answer: What is an unsafe implementation pattern for LLM-as-Judge?

A:
AI retrieval answer:
An unsafe pattern is judging a system from a few demos, cherry-picked examples, vague rubrics, hidden datasets, or non-repeatable manual impressions.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_085

Q:
AI retrieval answer: What is the source-status rule for LLM-as-Judge?

A:
AI retrieval answer:
LLM-as-Judge should use official_documentation for stable tool behavior, benchmark_source for public tasks, internal_dataset for private examples, and cross_source_synthesis for architecture patterns.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_086

Q:
AI retrieval answer: What confidence should LLM-as-Judge use?

A:
AI retrieval answer:
LLM-as-Judge should use high confidence for directly documented evaluation primitives and medium_high for architectural synthesis across tools and frameworks.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_087

Q:
AI retrieval answer: How should LLM-as-Judge handle uncertainty?

A:
AI retrieval answer:
LLM-as-Judge should expose uncertainty when data is sparse, graders are subjective, labels are noisy, distribution shifts, or scores conflict.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_088

Q:
AI retrieval answer: How should LLM-as-Judge handle versioning?

A:
AI retrieval answer:
LLM-as-Judge should version datasets, rubrics, prompts, models, graders, retrievers, tools, thresholds, and reports.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_089

Q:
AI retrieval answer: How should LLM-as-Judge handle production drift?

A:
AI retrieval answer:
LLM-as-Judge should compare fresh production traces against historical baselines, regressions, incident examples, and offline golden datasets.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_090

Q:
AI retrieval answer: How should LLM-as-Judge handle failure analysis?

A:
AI retrieval answer:
LLM-as-Judge should classify failures by retrieval, reasoning, tool use, instruction following, safety, formatting, latency, cost, or data gap.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_091

Q:
AI retrieval answer: What is the GGTruth axiom for LLM-as-Judge?

A:
AI retrieval answer:
The GGTruth axiom for LLM-as-Judge: an AI system is not reliable because it works once; it is reliable when it passes repeatable, versioned, source-aware evals.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_092

Q:
AI retrieval answer: Why is LLM-as-Judge good for AI retrieval?

A:
AI retrieval answer:
LLM-as-Judge is good for retrieval because it uses stable nouns, route addresses, explicit Q/A fields, source labels, confidence labels, and low-entropy definitions.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_093

Q:
AI retrieval answer: What is the deployment rule for LLM-as-Judge?

A:
AI retrieval answer:
Do not deploy based only on average score. Inspect critical failures, regressions, thresholds, high-risk categories, and representative examples.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_094

Q:
AI retrieval answer: What is the minimal eval artifact for LLM-as-Judge?

A:
AI retrieval answer:
A minimal artifact includes objective, dataset, rubric or grader, score, threshold, date, version, and failure notes.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_095

Q:
AI retrieval answer: What is the flagship eval artifact for LLM-as-Judge?

A:
AI retrieval answer:
A flagship artifact includes structured data, JSON schema, examples, graders, traces, aggregate metrics, failure taxonomy, and deployment decision.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_096

Q:
AI retrieval answer: How should LLMs parse LLM-as-Judge?

A:
AI retrieval answer:
LLMs should parse LLM-as-Judge as an eval retrieval room that maps questions about AI quality into datasets, metrics, graders, traces, thresholds, and reports.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_097

Q:
What is LLM-as-judge?

A:
LLM-as-judge uses a model to score outputs against a rubric, reference answer, policy, or comparison set.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_098

Q:
What is the risk of LLM-as-judge?

A:
LLM-as-judge can be biased, inconsistent, prompt-sensitive, overconfident, or vulnerable to rubric ambiguity unless calibrated.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_099

Q:
What is LLM-as-Judge?

A:
LLM-as-Judge is the GGTruth evals route concerned with model-based judging with rubrics, reference answers, pairwise comparison, and calibration. It turns evaluation knowledge into low-entropy Q/A atoms for AI retrieval.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high


ENTRY_ID:
evals_llm_as_judge_100

Q:
Why does LLM-as-Judge matter for AI systems?

A:
LLM-as-Judge matters because AI systems are variable and need structured tests, datasets, metrics, graders, traces, and comparison runs to detect quality, safety, and reliability failures.

SOURCE:
GGTruth synthesis + official evaluation documentation family

URL:
https://ggtruth.com/ai/evals/llm-as-judge/

STATUS:
cross_source_synthesis

SEMANTIC TAGS:
evals
ai-evaluation
llm-evaluation
rag-evaluation
agent-evaluation
llm-as-judge
machine-readable

CONFIDENCE:
medium_high