[Paper] Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations

Summary

This paper introduces a diagnostic toolkit to assess the per-instance reliability of LLM-as-judge frameworks, which are widely used for automatic Natural Language Generation (NLG) evaluation. Applying this toolkit to SummEval, researchers found widespread per-input inconsistencies and transitivity violations, with 33-67% of documents exhibiting at least one directed 3-cycle, despite low aggregate violation rates. This highlights a critical issue where LLM judges can be unreliable for individual evaluations, even when overall metrics appear acceptable.

Editorial note

AI Dose summarizes public reporting and links to original sources when they are available. Review the Editorial Policy, Disclaimer, or Contact page if you need to flag a correction or understand how this site handles sources.

Continue Reading

[Paper] Ruka-v2: Tendon Driven Open-Source Dexterous Hand with Wrist and Abduction for Robot Learning
March 30, 2026
[Paper] MedObvious: Exposing the Medical Moravec's Paradox in VLMs via Clinical Triage
March 25, 2026
[Paper] MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation
April 17, 2026
[Paper] From $P(y|x)$ to $P(y)$: Investigating Reinforcement Learning in Pre-train Space
April 16, 2026

Next read

[Paper] Ruka-v2: Tendon Driven Open-Source Dexterous Hand with Wrist and Abduction for Robot Learning

Stay with the thread by reading one adjacent story before leaving this update.

Comments

Loading comments...