This paper introduces a diagnostic toolkit to assess the per-instance reliability of LLM-as-judge frameworks, which are widely used for automatic Natural Language Generation (NLG) evaluation. Applying this toolkit to SummEval, researchers found widespread per-input inconsistencies and transitivity violations, with 33-67% of documents exhibiting at least one directed 3-cycle, despite low aggregate violation rates. This highlights a critical issue where LLM judges can be unreliable for individual evaluations, even when overall metrics appear acceptable.