- SUT ID: STAC250701
- STAC-AI
LLM-Based RAG Evaluation Metrics: Model Relatedness and Consistency
Type: Research Note
Specs: STAC-AI™ LANG6
This study investigates how the choice of LLM evaluator impacts scoring-based metrics such as Faithfulness, using the generated responses of STAC-AI™ LANG6 (Inference-Only) benchmark. As scoring-based evaluation becomes a core component of RAG system development, understanding the stability and trustworthiness of LLM evaluators is critical.
We analyze how relatedness of generator and evaluator models (OpenAI) influences scoring outcomes in complex, rubric-based evaluation tasks. Key findings highlight that while generator and evaluator relatedness may artificially inflate scoring bias, the inter-generational (e.g., GPT-4.1 vs GPT-4o) and intra-generational (e.g., GPT-4.1 vs GPT-4.1-mini) differences in related evaluators better explain score variability. We also examine the relationship between scoring accuracy and variance across repeated trials, revealing cases where models fundamentally misinterpret evaluation rubrics.
The report introduces a new framework to classify scoring variation as tolerable (task ambiguity) or non-tolerable (model miscomprehension), and provides best practices for minimizing evaluator bias and inconsistency during metric development.
These findings offer important guidance for anyone using LLMs to evaluate generative model quality.