SUT ID: STAC250701
STAC Insights

LLM-Based RAG Evaluation Metrics: Model Relatedness and Consistency

Type: Research Note

Specs: STAC-AI™ LANG6

This study investigates how the choice of LLM evaluator impacts scoring-based metrics such as Faithfulness, using the generated responses of STAC-AI™ LANG6 (Inference-Only) benchmark. As scoring-based evaluation becomes a core component of RAG system development, understanding the stability and trustworthiness of LLM evaluators is critical.

We analyze how relatedness of generator and evaluator models (OpenAI) influences scoring outcomes in complex, rubric-based evaluation tasks. Key findings highlight that while generator and evaluator relatedness may artificially inflate scoring bias, the inter-generational (e.g., GPT-4.1 vs GPT-4o) and intra-generational (e.g., GPT-4.1 vs GPT-4.1-mini) differences in related evaluators better explain score variability. We also examine the relationship between scoring accuracy and variance across repeated trials, revealing cases where models fundamentally misinterpret evaluation rubrics.

The report introduces a new framework to classify scoring variation as tolerable (task ambiguity) or non-tolerable (model miscomprehension), and provides best practices for minimizing evaluator bias and inconsistency during metric development.

These findings offer important guidance for anyone using LLMs to evaluate generative model quality.

Please log in to see file attachments. If you are not registered, you may register for no charge.

LLM-Based RAG Evaluation Metrics: Model Relatedness and Consistency

User login