LLM-Based RAG Evaluation Metrics: Model Relatedness and Consistency

Type: Research Note

Specs: STAC-AI™ LANG6

This study investigates how the choice of LLM evaluator impacts scoring-based metrics such as Faithfulness, using the generated responses of STAC-AI™ LANG6 (Inference-Only) benchmark. As scoring-based evaluation becomes a core component of RAG system development, understanding the stability and trustworthiness of LLM evaluators is critical.

We analyze how relatedness of generator and evaluator models (OpenAI) influences scoring outcomes in complex, rubric-based evaluation tasks. Key findings highlight that while generator and evaluator relatedness may artificially inflate scoring bias, the inter-generational (e.g., GPT-4.1 vs GPT-4o) and intra-generational (e.g., GPT-4.1 vs GPT-4.1-mini) differences in related evaluators better explain score variability. We also examine the relationship between scoring accuracy and variance across repeated trials, revealing cases where models fundamentally misinterpret evaluation rubrics.

The report introduces a new framework to classify scoring variation as tolerable (task ambiguity) or non-tolerable (model miscomprehension), and provides best practices for minimizing evaluator bias and inconsistency during metric development.

These findings offer important guidance for anyone using LLMs to evaluate generative model quality.

Please log in to see file attachments. If you are not registered, you may register for no charge.

The STAC-AI Working Group focuses on benchmarking artificial intelligence (AI) technologies in finance. This includes deep learning, large language models (LLMs), and other AI-driven approaches that help firms unlock new efficiencies and insights.