DeepEval

In collaboration with Deloitte’s Anthropic Alliance, this capstone research for the M.S. in Data Science at the University of Virginia critically examines leading frameworks for evaluating large language models (LLMs). The study leverages multiple datasets and methodologies to benchmark state-of-the-art approaches for ethical and reliable AI assessment. This comprehensive research evaluated and compared multiple LLM evaluation frameworks across eight critical metrics: toxicity detection, bias detection, hallucination detection, summarization quality, tone identification, readability assessment, retrieval accuracy, and response accuracy.

DeepEval

Comparative Study of LLM Evaluation Frameworks (Deloitte-Anthropic Alliance)

Understanding DeepEval's Bias Evaluation Methodology