Comparative Study of LLM Evaluation Frameworks (Deloitte-Anthropic Alliance)

Overview

🏆 Winner of “Most Innovative Analytical Solution” Award

As part of my UVA capstone project, I collaborated with Deloitte’s Anthropic Alliance to analyze and compare evaluation frameworks for large language models (LLMs). This award-winning study focused on key metrics such as response accuracy, retrieval effectiveness, bias detection, toxicity, hallucination, and tone identification. I led the bias detection evaluation, implementing Counterfactual Data Testing with the WinoBias dataset to measure LLM response consistency across sensitive attributes. Additionally, I developed and applied custom bias detection methods using promptfoo, DeepEval, and RAGAS, conducting comparative analyses on 1,500+ sentence pairs from the CrowS-Pairs dataset. By integrating counterfactual data testing with contextual sensitivity analysis, our research aimed to enhance bias evaluation in LLMs and contribute to more ethical AI assessment methodologies.

Publications: This research was presented at the UVA 2025 Systems and Information Engineering Design Symposium (SIEDS), where we published a condensed 6-page version of our findings in the IEEE conference proceedings.

Links

Overview

Related material