Links
- Paper (Extended Version)
- SIEDS Conference Paper (IEEE)
- Code
- Presentation Slides
- LLM Evaluation Framework Leaderboard
Overview
🏆 Winner of “Most Innovative Analytical Solution” Award
As part of my UVA capstone project, I collaborated with Deloitte’s Anthropic Alliance to analyze and compare evaluation frameworks for large language models (LLMs). This award-winning study focused on key metrics such as response accuracy, retrieval effectiveness, bias detection, toxicity, hallucination, and tone identification. I led the bias detection evaluation, implementing Counterfactual Data Testing with the WinoBias dataset to measure LLM response consistency across sensitive attributes. Additionally, I developed and applied custom bias detection methods using promptfoo, DeepEval, and RAGAS, conducting comparative analyses on 1,500+ sentence pairs from the CrowS-Pairs dataset. By integrating counterfactual data testing with contextual sensitivity analysis, our research aimed to enhance bias evaluation in LLMs and contribute to more ethical AI assessment methodologies.
Publications: This research was presented at the UVA 2025 Systems and Information Engineering Design Symposium (SIEDS), where we published a condensed 6-page version of our findings in the IEEE conference proceedings.
Related material
- 🏛️ UVA Master’s in Data Science Students Showcase Real-World Solutions in 2025 Capstone Presentations
- Deloitte and Anthropic Collaborate to Bring Safe, Reliable and Trusted AI to Commercial and Government Organizations
- Deloitte Collaborates With Anthropic to Advance Enterprise AI Capabilities Through AI Training and Certification Program