Overview

As part of my UVA capstone project, I collaborated with an industry sponsor to analyze and compare evaluation frameworks for large language models (LLMs). This study focused on key metrics such as response accuracy, retrieval effectiveness, bias detection, toxicity, hallucination, and tone identification. I led the bias detection evaluation, implementing Counterfactual Data Testing with the WinoBias dataset to measure LLM response consistency across sensitive attributes. Additionally, I developed and applied custom bias detection methods using promptfoo, DeepEval, and RAGAS, conducting comparative analyses on 1,500+ sentence pairs from the CrowS-Pairs dataset. By integrating counterfactual data testing with contextual sensitivity analysis, our research aimed to enhance gender bias evaluation in LLMs and contribute to more ethical AI assessment methodologies.