Comparative Study of Large Language Model Evaluation Frameworks

As part of my capstone project in the Master’s in Data Science program at the University of Virginia, this research evaluates various LLM evaluation frameworks, emphasizing bias detection, response quality assessment, and robustness testing. The study leverages multiple datasets and methodologies to benchmark state-of-the-art approaches for ethical and reliable AI assessment.

February 2025 · Afnan Alabdulwahab

Understanding DeepEval's Bias Evaluation Methodology

This blog post explores the three-stage bias detection process in DeepEval, an LLM-based evaluation system that quantifies bias in AI-generated text. The methodology leverages structured validation, templated prompts, and a scoring framework to assess bias across multiple categories.

February 2025 · Afnan Alabdulwahab