Comparative Study of Large Language Model Evaluation Frameworks

As part of my capstone project in the Master’s in Data Science program at the University of Virginia, this research evaluates various LLM evaluation frameworks, emphasizing bias detection, response quality assessment, and robustness testing. The study leverages multiple datasets and methodologies to benchmark state-of-the-art approaches for ethical and reliable AI assessment.

February 2025 · Afnan Alabdulwahab

Understanding DeepEval's Bias Evaluation Methodology

This blog post explores the three-stage bias detection process in DeepEval, an LLM-based evaluation system that quantifies bias in AI-generated text. The methodology leverages structured validation, templated prompts, and a scoring framework to assess bias across multiple categories.

February 2025 · Afnan Alabdulwahab

The Turing Test: Detecting LLM-Generated Content

As part of my Decoding LLMs course, this project investigates the challenges of detecting AI-generated text in academic settings. Our team explores state-of-the-art detection methods, benchmarks existing classifiers, and develops a detection model trained on diverse datasets. The goal is to enhance academic integrity by providing reliable AI-text detection tools.

February 2025 · Afnan Alabdulwahab