Python

Comparative Study of LLM Evaluation Frameworks (Deloitte-Anthropic Alliance)

In collaboration with Deloitte’s Anthropic Alliance, this capstone research for the M.S. in Data Science at the University of Virginia critically examines leading frameworks for evaluating large language models (LLMs). The study leverages multiple datasets and methodologies to benchmark state-of-the-art approaches for ethical and reliable AI assessment. This comprehensive research evaluated and compared multiple LLM evaluation frameworks across eight critical metrics: toxicity detection, bias detection, hallucination detection, summarization quality, tone identification, readability assessment, retrieval accuracy, and response accuracy.

LLMs LLM-as-a-Judge AI Ethics Bias Detection Model Evaluation Python Claude RAGAS promptfoo DeepEval TruLens UVA Data Science Capstone Anthropic Deloitte

Understanding DeepEval's Bias Evaluation Methodology

This blog post explores the three-stage bias detection process in DeepEval, an LLM-based evaluation system that quantifies bias in AI-generated text. The methodology leverages structured validation, templated prompts, and a scoring framework to assess bias across multiple categories.

LLMs AI Bias Detection Python DeepEval AI Ethics NLP

GPU Matrix Operations Performance Benchmarking

Benchmarking GPU Matrix Operations Optimizations

Completed project for CS6501: GPU Architectures at the University of Virginia. This research presents a comprehensive benchmarking study of matrix operation optimizations across NVIDIA GPU architectures, focusing on matrix transpose and multiplication. Through systematic evaluation of custom CUDA kernels and library implementations across RTX 2080 Ti and A100 GPUs, demonstrated that vectorized implementations achieve up to 6x speedup over naive approaches, reaching 1800 GB/s throughput on A100.

Transformers CUDA GPU Optimization AI Deep Learning Python C++ PyTorch NVIDIA Nsight Systems Profiling Tools

Detecting AI-Generated Text: Targeting Academic Integrity Applications

Completed for DS6051: Decoding Large Language Models at UVA, this project explores transformer-based methods for detecting AI-generated text in academic contexts. By fine-tuning RoBERTa using LoRA and optimizing for human accuracy, the model reduced false positives on human-written abstracts from 83.2% to just 0.7%, demonstrating the importance of fairness and robustness in detection systems.

LLMs AI Text Detection RoBERTa LoRA RAID Academic Integrity Transformers Binary Classification Python PyTorch HuggingFace RAID Dataset

Practice and Application of Data Science

Jonathan Kropko | Summer 2024

Python Jupiter Lab

List of Irregular Verbs Across Romance Languages

This dataset contains all irregular verbs in known Romance languages.

Romance languages philology irregular verbs Portuguese Italian French Spanish simulations dataset python