Benchmarking GPU Matrix Operations Optimizations

Links

Final Report
Code
Final Presentation Slides

Overview

Conducted comprehensive benchmarking of CUDA matrix operations to explore opportunities for reducing latency and improving GPU resource utilization in AI workloads across NVIDIA Turing and Ampere architectures (RTX 2080 Ti and A100 GPUs). I focused in matrix transpose analysis, implementing multiple custom CUDA kernels using advanced techniques including shared memory tiling with bank conflict avoidance, vectorized float4 memory access, and warp shuffle primitives. Systematically compared custom implementations against industry-standard libraries (cuBLAS, CUTLASS, CuTe) across matrix dimensions (32×32 to 8192×8192), identifying architecture-specific optimization strategies for transformer model acceleration.

🏛️ CS6501: GPU Architectures Course

Links

Overview

Related material