Overview

Conducted comprehensive benchmarking of CUDA matrix operations to explore opportunities for reducing latency and improving GPU resource utilization in AI workloads across NVIDIA Turing and Ampere architectures (RTX 2080 Ti and A100 GPUs). I focused in matrix transpose analysis, implementing multiple custom CUDA kernels using advanced techniques including shared memory tiling with bank conflict avoidance, vectorized float4 memory access, and warp shuffle primitives. Systematically compared custom implementations against industry-standard libraries (cuBLAS, CUTLASS, CuTe) across matrix dimensions (32×32 to 8192×8192), identifying architecture-specific optimization strategies for transformer model acceleration.