Accelerating Transformer Attention with Custom CUDA Kernels
As part of my GPU Architectures course, this project explores profiling and optimizing attention mechanisms in transformers using custom CUDA extensions. The focus is on reducing inference and training latency through kernel-level enhancements, improving GPU resource utilization for deep learning workloads.