Implementing a Fast Tensor Core Matmul on the Ada Architecture

jhlee525 14 hours ago

This is incredibly useful. Thanks for making the kernels public.

I'm curious if anyone has tried generalizing this to batched matmuls or to sparse inputs on Ada?