Unlike legacy cublasGemmStridedBatchedEx , which requires all matrices in a batch to have the , cuBLASLt Grouped GEMM supports variable dimensions per group.
By "grouping" these operations, cuBLASLt packs multiple problems into a single grid, maximizing and minimizing communication bottlenecks. Key Features of cuBLASLt Grouped GEMM cublaslt grouped gemm
For this, NVIDIA introduced specific Grouped GEMM features often utilized via or newer cuBLASLt extensions. In cuBLASLt, if you need variable sizes, you typically must process them in sub-groups of identical sizes or use the cublasLtMatmul with specific "Grouped" descriptors (checking CUBLASLT_NUMERICAL_IMPL_FLAGS or specific Grouped GEMM extensions in the latest CUDA 12.x documentation). In cuBLASLt, if you need variable sizes, you
Imagine training a recommendation system with embedding tables of varying sizes, or running inference on a transformer model with variable sequence lengths. In these scenarios, you might have 1,024 independent GEMM operations, each with different M, N, or K dimensions. According to NVIDIA Developer documentation
According to NVIDIA Developer documentation, the grouped GEMM API in cuBLASLt offers several advanced capabilities: Description Each matrix in the group can have its own Micap M sub i Nicap N sub i Kicap K sub i Mixed Precision