pub fn gemm_concurrent<const M: usize, const N: usize, L: LoopOrder>(
alpha: bool,
a: MatrixTileSlice<'_>,
b: MatrixTileSlice<'_>,
beta: bool,
c: MatrixTileSliceMut<'_>,
)Expand description
Performs tile-level GEMM with recursive parallelization.
The matrix is recursively split along rows (if rows > M blocks) or columns (if cols > N blocks) until tiles are small enough, then all tiles are processed in parallel using rayon.
§Type Parameters
M- Minimum block rows before parallelization stopsN- Minimum block columns before parallelization stopsL- Loop ordering strategy (seeLoopOrder)
§Performance
For best performance, choose M and N based on your matrix sizes. The defaults used in the codebase are M=1, N=16, which work well for many workloads.