Expand description
BLAS-like operations for F_2 matrices.
This module provides highly optimized matrix multiplication kernels using a hierarchical tiling approach:
§Architecture
- Tiles: Matrices are divided into tiles, where each tile contains multiple 64 x 64 bit blocks
- Blocks: The fundamental unit of computation (64 x 64 bits = 64 rows x 64 columns of bits)
- SIMD kernels: Block-level operations use AVX-512 or scalar fallbacks
§Performance Strategy
- Loop ordering: Six different loop orderings (CIR, CRI, ICR, IRC, RCI, RIC) to optimize cache locality depending on matrix dimensions
- Parallelization: Recursive divide-and-conquer using rayon for large matrices
- Vectorization: AVX-512 intrinsics for significant speedup on supported CPUs
§Implementation Notes
- Only
prime = 2is optimized; other primes fall back to naive multiplication - The optimal loop order and tile size depend on matrix dimensions (see benchmarks)
- Default configuration uses RIC ordering with 1 x 16 tiles for best average performance