Module blas

Module blas

Expand description

BLAS-like operations for F_2 matrices.

This module provides highly optimized matrix multiplication kernels using a hierarchical tiling approach:

§Architecture

Tiles: Matrices are divided into tiles, where each tile contains multiple 64 x 64 bit blocks
Blocks: The fundamental unit of computation (64 x 64 bits = 64 rows x 64 columns of bits)
SIMD kernels: Block-level operations use AVX-512 or scalar fallbacks

§Performance Strategy

Loop ordering: Six different loop orderings (CIR, CRI, ICR, IRC, RCI, RIC) to optimize cache locality depending on matrix dimensions
Parallelization: Recursive divide-and-conquer using rayon for large matrices
Vectorization: AVX-512 intrinsics for significant speedup on supported CPUs

§Implementation Notes

Only prime = 2 is optimized; other primes fall back to naive multiplication
The optimal loop order and tile size depend on matrix dimensions (see benchmarks)
Default configuration uses RIC ordering with 1 x 16 tiles for best average performance

Modules§

block
tile