Module blas

Module blas 

Source
Expand description

BLAS-like operations for F_2 matrices.

This module provides highly optimized matrix multiplication kernels using a hierarchical tiling approach:

§Architecture

  • Tiles: Matrices are divided into tiles, where each tile contains multiple 64 x 64 bit blocks
  • Blocks: The fundamental unit of computation (64 x 64 bits = 64 rows x 64 columns of bits)
  • SIMD kernels: Block-level operations use AVX-512 or scalar fallbacks

§Performance Strategy

  1. Loop ordering: Six different loop orderings (CIR, CRI, ICR, IRC, RCI, RIC) to optimize cache locality depending on matrix dimensions
  2. Parallelization: Recursive divide-and-conquer using rayon for large matrices
  3. Vectorization: AVX-512 intrinsics for significant speedup on supported CPUs

§Implementation Notes

  • Only prime = 2 is optimized; other primes fall back to naive multiplication
  • The optimal loop order and tile size depend on matrix dimensions (see benchmarks)
  • Default configuration uses RIC ordering with 1 x 16 tiles for best average performance

Modules§

block
tile