This project involved analyzing cache behavior, inferring cache parameters, and optimizing a matrix transpose function to minimize cache misses. Through careful implementation and testing, efficient ...
Abstract: The rising popularity of deep learning algorithms demands special accelerators for matrix-matrix multiplication. Most of the matrix multipliers are designed based on the systolic array ...