Cache-efficient matrix transposition

Chatterjee, S. ; Sen, Sandeep (2000) Cache-efficient matrix transposition Proceedings. Sixth International Symposium on High-Performance Computer Architecture, 2000. HPCA-6 . 195 - 205.

PDF - Publisher Version

Official URL:


We investigate the memory system performance of several algorithms for transposing an N×N matrix in-place, where N is large. Specifically, we investigate the relative contributions of the data cache, the translation lookaside buffer, register tiling, and the array layout function to the overall running time of the algorithms. We use various memory models to capture and analyze the effect of various facets of cache memory architecture that guide the choice of a particular algorithm, and attempt to experimentally validate the predictions of the model. Our major conclusions are as follows: limited associativity in the mapping from main memory addresses to cache sets can significantly degrade running time; the limited number of TLB entries can easily lead to thrashing; the fanciest optimal algorithms are not competitive on real machines even at fairly large problem sizes unless cache miss penalties are quite high: low-level performance tuning "hacks", such as register tiling and array alignment, can significantly distort the effects of improved algorithms; and hierarchical non-linear layouts are inherently superior to the standard canonical layouts (such as row- or column-major) for this problem

Item Type:Article
Source:Copyright of this article belongs to Conference Publications.
ID Code:90083
Deposited On:04 May 2012 14:30
Last Modified:19 May 2016 04:25

Repository Staff Only: item control page