ConvStencil: Transform Stencil Computation to Matrix Multiplication on Tensor Cores

链接:https://dl.acm.org/doi/pdf/10.1145/3627535.3638476
引用:Yuetao Chen, Kun Li, Yuhao Wang, Donglin Bai, Lei Wang, Lingxiao Ma, Liang Yuan, Yunquan Zhang, Ting Cao, and Mao Yang. 2024. ConvStencil: Transform Stencil Computation to Matrix Multiplication on Tensor Cores. In Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (PPoPP '24). Association for Computing Machinery, New York, NY, USA, 333–347. https://doi.org/10.1145/3627535.3638476


不要钻牛角尖,细节的东西暂时不需要花大把时间去考虑。


image


A100 的共享内存 —— 每个 SM 上 164KB * 108 SM


Stencil 本身就是 Memory-bound 的算法;

im2col 方法的缺点:

  1. 内存成倍扩展 -> Stencil2row 变换
  2. TCU 的部分浪费 - 对于 Stencil 计算的 kernel 部分只能容纳一列 -> 计算适应(Dual Tessellation)
  3. 算法和硬件冲突 -> 冲突消除

将输入通道转换为矩阵形式后,相邻两行之间至少有[(kernel_size - 1) / kernel_size]重复元素,