Felix

@felixkao

Joined on Oct 17, 2016

  • Flat-Attention v.s. FlashAttention FLAT: An Optimized Dataflow for Mitigating Attention Bottlenecks, ASPLOS'23 Summary of FLAT-Attention FLAT (Fused Logit and Attend Tiling) The quadratic complexity of Logit and Attend operator in Attention layer causing two major challenges: Low performance from memory boundedness Large on-chip buffer requirement for staging intermediate activations
     Like 1 Bookmark
  • FLAT-Attention was public in Arxiv in July 2021 and published in ASPLOS March 2023. We are aware of concurrent work FlashAttention. In short, we are taking different route to tackle the same problem. The proposed solutions are different but the key idea is the same (tiling and scheudling). We summmarize the key difference in the following. To see the detail difference, please refer to our colab demo. Qualtively Comparisons Comparisons of FLAT-Attention and FlashAttention Tiling Strategy Comparisons The tiling strategy difference between FLAT-Attention and FlashAttention. FlashAttention uses block tiling and weight stationary. FLAT-Attention uses row tiling (row-granularity) and output stationary.
     Like 1 Bookmark
  • Felix Kao Up-to-date website Github, Linkedin Skills Proficient: Python, Pytorch, JAX, GCP, Cloud TPU, Verilog Experienced: Tensorflow, C/C++, Matlab Research Interest and Experience [ML] ML-based automation, RLs, GA-based optimization, Transformer, Efficient attention for long sequence, Pruning, Quantization, Neural architecture search
     Like  Bookmark