Flat-Attention v.s. FlashAttention
FLAT: An Optimized Dataflow for Mitigating Attention Bottlenecks, ASPLOS'23
Summary of FLAT-Attention
FLAT (Fused Logit and Attend Tiling)
The quadratic complexity of Logit and Attend operator in Attention layer causing two major challenges:
Low performance from memory boundedness
Large on-chip buffer requirement for staging intermediate activations