We thank all the reviewers for their helpful comments!
Reviewer1:
Thanks for your positive review!
Reviewer2:
1. We propose a novel search space that combines the number of heads and tensor-level mixed-precision MXInt quantization. For example, the search space size in Fig.2 is 6x5^(2x8x12), for 6 head number choices/layer, 5 precision choices/operand, 2 operands/GEMM, 8 GEMMs/layer, 12 layers. We will add this to the caption.
2(Q). Our MXInt quantization focuses on mixed mantissa bitwidth(averaged ~5 bits) for better scalability(Sec.III.C). Additionally, PTQ is only applicable for quantization, and changing the number of heads requires training. Fig.7&8 illustrate searching for the number of heads can further improve area efficiency compared to quantization-only approaches.
3. a) Hardware design choices such as mapping on AIEs or PL have non-linear trade-offs between area and performance, making existing search algorithms such as TPE(https://optuna.readthedocs.io/en/stable/reference/samplers/generated/optuna.samplers.TPESampler.html) unapplicable. Instead, we use random search to improve the hardware design efficiency, as shown in Sec.IV.
b) We follow existing NAS work(https://arxiv.org/pdf/1906.11829,https://arxiv.org/pdf/2001.01233) and train one epoch per variant. We will clarify that in Sec.III.
4. Existing GPUs do not support MXInt or mixed-precision lower than 4 bits, so it is unfair for quantization comparison.
5. We will add tokens/Joule results to table-III&IV.
6. We will add following related work for comparison(we are the first targeting designs with AIEs):
https://ieeexplore.ieee.org/document/9218596
https://ieeexplore.ieee.org/document/9586121
https://ieeexplore.ieee.org/document/9218749
https://ieeexplore.ieee.org/document/9502478
https://ieeexplore.ieee.org/document/9712541
https://ieeexplore.ieee.org/document/9774574
https://ieeexplore.ieee.org/document/8953587
https://ieeexplore.ieee.org/document/9060902
https://ieeexplore.ieee.org/document/9912074
https://ieeexplore.ieee.org/document/9586250
https://ieeexplore.ieee.org/document/10396202
https://ieeexplore.ieee.org/document/9444059
https://ieeexplore.ieee.org/document/10137094
https://ieeexplore.ieee.org/document/10505904
https://dl.acm.org/doi/10.1145/3490422.3502364
Reviewer3:
(Q) The trend is 1 head.
We agree that this is an interesting but *FPGA-specific* observation.
1. Inside a dataflow-style implementation, increasing the number of attention heads introduces additional logic&interface overhead as more hardware blocks are instantiated.
2. However, reducing the head number can hurt model accuracy instead. Current Table-III&IV present results on a simple task:*ag_news*, where the impact on accuracy is not signicant. However, we have also evaluated our methodlogy on more complex tasks: *emotion*, *tweet_eval-sentiment* and *imdb* in Fig.7&8. Actually, only 2 designs(out of 6) in *emotion*, 1 in *imdb* and none in *tweet_eval-sentiment* have single-head results. We will label all head numbers in Fig.7&8 to clarify this.
(Q) Training details
We train all variants for 1 epoch and further train the top 10% variants in terms of 1-epoch accuracy for another 4 epochs. For datasets that already have train, validation, and test splits like tweet_eval-sentiment, we directly reuse the split. For datasets that only have a train and a test spit such as *ag_news* and *imdb*, we create a new train and a new validation split from the original train split with train:validation=8:2. Both the batch-size and the sequence length are 128 for all training. For example, with 120k training samples in *ag_news*, we split them into 96k and 24k, and have 750 training steps per epoch. We will add these details to Sec.IV.
Reviewer4:
--(Q) We change the number of heads and the size per head to keep the total weight size the same between variants.
--(Q) We agree that searching for more hyper-parameters is interesting but faces different challenges when combined with hardware design constraints. This will be our future work.
-- Our output is a design having AIEs in FP32 and PL in tensor-level mixed-precision MXInt.
-- We observed that the order of quantization and head-search may affect final results. Here we empirically choose an efficient order. We will clarify that in Sec.III.
-- We target fixed device count and TPS, and search for area-efficient mapping. For simplicity, we focus on data and pipeline parallelism. We will clarify that in Sec.IV.
-- Thanks for your advice. We will add more discussion(response to Reviewer3) to Sec.IV.