NAS4AIE Rebuttal

We thank all the reviewers for their helpful comments! Reviewer1: Thanks for your positive review! Reviewer2: 1. We propose a novel search space that combines the number of heads and tensor-level mixed-precision MXInt quantization. For example, the search space size in Fig.2 is 6x5^(2x8x12), for 6 head number choices/layer, 5 precision choices/operand, 2 operands/GEMM, 8 GEMMs/layer, 12 layers. We will add this to the caption. 2(Q). Our MXInt quantization focuses on mixed mantissa bitwidth(averaged ~5 bits) for better scalability(Sec.III.C). Additionally, PTQ is only applicable for quantization, and changing the number of heads requires training. Fig.7&8 illustrate searching for the number of heads can further improve area efficiency compared to quantization-only approaches. 3. a) Hardware design choices such as mapping on AIEs or PL have non-linear trade-offs between area and performance, making existing search algorithms such as TPE(https://optuna.readthedocs.io/en/stable/reference/samplers/generated/optuna.samplers.TPESampler.html) unapplicable. Instead, we use random search to improve the hardware design efficiency, as shown in Sec.IV. b) We follow existing NAS work(https://arxiv.org/pdf/1906.11829,https://arxiv.org/pdf/2001.01233) and train one epoch per variant. We will clarify that in Sec.III. 4. Existing GPUs do not support MXInt or mixed-precision lower than 4 bits, so it is unfair for quantization comparison. 5. We will add tokens/Joule results to table-III&IV. 6. We will add following related work for comparison(we are the first targeting designs with AIEs): https://ieeexplore.ieee.org/document/9218596 https://ieeexplore.ieee.org/document/9586121 https://ieeexplore.ieee.org/document/9218749 https://ieeexplore.ieee.org/document/9502478 https://ieeexplore.ieee.org/document/9712541 https://ieeexplore.ieee.org/document/9774574 https://ieeexplore.ieee.org/document/8953587 https://ieeexplore.ieee.org/document/9060902 https://ieeexplore.ieee.org/document/9912074 https://ieeexplore.ieee.org/document/9586250 https://ieeexplore.ieee.org/document/10396202 https://ieeexplore.ieee.org/document/9444059 https://ieeexplore.ieee.org/document/10137094 https://ieeexplore.ieee.org/document/10505904 https://dl.acm.org/doi/10.1145/3490422.3502364 Reviewer3: (Q) The trend is 1 head. We agree that this is an interesting but *FPGA-specific* observation. 1. Inside a dataflow-style implementation, increasing the number of attention heads introduces additional logic&interface overhead as more hardware blocks are instantiated. 2. However, reducing the head number can hurt model accuracy instead. Current Table-III&IV present results on a simple task:*ag_news*, where the impact on accuracy is not signicant. However, we have also evaluated our methodlogy on more complex tasks: *emotion*, *tweet_eval-sentiment* and *imdb* in Fig.7&8. Actually, only 2 designs(out of 6) in *emotion*, 1 in *imdb* and none in *tweet_eval-sentiment* have single-head results. We will label all head numbers in Fig.7&8 to clarify this. (Q) Training details We train all variants for 1 epoch and further train the top 10% variants in terms of 1-epoch accuracy for another 4 epochs. For datasets that already have train, validation, and test splits like tweet_eval-sentiment, we directly reuse the split. For datasets that only have a train and a test spit such as *ag_news* and *imdb*, we create a new train and a new validation split from the original train split with train:validation=8:2. Both the batch-size and the sequence length are 128 for all training. For example, with 120k training samples in *ag_news*, we split them into 96k and 24k, and have 750 training steps per epoch. We will add these details to Sec.IV. Reviewer4: --(Q) We change the number of heads and the size per head to keep the total weight size the same between variants. --(Q) We agree that searching for more hyper-parameters is interesting but faces different challenges when combined with hardware design constraints. This will be our future work. -- Our output is a design having AIEs in FP32 and PL in tensor-level mixed-precision MXInt. -- We observed that the order of quantization and head-search may affect final results. Here we empirically choose an efficient order. We will clarify that in Sec.III. -- We target fixed device count and TPS, and search for area-efficient mapping. For simplicity, we focus on data and pipeline parallelism. We will clarify that in Sec.IV. -- Thanks for your advice. We will add more discussion(response to Reviewer3) to Sec.IV.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.