We are really grateful to the reviewers for their helpful feedback.
We begin by addressing some concerns raised by multiple reviewers and then respond to each reviewer in turn.
### Common Concerns
> Clarification on novelties.
To clarify, our paper makes the following contributions:
* The first **mixed-precision MXInt** FPGA accelerator system for LLM inference in the cloud, its corresponding Design Space Exploration and Precision Search Strategy. The previous BrainWave paper [10] from Microsoft did not support mixed-precision.
* Our proposed approach produces efficient hardware systems with **on-average 4-bit weight** and 8-bit activation, this is extremely low-precision inference compared to *existing hardware implementation* of LLM accelerators (see Table 1). We also reported the accuracy results with these quantizations.
* Compared to multi-GPU systems, our multi-FPGA approach achieves better **energy efficiency**.
To our knowledge, there is no existing work that has explored mixed-precision MXInt implementation, and our paper provides performance and energy efficiency results. These results are valuable and provide references for future ASIC MXInt accelerator designs.
## Reviewer A
> Inconsistency about average 4b and under 8b
Sorry about the confusion. We will change them to "average 4b".
> VSQuant [9]
Thanks for the advice. We will add the following results (OPT-6.7B on ARC) to the revised version:
| Approaches | Accuracy (%) | Average bits |
|---------------|--------------|:------------:|
| Float | 65.6 | 32 |
| LLM.int4() | 64.5 | 16 |
| VS-Quant | 38.2 | 8 |
| VS-Quant | 26.1 | 4 |
| SmoothQuant-c | 65.3 | 8 |
| MiniFloat | 64.9 | 6 |
| Mixed MXInt | 64.9 | 4.4 |
It is known that VS-Quant is not amenable to LLM quantization because it is coarse-grained and does not consider activation outliners. SmoothQuant has pointed this phenomenon out in their paper [https://arxiv.org/abs/2211.10438] and has shown better accuracy than VS-Quant (per token, and per channel), later work such as LLM.int [14] has also noted that. We compare them with both existing state-of-the-art LLM quantization methods and VSQuant. We will also add the table above to Section IV.
## Reviewer B
> being the first implementation of the MXInt.
Sorry about the confusion, we will update it to "the first open-source mixed-precision MXInt implementation".
> “Max Model Size”
Thanks for the comments. We will focus on "many-device" instead and focus on the scalability of the hardware design.
> Cocotb simulation v.s. running on board
Verifying a 32-FPGA hardware system implementation faces technical challenges mainly because of limited hardware resources. We physically verified the latency in our point-to-point cross-board communications is consistent in a physical multi-FPGA network. This enables us to deduce system performance from the gathered data.
On the other hand, the hardware simulations provide cycle-accurate results and the synthesis tools provide exact hardware clock frequency and area. This provides promising hardware results for evaluation.
> BrainWave implementation ([12])
Sorry about the confusion. We will update the text to "There is no extensive comparison to [12] because of unpublic hardware implementation." as Reviewer B suggests.
## Reviewer D
Thank you for your comments.
## Reviewer C
> Practical challenges and complexities
Yes, thanks for the advice. We will add a paragraph to clarify the benefits and limitations of our approach. In summary, quantization for MXInt has a larger search space, and in the paper, we propose an efficient search approach for mixed precisions. On the hardware side, mapping across multiple devices faces performance challenges and we use dataflow architecture to preserve high throughput.
> Applicability of the findings to other data formats, and impact on the broader field of ML systems?
Yes, we plan to explore mixed-precision and mixed-type hardware systems in the future. We will clarify that in the conclusion section.
> Potential impact
Thanks for the comments. We see our work as a starting point for MXInt-based hardware accelerator designs. We hope that the results presented in this paper provide references for future ASIC/cloud accelerator designs.
### Reviewer E
> Existing mixed-precision techniques
Existing quantization techniques cannot be directly applied, mainly because the search space is different: the search space of traditional mixed-precision quantization focuses on total bits and their components, while our search space includes multi-level components, including **block sizes** of MXInt types and their **hardware operator cost**.
> Comparison with ASIC accelerator on performance/energy efficiency
That is an interesting question. On the methodology side, ASIC accelerators may achieve better performance or energy efficiency but have less flexibility to accommodate arbitrary models compared to FPGA systems. On the technical side, there is no known ASIC implementation of LLM accelerators with MXInt formats at the time of submission. The results presented in this paper provide references for future ASIC accelerator designs.