We're really grateful to the reviewers for their careful reviews and helpful feedback.
We begin by addressing some concerns raised by multiple reviewers, and then respond to each reviewer in turn. We have include additional results in the attachment as an extra page.
## Common Concerns:
### Software and Hardware Co-optimization in MASE
**In the attachment**, we've included a figure illustrating how MASE IR enables automatic and concurrent software-hardware co-optimization using multi-object search for improved results.
### GPU Comparison
**In the attachment**, we've added a comparison with multi-GPU works with int8 for 7B LLMs. MASE adopts a custom type named MSFP6 (block size of 16, 9-bit exponent, and 6-bit mantissa), which outperforms *all other works* in energy efficiency. The state-of-the-art `LLM.int8` on GPUs (https://arxiv.org/abs/2208.07339) uses *tensor core hardware with manually implemented CUDA kernels* for the "gather" and "scatter" operations, and performs floating point multiplications, while our FPGA solution by MASE achieves high energy efficiency by automatically deploying new arithmetic types.
### Results for CNN Models
Table VII demonstrates MASE's ability to map a variety of models, from CNNs to LLMs. MASE is amenable to mapping large models to maximize parallelism across multiple accelerators, as clarified in the revision.
### Writing
We've revised the draft and improved the writing. Many thanks to Reviewers B and D for the valuable suggestions on improving our writing!
---
## Reviewer E:
> TVM, Vitis AI and ScaleHLS.
There is a misunderstanding here. TVM, Vitis AI and ScaleHLS do not support multi-device systems, while MASE does. We have clarified that in the revised related work section.
> Training in ONNX dialect MLIR + Hardware dialect MLIR in Scale HLS
There is another misunderstanding here. First, ONNX dialect in MLIR cannot train models but only do inference, provided by ONNX official documentation: https://github.com/onnx/onnx-mlir/blob/main/docs/mnist_example/README.md This agrees with Table I in our paper.
Secondly, while ONNX runtime (outside MLIR) does offer support for training ONNX transformed models, there is currently no straightforward method or existing implementations to integrate hardware feedback into this training loop. The training-in-MLIR approach might work *in theory* but MASE is the first working implemention.
> Section IV.B and IV.D, employ existing tools without new contribution.
In these sections, MASE only orchestrates PyTorch and a Verilog simulator named XSIM. It is unnecessary to revent these wheels.
Section IV.B proposes a set of training-capable MASE passes. Other frameworks such as MLIR cannot fine-tune between transformations.
Section IV.D proposes a novel hardware equivalence check flow integrated in PyTorch.
> On-chip and cross-device communication is common practice
Mapping hardware on multi-accelerator systems presents significant challenges. Despite increasing efforts in model partition and parallelism on accelerators like GPUs (https://dl.acm.org/doi/10.1145/3341301.3359630), the FPGA domain remains under-explored. MASE is the first attempt at automatically mapping an LLM up to 7B onto an efficient multi-FPGA system. We have clarified that in the revision.
> Tiling and double buffering among kernels
These techniques target different granularities than MASE. They are instruction-level parallelism, while MASE is module-level parallelism, as shown in Table I. We believe these techniques are useful (such as HIDA accepted at ASPLOS 2024, which we have cited and now included in our related work) and can improve MASE, but it is out of the scope for this work.
> MASE relies on commercial HLS tools
As illustrated in Table III, only 10.8% of the hardware is generated by HLS tools. Most of the hardware components are manually optimized Verilog templates and will be open-sourced with MASE, contributing to a broader hardware community.
---
## Reviewer C:
> Differences between what you are proposing and current HLS workflow?
Existing HLS workflows do not involve training during hardware optimization. MASE IR, as a hardware-aware and trainable IR, allows adjustment of model during hardware synthesis because of its layer-wise hardware transparency in the IR, generating a better hardware .
> [34] outperformed MASE
Sorry about missing clarification. The performance gap between MASE and [34] is because [34] manually optimized small models. For instance, they quantized the model to <5 bits and MASE uses int8 for CNN models. [34] achieves worse accuracy because of fewer bits, and its manual efforts are not scalable for large models. We have clarified that in the revision.
> How convenient is it to use MASE over conventional HLS tools?
Thanks a lot for the advice and this is a very interesting point. We highlighted in the revision that MASE can be orchestrated using a few lines of code integrated into PyTorch, as illustrated in top-right of Fig. 2.
> New in quantization optimization
MASE as a tool flow opens up opportunities for hardware-aware customized quantization, for instance, with new arithematic types (MSFP) and layer-wise precision. This is not supported by any existing general tools. (MLIR supports custom types but cannot fine-tune the model.) We have clarified that in the revision.
---
## Reviewer B:
> Why not inference on GPUs?
There is a trade-off there. We observed that FPGAs are beneficial for large models such as LLMs, as shown in Table VI.
> Comparison with TVM
Thanks for pointing us to this! We have added the comparison to the related work section.
> Misleading terms
Changed "mixed-precision arithmetic" to "layer-wise multi-precision"
Changed "software and hardware co-simulation" to "equivalence check"
---
## Reviewer A:
> open-source?
Yes, MASE will be fully open-sourced, as a contribution to the ML hardware community.
> Largest models and the limiting factor?
The largest model we tried is LLaMA 7B on 32 FPGAs, restricted by the number of available FPGA devices. We have added the latest results to the revision.
> Specialized IP compared TPU or Xilinx?
TPU and Vitis AI (Xilinx) uses general components, while
our specialized IPs are tailored to be layer-specific, potentially leading to high efficiency. For example, MASE allows computation in abitrary precision while TPUs only compute in a small set of available precisions.
> training on distributed set of fpgas
There is a misunderstanding here. MASE trains a model on GPUs and maps its inference on FPGAs.
> The QSFP has 32GB/s vs 900GB/s for NVlink.
MASE hardware is either device count-bounded or memory bandwidth bounded. A higher bandwidth interface will lift the latter restriction, leading to potentially better performance. In our experiments, all the results are device count-bounded because of the limited devices we could access.
> user interface example, the usage and system design that you are proposing.
MASE provides a platform as shown in Fig. 2: software ML developers without hardware backgrounds can use MASE to efficiently accelerate their ML models in FPGA data centers. We have highlighted that in the revision.
---
## Reviewer D:
> Lower-end FPGAs
Yes, it is possible but we have not tried.
> Changing targets
MASE and MASE IR can map models onto GPUs and CPUs, but we focus optimizations for FPGAs in this work.
> heterogeneous devices and multi-nodes
MASE does not support these and there are unknown challenges. This is a very interesting idea and will be our future work.