# ASPLOS 2024 Rebuttal Response
We're really grateful to the reviewers for their careful reviews and helpful feedback.
We especially appreciate the enthusiasm for the approach shown by Reviewer B: "Equality saturation and a set of tastefully designed heuristics for approximating code quality seem to be the right answer."
We begin by addressing some concerns raised by multiple reviewers, and then respond to each reviewer in turn.
### Common Concerns
#### Benchmarks
Finding suitable benchmarks is a perennial problem for papers that push the limits of HLS, in part because existing benchmarks tend to be tailored to what HLS tools can already comfortably handle.
SEER is amenable to programs with complex datapath blocks, control flow, or memory access patterns. In this work, we include the subset of MachSuite benchmarks (8 out of 19) in which current HLS tools are unable to achieve the optimal results. For the remaining benchmarks in MachSuite, HLS tools are already able to match expert human designers. We will more explicitly describe the benchmark selection in Section 5.1.
These benchmarks are commonly used in realistic applications such as signal processing and high-performance computing. This demonstrates that our approach is broadly applicable.
The scalability problem is an open challenge in both super-optimization and equality saturation. In hardware design, firstly, the modular design principles greatly help to limit the size and scope of optimization. Secondly, as observed by Reviewer B: "the bar for compile times is set so low" for hardware compilation, meaning users are willing to wait for quality results. We will clarify scalability concerns in Section 5.3 and add additional comments in the future work section.
#### Why alternating between iterations of control flow rewriting and data path rewriting, not combining them? Why dynamic rewrites?
This is a good question, and exploring both concurrently may offer advantages. We separated them as combining them would require significant engineering efforts in implementing all existing MLIR passes in Rust. To avoid reinventing wheels, we decided to adopt existing rewrites in MLIR and orchestrate them in egg.
Specifically, we use dynamic rewrites in egg as an interface to external tools. Each dynamic rewrite will call a set of MLIR passes for analysis or transformation. We will clarify that in Section 4.3.
#### Equivalence Checks
The external rewrites adopted by SEER may not be verified, which could potentially lead to an non-equivalent code. To ensure correctness, we use Synopsys equivalence checker to verify the equivalence between the input code and output code. We will clarify that in Section 4.7.
#### Clarification of minor issues
Thank you for helping to improve the quality of the paper. We will addresses these issues in the revised draft.
---
### Reviewer A
> SeerLang and $\theta$ nodes
We did not use $\theta$ nodes because we did not analyze data dependences in egg. Instead, we use symbols in SeerLang to represent induction variables and memory, so that they can be translated back to MLIR for analysis. We will clarify in Section 4.2.
> One cost function for extracting to enable static analyses and the other for extraction for hardware synthesis.
Correct.
> How often is extracting in SEER?
The extraction for static analysis or transformation happen every time SEER interfaces with MLIR. This extraction is fast because we do not need to account for common sub-expressions.
The extraction for hardware synthesis is slow as this stage must account for common sub-expressions. This only happens once after the exploration. We will clarify that in Section 4.6.
> SEER for FPGAs
This is an interesting direction. Yes, SEER can be used for exploring FPGA designs. The difference would be an FPGA-specific cost function for extracting an efficient datapath implementation.
> The implementation details
> LOC, language, key components of the implementation, etc.?
The key components include an egg-based framework (written in ~600 lines of Rust) and a set of MLIR passes for orchestration (written in C++ and modified MLIR passes). We will clarify that in Section 4.1.
> Missing citations
Thanks for pointing them out. We will add them.
---
### Reviewer B
> Discussion about Intel example
The example is a IP free version of a resource combining algorithm that has various similar, but slightly different interpretations with various Intel's IP. Commonly it is used as write combination logic, as part of a larger state machine which is trying to maximize bus traffic. The example uses the minimization of bus accesses as a criteria and uses the bits to tracks these resources, in the IP free case these are byte enables.
These small variants cause re-implementation, and due to the time critical path nature of these kind of resource management state machine the logic needs to be quick so that any decisions on dispatch can be made in few clock cycles. We will add further context in Section 5.2.
---
### Reviewer C
> Non-equivalence in traditional optimization passes.
That is a very interesting point. We only considered equivant rewrites but refinement would be great to discuss at the conference.
> How much time does it take to verify?
Is it any faster than doing end-to-end equivalence checking between the original and final programs?...
The translation validation approach increases robustness as the formal equivalence checking tools are often not able to prove a complex sequence of datapath transformations. In many cases it may be unnecessary. We did not measure verification runtime as it is relatively fast (minutes) and can be run in parallel with downstream tools.
> Detailed optimization time
We will add further datapoints to the manuscript in Table 5.
---
### Reviewer D
> Is the tool only complete with respect to the power of the static analysis? If one cannot find an invariant, the rewrite is discarded and not explored. How often does this happen?
We cannot make any completeness claims in this work, but it is often the case that during the e-graph growth representations are added that enable further rewrites as the analysis tools discover an invariant. We could investigate rewrite discard rates in the full paper.
> Adding power to cost function
Power is challenging to estimate statically because of low-level technology information from vendors. This may be a future work.
---
### Reviewer E
> Long-running kernels
The problem scales with code size and complexity. Long-running kernels may have a loop with a large iteration count but a small code size. SEER can handle these kernels efficiently.
> Control nodes
Control nodes refer to a subset of SeerLang operations, such as `for` and `if` statements. SEER has pre-defined patterns for the cost function, so that the extractor can directly identify and extract these control nodes.
> Function calls and global variables.
Function calls are treated as black box ops for now, which also enables the modular design principles for better scalability. Local variables have two types: temporary variables are canonicalized in MLIR; and stack-allocated local variables are treated as memory. Global variables are not commonly used in HLS applications, because they need a register to store the information, the reset of which is impossible to assign if it is not in a SystemC module and they can create scheduling problems. We have not observed global variables in our benchmarks. We will add discussion about this in the future work section.
> variable loop trip count for the cost functions
In MLIR `affine` and `scf` dialects, the loop trip count is a compile-time constant. We extracted them directly using static analysis.