# FPGA2020 Review Response
###### Tags: `Discussions`
## Review 1
>### Detailed Comments
>"Recent work [16] proposes a toolflow named Dynamatic that generates synchronous dataflow circuits from HLS code." What is HLS code? Did you mean C code?
>
>"For instance, the hardware may be stalled waiting for a new input after taking the last input data in the whole operation, resulting in a permanent stall." This is not clear to me.
>
>### Questions for Authors
>1. Please explain why the "Total Cycles" column in Table 1 sometimes gives a range, even for the SS case.
>2. For the cases where the Wall Clock Time is a range, how exactly did you choose which point to show in FIgure 9?
>3. No prior work aimed at combining dynamic and static scheduling was cited. Is this an oversight, or is there really no such work? For instance, Luca Carloni of Columbia Univ. has published extensively on latency insensitive design. I believe that work proposed wrapping conventional statically scheduled blocks to serve as a node in a higher-level latency insensitive system.
### <span style="color:red">Drafted answer 1 </span>
Thank you very much for your review. In answer to your questions:
1. In general, the 'Total Cycles' column is a range because the number of cycles required depends on whether inputs take the short path or the long path. For the SS case specifically, the time taken should be fixed; however, if the II is too large, Vivado HLS chooses to make the circuit sequential (non-pipelined), and hence the time taken becomes dependent on the input data distribution.
2. Fig. 9 intends to give an overview of the effectiveness of our approach, complementing the detailed results in Table 1. In the figure, we show three arrows for each benchmark: the best case (all inputs take the short path), the worst case (all long), and a middle case (half short, half long). These arrows are normalised to the corresponding SS solutions. Prompted by your remark, we have now improved this graph so that each benchmark's arrows have a different colour.
3. Thank you for pointing out the relevant work of Carloni. We will include it in our Background section. Carloni described how to encapsulate static modules into a latency-insensitive system, and we use a similar integration philosophy. Yet, nothing in Carloni's work, as far as we can tell, is related to the automatic generation of circuits from high-level code nor to the mix of the two HLS paradigms in a single synthesis, which is our contribution here.
<!---
Thank you for pointing out the relevant work of Carloni. We will include it into our Background section. Carloni described how to encapsulate static modules into a latency-insensitive system; our contribution is to use a similar concept in the context of HLS. Our work includes the following novelties:
(1) We have built a working tool for combining static and dynamic scheduling to create high-quality hardware designs out of high-level code.
(2) We discuss the pros and cons of using static and dynamic scheduling in HLS and detail the properties of the high-level code that can allow it to benefit most from a combined scheduling approach (DSS).
(3) We investigate how to exploit standard HLS optimizations to improve the quality of the resulting DSS hardware (e.g., choosing the appropriate II to benefit from resource sharing).
(4) We integrate DSS with actual HLS tools and demonstrate its benefits on real benchmarks obtained from high-level code.
Thank you for pointing us to Carloni; we knew his work on latency insensitive design models, of course, but, prompted by the remark, noticed a recent paper relevant to the introduction of static components into latency insensitive designs. Indeed, we use a similar integration philosophy. Yet, nothing in Carloni's work, as far as we can tell, is related to the automatic generation of circuits from high level code nor to the mix of the two HLS paradigms in a
single synthesis, which is our contribution here.
--->
## Review 2
>### Detailed Comments
>
>This paper is an interesting extension of the recent work in this conference on dynamic scheduling techniques in HLS. In this paper, the authors contend there are places where static scheduling is more efficient (primarily in area) and others where dynamic scheduling improves performance. This paper does experiments illustrating that hypothesis, and then builds a methodology to express the dynamic/static boundary and connect those two domains.
>
>I'm disappointed that this work does not yet fully automate the process: the user determines the Iteration Interval, as well as the Dynamic/Static boundaries. This is left to future work, and even if it was complete, probably would not fit in 10 pages.
>
>This paper's contribution's are (1) the illustration of the case for hybrid dynamic/static solutions (2) discussion of the interface synthesis between the static and dynamic.
## Review 3
>### Detailed Comments
>Overall, this paper is well-written. Experimental results seem to be comprehensive. Moreover, the key idea is quite easy to understand and the target problem is also quite important.
>
>However, The pros and cons are quite well-known. Also, the Merge, Fork, Join, and Branch classifications are extensively studied in asynchronous logic design research. I suggest the authors can reduce the amount of space to discuss all the above. Instead focusing on the key novelty of this research. In my opinion, maybe the most important technical idea is the part of how to partition a given logic design into SS part and DS part. Unfortunately, there is very little discussion about how to make the best choice of this partition. Clearly, this partitioning is very challenge to optimize because it involves solving a quite complex optimization problem.
>
>My suggestions:
>
>1. Focus much more on discussing how to optimize the DS/SS partition and justify why.
>
>2. The results presented are quite extensive. However, the depth of results analysis can be improved. For example, how does one further understand the correlation between performance improvements and the nature of the circuits under consideration. How the compute and memory access patterns of these benchmark designs impact the partitioning choices.
### <span style="color:red">Drafted answer 3 </span>
Thank you for the suggestion for improving our evaluation. When revising the paper we will rank our benchmarks based on the amount of code that is synthesised using SS, resource sharing opportunities, irregular control flow and variable-latency, and include a graph showing how these metrics affect the quality of the generated hardware.
-|resource sharing opportunities | irregular control flow | variable-latency
--- |--- | --- | ---
sparseMatrixPower| No | Yes | Yes|
histogram| No | Yes | Yes|
BNNKernel| Yes | Yes | No |
getTanh(double)| Yes | Yes | Yes|
gSum| Yes | Yes | Yes|
gSumIf| Yes | Yes | Yes|
getTanh| Yes | Yes | Yes|
### <span style="color:red">Drafted answer for all reviews</span>
We thank all the reviewers for the time and expertise they have invested in these reviews.
Reviewers 2 and 3 both point out the program partitioning problem between DS and SS. Now that we have assessed that DSS can be of great value, our natural next step is to automate the partitioning. We already make some general observations (in Section 4.1 and the end of Section 4.3) about which properties of a program make it most amenable to DSS -- the challenge will now be to identify these properties automatically. We expect that this will be rather challenging given the range of different patterns that are possible in arbitrary input code.
(word count: 402)