# FPL 2021 Rebuttal
----------------------- REVIEW 1 ---------------------
SUBMISSION: 24
TITLE: (Short Paper) Exploiting the Correlation between Dependence Distance and Latency in Loop Pipelining
AUTHORS: Jianyi Cheng, John Wickerson and George Constantinides
----------- Overall evaluation -----------
> The paper addresses an important problem using a simple, yet elegant, idea. Its introduction and motivation is well-written. The rest of the paper suffers from a slight lack of rigor. Especially the mathematical formulation of the problem could be more precise up to the point where the correctness of specific statements is questionable. In particular, in contrast to its claim in the conclusions, the paper does not formally prove the correctness of the schedule, although a sufficient description is given to make the proof obvious. The performed experiments are short, but convinving.
The following points should be reviewed again by the authors.
> (1) "The same operation in two consecutive iterations has a constant time difference"
This is not necessarily the case. The II only gives a (constant) upper bound for the time difference.
> (2) "There are also dynamic pipelining...", "Then condition of the traditional..."
These sentences are broken
> (3) "A statement k always takes Lk' cycles to execute."
Lk' is only an upper bound. The model assumes that a statement takes Lk' cycles to execute.
> (4) "the total execution time of the loop is"
Again, this is only an upper bound. I would suggest "is bounded by".
> (5) "The dependences are also restricted as D'"
I do not see what the "also" refers to. I think it should be removed.
> (6) "The dependence constraints that need to be solved are approximated as Eq. 6 by substituting ..."
This argumentation goes into the wrong direction. It suggests that Eq. 6 follows from Eq. 2, while actually only the reverse is true (and only the reverse is important for the correctness).
> (7) "A novelty of our formulation is that we introduce D''..."
D'' is the same as D, the novelty lies in changing the constraints from (2) - or, respectively, (6) - to (7)
> (8) "that have the code patterns where Eq. 7 exists"
The equation always exists.
> (9) Typos etc
> - "returns an arbitrary (k,n) from loop"
> - "two call instructions returns two arbitrarily"
> - "the same array at the asme address"
> - "program excludes the cases"
> - "exists and appear"
> - "in realistic applications 1." (should there be a blank?)
> - "or gives the wrong results" (plural)
> - "the distance distance"
To Reviewer 1:
Thanks for Reviewer 1's comments. We agree that the goal of this paper is hardware optimization. We have now rephrased that in the conclusions.
(1) Thanks. Fixed.
(2-5) Fixed.
(6) We really appreciate your efforts on this, and it has now been corrected.
(7) Thanks a lot for helping with this. We have now reclaimed it as reviewer 1 suggested.
(8) Thanks for pointing it out. This is a simple typo, which should be the unnumbered equation right after Eq. 7 (now numbered to Eq. 8).
(9) All fixed!
----------------------- REVIEW 2 ---------------------
SUBMISSION: 24
TITLE: (Short Paper) Exploiting the Correlation between Dependence Distance and Latency in Loop Pipelining
AUTHORS: Jianyi Cheng, John Wickerson and George Constantinides
----------- Overall evaluation -----------
> This short paper proposes a technique and implementation to prove correct initiation intervals for modulo scheduling of data dependent operations in a loop. Traditional HLS techniques are conservative in analyzing such data access patterns. The authors describe how to encode the program semantics, in particular the memory access patterns, for Boogie, an SMT solver. They describe the scheduling constraint Eqn. 7 in Boogie and the SMT solver proves II validity. The authors show significant speedups on four kernels that exhibit such data access patterns.
> Some of the terms in the equations in Section IV.B are not introduced or clearly explained. For example, "d" and D' in Equation 5. What is P_n,alpha_k?
> It isn't clear how Equations 6 and 7 are derived. Could the authors describe them with more intermediate steps to improve clarity?
> The idea of encoding program semantics and proving the modulo scheduling constraint via an SMT solver is interesting. The paper could be improved with a clearer description and a more thorough evaluation against more kernels.
To Reviewer 2:
$d$ is $n_1 - n_2$, also known as the dependence distance.
$D'$ is an approximated set of D.
$P$ is the II and $\alpha_k$ is the offset of a statement.
We have now added the definitions to the paper.
Here is the additional clarification:
Substituting the approximation of modulo scheduling (3):
$t_{n, k} = \alpha_k + Pn, \alpha_k \geq 0$
into (2):
$t_{n_2, k_2} \geq t_{n_1, k_1} + L_{n_1, k_1}$
The dependence constraint becomes:
$\alpha_{k_2} + Pn_2 \geq \alpha_{k_1} + Pn_1 + L_{n_1, k_1}$
i.e.:
$\alpha_{k_2} \geq \alpha_{k_1} + P(n_1-n_2) + L_{n_1, k_1}$
The approximation of modulo scheduling (4):
$L'_k = \max_n L_{n, k}$
restricts the dependence constraint into (7):
$\alpha_{k_2} \geq \alpha_{k_1} + P(n_1-n_2) + L'_{k_1} \geq \alpha_{k_1} + P(n_1-n_2) + L_{n_1, k_1}$
The approximation of modulo scheduling has $d = \min_{n_1, n_2} (n_2 - n_1)$, so the dependence constraint becomes (6):
$\alpha_{k_2} \geq \alpha_{k_1} + P(n_1 - n_2) + L'_{k_1} \geq \alpha_{k_1} - Pd + L'_{k_1}$
We have now added further clarification in the paper.
----------------------- REVIEW 3 ---------------------
SUBMISSION: 24
TITLE: (Short Paper) Exploiting the Correlation between Dependence Distance and Latency in Loop Pipelining
AUTHORS: Jianyi Cheng, John Wickerson and George Constantinides
----------- Overall evaluation -----------
> This paper an extension to prior work on static schedulers with a fully automated pass that exposes and takes advantage of the correlation between values to improve performance but with a substantial area overhead. The paper is fairly well written and motivated overall, and this review considers that this is a short paper submission.
> Even with the consideration of a short paper, the theoretical side of the paper could be stronger, and the formal proof should be included.
> Experimental evaluations could cover more benchmarks. Comparisons should be shown as a graph to provide an easier comparison. There are concerns over the large area overhead.
To Reviewer 3:
Thanks for reviewer 3's comments. We have now added more detailed proof (included in the response to Reviewer 2) and further clarification in the paper.
Thanks again for your advice. We have now added a graph comparison of our experiments. The reviewer is quite right that there is indeed an area cost incurred by our approach. However, as the graph clearly shows, the reduction in delay greatly outweighs this (at least on our benchmarks).

----------------------- REVIEW 4 ---------------------
SUBMISSION: 24
TITLE: (Short Paper) Exploiting the Correlation between Dependence Distance and Latency in Loop Pipelining
AUTHORS: Jianyi Cheng, John Wickerson and George Constantinides
----------- Overall evaluation -----------
> This impressive paper presents a new method for loop pipelining
where the loop body has a range of execution times with associated
data dependencies and other standard unwinding techniques cannot
be applied.
> The analysis is well presented and seems sound. The performance gain
with the presented example is impressive, although how often these
situations really arise is not explored, but then this is a short paper.
> "Then condition of the traditional modulo" --> "The condition of the traditional modulo"
To Reviewer 4:
Fixed. Thanks for pointing it out!