# Message to AC: complain about reviewer 4fV3.
Dear AC,
We reach you concerning the behavior of reviewer **4fV3**, which does not fulfil the base [reviewer guidelines](https://neurips.cc/Conferences/2022/ReviewerGuidelines/), from what we can tell. In short, the reviewer is absolutely not factual: he is very vague and at least slightly impolite. Furthermore, his argument are totally flawed and in contradiction with the results of our work (while not claiming any mistake in it), and even his first comments. Finally, he asks for arbitrary extensions.
Therefore, we would like to kindly refer to the reviewer guidelines mentionning that
> superficial, uninformed reviews are worse than no review as they may contribute noise to the review process.
It also seems very surprising to us that such a reviewer put itself a level of 5 in confidence.
Let us be factual.
First and foremost, this reviewer implies that we did not do our job carefully in a very impolite manner:
> If you read the complexity theory of convex optimization more carefully.
We quote here the entire sentence:
> I disagree when the authors say that their case should be compared with the non-smooth case. If you read the complexity theory of convex optimization more carefully, you will realize the main issue for worse complexity for non-smooth problems is that function can be non-smooth at the optimal solution. Hence, if the setting removes that possibility, it is a way to allow acceleration.
Secondly, the reviewer's statements are factually wrong and unsupported by any evidence nor reference. More precisely, the reviewer claims that the difficulty of non-smooth convex optimization setting comes from the non-smoothness at optimum. The claim is factually wrong as shown by the following arguments:
- First of all, lower bound functions used in the non-smooth setting are generally based on the infinity norm which is non-smooth non only at optimum (and the lower bound proof heavily relies on this non-smoothness). Non-smoothness outside of the optimum is exploited in the proof, see, e.g., Appendix A of "Drori, Yoel, and Marc Teboulle. "An optimal variant of Kelley’s cutting-plane method." Mathematical Programming 160.1 (2016): 321-351" ([arxiv link](https://arxiv.org/pdf/1409.2636.pdf)). It is clear that it is easy to exhibit convergence problems when the non-smoothness appears at the optimum, and that any lecture on nonsmooth convex optimization will contain that as a 1 dimensional counter-example. However, the story does not stop here and the reviewer does not seem to be aware of that.
- Secondly we precisely prove the opposite in our paper. Indeed, we consider convex functions that are possibly non-smooth and only assume smoothness at optimum. According to the reviewer, this removes all the difficulty caused by the non-smoothness. But we show (Theorem 2.3) that no method can accelerate on $QG^+$ as much as the accelerated guarantee $O(1/t^2)$ that is achieved on smooth functions. As the reviewer did not claim any mistake in our results, there is a flaw in his own (not humble) reasoning.
As far as we know, no work in the literature contains this « non-smooth at optimum » vs « non-smooth away from optimum » discussion. In the completely non-smooth case, the best rate is O$(1/\sqrt{t})$, on smooth functions, it is O$(1/t^2)$. We precisely prove that removing non-smoothness at optimum leads to better rate than without doing it and we prove we cannot reach smoothness rates. Therefore we precisely bring evidence that the 2 regions (at optimum and elsewhere) of non-smoothness are both important.
In conclusion to this, we feel that the reviewer claimed wrong statements with confidence without trying to justify his point of view, and in a very unpleasant and unprofessional manner. Let us also quote:
> Overall, I think with the right steps (some of which authors already allude to), exploring QG function class might be worthwhile.
which seems to suggest that the reviewer himself could do a better job. Of course, no element showing how it is possible is provided.
Moreover, after introducing a class of functions, providing interpolation conditions, providing tight guarantees of gradient descent with and without Polyak-Rupert average, tight guarantees of heavy-ball with and without knowledge of the class, tight bound of first order optimization, and after expending it to an even larger class of functions, providing also linear rates of convergence under lower bound assumption playing the role of strong convexity (but much weaker), this reviewer consider this work does not bring sufficiently many new results and decides that we have to include first order optimization with constraints on this class of functions.
> Also, (in my opinion), the ideas for expanding for set constraints should be included in this paper because otherwise, the QG^+ formulation is too narrow. This will require a non-trivial extension of current ideas proposed in the authors’ response.
In our opinion, this is a completely different work, and the humble reviewer does not have any authority for distinguishing what is interesting and what is not. This is again against reviewer's guidelines, as those statements are totally subjective and not backed-up by any fact.
Therefore, we hope you will take these arguments in consideration while reviewing our work and taking the final decision.
In any case, we apologize for the long message, and we thank you in advance for taking our remarks into consideration.
Best regards.
# General answer to all reviewers:
We thank all the reviewers for their careful assessments of our paper. We are happy that they appreciated and underlined a few qualities of our paper, including a few technical contributions, particularly the adaptivity of the proposed heavy-ball method, as well as the writing clarity, soundness, and honesty.
Reviewers LChT and oxLY gave positive reviews (6) and reviewers 4fv3 and iuCD borderline scores (5 and 4): we hope that our answers will improve their visions of our work.
Before going into detailed answers for all reviewers, we would like to put emphasis on two points that motivated us doing this work, and that we thought were a bit underrated by the reviewers, probably due to our writing.
(1) First, a lot of emphasis seems to be put on comparing our results to the case of smooth convex minimization, although the setup itself has more to do with the non-smooth convex minimization setting. As the class of $QG^+$ functions contains non-smooth functions, we believe that the reference convergence rate should be that of subgradient methods: $O(1/\sqrt{t})$ (and not $O(1/t^2)$ from smooth convex minimization). Very interestingly, our heavy-ball algorithm is adaptive to the class of problems: it achieves simultaneously the best rate on Lipschitz non-smooth functions (without requiring the knowledge of the problem class parameters) and on the class of $QG^+$ functions (also without requiring knowledge of the problem class parameters). As a consequence, one can use the adaptive heavy-ball on non-smooth functions and benefit from the fact that the objective function might as well belong to $QG^+$ (and hence benefit from a “free” $O(1/k)$). **In short, the heavy-ball algorithm does not require to be aware of what class of problems it is applied to.**
(2) Second, we wanted to introduce a strictly larger class than the class of smooth convex functions (Note that the class of Lipschitz convex functions is not strictly larger as it does not include all smooth convex functions) that does not suffer from the condition discontinuity discussed in [Guille-Escuret et al. (2021)].
**Line-search**: One question shared by reviewer 4fV3 and iuCD was **the difficulty to implement a line-search within the heavy-ball method** One-dimensional line-search is generally not hard to implement for convex problems: it can be performed by a large variety of methods such as binary search or golden section search.
Moreover, the specific line-search that is used in the heavy-ball method does not actually need to be very accurate, as the condition of Theorem 2.3 (which allows guaranteeing the target convergence property) is satisfied for a whole interval of stepsizes values (denoted by $\alpha$) which includes the value of $\alpha$ obtained by an exact line-search. In other words, the line-search is a priori somehow robust.
Please note that a few updates were incorporated to the paper and highlighted in green. This was done while respecting the page limit constraint. In case the paper is accepted, the extra page will be used to address all concerns.
We thank the reviewers again and hope that our answers will clarify the corresponding points / resolve their concerns. We would be happy to provide further insights in case we misunderstood a question or would the reviewers have any other remaining question. In case our answers are found convincing enough, we would be happy if the reviewers could update their scores.
# Specific answer to official review of paper12116 by reviewer LChT
We thank the reviewer for their careful reading and assessment of our work; we appreciate the very constructive feedback.
(1) (**Title of section 2.3 is not proper**) We are not sure what the reviewer means by “not proper”, but we would be happy to take any suggestion.
(2) (**Why Algorithm 3 satisfies Eq. (7) is not very clear**) Regarding Algorithm 3 satisfying Eq. (7): Algorithm 3 verifies the condition of Theorem 2.4 because under Algorithm 3 update, the RHS of the inner product term in Eq. (7) is a multiple of $\sum_{i=0}^{k-1} g_i$, which is precisely the direction of the search. Therefore, one can choose $g_k$ as a subgradient of $f$ on $x_k$ that is orthogonal to the search direction and this is precisely the requirement we made in Algorithm 3. We have added a proof of it in the appendix C following the one of Theorem 2.4 (updates are visible in green).
(3) (**I believe that section 2 includes many results. Sections 2.4 seems not a fit in this section**) We will improve the presentation of Section 2 and the flow as suggested by the reviewer. In particular, we will move section 2.4 to a new section 3 in the camera ready version of the paper. The result of Section 2.4 is very important as it allowed us to derive all the bounds presented in this work using the Performance estimation problem framework, but we acknowledge Section 2 was probably not the best place for it.
We thank the reviewer again and hope that our answers clarify the corresponding points / resolve their concerns. We would be happy to provide further insights in case we misunderstood a question or would the reviewer have any other remaining question. In case our answers are found convincing enough, we would be happy if the reviewer could update their score.
# Specific answer to official review of paper12116 by reviewer 4fV3
We thank the reviewer for his careful reading and assessment of our work. Their summary and questions are very accurate. We would like hereafter to emphasize a few elements: (i) the class of $QG^+$ problems we are studying already contains “non-smooth” problems, (ii) the heavy-ball algorithm that we propose is adaptive to Lipschitz and $QG^+$ properties but not to smoothness, and (iii) under additional weak assumption, we obtain linear convergence rates that beat those on strong convexity and smoothness even when $T$ tends to infinity. We also discuss (iv) the constrained case and (v) details on the line-search. In more details:
(i) (**Non-smoothness**) First and foremost, we would like to clarify that $QG^+$ functions are not-necessarily smooth. They only need to be « smooth » at optimum. The $h-RG^+$ class relaxes the latter assumption.
(ii) (**Can you discuss the $L_S$ vs $L_Q$ issue above?**) The reviewer raises the fact that our proposed algorithm does not reach $O(1/T^2)$ on smooth convex functions. We agree on that point: as is, our proposed algorithms do not achieve $1/T^2$ on smooth convex functions. We think that an important point to stress is that our class should rather be compared to the class of non-smooth functions, such as the Lipschitz continuous ones. On the latter, first-order methods can reach up to $O(1/\sqrt{t})$ while in our class, we achieve $O(1/t)$. To the best of our knowledge, no known algorithm tuned for Lipschitz functions accelerates on smooth functions. Similarly, the algorithms we proposed and which achieve optimal rate on $QG^+$ functions do not accelerate on smooth functions either. But if we really want an algorithm reaching $O(1/T)$ over $QG^+$ convex functions as well as $O(1/T^2)$ over smooth convex functions, one can consider Algorithm UM (Corollary 6) of [13] (Drori, Taylor 2020) which proposes an algorithm optimal both on Lipschitz and smooth functions, therefore also optimal on our class, achieving all 3 class optimalities (at the cost of 3dimensional span searches). We are happy to provide more elements on this discussion if the reviewer wishes.
(iii) (**Hence, even if they have $L_S >> L_Q$ (line 57), the effect of LS can be reduced effectively using accelerated convergence of $L_S/T^2$ compared to $L_Q / T$**)
First, we agree that for T large enough, $L_S/T^2$ will always be smaller than $L_Q/T$. However, $L_S/L_Q$ can be arbitrarily large and hence the minimal number of iterations to perform before observing $L_S/T^2 < L_Q/T$ can be arbitrarily large as well (and our work is applicable even when $L_S$ does not even exist).
Perhaps more importantly, even if our paper focuses on the sublinear convergence rates regime, we briefly show in appendix G that when the functions are also $QG^-$, we obtain a linear convergence rate. In that case, the impact of the problem parameters is *independent of the horizon*. Indeed, the worst-case convergence rates for respectively (a) $L_S$-smooth and (b) $QG^+$ functions are (a) $\sqrt{L_S/ \mu}$ and (b) $L_Q / \mu$. Consequently, if $L_Q/ \mu < \sqrt{L_S/ \mu}$, our algorithm leads to better performances than the optimal algorithm on smooth strongly convex functions, whatever the number of iterations.
(iv) (**Can you also discuss the possibility of extending this framework to at least set constrained problems?**) The constrained case is very interesting but indeed not studied in this paper. We thank the reviewer for pointing that in general Definition 1.1 can be irrelevant for general constraint minimization problems. Indeed, if the optimum is not an interior point, then it may not have the zero vector as subgradient. Therefore, to study it, we need to slightly modify the definition of our class taking this into account while keeping the idea of this class. This can be done by simply adding a term: $f(x) \leq f_\star + \left< g_\star, x - x_\star \right> + \frac{L}{2}\|x-x_\star\|^2$. This leads to replacing Equation 9 of the interpolation conditions (Theorem 2.6) by $\frac{1}{2L}\|g_i - g_\star\|^2 \leq f_\star - f_i - \left< g_i, x_\star - x_i \right> + \frac{L}{2}dist(x_\star + \frac{1}{L}(g_i - g_\star), \mathcal{X})^2$ where $\mathcal{X}$ is the constraint convex set we consider. This can be used in the PEP framework but we decided to leave it for future work as the analysis will be completely different from what we did in this work.
(v) (**The line-search in the heavy-ball method is difficult to implement. Can you comment on using the Armijo rule in place of exact line-search in the heavy-ball method?**) Although your question is very good and legitimate, we believe there is no specific reason for the Armijo rule to work here. Indeed, the Armijo rule works when the Lipschitz parameter can be “checked” between consecutive iterates. Here, no such Lipschitz constant is even assumed to exist.
That being said, we do not agree with the reviewer on the difficulty of running the line-search which are one-dimensional and convex: one can use many different things ranging from binary search or golden section search to closed-form formulas, depending on the problem at hand. Furthermore, to solve Eq.7, we only need $\alpha$ in Algorithm 3 to be in between $1/L$ and the actually optimal $\arg\min_\alpha f(y_k + \alpha v_k)$, so there is a range of acceptable parameters for the guarantee to work (meaning that our guarantee works not only for the very particular choice of the optimal line-search parameter: it is “more robust” than that in practice). Therefore, it is very realistic to replace this exact line-search by inexact ones in practice. The algorithm is expected to be robust to this change. We left studying the effect of specific line-search techniques (say requiring the directional subgradients to be smaller than some $\epsilon_k$) for further work, but an advantage of our technique relying on SDPs is that this study can be done directly using those SDPs, as well.
We thank the reviewer again and hope that our answers clarify the corresponding points / resolve their concerns. We would be happy to provide further insights in case we misunderstood a question or would the reviewer have any other remaining question. In case our answers are found convincing enough, we would be happy if the reviewer could update their score.
# Specific answer to official review of paper12116 by reviewer oxLY
We thank the reviewer for his careful reading and constructive comments, which we did our best to address below. Their summary is also accurate, and we thank them for the time spent.
**Weaknesses 1-3** (**Examples**): We agree with the reviewer that this work is mostly theoretical. It still introduces a class of functions strictly larger than the class of smooth convex functions (which is not the case of the class of Lipschitz convex functions). All the construction examples provided after Definition 1.1 are $QG^+$ convex but not smooth convex. One can easily build a lot of $QG^+$ convex functions that are not smooth (see answer to **Weakness 4**), but we acknowledge that most well-known machine learning models already have their own specialized optimization technique.
**Weakness 4** (**$L_S$ vs $L_Q$**): L is a generic way of denoting the class parameter. $L_Q$ and $L_S$ denote respectively the one for $QG^+$ and the one for smooth functions. We decided not to dive into details of the condition continuity as this is already explained in [Guille-Escuret et al., 21]. An example of condition discontinuity is provided by Guille-Escuret et al. (2021, see Equation 4), who consider a function $f_\varepsilon$ with second derivative assumed to be $2 + 2/\varepsilon$ on $[1 + 1+\varepsilon^2]$ and 2 elsewhere. As $\varepsilon$ approaches $0$, the smoothness parameter of this function diverges to $\infty$, while this function is close (both in terms of function and gradient values) to the 2-smooth function $x \mapsto x^2$. On the other hand, its $QG^+$ parameter is close to 2, for any $\varepsilon$ close to 0. As a consequence, for such a function, applying the optimal algorithm for the $QG^+(L_Q=2)$ class leads to a \textit{faster} convergence than applying an optimal algorithm (e.g. NAG), for smooth functions with very large smoothness parameter $L_S$, thus with very small step size.
**Weakness 5** (**Gathering terms in Theorem 2.2 statement**): In Theorem 2.2, we separated the two occurrences of $L$ because we wanted to gather L and $\gamma$ together as $L \gamma$ is dimension free. Writing $L^2 \gamma$ sometimes leads to the conclusion that the guarantee grows quadratically w.r.t $L$ which is not true as $\gamma$ is typically proportional to $1/L$. However, we do not have a very strong opinion on this and if reviewers think it would be better to gather all the occurrences of $L$ together, we will definitely do it for the camera ready version.
**Weakness 6** (**Presentation of section 3**): We will reorganize Section 3 as well as add previously discussed explanations as suggested by the reviewer and thanks to the extra page of the camera ready version.
**Weakness 7** (**Interest of $h-RG^+$ class**): Concerning the $h-RG^+$ class, typical examples are $\ell_1$ regularized problems (or TV regularized ones) as non-smoothness is enforced at the optimum. The discussion on the h-RG class will be placed in a separate section starting with an introductory example (e.g. TV-$L_2$ denoising problem).
**Question 1** (**Verifying assumptions in practice**): Verifying that a function belongs to this class and estimating $L$ is not an easy task in general, as it is not for smooth functions. Sometimes, we can estimate it handily using the design of the problem. But In general, if the optimized function is a black box, there is no way to be sure a function belongs to this class. Note the same problem holds for smoothness, even if some workarounds have been proposed (smoothness only needs to be estimated “between the iterates” in practice, e.g., using backtracking). Note however in this work, we propose one algorithm (Algorithm 3) that does not require the knowledge of $L$.
**Question 2** (**How to choose between the heavy-ball method and the subgradient method?**): The heavy-ball algorithm converges at least as fast as the GD algorithm in the worst-case. Therefore, we would use the former one in general. Moreover, there exists a line-search version of HB, while we showed the line-search version of GD is not converging.
**Question 3**(**About constrained optimization**): This is a really good question: constrained optimization is of great interest and we thank the reviewer for asking it. As reviewer 4fV3 pointed out, if the optimum is not an interior point, then it may not have zero as subgradient. Therefore, to study it, we need to slightly modify the definition of our class taking this into account while keeping the idea of this class. This can be done by simply adding a term: $f(x) \leq f_\star + \left< g_\star, x - x_\star \right> + \frac{L}{2}\|x-x_\star\|^2$. This leads to replacing Equation 9 of the interpolation conditions (Theorem 2.6) by $\frac{1}{2L}\|g_i - g_\star\|^2 \leq f_\star - f_i - \left< g_i, x_\star - x_i \right> + \frac{L}{2}dist(x_\star + \frac{1}{L}(g_i - g_\star), \mathcal{X})^2$ where $\mathcal{X}$ is the constraint convex set we consider. This can be used in the PEP framework but we decided to leave it for future work as the analysis will be quite different from what we did in this work.
We thank the reviewer again and hope that our answers clarify the corresponding points / resolve their concerns. We would be happy to provide further insights in case we misunderstood a question or would the reviewer have any other remaining question. In case our answers are found convincing enough, we would be happy if the reviewer could update their score.
# Specific answer to official review of paper12116 by reviewer iuCD
We thank the reviewer for his careful reading and for underlining the soundness and clarity of our work. We also thank him for the typos raised.
**Regarding the novelty and impact of our work**
We find the claim about the novelty and the incrementality of the paper to be rather harsh: the class of functions under consideration is relatively new, as well as the interpolation conditions, convergence rates, and lower bounds. The proof techniques are indeed standard, but it follows from performance estimation techniques that any worst-case results for such first-order methods can be formulated in such terms (i.e., linear combinations of inequalities).
We hereafter provide some elements answering the questions raised by the reviewer:
**Question 1** (**Orthogonal subgradient in line-search**): (a) By definition of the line-search, there indeed exists such a subgradient orthogonal to the search direction if there is an optimal point (the problem is closed and convex). We thank the reviewer for pointing this to us as we realize we forgot to write an assumption down: the set of minimizers is compact (this is now visible in green in the paper at l.17). This assumption ensures the objective function to be coercive (by convexity) and each line-search problem has a solution. This also implies the existence of a subgradient orthogonal to the search direction. (b) In general, there might not be a recipe to find it, and the technique must be adapted to the problem. For instance, when the objective function is differentiable at the point of interest (note that a convex function is necessarily differentiable almost everywhere), the gradient is automatically verifying the desired condition. If we consider problems such as maximum of quadratics, the subdifferential is easy to characterize, and finding one subgradient orthogonal to the desired direction is a tractable linear problem. More generally, if the problem is piecewise differentiable then the set of subgradients at a particular point is the convex hull of the subgradients of the individual pieces that are active at the particular point, and one should search in that convex hull.
**Question 2** (**How to perform the line-search in practice?**): We have not precisely studied inexact line-search techniques for this specific method, but a first element of answer is that we do not really need the exact optimal $\alpha$ for Eq.(7) to hold. We only need $\alpha$ in Algorithm 3 to be in between $1/L$ and the actually optimal $\arg\min_\alpha f(y_k + \alpha v_k)$, so there is a range of acceptable parameters for the guarantee to work (meaning that our guarantee works not only for the very particular choice of the optimal line-search parameter: it is “more robust” than that in practice). Therefore, it is very realistic to replace this exact line-search by inexact ones in practice. The algorithm is expected to be robust to this change.
We left studying the effect of specific line-search techniques (say requiring the directional subgradients to be smaller than some $\epsilon_k$) for further work, but an advantage of our technique relying on SDPs is that this study can be done directly using those SDPs.
**Question 3** (**Using additional assumption to obtain geometric rates**): We did not study the sharpness condition but we added a remark in Appendix G about the $QG^-$ condition which is essentially a quadratically lower bound condition. Under this new assumption, the rates of convergence are indeed geometric.
**Question 4** (**Practical examples**): This question is very legitimate, we thank the reviewer for asking it. However, we believe it is normal to have gaps between theory and practice, and algorithms that have the best theoretical worst-case convergence rates are often not the best ones in practice. In particular, their theoretical setups might not be in line with practical requirements. We therefore believe that exploring the space of theoretically possible methods is of interest independently of direct practical applications. One example of places where the existence of smaller Lipschitz constants can be used, where smaller (componentwise) Lipschitz constants allow using larger stepsizes.
Regarding the impact of our work, we indeed do not aim at tackling any new problem, but to propose methods and analyze their complexity from a new perspective: a novel class of functions. From what we can tell, our heavy-ball algorithm is a nice contribution, as it is adaptive to the class of functions (non-smooth or $QG^+$) and does not require neither the knowledge of the problem parameters, nor that of the particular class it is in. There are indeed more methods with better worst-case properties, but they usually do not have such adaptive properties, and some of them (e.g., accelerated methods) typically require knowledge of more problem specific parameters.
Furthermore, even though Algorithms 2 and 3 are not the fastest methods to solve some of the smooth + non-smooth problems (methods utilizing the structure can be more efficient), we provide answers to simple questions such as:
“Does the fact that the loss is differentiable at the optimum result in a faster convergence rate for heavy-ball with line-search? (for example for a regularized $\ell_1$ problem)”.
Moreover, the extension to $h-RG^+$ (described in Section 3.3) offers a flexibility to tackle several non-smooth machine learning problems, including for example RELU activation, $\ell_1$ or TV regularization, etc. As specific methods have often been designed for those problems, our approach does not bring a systematic improvement, and we thus focused on providing a thorough analysis of optimization on the $QG^+$ class. Yet, we believe it paves the way for adaptive methods that could do so.
As an example, the classical TV-$L_2$ denoising problem, which combines a term with quadratic growth with a non-smooth (even at the optimum) term, is neither Lipschitz nor smooth, but belongs to our class $h-RG^+$, with $h: z \mapsto M \sqrt{z}+\frac{L z}{2}$ (Remark 3.2 and Theorem 3.3, item 3). Moreover, note that our Algorithm 3 is adaptive and optimal for both the class of $QG^+(L)$ functions, and the class of $M$-Lipschitz continuous ones. Extending our analysis to obtain an adaptive algorithm for $h-RG^+$, which is left as an open direction, would efficiently tackle TV-$L_2$ type problems in a parameter-free way.
We thank the reviewer again and hope that our answers clarify the corresponding points / resolve their concerns. We would be happy to provide further insights in case we misunderstood a question or would the reviewer have any other remaining question. In case our answers are found convincing enough, we would be happy if the reviewer could update their score.