# Reviewer fbqb (Rating 4, Confidence 4)
## General response
Thank you for your valuable time and efforts spent on reviewing our paper.
We appreciate your feedback.
It seems that overall the reviewer has the following two general complaints:
1. The assumptions made in the paper are not weaker (mainly bounded domain and knowledge of diameter).
2. There are not enough explanations why the Hölder class would be interesting to consider.
Regarding 1, please note that the problem we are addressing, specifically proving that gradient methods can adapt to the Hölder smoothness when the oracle is stochastic, is already quite hard under our set of assumptions which are fairly standard in the study of adaptive methods for convex problems.
Indeed, no such result exists in the literature, and certainly not for the case when the knowledge of the diameter/distance from the initial point to the solution is not available. We aim to provide a reasonable first step for this open problem. Relaxing our assumptions and further improving the methods is indeed an interesting and important question, but this is already the next (potentially big) step.
As for 2., the Hölder class is rather well-known in optimization and has been studied for decades. In particular, it is one of the standard ways to continuously connect the two important classes of smooth and nonsmooth functions. In the realm of complex and large models, such generalizations are valued for robustness and tuning-efficiency reasons. Nevertheless, we agree to elaborate more on the motivation as needed. Indeed, we provide a set of examples for this class of functions below.
We therefore kindly ask you to reconsider your score and we are more than happy to continue our discussion should you need further information and clarification.
Below you can find more detailed answers to the specific questions / remarks that you raised.
## Strengths and weaknesses
> 1. The biggest weakness is that Assumption 2.1 seems too strong for (Holder) smooth functions. For instance, the analysis for the exact case becomes much easier under Assumption 2.1, if not a direct extension of the existing analyses that use a much less restrictive assumption on the bounded initial distance to the optima. The paper does not discuss whether Assumption 2.1 is necessary (at least for some values of $\nu$, say differentiable functions) by giving a lower-bound argument or even a simple example. Such a discussion would be helpful to understand the relevance of the problem. Under Assumption 2.1, for instance, optimization on an unbounded domain is not feasible (according to the upper bound), which seems very unlikely for smooth (and not just differentiable) functions. Thus, the only relevant setting is when $\psi$ is a hard constraint function: a setting much more specific than the one considered in previous works. To me, this seems like an analytical gap, not a real one, because intuitively, for smooth functions, the only relevant scale is the size of the target optimum.
We agree that it would be interesting to understand if our methods work when the domain is unbounded and $D$ is replaced by $R_0$—an upper bound on the distance from the initial point $x_0$ to the solution $x^*$. But this does not seem to be very simple.
In any case, if we know $R_0$, we can easily convert our original problem $\min_x [f(x) + \psi(x)]$ into an equivalent one, $\min_x [f(x) + \psi_D(x)]$, where $\psi_D$ is the restriction of $\psi$ onto the ball $\| x - x_0 \| \leq R_0$ (i.e., $\psi_D$ is the sum of $\psi$ and the indicator function of the ball). For this new problem, we can run our methods with diameter $D = 2 R_0$, and they would work absolutely fine even when the domain $\mathrm{dom} \, \psi$ in the original problem is unbounded.
The only detail that one needs to address is how to compute the proximal-point step for the function $\psi_D$ via the corresponding operation for $\psi$. But this is usually not difficult and requires solving a certain one-dimensional nonlinear equation, which can be done very efficiently by Newton's method (at no extra queries to the stochastic gradient oracle). In some special cases, this equation can even be solved analytically, e.g., when the original problem is unconstrained, one simply needs to perform the projection on the $\| x - x_0 \| \leq R_0$ ball.
> 2. The paper also does not do an excellent job of motivating why the function class being studied is interesting. For instance, are there simple functions in the Holder smooth class but not in well-studied extremes of the class?
We agree with our reviewer that we should add more motivation for Holder class of functions. In general, it is simply a sufficiently rich class containing both nonsmooth and smooth functions.
Concerning examples for intermediate values of $\nu \in (0, 1)$, the simplest one is probably (the $p$-th power of) the $\ell_p$-norm of the residual for the sysem of linear equations, $f_p(x) = \sum_{i = 1}^m |\langle a_i, x \rangle - b_i|^p$ for $p \in (1, 2)$,
which generalizes the classical least-squares (corresponding to $p = 2$); the gradient of $f_p$ is Hölder with $\nu = p - 1$, but not Lipschitz (unless $p = 2$).
Another simple example is the similar construction but for linear inequalities: $h_p(x) = \sum_{i = 1}^m [\langle a_i, x \rangle - b_i]_+^p$, where $[t]_+ \equiv \max\{ t, 0 \}$; this is the smooth counterpart of the classical loss function used by SVM.
More generally, there is a duality relationship between *Hölder smoothness* and *uniform convexity*: if the convex function $f$ is uniformly convex with degree $q \geq 2$, then its (Fenchel) dual $f_*$ has Hölder gradient with $\nu = \frac{1}{q - 1}$ (see, e.g., Lemma 1 in [1]), and vice versa.
In particular, for any convex function $F_*$, the function $F_q(x) = \max_s [\langle s, x \rangle - F_*(s) - \frac{\sigma}{q} \| s \|^q]$ is Hölder smooth with $\nu = \frac{1}{q - 1}$.
> 3.What are the connections of this assumption with other smoothness assumptions that control higher-order derivatives (and are relevant to learning with logistic or exponential loss)?
Could you please elaborate on this comment so we could provide a concrete answer. According to what we understand from your comment; Holder continuity has extensions for higher-order regularity, but our definition only concerns the range between Lipschitz continuous objective and gradient.
> 4.In a learning setting, where the loss function is fixed, and stochasticity arises from the sampling from the data distribution, what do the Holder smoothness assumption and Assumption 2.1 imply about the stochastic variance parameter $\sigma$?
In fact, neither Holder continuity nor Assumption 2.1 has any implications/limitations on the variance bound. $\sigma$ is only governed by the stochastic oracle, specifically the sampling strategy itself.
From a practical point of view, $\sigma$ largely depends on the particular loss function we consider. In general, we can say that $\sigma$ is bounded by the magnitude of the stochastic gradients divided by the square root of the minibatch size, which has no connection with Holder continuity or Assumption 2.1.
> 5. Can the analysis be extended to the case with only bounded variance at the optima, i.e., usual $\sigma^*$ using analysis based on quadratic growth assumptions? If not, can this be done for at least some $\nu$ but not all of them?
Yes.
This can be done for any $\nu$ and is, in fact, not very difficult.
Let us briefly illustrate the idea considering Algorithm 2 and $\nu = 1$
(any other $\nu$ can be handled similarly).
Assume that our objective function is given in the form of the expectation:
$f(x) = \mathbb{E}_{\xi} [f(x, \xi)]$, where each function $f(x, \xi)$
is convex and its gradient $g(x, \xi)$ is $L$-smooth.
For simplicity, we shall us use the same letter $L$ to denote the smoothness
constant of $f$ (although the real constant $L_f$ can potentially be much
smaller).
In this case, we can upper bound the variance
$\sigma^2(x) := \mathbb{E}_{\xi} [\| g(x, \xi) - \nabla f(x) \|^2]$
via $\sigma_*^2 := \sigma^2(x^*)$ plus the Bregman distance for $f$:
$$
\sigma^2(x)
\leq
2 \sigma_*^2
+
2 \mathbb{E} \|
[g(x, \xi) - g(x^*, \xi)] - [\nabla f(x) - \nabla f(x^*)]
\|^2
\leq
2 \sigma_*^2 + 2 \cdot 2 L \beta_f(x^*, x)
$$
(the final inequality upper bounds the second term via
$\mathbb{E}_{\xi} [\| g(x, \xi) - g(x^*, \xi) \|^2]
\leq
2 L \mathbb{E}_{\xi} [\beta_{f(\cdot, \xi)}(x^*, x)] =
2 L \beta_{f}(x^*, x)$).
The general convergence guarantee for Algorithm 2 is given by
inequality (27) (but in the form where we have not yet passed to the average point $\bar{x}_k$) combined with the general bound on $H_k$ from
lines 373-374:
$$
S_k
:=
\sum_{i = 1}^k F_i
\lesssim
D^2 \mathbb{E}[H_k]
\lesssim
L D^2 + D \Bigl( \sum_{i = 1}^k \sigma_i^2 \Bigr)^{1 / 2},
$$
where $F_i := \mathbb{E} F(x_i) - F^*$ and
$\sigma_i^2 \sim \sigma^2(x_i) + \sigma^2(x_{i - 1})$
(the tilde notation is for simplicity to ignore absolute constants).
Instead of using the uniform boundedness of $\sigma_i$, we can now use
the bound via $\sigma_*$ and the Bregman distance.
This will upper bound $\sum_{i = 1}^k \sigma_i^2$ via $k \sigma_*^2$
plus the summation of Bregman distances
$\tilde{F}_i := \mathbb{E}[\beta_{f}(x^*, x_i)]$ multiplied by $L$.
Each such $\tilde{F}_i$ can be upper bounded by the actual residual $F_i$;
the first term $\tilde{F}_0$ can be upper bounded by
$L \| x_0 - x^* \|^2 \leq L D^2$.
Thus:
$$
\sum_{i = 1}^k \sigma_i^2 \lesssim k \sigma_*^2 + L^2 D^2 + L S_k.
$$
Substituting this into the previous display, we obtain the following
recurrence for $S_k$:
$$
S_k
\lesssim
L D^2
+ \sigma_* D \sqrt{k}
+
\sqrt{L D^2 S_k},
$$
which implies that
$$
S_k \lesssim L D^2 + \sigma_* D \sqrt{k}.
$$
Passing now to the average point $\bar{x}_k$, we finally get
$$
\mathbb{E}[F(\bar{x}_k)] - F^*
\lesssim
\frac{L D^2}{k} + \frac{\sigma_* D}{\sqrt{k}}.
$$
This is essentially the same convergence rate as in Theorem 4.2
but with $\sigma_*$ instead of $\sigma$, and $L$ being the worst Lipschitz
constant among each $f(\cdot, \xi)$ (instead of $L = L_f$).
For the general $\nu$, the similar analysis leads to
$$
S_k
\lesssim
A_k + \sigma_* D \sqrt{k} + A_k^{1 / (1 + \nu)} S_k^{\nu / (1 + \nu)},
\qquad
A_k := L_{\nu} D^{1 + \nu} k^{(1 - \nu) / 2},
$$
which gives $S_k \lesssim A_k + \sigma_* D \sqrt{k}$ and
$$
\mathbb{E}[F(\bar{x}_k)] - F^*
\lesssim
\frac{L_{\nu} D^{1 + \nu}}{k^{(1 + \nu) / 2}}
+
\frac{\sigma_* D}{\sqrt{k}}.
$$
> 6. Even for experiments, understanding and quantifying the Holder smoothness can shed more light on how the algorithm behaves in practice (and if the bounds are "tight"). This can be done via some synthetic problems.
We will appreciate it greatly if the reviewer could clarify parts of their comment for us. What do you mean by "quantifying the Holder smoothness can shed more light on how the algorithm behaves"? to clarify, our bounds are known to be tight, as discussed by (Nesterov, 2015). Could you please walk us through your suggestion of designing a synthetic experiment?
[1] Nesterov, Y.
Universal gradient methods for convex optimization problems.
Math. Program., 152:381-404, 2015.
## Minor comments
> 1. There is a missing discussion on proximal update algorithms, a general class of the algorithms considered in this paper that have been heavily studied theoretically. How do the paper's results change if given access to a stochastic prox-oracle, which directly solves the stochastic version of the problem (8) for some $H_k$? Some discussion would be nice; giving new results is not needed. This can also aid the presentation.
Due to space constraints, we chose to focus on the algorithmic and analysis details in the main text. We agree that we could focus more on motivating the composite problem formulation and classical solution methods (such as the family of proximal (gradient) methods). We will add new sections on composite convex optimization. For instance, let us draw the attention of the reviewer to line 5 of Algorithm 3. This step is equivalent to computing the proximal operator of $\psi$.
Would you be able to define what you mean by stochastic proximal oracle? The literature of stochastic composite problems mostly consider the case where $f$ is accessed through stochastic first-order oracles and $\psi$ is sufficiently simple that its proximal operator is efficiently computable (as in line 5 Algorithm 3).
> 2. The term "sufficiently simple" is not defined in the paper and is used to mean many different things in analysis.
We explain what we mean by simple function in page 3, line 151, left column. We will make sure we explain what sufficiently simple means in the rest of the manuscript, appropriate to the context of its use.
> 3. There is a typo while defining $r_{k + 1}$ on page 3.
Thank you for pointing out the typo, we will fix it accordingly.
> 4. The discussion of the algorithm's analysis is too technical and can be shortened to make space for some motivation.
We solve an open problem which hasn't been answered over the past 8 years; designing a universal method that automatically adapts to noisy gradients. As a theory paper, we want to achieve a clear and open display of our design process, intuitions, novelties and also possible drawbacks. We believe this will enable the readers to get a better sense of the theory behind our results. Having that said, we keep your suggestion in mind and will add more discussion on the motivation of our problem setting.
# Reviewer rET6 (Rating 3, Confidence 5)
## Strengths And weaknesses
1. > If we have knowledge of $D$, we can just use AdaGrad and it will also yield
the same rates.
Theorem 1.1 in [1] gives a guarantee that essentially yields the optimal
rate for any Holder-smooth function (with Holder smoothness defined as in
this paper).
It also works in both the deterministic and stochastic settings.
The last point essentially means that the main result of the paper (in Thm
3.1) is not really about achieving the best rate but extending Nesterov's
method specifically.
That said, I do not really see why this method is any more desirable to use
than AdaGrad.
It seems that there is a certain misunderstanding regarding the principal
contributions of our paper.
Note that the main results of the paper are Theorems 4.2 and 5.1,
**not** Theorem 3.1.
The only purpose of Section 3 is to explain the main idea behind our
proposed formula for updating step-size coefficients in the (simplest)
case when the oracle is deterministic.
The deterministic case itself is not the focus of the paper since there
already exist good methods using line search.
That being said, it would be unfair to claim that the convergence rates
presented in [1] for the AdaGrad method are the same as those for our
Algorithms 2 and 3:
1. Please note that [1] studies AdaGrad only in the **deterministic**
setting.
There is no extension of Theorem 1.1 to the stochastic oracle.
1. There are no **accelerated** gradient methods in [1].
In contrast, our Algorithm 3 is accelerated with optimal worst-case
complexity bounds.
1. The classical AdaGrad studied in [1] does not work correctly for
constrained optimization when the function is smooth
(see the explanation below).
1. > In lines 341-346 col. 2 the paper mentions that the AdaGrad stepsizes may
not work if the function does not have unbounded gradients, but this is not
really true.
It is true, even for the deterministic oracle.
Please note that all existing theoretical results for AdaGrad on smooth
functions have an additional (sometimes hidden) assumption that
the gradient at the minimizer is zero, $\nabla f(x^*) = 0$, which is usually
not the case for constrained optimization (otherwise, there is no point in
keeping the constraints as the problem is equivalent to the unconstrained
one).
For instance, Theorem 2.1 from the work [1] that you cited before, has
the following clause:
"Moreover, if $f$ is also $\beta$-smooth and the global minimum
$x^* = \mathrm{argmin}_{x \in \mathbb{R}^n} f(x)$ belongs to $\mathcal{K}$
(the constraint set), then ..."—this is exactly the assumption that
$\nabla f(x^*) = 0$.
To understand why this is indeed a problem, look, e.g., at the
convergence guarantee from Theorem 1.1 in [1]:
$\Delta_T
\equiv
\sum_{t = 1}^T [f(x_t) - f^*]
\leq
\sqrt{2 D^2 \sum_{t = 1}^T \| \nabla f(x_t) \|^2}.$
If $f$ is $L$-smooth and $\nabla f(x^*) = 0$, one can use the fact that
$\| \nabla f(x_t) \|^2 \leq 2 L [f(x_t) - f^*]$ (\#) to obtain
$\Delta_T \leq \sqrt{4 L D^2 \Delta_T}$, which immediately gives the
standard $O(\frac{L D^2}{T})$ convergence for $f(\bar{x}_T) - f^*$,
where $\bar{x}_T$ is the average point.
However, if $\nabla f(x^*) \neq 0$, then (\#) is no longer true.
Note that this problem does not arise in the algorithms which
use the difference of gradients to set up their step-size coefficients
(e.g., our Algorithms 3 and 4).
1. > The experiments don't really show any particular advantage of one over the
other.
We would like to remind the reviewer that our method demonstrates
competitiveness against established baselines, such as Adagrad.
Our main target is to contribute to the theoretical aspect by introducing
a novel algorithm capable of adapting in stochastic setting while beating
state-of-the-art experimentally is not the primary focus.
[1] Levy, Kfir Y.
Online to Offline Conversions, Universality and Adaptive Minibatch Sizes,
Advances in Neural Information Processing Systems 30: Annual Conference on
Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA,
USA (2017).
## Questions
1. > Why is extending Nesterov's UGM a particularly interesting problem, given
that we have many methods that already adapt to the Holder smoothness?
We do not fully understand what you mean by "extending Nesterov's UGM".
Our methods are indeed inspired by Nesterov's UGMs but we would not call
them "extensions".
The key idea of Nesterov's UGMs is the use of **line search**, which is
possible only for the deterministic oracle.
Our algorithms are line-search-free, so, at best, they can be called
"alternatives" to Nesterov's UGMs, not "extensions".
Of course, in the deterministic case, there is not much point in developing
any alternatives to Nesterov's UGMs—they already work very well,
and have strong theoretical guarantees.
But this does make sense in the stochastic case—this is exactly what
adaptive / parameter-free methods are all about.
Note that many existing algorithms that adapt to Hölder smoothness do it
provably only when the oracle is deterministic.
There are no such theoretical results for the **stochastic** gradient
oracle.
This is exactly the gap which we attempt to close in our work.
1. > Does your method require a bounded domain even in the deterministic
setting?
Can that be relaxed?
If not, why not?
We do not know if our method still works when the domain is unbounded and the
oracle is deterministic.
At least, we do not have any specific proofs (experimentally the method seems
to work fine even for nonconvex problems such as neural networks).
However, we believe this should not be of primary concern:
1. The purpose of our work is to prove that **stochastic** gradient methods
can indeed adapt to Hölder smoothness, enjoying the same worst-case
efficiency estimates as if they knew all the necessary parameters of the
function class.
This is already a significant contribution since **no such results exist,
even when the domain is bounded and we know its diameter $D$**.
2. Except a very small handful of the most advanced parameter-free methods
for stochastic optimization (such as [2]), all other algorithms need
to know either the diameter of the domain, or an upper bound $R_0$ on the
distance from the initial point $x_0$ to the solution $x^*$.
If we know such $R_0$, we can easily convert our original problem
$\min_x [f(x) + \psi(x)]$ into an equivalent one,
$\min_x [f(x) + \psi_D(x)]$, where $\psi_D$ is the restriction of $\psi$
onto the ball $\| x - x_0 \| \leq R_0$ (i.e., $\psi_D$ is the sum of
$\psi$ and the indicator function of the ball).
For this new problem, we can run our method with diameter $D = 2 R_0$.
The only detail that one needs to address is how to compute the
proximal-point step for the function $\psi_D$ via the corresponding
operation for $\psi$.
But this is usually not difficult and requires solving a certain
one-dimensional nonlinear equation, which can be done very efficiently
by Newton's method (at no extra queries to the stochastic gradient
oracle).
In some special cases, this equation can even be solved analytically,
e.g., when the original problem is unconstrained, one simply needs
to perform the projection on the $\| x - x_0 \| \leq R_0$ ball.
3. Finally, we are not claiming that the specific methods we propose
are ideal or final, and one should use only them from now on.
What is valuable is the particular ideas and techniques that we use in
the theoretical analysis of our algorithms.
We believe these ideas are of interest in a much broader context,
including the works trying to remove the need for knowing $D$ (or $R_0$).
[2] Igvi, M., Hinder, O., Carmon, Y.
DoG is SGD's Best Friend: A Parameter-Free Dynamoc Step Size Schedule,
ICML (2023).
## Minor points
Thanks for noticing a couple of typos. We will correct them.
Regarding the notation for the Bregman divergence, there seems to be no general
consensus on one specific notation.
Our particular choice is not self-invented but follows the standard notation
used by Nesterov (see, e.g., eq. (2.9) in [3]).
[3] Nesterov, Y.
Universal gradient methods for convex optimization problems.
Math. Program., 152:381-404, 2015.
## Final comment
We kindly ask you to reconsider your score taking into account the detailed
explanations we provided above.
## Second response
Dear Reviewer,
The works you have cited (in both replies) do not contain any specific results for the Hölder smooth class.
The fact that it might be possible to obtain results similar to ours by combining and extending many existing techniques scattered in the literature does not change the fact that our results are still novel. Let us also underline respectfully that it is not clear how would combining the existing theory would give the same results as ours. We would appreciate it greatly if you could point out the specific references and derivations so we could understand your perspective. Otherwise, any paper can be criticized as "a trivial extension" of the previous work.
<!-- , especially from the viewpoint of certain people. -->
Moreover, one of the main contributions is our alternative analysis paradigm to existing theory of adaptive methods. Our method is inspired from the line-search analysis and combines it with a new type of data-adaptive step-size while the existing literature heavily relies on online learning techniques. Note that online learning analysis is not naturally adapted to any notion of smoothness, which makes the extension of existing theory to Holder smooth problems even harder, so we take an alternative approach.
Let us point out briefly why UnixGrad, which is an accelerated algorithm, could not be extended to Holder smooth class trivially. Their analysis assumes that the function is either Lipschitz smooth or Lipschitz continuous ($\nu = 1$ or $\nu = 0$, respectively). The way they achieve acceleration is balancing the "regret term" and showing that it is constant, then dividing it with the sum of averaging sequence $\alpha_t = t$, which gives the desirable $1/T^2$ rate. However, in the case when $\nu \in (0,1)$, then they won't be able to show constant regret *without explicitly changing* the averaging sequence from $\alpha_t = t$ to $\alpha_t = t^\nu$. As it stands, their analysis requires the prior knowledge of $\nu$, while we solve this problem with an alternative algorithm and an alternative analysis.
Respectfully, we do not agree with your assessment that it is not difficult to modify the already existing proofs to fit the Hölder setting, especially with a unified analysis which would be suitable for stochastic/deterministic, composite/noncomposite problems, and accelerated/non-accelerated methods at the same time (which is exactly what our work is offering). There are many details that need to be addressed carefully with explicitly written down rigorous proofs.
# Reviewer vQGC
## General response
Thank you for the kind words.
We appreciate your feedback and value the time and efforts you spent on
reviewing our paper.
Please find below the answers to your questions, as well as some minor comments
regarding your remarks.
## Questions
1. > Line 163, second column: it should be $r_{k + 1} = \| x_{k + 1} - x_k \|$.
Thank you for noticing this typo, we will fix it.
1. > Lines 128-132, second column.
Could you please provide some explanation of why
$L_{\nu_1} D^{\nu_1} \leq L_{\nu_2} D^{\nu_2}$?
This claim does not seem obvious to me.
Yes, of course.
According to the definition in eq. (6),
$L_{\nu} D^{\nu} =
\sup_{x, y} \| g(x) - g(y) \| ( \frac{D}{\| x - y \|} )^{\nu}$.
Since $\| x - y \| \leq D$, the quantity in the parentheses is $\geq 1$,
and hence the whole expression is monotonically increasing in $\nu$.
Thanks for asking, we will add the corresponding clarification to the paper.
1. > Lines 118-127, second column.
The same here, please provide some explanations.
We are not sure which explanations one would expect here.
Most of the claims are just straightforward consequences of the definition
or the introduction of some common terminology (Lipschitz continuous
gradient, bounded variation of subgradients, etc.).
The only slightly non-obvious statement is perhaps the fact that
$f$ must be differentiable if $L_{\nu} < +\infty$ for some $\nu > 0$.
But this is a simple consequence of the definition:
since $\| g(x) - g(y) \| \leq L_{\nu} \| x - y \|^{\nu}$
for any points $x, y$ and any subgradients $g(x) \in \partial f(x)$,
$g(y) \in \partial f(y)$, we can take $y = x$ and conclude that
the subdifferential $\partial f(x)$ is a singleton (contains only one
element)—this means that $f$ is differentiable.
1. > Line 1169-1170.
If $H_k \leq M < H_{k + 1}$, then the right-hand side should be zero,
shouldn't it?
And you write, that "the right-hand side is nonnegative".
Yes, of course, there should be "the right-hand side is zero".
Thanks for noticing this typo, we will fix it.
## Strengths and weaknesses
1. > Lack of discussion of experimental results.
It would be beneficial to discuss the pros and cons of your algorithm and
the ones you compare within the experimental section.
Thanks for the suggestion.
We will elaborate on this in the final version.
1. > Need to know the diameter of the domain as an input parameter of the
algorithm, which is rather restrictive.
Moreover, the assumption of a bounded domain prevents the algorithm from
being used in unbounded case.
This is true.
However, we can still run our algorithm even in the unbounded case,
provided we know a certain upper bound $R_0$ on the distance from the
initial point $x_0$ to the minimizer $x^*$.
To that end, we simply need to (artificially) transform our original problem
$\min_x [f(x) + \psi(x)]$ into an equivalent one,
$\min_x [f(x) + \psi_D(x)]$, where $\psi_D$ is the restriction of $\psi$
onto the ball $\| x - x_0 \| \leq R_0$
(i.e., $\psi_D$ is the sum of $\psi$ and the indicator function of the ball).
This problem would satisfy all our assumptions with diameter $D = 2 R_0$.
The only detail one needs to address is how to compute the
proximal-point step for the function $\psi_D$ via the corresponding
operation for $\psi$.
But this is usually not difficult and requires solving a certain
one-dimensional nonlinear equation, which can be done very efficiently
by Newton's method (at no extra queries to the stochastic gradient
oracle).
In some special cases, this equation can even be solved analytically,
e.g., when the original problem is unconstrained, one simply needs
to perform the projection on the $\| x - x_0 \| \leq R_0$ ball.
# Reviewer jbR1
## General response
Thank you for the kind words.
We appreciate your feedback and value the time and efforts you spent on
reviewing our paper.
Please find below the answers to your questions, as well as our comments regarding your remarks.
## Weaknesses
> Like Adagrad, the method can only decrease the stepsize. This is in contrast to the line-search methods, which the authors are trying to improve upon. Since the method cannot increase the stepsize, it might eventually slowdown, and become slower than a line search.
The stepsize requires having a bounded domain and, on top of that, knowing the diameter. This assumption does not hold in the majority of applications.
Let us emphasize that our design allows the step-size to stay the same across iterations while AdaGrad step-size monotonically decreases. To design a parameter-adaptive method (simultaneous adaptation to Holder constant, Holder exponent and variance), we develop a new approach inspired by the line-search. As you might agree, stochastic line-search requires additional assumptions on the stochastic function oracle. Also note that deterministic and stochastic line-search mechanism are completely different from each other, while we propose a single strategy that works for all cases.
Regarding your comment on the need for a bounded domain, we might relax it when we have some rough estimate on the initial distance. We refer the reviewer for our proposition in our answer to Reviewer fbqb.
We also suspect that our method should work without knowing D, but we haven't had a chance to prove it exactly, yet. Let's assume we have an initial guess $\hat D$, then we would suffer an additional multiplicative term depending on how far $\hat D$ is with respect to $D$. This multiplicative factor would depend on the ratio $D / \hat D$.
Last but not least, the literature on adaptive methods indicate the difficulty of simultaneous adaptation to smoothness, variance and diameter. A recent work by (Carmon and Hinder, 2024) studies parameter-free methods for non-smooth case and show that adapting to all the parameters comes at the cost of a multiplicative error of $\log(D / \hat D)$. Hence, we want to emphasize the difficulty of achieving acceleration without knowing any problem parameters while adaptation to Holder smoothness and stochasticity at the same time. We regard removing diameter dependence as an independent and challenging future direction.
> The experiments are quite limited, especially for a paper on adaptive optimization. I appreciate the diversity of problems: log regression, least squares, and a non-nonconvex neural network. However, the MNIST dataset is highly outdated, and doesn't represent what people are using adaptive methods for. A small transformer model would be much more appropriate and it wouldn't take more than a couple of hours to train it on a small dataset on a single GPU. I'd also suggest running multiple random seeds and presenting the average with confidence intervals, since otherwise results on neural networks can be rather random.
Following your suggestion, we ran a couple of experiments for training ResNet with CIFAR-10. The demand for compute power during the rebuttal time restricts our ability to execute extensive experiments, but our new plot shows comptetitive performance of out method against AdaGrad and Adam. We share the plots in a separate comment for all at the top.
## Minor
> In line 174, right column, I think it'd be more consistent with the other notation to denote the best value of $\bar H_k$ as $\bar H^*$ rather than $\bar H_*$.
Thank you for the suggestion, we will implement it for the final version of our paper.
> The parentheses starting in line 217 on page 4 seem unnecessary, this sentence does not comment a part of a sentence. The impact statement has a typo: "none which" -> "none of which"
Thank you for noticing the typos. We correct all the typos indicated by the reviewers for the final version of the submission.
Questions:
> One major missing component of the current empirical evaluations is the sensitivity of the method to how we choose D. This can be tested on both convex and nonconvex examples. In the current implementation of convex problems, the authors use the fact that the diameter D is known, however, we can try plugging-in other values in the method to see how well it performs. This way, we will now the real value of D, so we'll be able to know how far off this estimate can be. For nonconvex problems, the authors report sweeping D over values {50, 35, 20, 10, 5}, but I do not see any comments on how this choice affected the performance. Can the authors comment on that?
Let us note that our step-size capable of recovering from a bad estimate of $D$. Unless the estimate is prohibitively small or large, the method converges with similar performance. We pick the best performance over a sweep and report it in the plots. The set of values we reported yield convergent curves, but the performance fluctuates as is the case for other methods with different initial step-size values.
> Why there is no comparison to Polyak stepsize? Gradient descent with Polyak stepsize is also universal, so it would make sense to include it in the comparison.
As the reviewer might recognize, Polyak step-size requires the knowledge of the optimal objective value, which is most of the times impossible to know, hence, it is very difficult to implement in practice.
<!-- ## Strengths and weaknesses
1. > The experiments are quite limited
We appreciate the reviewer for the suggestion, and we will update our experiment on CIFAR-10 in the final version. Additionally, we would like to remind the reviewer that this paper aims to design an adaptive method for convex optimization problems, while the performance on neural network training is beyond the scope.
## Questions
??? -->
# References for adaptive methods with boundedness
[1] Wang, Abernethy. Acceleration through Optimistic No-Regret Dynamics, *NeurIPS* (2018).
> Studies smooth convex problems in the deterministic setting. The non-adaptive algorithm must know $L$. They make boundedness assumption and the algorithm doesn't need to know the diameter.
[2] Cutkosky. Anytime Online-to-Batch, Optimism and Acceleration, *ICML* (2019).
> Studies non-smooth or smooth convex problems under deterministic and stochastic settings (bounded variance). The algorithm doesn't need the knowledge of $L$ and $\sigma$. Assumes bounded domain and the diameter ($B$ in their algorithm) appears in the step-size.
[3] Joulani, Raj, Gyorgy, Szepesvari. A Simpler Approach to Accelerated Stochastic Optimization: Iterative Averaging Meets Optimism, *ICML* (2020).
> Studies non-smooth or smooth convex problems under deterministic and stochastic settings (bounded variance). They propose 2 algorithms. The **non-adaptive** algorithm must know $L$ to set its step-size and achieves $1/T^2$ rate without bounded domain assumption. The adaptive algorithm doesn't need to know $L$, but requires boundedness ($R = \max \| x_t - x^* \|$). They argue that their algorithm could work without knowing $R$ but the dependence on $R$ would be sub-optimal. Our algorithm could also work without knowing $D$ and achieve the best dependence on $T$, $L$ or $\sigma$.
[4] Antonakopoulos, Kavis, Cevher. EXTRA-NEWTON: A FIRST APPROACH TO NOISE ADAPTIVE ACCELERATED SECOND-ORDER METHODS, *NeurIPS* (2022).
> Studies second-order smooth convex problems in the stochastic setting with bounded variance. The algorithm doesn't need the knowledge of $L$ and $\sigma$. Assumes bounded domains but doesn't need to know $D$ to achieve best dependence on $T$, $L$ or $\sigma$. They study second-order methods, but their reasoning applies to our setting.
[5] Ene, Huy, Vladu. Adaptive Gradient Methods for Constrained Convex Optimization and Variational Inequalities. *AAAI* (2021).
> Studies non-smooth or smooth convex problems under deterministic and stochastic settings (bounded variance). Their algorithm is adaptive and doesn't need the knowledge of $L$ and $\sigma$. They assume bounded doamin and the step-size is set using the diameter knowledge.
[6] Carmon, Hinder. The Price of Adaptivity in Stochastic Convex Optimization. *arXiv* (2024).
> This recent paper theoretically analyzes the difficulty of designing completely parameter-free (adaptive) methods for non-smooth convex optimization. Their results imply that when the algorithm doesn't know the Lipschitz constant and the initial distance, achieving optimal paramter dependence is not possible and the algorithm must suffer some multiplicative error.
# References for parameter-free optimization and D-estimation
To explain the individual difficulty of adapting to initial distance while being adaptive, there is a separate line of research called parameter-free optimization.
[7] Igvi, Hinder, Carmon. DoG is SGD's Best Friend: A Parameter-Free Dynamoc Step Size Schedule, *ICML* (2023).
> They study general convex functions under stochastic gradients with bounded norms. They show that their parameter-free SGD variant suffers an additional $\log(\| x_1 - x^* \| / r_\epsilon)$ where $r_\epsilon$ is a user specified initial "estimate".
[8] Defazio, Mishchenko. Learning-Rate-Free Learning by D-Adaptation, *ICML*, (2023).
> They focus on non-smooth convex minimization, a much simpler problem formulation. They do not assume boundedness but they make use of a delicate mechanism that lower bounds the $\| x_1 - x^* \|$. Their rate is suboptimal in terms of initial distance dependence and suffer a $\log(\| x_1 - x^* \| / d_0)$ multiplicative factor where $d_0$ is initial "estimate" of the initial distance, which corresponds to an estimate for $D$ in out case.
[9] Mishchenko, Defazio. Prodigy: An Expeditiously Adaptive
Parameter-Free Learner, *arXiv* (2024).
> They study the same setting as D-adaptation paper. They improve the $\log(\| x_1 - x^* \| / d_0)$ dependence to $\sqrt{\log(\| x_1 - x^* \| / d_0)}$.
# New response to rET6
Dear Reviewer,
What you are proposing in your notes is a reasonable approach.
Note, however, that you cannot simply upper bound $r_t \leq D$ without modifying
the method by adding, e.g., an additional projection step onto the ball of
diameter $D$.
To cover composite optimization problems (for which, in general,
$\nabla f(x^*) \neq 0$), you also need additional modifications to the method
such as using $\| g_k - g_{k - 1} \|$ in $v_t$ instead of the current
$\| g_k \|$.
There are also several other details that need to be addressed more carefully,
e.g., the possibility of $v_t = 0$ in which case $\eta_t$ is ill-defined, etc.
But, most importantly, it does not seem that this type of analysis can be
directly extended to accelerated methods such as UnixGrad (which you mentioned
previously).
The main problem is that, for accelerated methods, the right-hand side of the
convergence rate bound $A_t \mathbb{E} [F(x_t) - F^*] \leq \ldots$ contains the
summation of gradients at some auxiliary points $y_t$, which are not directly
related to $F(x_t) - F^*$ in the left-hand side.
Clearly, all of this is not just a "trivial extension" of the existing analysis
from a certain paper, which would require a couple of paragraphs to describe.
As it stands, it also contributes to our argument that justifying
parameter-agnostic, optimal, universal rates for the class of Hölder smooth
functions under stochasticity is not a trivial task.
Let us stress one more time that we disagree with your harsh evaluation of our
work.
We propose new interesting algorithms and ideas and, most importantly,
**we establish new complexity results** (for stochastic gradient methods, both
accelerated and nonaccelerated, on the Hölder class) **which are not present in
the existing literature**.
**The fact that the reviewer might have some alternative ideas how to attack the
same problem should not be a reason for rejecting our work.**
If you believe that our approach can be improved upon, this is great, please
feel free to write your own paper tackling the same problem and taking our
results one step further.
This will be more beneficial to the community rather than having an inconclusive
argument that the problem is trivial solely because the reviewer already has
some ideas on how to approach it.
Regarding the discussion of existing work, please note that it is already
present in our paper, including the detailed comparison of our method with
AdaGrad (which uses the difference of successive gradients), see Section 4.3.
Specifically, we explain that our analysis can be seen as a more precise version
of that of AdaGrad (see lines 348-360, right column) because we apply one less
inequality compared to AdaGrad.
One particular implication of this is that, for our Algorithm 2, we have the
following convergence rate bound which is similar to that from your notes:
$\sum_{i = 1}^k \mathbb{E} [F(x_i) - F^*]
\leq
2 D \mathbb{E}\bigl[ (\sum_{i = 1}^k \| g_i - g_{i - 1} \|^2)^{1 / 2} \bigr]$
(combine $\sum_{i = 1}^k \mathbb{E} [F(x_i) - F^*] \leq 2 \mathbb{E}[H_k] D^2$,
which is (27) before applying the convexity of $F$, with (29)).
This means that, instead of proving Theorem 4.2 as we currently do by estimating
the growth rate of $\mathbb{E}[H_k]$ (see lines 357-378, left column), we could
also do it by using the technique from your notes (bounding
$\mathbb{E}[\| g_i - g_{i - 1} \|^2]$ via the function residual plus $\sigma^2$
and then resolving the obtained recurrent inequality).
However, as we already pointed out in the first paragraph of our response, this
type of analysis is not directly extendable to the accelerated Algorithm 3,
while our idea of bounding $\mathbb{E}[H_k]$ using the balance equation (68),
see lines 956-1022, still works exactly the same as for the non-accelerated
Algorithm 2.
Of course, we will add all these remarks/citations to the paper if the reviewer
finds them helpful for a better understanding of our work.
<!-- Dear Reviewer,
What you are proposing in your notes is a reasonable approach.
Note however that you cannot simply upper bound $r_t \leq D$ without modifying
the method by adding, e.g., an additional projection step onto the ball of
diameter $D$.
To cover composite optimization problems (for which, in general,
$\nabla f(x^*) \neq 0$), you also need additional modifications to the method
such as using $\| g_k - g_{k - 1} \|$ in $v_t$ instead of the current
$\| g_k \|$.
There are also several other details that need to be addressed more carefully,
e.g., the possibility of $v_t = 0$ in which case $\eta_t$ is ill-defined, etc.
But, most importantly, it does not seem that this type of analysis can be
directly extended to accelerated methods such as UnixGrad (which you mentioned
previously).
The main problem is that, for accelerated methods, the right-hand side of the
convergence rate bound $A_t \mathbb{E} [F(x_t) - F^*] \leq \ldots$ contains
the summation of the gradients at some auxiliary points $y_t$, which are not
directly related to $F(x_t) - F^*$ in the left-hand side.
Clearly, all of this is not just a "trivial extension" of the existing analysis
from a certain paper, which would require a couple of paragraphs to describe.
Let us stress one more time that we disagree with your harsh evaluation of our
work.
We propose new interesting algorithms and ideas and, most importantly,
**we establish new complexity results** (for stochastic gradient methods, both
accelerated and nonaccelerated, on the Hölder class) **which are not present in
the existing literature**.
**The fact that the reviewer might have some alternative ideas how to attack the
same problem should not be a reason for rejecting our work.**
If you believe that our approach can be improved upon, this is great, please
feel free to write your own paper and share your thoughts.
This will be much more beneficial to the community compared to insisting that
the problem is trivial because the reviewer already has some ideas how to
approach it.
Regarding the discussion of existing work, please note that it is already
present in our paper, including the detailed comparison of our method with
AdaGrad (which uses the difference of successive gradients), see Section 4.3.
Specifically, we explain that our analysis can be seen as a more precise
version of that of AdaGrad (see lines 348-360, right column) because we
do not apply one extra inequality compared to AdaGrad.
One particular implication of this is that, for our Algorithm 2, we have the
following convergence rate bound which is similar to that from your notes:
$\sum_{i = 1}^k \mathbb{E} [F(x_i) - F^*]
\leq
2 D \mathbb{E}\bigl[ (\sum_{i = 1}^k \| g_i - g_{i - 1} \|^2)^{1 / 2} \bigr]$
(combine $\sum_{i = 1}^k \mathbb{E} [F(x_i) - F^*] \leq 2 \mathbb{E}[H_k] D^2$,
which is (27) before applying the convexity of $F$, with (29)).
This means that, instead of proving Theorem 4.2 as we currently do by estimating
the growth rate of $\mathbb{E}[H_k]$ (see lines 357-378, left column), we could
also do it by using the technique from your notes (bounding
$\mathbb{E}[\| g_i - g_{i - 1} \|^2]$ via the function residual plus $\sigma^2$
and then resolving the obtained recurrent inequality).
However, as we already pointed out in the first paragraph of our response, this
type of analysis is not directly extendable to the accelerated Algorithm 3,
while our idea of bounding $\mathbb{E}[H_k]$ using the balance equation (68),
see lines 956-1022, still works exactly the same as for the non-accelerated
Algorithm 2.
Of course, we can add all these remarks to the paper if the reviewer finds them
helpful for a better understanding of our work. -->