imjshang
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    1
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    ## Reviewer ZQXi We sincerely thank the reviewer's constructive comments and valuable insights, which help improve the quality of our work significantly. We have carefully revised our paper according to your comments and suggestions. Please see our revised submission, where we have highlighted all major changes in **Blue** color. Please also see our point-to-point responses as follows: > **Your Comment 1:** Performing the forward and backward knowledge transfer has been done in the existing works. **Our Response:** Thanks for the comments. Although it is true that both forward knowledge transfer (FWT) and backward knowledge transfer (BWT) have been studied in the literature of continual learning (CL), the FWT and BWT performances of existing work in this area *remain far from satisfactory*. More specifically, the FWT and BWT in the existing CL methods require either an extensive amount of resources for computing gradient projections (in orthogonal-projection-based CL) or storing a large amount of old tasks’ data (in experience-replay-based and regularization-based CL). The limitations of these existing work motivate us to propose a new CL method to improve the FWT and BWT performances. Specifically, in this paper, we focus on reducing the $\mathcal{O}(n^3)$ computational complexity in SVD and also increasing the scalability of orthogonal-projection-based CL methods. This is due to the nice fact that orthogonal-projection-based CL methods do not need to access old tasks' data. Toward this end, we propose a local model space projection (LMSP) approach that could achieve $\mathcal{O}(n^2)$ instead of $\mathcal{O}(n^3)$ complexity. <!--(from Jin Shang [store a lot of data seems not to be the pain point])--> ------------- > **Your Comment 2:** The proposed approach relies on the task information, which can not be used in task-free continual learning. **Our Response:** Thanks for your comments. We clarify that our focus in this paper is the standard task-based CL setting, i.e., tasks arrive at the learner *sequentially* with clear task boundaries (see, e.g., [R1] for the description of this standard setting of CL). However, we would like to point out that our work focuses on the *orthogonal-projection-based* CL approach, which requires the *least* (in fact, almost zero) amount of task information since orthogonal-projection-based CL methods do *not* need to save any old tasks data. All we need is to compute the new null space of the model parameters upon finishing the learning of the previous task. On the other hand, we also note that "task-free continual learning" is a new CL paradigm, which refers to CL systems with **no** clear boundaries between tasks and data distributions of tasks gradually and continuously changing (see [R2] for the detailed description of task-free CL). Clearly, task-free CL is a more complex CL paradigm. How to conduct CL without requiring previous tasks' information is a far more challenging open problem in the community, which deserves an independent paper dedicated to this topic. But this is beyond the scope of our current work, and could be an interesting and important future direction. We thank the reviewer for suggesting this direction. [R1] L. Wang et al., "A Comprehensive Survey of Continual Learning: Theory, Method and Application," https://arxiv.org/abs/2302.00487 [R2] R. Aljundi et al., "Task-Free Continual Learning," in Proc. CVPR 2019. <!-- <span style="color:blue">[Kevin: I'm not sure if the reviewer really understands what he is talking about, and he seems to be clueless about the definition of "task-free continual learning." Task-free CL refers to CL systems where there is no clear boundary between each pair of successive tasks and the data distributions of tasks continually shift over time (see https://arxiv.org/pdf/1812.03596.pdf). This is actually an even more challenging setting than the regular CL we consider, which requires even more task information. For this reason, I modified our response. Jin's original response is kept below. Also, I think we may complain to the AC that this reviewer seems to be rather unqualified]</span> <span style="color:red">Thanks for the feedback. We conclude this in our future work. Given that usually we have a lot of pretrained models, it is more practical and useful to first focus on how to run task based continual learning efficiently. That is also the purpose of this paper.</span> --> ------------- > **Your Comment 3:** The proposed approach does not always achieve the best performance in some datasets. **Our Response:** Thanks for your comments. We would like to clarify that, due to the information loss of using local model approximation in our LMSP method, it could happen that LMSP may be outperformed by other baseline methods. However, we want to emphasize that our goal in this paper is to significantly reduce the computational complexity from $\mathcal{O}(n^3)$ to $\mathcal{O}(n^2)$ by using local model approximation, even though this could lead to a slight performance loss. In other words, we would like to pursue *low-complexity CL algorithmic design* by potentially and slightly trading-off learning performance. Interestingly, our experiments show that, due to other complex factors in CL systems, our LMSP approach actually *outperforms* the baseline approaches in most scenarios (cf. Table 1). Also, it is worth noting that we theoretically characterized the conditions under which our LMSP approach could achieve better results. <!--Using local models to approximate the global one does necessarily lose some information. We aim to reduce the complexity without sacrificing too much performance. Considering the huge complexity reduction from $\mathcal{O}(n^3)$ to $\mathcal{O}(n^2)$, we think the minor performance loss is acceptable. We also provide some theoretical analysis on the scenario when our approach could get even better results.--> ------------- > **Your Comment 4:** Although the proposed approach can reduce computational costs but would increase more parameters. **Our Response:** Thanks for the comments. It appears that there are some misunderstandings and confusions that are perhaps due to the relatively complex math notations in our algorithm. We would like to clarify that our local model projection approach does *not* increase the number of model parameters (i.e., our LMSP approach remains having the same number of parameters compared to CUBER, which is the most related work). Specifically, we apply local model approximation on each layer's output representation by partitioning the layer's matrix into smaller submatrices (defined by anchor points), which allows faster processing of these smaller submatrices *in parallel*. During this process, the total number of parameters remains the same (cf. the description at the bottom of Page 4 and Eq. (3)). Then, our LMSP method updates the new weights $\mathbf{W}^l$ from previous model weights using the LMSP-based projected gradients and scaling parameters, hence the number of parameters remains the same as those of CUBER (cf. Eq. (7)). <!--throughout the whole learning process (in order to not only learn the new information but also keep the existing knowledge).--> ------------- ## Reviewer Shvn > **Your Comment 1:** Although the proposed method seems quite competitive in terms of experimental results, there is no report on the performance of forward transfer. This is extremely relevant as forward and backward transfer are usually in trade-off (the more forward, the less backward transfer and vice-versa). How can you guarantee that the good results in backward transfer do not require sacrificing forward transfer, or even just the fact of learning the new task reasonably well? **Our Response:** Thanks for this suggestion. In this rebuttal period, we have added additional experiments to evaluate the forward knowledge transfer (FWT) performance. As shown in the following table, we compared the FWT performance of our LMSP approach to those of GPM, TRGP, and CUBER methods, which are the most related work to our paper. The value for GPM is zero because we treat GPM as the baseline and consider the relative FWT improvement over GPM. We compare them using four public datasets. We can see from the table that the FWT performance of our LMSP approach beats those of the TRGP and CUBER (two most related and state-of-the-art methods) on the PMNIST, Cifar-100 Split, and 5-Dataset datasets, and is comparable to those of the TRGP and CUBER on the MiniImageNet dataset. Clearly, this shows that the good BWT performance of our LMSP method is **not** achieved at the cost of sacrificing the FWT performance. | FWT (%) | PMNIST | Cifar-100 Split | 5-Dataset | MiniImageNet | | ---------------- | --- | --- | --- | --- | |GPM | 0 | 0 | 0 | 0 | |TRPG | 0.18| 2.01| 1.98| 2.36| |CUBER | 0.80| 2.79| 1.96| **3.13**| |**LMSP (Ours)** | **0.92**| **2.89** | **2.43** | 2.79| <!-- <span style="color:purple">[Jin: TRPG and CUBER paper already report FWT in the paper, just copied the number]--> <!-- | ACC (%) | PMNIST | Cifar-100 Split | 5-Dataset | MiniImageNet | | ---------------- | --- | --- | --- | --- | |TRPG |96.26|74.98|92.41|64.46| |CUBER |97.04|75.29|92.85|63.67| |LMSP (forward-only)|97.42|74.82|92.78|63.90| |LMSP |97.48|74.21|93.78|64.20| | BWT (%) | PMNIST | Cifar-100 Split | 5-Dataset | MiniImageNet | | ---------------- | --- | --- | --- | --- | |TRPG |-1.01|-0.15|-0.08|-0.89| |CUBER |-0.11| 0.14|-0.13| 0.11| |LMSP (forward-only)|-0.10|-0.09|-0.13|-0.35| |LMSP | 0.16| 0.94| 0.07| 1.55| --> <!-- We conducted this experiment as an ablation study, and these new results are provided below. It shows that forward transfer only with LMSP also provides some competitive results. We list the ACC and BWT following and the method LMSP(forward-only) is our approach with only forward knowledge transfer. The baseline method TRPG only adopts forward knowledge transfer and CUBER adopts both forward and backward knowledge transfer.--> ------------- > **Your Comment 2:** Can you provide actual performance numbers for forward transfer? **Our Response:** Thanks for the suggestion. Please see the table above. ------------- ## Reviewer CGvu > **Your Comment 1:** The problem definition, framework, and convergence analysis of this work are derived from existing work. While the efficiency approach is intuitive and easy to understand, its novelty causes me concern. **Our Response:** Thanks for the comments. We would like to point out that, although the continual learning (CL) problem definition and framework in this paper are not completely new, it does *not* necessarily mean that there is no more research to be done. We note that CL is a broad and very active research field in recent years (see [R1] for the latest survey). However, there remains a large number of open fundamental research problems in CL and the performance of existing CL methods are still far from satisfactory. In this paper, we note that even the state-of-the-art orthogonal-projection-based CL approaches (TRGP and CUBER) still suffer from *high computational complexity* due to their use of the expensive SVD operations. This problem is further exacerbated by the ever-increasing large and deep vision and language models in the CL regime (i.e., sequential multi-task training). Therefore, our goal in this paper is to develop **new** orthogonal-projection-based CL methods with a significantly lower computational complexity. Toward this end, we propose a local model space projection (LMSP) approach, which is **new** in the CL literature. For the convergence analysis for our LMSP method, it is true that our proof is based on the framework of first-order optimization algorithm convergence analysis, which starts from bounding one-step descent and finishes at telescoping one-step descent and rearranging to arrive at stationarity gap bound. However, we point out that the similarity of our convergence performance analysis compared to other methods ends there. The complications arising from the use of local model projection approximations renders our convergence proof significantly different from those of existing CL methods. Specifically, our convergence proof and analysis involves the new notion called "local relative orthogonality" (see Definition 5 in our revised paper). Theorem 1 focuses on proving the convergence of our local algorithm and Theorem 2 proves that under such conditions and Definition 5, the out LMSP could achieve even better results than the global algorithm counterpart (CUBER and TRGP). [R1] L. Wang et al., "A Comprehensive Survey of Continual Learning: Theory, Method and Application," https://arxiv.org/abs/2302.00487 <!--<span style="color:blue">[Kevin: Starting from here, it may be better to pinpoint which parts of our proof in the appendix (e.g., Lemma XXX, Proposition XXX, or Eq. (XXX)) are different compared to existing works and provide some detailed explanations.]</span> <span style="color:purple">[Jin: added some sentence below]--> ------------- > **Your Comment 2:** The authors use local low-rank matrices defined by anchor points to approximate each layer parameter matrix. However, the accuracy of this approximation, and in particular how it is affected by $m$, is not discussed. Moreover, the proposed framework and analysis also ignore this issue. **Our Response:** Thanks for your comments. In Section 5, we have provided in-depth ablation studies on the impacts of different rank values $r$ in low-rank approximations and different number of anchor points $m$ on the learning accuracy (ACC) and backward knowledge transfer (BWT). More specifically, with the use of local model approximation to reduce computational complexity, information loss of the original learning model is inevitable. Our goal in this paper is to reduce the computational complexity without sacrificing too much performance. <!--<span style="color:red">However, the prediction results of new task not only affected by the top-K correlated old tasks model weights but also learned with new task data, thus a theoretical analysis on that becomes non-trivial. However, we do show some theoretical analysis on the scenario when our approach could get even better results. Also, the new task could find more correlated candidate old tasks with more anchor points selected, which leads to better knowledge transfer from old tasks.</span>--> Also, we did not ignore the impacts of $m$ in our theoretical analysis since these local model approximation errors have already been implicitly captured in $\bar{\mathbf{g}}_i(\mathbf{W})$. As shown in the paper, the local model approximation affects the convergence analysis through the $\bar{\mathbf{g}}_i(\mathbf{W})$, $i=1,2$. Thus, by choosing the top-K correlated local tasks, the more anchor points we have, the smaller approximation error of $\bar{\mathbf{g}}_i(\mathbf{W})$ compared to their true versions $\ddot{\mathbf{g}_i}(\mathbf{W})$, $i=1,2$ we get. Moreover, with the approximation error bound from [R2], we can theoretically characterize the impact of $m$ in our analysis. [R2] J. Lee et al., "Local Low-Rank Matrix Approximation," in Proc. ICML 2013. <!--<span style="color:blue">[Kevin: Jin, The paragraph above doesn't seem to be very relevant to the reviewer's question, in my opinion. I quickly went over the proof. It seems to me that the local model approximation should affect the convergence analysis through the $\bar{\mathbf{g}}_i(\mathbf{W})$, $i=1,2$. The more achor points you have, the smaller the approximation error of $\bar{\mathbf{g}}_i(\mathbf{W})$ compared to their true versions $\ddot{\mathbf{g}_i}(\mathbf{W})$, $i=1,2$ (note: I don't like the use of "double dots" here, since this notation is usually reserved for second-order derivatives (Newton's notation)). Therefore, I think you can say "We did not ignore the impacts of $m$ in our analysis, since these local model approximation errors have already been implicitly captured in $\bar{\mathbf{g}}_i(\mathbf{W})$. Moreover, with the approximation error bound from [Lee et at. ICML'13], we can theoretically characterize the impact of $m$ in our anlaysis." ]</span> <span style="color:purple">[Jin: added some sentence like this below.]--> ------------- > **Your Comment 3:** The author introduces LLRA to improve computational efficiency. However, they do not perform experiments to evaluate the computational complexity and specifically do not show the saved wall-clock time compared with the LRA method. **Our Response:** Thanks for your comments. We note that we do have experimental results to evaluate the computational complexity of our LMSP approach. In the table below, we summarize the normalized wall-clock training times of our LMSP algorithm and several baselines with respect to the wall-clock training time of GPM (additional wall-clock training time results can also be found in [R3]). Here, we set the rank $r$ to 5 for each local model. We can see that the wall-clock time of our LMSP method with *only one anchor point* can already reduce the total wall-clock training time of CUBER by 86% on average. Moreover, thanks to the fact that our LMSP approach endows distributed implementation that can run different local models in a parallel fashion, the total walk-clock training time with $m$ anchor points is similar to the single-anchor-point case above. | Training Time | Cifar-100 Split | 5-Dataset | MiniImageNet | | ---------------- | --- | --- | --- | |OWM | 2.41| - | - | |EWC | 1.76| 1.52| 1.22| |HAT | 1.62| 1.47| 0.91| |A-GEM | 3.48| 2.41| 1.79| |ER_Res | 1.49| 1.40| 0.82| |GPM | 1.00| 1.00| 1.00| |TRPG | 1.65| 1.21| 1.34| |CUBER | 1.86| 1.55| 1.61| |**LMSP (Ours)** |**0.24**|**0.42**|**0.18**| [R3] S. Gobinda et al., "Gradient Projection Memory for Continual Learning," in Proc. ICLR 2020. <!--<span style="color:red">As we already show the complexity in the paper, we didn't list the exact experiment running time here considering it is also affected by how the parallel mechanism is implemented on distributed machines and how the results are gathered together.</span> <span style="color:blue">[Kevin: Jin, I remember we do have experimental results for walk clock time, right? If not, I still suggest we provide some values here (I believe this should be pretty easy). Having this will help significantly to convince the reviewer.]</span> --> <!--As mentioned in the paper, our LMSP approach reduces the computational complexity in projections from $\mathcal{O}(n^3)$ to $\mathcal{O}(n^2)$ compared with baseline methods such as TRPG and CUBER.--> ------------- > **Your Comment 4:** The author states that there is no significant difference between the two methods in selecting anchor points. Can you give some intuitive explanation? **Our Response:** Thanks for the suggestion. As we discussed in the paper, if the new task is strongly correlated with some old tasks, there should be better knowledge transfer from the correlated old tasks to the new task. Thus, the performance largely relies on finding correlated tasks. Supposing the data is not biased, simply choosing enough anchor points should provide enough candidates for the new task to choose. ------------- > **Your Comment 5:** Is there some relationship between ranking and the number of anchors? **Our Response:** Thanks for your question. In theory, the values of rank and number of anchors can be chosen independently and arbitrarily in our LMSP approach. In the extreme case, if we use full rank and set number of anchor points to be 1, then our LMSP method reduces to the CUBER baseline method. In practice, we often prefer to choose a small rank value since it will significantly reduce the computational complexity. Also, if permitted by computational resources, choosing more anchor points is more preferable, since this would yields better approximation to the original model. Moreover, since each local model approximation could run in a parallel fashion (implied by our LMSP method's distributed implementation), having more anchor points will not significantly increase the wall clock time performance. ------------- ## Reviewer apk6: > **Your Comment 1:** This paper writing needs to be further improved. It would be better to directly state the intuitive idea and its illustration. This would make the main idea clearer and easier to understand. **Our Response:** Thanks for your suggestions. We fully agree with the reviewer that adding more intuition discussions and the rationale behind our proposed LMSP approach will make the presentation of our key idea clearer and easier to understand. In this revision, we have added such discussions in introduction to further clarify our key idea: To use local low-rank approximation to reduce the complexity in continual learning with forward and backward knowledge transfers, while not sacrificing too much performance. <!--<span style="color:purple">[Jin: Do you mean the location in this rebuttal? If so, it should be the answer of first question from 3rd reviewer.]--> > **Your Comment 2:** The authors argue that SVD decomposition is computationally costly. This is true but it seems not an important problem in GPM since SVD decomposition only happens after finishing training each task, not every iteration. Therefore, the computation cost of SVD decomposition is minor compared to the overall training cost. **Our Response:** Thanks for your comments. The reviewer is correct that one round of layer-wise SVD operations is performed in GPM after finishing the learning of a new task. In fact, all orthogonal-projection-based CL approaches (not only GPM, but also TRGP, CUBER, and our LMSP methods) all perform SVD once for each layer after the training of each task. However, we note that such a one-round layer-wise per-task SVD does **not** necessarily mean that the resultant computation is cheap. In fact, such SVD computations remain highly expensive. Specifically, note that we need to perform SVD for each layer. With the ever-increasing widths and depths of large and deep learning models, computing one SVD even just for one layer becomes more and more difficult due to the $\mathcal{O}(n^3)$ complexity as the width $n$ of each layer gets large. On the other hand, we note that the training cost of each task is **not** necessarily higher than performing SVD, as the total number of iterations of most first-order methods typically does *not* scale with the model size/dimension. In our experiments, we find that the processing time of SVD is significantly higher than those of other components of the model. This is also evidenced by our newly added walk-time comparison experiments in the table in the response to your Comment 3 below. In that table below, we summarize the wall-clock training times of our LMSP algorithm is much **shorter** than those baselines with full SVDs. All these results and analyses suggest that the computational complexity of SVD is the paint point of orthogonal-projection-based CL approaches. In addition, it is also mentioned in CUBER (Lin et al, 2022) that running a full SVD is time-consuming, which is consistent with our observations. <!--Thanks for your comments. We're not directly improving GPM. GPM does not achieve best performance in baseline methods and it's weaker than TRPG and CUBER based on our experiments. The most related work of our approach is CUBER (Beyond Not-Forgetting: Continual Learning with Backward Knowledge Transfer), and the author of CUBER also mentioned that running a full SVD is a limitation of CUBER since it extracts the bases of the task subspaces based on SVD, thus may lead to high computational cost for large dimensional data. <span style="color:blue">[Kevin: Jin, we need to be careful here. I think the reviewer has a valid point. In fact, not only GPM, TRGP, CUBER, and our LMSP methods all perform SVD once after finishing the learning of each task. I think we need to argue that even doing one SVD operation per-task is still highly expensive, particularly with the ever-increasing sizes of large and deep learning models. On the other hand, the training cost of each task is not necessarily higher than SVD, as the total number of iterations of most first-order methods typically doesn't depend on the model size/dimension.]</span> <span style="color:purple">[Jin: agree, added some sentence below.]--> ----------------------- > **Your Comment 3:** The authors state that their method could reduce the complexity of SVD basis computation, but there is no empirical evaluation of the overall training efficiency improvement with the proposed method compared to the GPM itself. **Our Response:** Thanks for your comments. In this rebuttal period, we have added an additional set of experiments to evaluate the wall-clock training time of our LMSP approach and compare with several closely related baselines. In the table below, we summarize the normalized wall-clock training times of our LMSP algorithm and several baselines with respect to the wall-clock training time of GPM (additional wall-clock training time results can also be found in [R3]). Here, we set the rank $r$ to 5 for each local model. We can see that the wall-clock time of our LMSP method with *only one anchor point* can already reduce the total wall-clock training time of CUBER by 86% on average. Moreover, thanks to the fact that our LMSP approach endows distributed implementation that can run different local models in a parallel fashion, the total walk-clock training time with $m$ anchor points is similar to the single-anchor-point case above. | Training time | Cifar-100 Split | 5-Dataset | MiniImageNet | | ---------------- | --- | --- | --- | |OWM | 2.41| - | - | |EWC | 1.76| 1.52| 1.22| |HAT | 1.62| 1.47| 0.91| |A-GEM | 3.48| 2.41| 1.79| |ER_Res | 1.49| 1.40| 0.82| |GPM | 1.00| 1.00| 1.00| |TRPG | 1.65| 1.21| 1.34| |CUBER | 1.86| 1.55| 1.61| |**LMSP (Ours)** |**0.24**|**0.42**|**0.18**| [R1] S. Gobinda et al., "Gradient Projection Memory for Continual Learning," in Proc. ICLR 2020. <!--Our LMSP approach reduces the complexity from $\mathcal{O}(n^3)$ to $\mathcal{O}(n^2)$ compared with baseline methods such as TRPG and CUBER.--> ---------------------- > **Your Comment 4:** From the empirical results, LMSP improves the backward transfer, but the overall accuracy drops in some cases. The paper states that LMSP can improve both the forward and backward transfer, which does not support the claim. **Our Response:** Thanks for your comments. We would like to clarify that, due to the information loss of using local model approximation in our LMSP method, it could happen that the overall accuracy of LMSP may be outperformed by other baseline methods. However, we want to emphasize that our goal in this paper is to significantly reduce the computational complexity from $\mathcal{O}(n^3)$ to $\mathcal{O}(n^2)$ by using local model approximation, even though this could lead to a slight performance loss. In other words, we would like to pursue *low-complexity CL algorithmic design* by potentially and slightly trading-off learning performance. Also, in this rebuttal period, we have added additional experiments to evaluate the forward knowledge transfer (FWT) performance. As shown in the following table, we compared the FWT performance of our LMSP approach to those of the GPM, TRGP, and CUBER methods, which are the most related work to our paper. The value for GPM is zero because we treat GPM as the baseline and consider the relative FWT improvement over GPM. We compare them using four public datasets. We can see from the table that the FWT performance of LMSP outperforms those of TRGP and CUBER (two most related and state-of-the-art methods) on the PMNIST, Cifar-100 Split, and 5-Dataset datasets, and is comparable to those of TRGP and CUBER on the MiniImageNet dataset. This shows that LMSP does improve both FWT and BWT in most cases. | FWT (%) | PMNIST | Cifar-100 Split | 5-Dataset | MiniImageNet | | ---------------- | --- | --- | --- | --- | |GPM | 0 | 0 | 0 | 0 | |TRPG | 0.18| 2.01| 1.98| 2.36| |CUBER | 0.80| 2.79| 1.96| **3.13**| |**LMSP (Ours)** | **0.92**| **2.89** | **2.43** | 2.79| <!--Thanks for the feedback. Using local models to approximate the global one does necessarily lose some information. We want to clarify that our LMSP aims to reduce the complexity without sacrificing too much performance. Considering the huge complexity reduction from $\mathcal{O}(n^3)$ to $\mathcal{O}(n^2)$, we think the minor performance loss is acceptable. We also provide some theoretical analysis on the scenario when our approach could get even better results.We add an additional ablation study here to show our LMSP with only forward knowledge transfer, which could provide more evidence that our approach sacrifices neither forward nor backward knowledge transfer. We list the ACC and BWT following and the method LMSP(forward-only) is our approach with only forward knowledge transfer. The baseline method TRPG only adopts forward knowledge transfer and CUBER adopts both forward and backward knowledge transfer. | ACC(%) | PMNIST | Cifar-100 Split | 5-Dataset | MiniImageNet | | ---------------- | --- | --- | --- | --- | |TRPG |96.26|74.98|92.41|64.46| |CUBER |97.04|75.29|92.85|63.67| |LMSP(forward-only)|97.42|74.82|92.78|63.90| |LMSP |97.48|74.21|93.78|64.20| | BWT(%) | PMNIST | Cifar-100 Split | 5-Dataset | MiniImageNet | | ---------------- | --- | --- | --- | --- | |TRPG |-1.01|-0.15|-0.08|-0.89| |CUBER |-0.11| 0.14|-0.13| 0.11| |LMSP(forward-only)|-0.10|-0.09|-0.13|-0.35| |LMSP | 0.16| 0.94| 0.07| 1.55| --> ---------------------- ## Reviewer CGvu 2nd Reply: > **Your Comment 1:** To clarify, I did not intend to suggest that the field of continual learning has been fully explored. I agree with Reviewer apk6 regarding the presentation of the manuscript. A concise explanation of existing frameworks and a focused discourse on the unique aspects of your method, would enhance the paper's readability and effectively highlight its novel contributions. **Our Response:** Thanks for your agreement that the field of continual learning still has many open and foundational problems to study. In this revision, we have already added discussions and highlighted the differences and novelty of our work in introduction with **Blue** color, please check our revised submission. ---------------------- > **Your Comment 2:** Regarding the relationship between rank and the number of anchor points, my understanding is as follows: The rank reflects the number of local modes within the representation matrix. The number of anchor points influences the accuracy of this local approximation. For matrices with few information, a lower rank suggests that fewer anchor points are needed to accurately represent the information. Both your experiments and response suggest that these two elements function independently. **Our Response:** Thanks for your comments. It appears that there are still some misunderstandings on the local model approximation approach. We want to emphasize that the number of anchor points $n$ and the rank $r$ are indeed two **independent** parameters. Here, we would like to first further clarify the local low-rank approximation approach used in our LMSP-based CL algorithm. To perform local low-rank approximation (cf. [Lee et al. ICML'13]), one will first decide the number of anchor points $n$ to project the original high-rank matrix into a set of local low-rank matrices, i.e., the number of anchor points defines the number of local models (for visual illustration, see Fig. 1 in [Lee et al. ICML'13] at https://jmlr.org/papers/volume17/14-301/14-301.pdf). Then, for each projected local matrix (i.e., a local model), we compute a rank-$r$ approximation, where rank $r$ is another parameter we can choose. From this procedure, it can be seen that $n$ and $r$ are two **independent** parameters to be chosen (i.e., $n$ is not necessarily determined by $r$ and vice versa). Moreover, they jointly determine the overall accuracy of the local low-rank approximation. The more anchor points $n$ in use and the higher the chosen rank $r$, the smaller the overall approximation error of local low-rank approximation. In the special case where we use full rank and set number of anchor points to be 1, then our LMSP-based CL method (i.e., with local low-rank approximation) reduces to the CUBER baseline method. As mentioned in our first response, in practice, we often prefer to choose a small rank value $r$ since it will significantly reduce the computational complexity. Also, subject to computational resource limits, choosing more anchor points is more preferable, since this would yield a better approximation to the original model. Moreover, since each local model approximation could run in a parallel fashion (implied by our LMSP method's distributed implementation), having more anchor points will not significantly increase the wall clock time performance. <!--The rationale of using local low-rank approximation in our LMSP-based CL algorithm is as follows: Since performing SVD on a high-rank large representation matrix is costly in computation, we project this high-rank representation matrix to several local low-rank matrices defined by a series of anchor points. The projected matrix can be viewed as the collection of neighboring entries of the anchor point and should have a lower rank since many entries that are not close to the anchor points become relatively small under the smoothing kernels. Our LMSP-based method performs orthogonal-projection-based CL using these low-rank local models, so that the computation cost can be dramatically reduced. , while we need more anchor points to finally reduce the error. Considering the anchor points can be calculated parallelly, the total processing time with multiple local models should be similar to only one local model with low rank. This explains why the rank and the number of anchor points can indeed be set independently. --> ---------------------- > **Your Comment 3:** Concerning the two methods of selecting anchor points: random selection may result in points that are too similar or possess overlapping information, whereas pre-clustering to find centroids is likely to provide a more distinct and diverse representation. I am unsure why both methods are deemed equally viable. Additionally, I'm curious about the role of data bias in relation to these selection methods. **Our Response:** Thanks for your comments. We agree that pre-clustering to find centroids is likely to provide a more distinct and diverse representation and it is also proved by some works such as [R1]. The improvements of accuracy is reported to be around 2% in MovieLens-1M. However, in the random selection approach, as long as the choices of random anchor points are relatively uniform, the empirical difference between two selection methods is not significant based on our numerical experience. Considering the additional computational costs introduced by clustering methods (e.g., k-means), such a marginal improvement (at least in the CL applications and experiments we conducted) may not justify the additional costs of doing pre-clustering. Thus, we have adopted random anchor points selection in our experiments for lower implementation complexity. That being said, we do not rule out the possibility that pre-clustering may be favorable in other applications of local low-rank approximation, but this is beyond the scope of the continual learning applications we focus on in this paper. [R1] M. Zhang et al.,"Local Low-Rank Matrix Approximation with Preference Selection of Anchor Points," in Proc. WWW 2017. --------------------- ## To Area Chair Dear Area Chair: We thank you and all reviewers for their constructive and insightful comments! We are writing to you to express our concerns on the quality of the review provided by Reviewer ZQXi. The main focus and contributions of our paper lie in proposing a local model space projection (LMSP) approach to avoid SVD operations, which significantly reduces the computation complexity of orthogonal-projection-based continual learning (CL), while achieving strong forward and backward knowledge transfer performances. However, it appears that some comments from Reviewer ZQXi completely missed the point of this paper, which also shows that Reviewer ZQXi has incorrect basic understandings of CL. In particular, Reviewer ZQXi's second comment reads *"The proposed approach relies on the task information, which can not be used in task-free continual learning."* We point out that not only is this comment unfair, it also shows that Reviewer ZQXi's does not have a correct understanding of "task-free continual learning." First, we want to emphasize that our focus in this paper is the task-based CL setting, i.e., tasks arrive at the learner sequentially with clear task boundaries (see, e.g., [R1] for the description of this standard setting of CL). The task-based CL setting, although under active research in recent years, remains having unsatisfactory performance so far. More importantly, our work focuses on the orthogonal-projection-based approach for solving task-based CL, which requires the **least** (in fact, almost zero) amount of task information. This is because orthogonal-projection-based CL methods do *not* need to store any old tasks data. All we need is to compute the new null space of the model parameters upon finishing the learning of the previous task. However, Reviewer ZQXi does not have any knowledge about this basic property of orthogonal-projection-based CL, and his comment "The proposed approach relies on the task information" is inaccurate. Moreover, we note that "task-free continual learning" is a new CL paradigm, which refers to CL systems with no clear boundaries between tasks and data distributions of tasks gradually and continuously changing (see [R2] for the detailed description of task-free CL). Clearly, task-free CL is a more complex CL paradigm and requires even more information from data of earlier tasks. However, Reviewer ZQXi clearly does not understand this basic characteristic of task-free continual learning and incorrectly thought task-free CL should not use task information. [R1] L. Wang et al., "A Comprehensive Survey of Continual Learning: Theory, Method and Application," https://arxiv.org/abs/2302.00487 [R2] R. Aljundi et al., "Task-Free Continual Learning," in Proc. CVPR 2019. In contrast to the comments by Reviewer ZQXi, other reviewers have acknowledged our works : *"The paper proposes an original method that exploits the idea of orthogonal projections to learn new tasks whilst controlling forgetting and encouraging forward and backward transfer"* by Reviewer Shvn, *"this study presents a novel local model space projection approach, optimizing continual learning."* by Reviewer CGvu. Moreover, Reviewer CGvu raised his/her score after we clarified his/her doubts. We have made a sincere attempt to address all the reviewers’ comments and have incorporated their suggestions. However, we are unsure how to appropriately address Reviewer ZQXi’s comments and would like to raise our concerns on the bias and limited understanding of CL from Reviewer ZQXi’s review. We hope ACs weigh the different qualities of reviewers and could conduct a fair and fruitful post-rebuttal discussion. Thank you, Authors

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully