# Mod-Squad V2 Rebuttal
<!-- * Release our anonymous code in https://www.dropbox.com/s/0l5wd74fc42au0x/multi_learning_code.zip [o6VS] -->
## General Response
We sincerely appreciate all reviewers’ time and efforts in reviewing our paper. We are glad to find that reviewers generally recognized our contributions:
* **Problem.** MTHL is not a new setting but it is practical, important and valuable [2zbi, mA5t, MBQv, o6VS, aYYd].
* **Method.** The proposed method is efficient [2zbi, mA5t, MBQv, o6VS, aYYd]. The modular design allows for easy expansion, efficient adaptation, and extend to continous learning setting. [mA5t, MBQv, o6VS, aYYd]
* **Experiments.** Proposed method obtains good and convincing performance with many popular network architectures and different datasets. [mA5t, MBQv, o6VS, aYYd]
* **Writing.** Having the paper clearly presented and easy to follow [2zbi, mA5t, MBQv, o6VS, aYYd].
And we also thank all reviewers for their insightful and constructive suggestions, which help a lot in further improving our paper. In addition to the pointwise responses below, we summarize supporting experiments added in the rebuttal according to reviewers’ suggestions.
**New Experiments**
* Extending our model to depth estimation, normal estimation, and keypoint detection [MBQv]
* Increase attention network experts in continous learning setting [o6VS]
* Incorporate more advanced technique for detection [2zbi]
* Self-distillation experiments with pseudo labels [mA5t]
* Incorporate other multi-task models in our MTHL framework [mA5t]
* Compare with multi-task methods: [MuST](https://arxiv.org/pdf/2108.11353.pdf) and [XLearner](https://arxiv.org/pdf/2203.08764.pdf) [MBQv, 2zbi]
* Ablation study on approximation for probability estimation [aYYd]
* Release the link to our anonymous code in a separate "official comment" to AC, as per NeurIPS regulations [o6VS]
We hope our pointwise responses below could clarify reviewers’ concerns. We thank all reviewers’ time again.
## Reviewer 2zbi (3)
Thank you very much for the constructive comments.
**Q1. [Similar to Mod-Squad.]**
> 1.1 The novelty compared with the Mod-Squad work is small.
>
Mod-Squad is a recently proposed multi-task training framework based on mixture of experts. It has shown promising reults on solving multi-task multi-label training. However, there are several limitation of mod-squad that restricted its practical usage in real-world computer vision and robotics applications: 1) heavily relies on multi-label datasets, 2) oversimplify task-specific network design, 3) hard to generalize on downstream tasks/datasets. Mod-Squad is more like **a prove of concept on an ideal multi-label dataset** created by pseudo-label and not a general-purpose vision backbone that can be directly used in the real world applications. In contrast, our work focus on a general-purpose vision backbone and our efforts really **push it to the limit for practical application**. One support evidence is the downstream performance (MTHL vs. Mod-Squad) as shown in Tab.2. Mod-Squad is worse than ImageNet pretrained and MTHL beat both of them.
We claim the major contribution compared to Mod-Squad in L36: “However, Mod-Squad oversimplifies some task-specific network designs and the success of this model heavily relies on multi-label datasets, which are difficult to obtain and scale up. Therefore, it remains unclear in Mod-Squad:
1) How to **scale up this MTL model for multi-task heterogeneous training on conventional computer vision datasets**;
2) Whether this model can be utilized **as a general-purpose vision backbone that can be easily adapted to many downstream tasks**;
3) Whether we can **leverage the success of single-task methods instead of removing complicated modules and simplifying the task-specific sub-network**.”
> 1.2 The Mod-Squad work can handle the so-called "heterogeneous" cases as well.
>
While it may be true that GPT is just an extension of transformer models, it is non-trivial to extend successful work from a proof of concept to real-world applications. Similarly, we agree that Mod-Squad or any multi-task model can be extended to MTHL but we can not ignore the efforts on exploring the scaling laws of multi-task models.
Besides, our work **call for a paradigm change in computer vision** that the commonly used pre-train then finetune could be replaced by multi-task heterogenous training. We believe this is a promising direction to obtain a generous-purpose vision foundation model. **We do not claim the technique difference compared to Mod-Squad as our novelty.** Instead, our novelties lie in several aspects, as mentioned in the paper:
1. "exploring how to scale up multi-task models by leveraging mainstream vision datasets designed for different purposes" (L8),
2. "demonstrate strong generalization on downstream tasks" (L16),
3. "due to its emergent modularity, this general-purpose model decomposes into high-performing components, efficiently adapting to downstream tasks"(L17),
4. "easy expansion in continual-learning-without-forgetting scenarios"(L20).
This work therefore addresses the abovementioned problems, we hope this rebuttal clarifies our contributions in relation to Mod-Squad and highlights the unique aspects of our work.
**Q2. [The results (e.g. on COCO detection) seem to be under-par compared to the state-of-the-art performances.]** \
The premise of the question is whether a fair comparison is established between our model and state-of-the-art detection methods. In a short word, **the method on COCO leaderboard use external large-scale datasets and huge model**. For example, InternImage-H [E] achieve 65.0 mAP by pre-trained on **ImageNet-22k** and fine-tune it along with the DINO detector [A] on **Object-365** and COCO. Also, InternImage-H have **2.18B params** while swin-B have 86.7M params.
However, compared with those state-of-the-art works with various modules and datasets specially designed for the detection task, our work is a general multi-task backbone model with simpler decoder and supervised loss, no bells and whistles.
- Our model emploies a baseline decoder "UpperNet", which may appear less powerful compared to state-of-the-art methods using more advanced decoders such as "DINO [A]".
- We also do not leverage self-supervision (such as contrastive learning, mask modeling, and autoregressive losses) or visual-language pre-training techniques used in SOTA models (like DINO [A], GLIP [B,C], and Florence [D]).
- While SOTA models do achieve higher performance by employing complex modules, tricks, and **large-scale pretraining datasets**, we intentionally use a baseline decoder to demonstrate the effectiveness of our approach on different vision tasks and datasets.
By utilizing a baseline decoder and a simpler pipeline, we aim to demonstrate the generality of our method across multiple tasks. In this way, our work is **apple-to-apple comparable to the backbone models listed in our paper: DaViT and Swin Transformer, with the only difference lies in multi-task heterogeneous training**. Using the same UpperNet decoder and without any other tricks, our model achieves comparable or even better performance than the original DaViT and Swin Transformer, validating the effectiveness of our training framework.
Moreover, based on your suggestion, we conduct an experiment using InternImage-L and Cascade Mask R-CNN as our decoder and conduct multi-task heterogenous training on our pre-training dataset (**No ImageNet-22K and Object365 due to computing resource limitation**). The results are shown in the table below. It demonstrate that our algorithm is orthogonal to SOTA decoder designs and can be further improved with additional training strategies and modules.
| Method | Backbone | Decoder | Params(M)| COCO Detection mAP |
| ----------| ----------| ------- |------- |------- |
| Pre-train then Finetune | Swin-S | UpperNet | 48.9 | 42.0 |
| MTHL.D | Swin-S | UpperNet | 48.9 | 45.0 |
| Pre-train then Finetune | InterImage-L | Cascade Mask R-CNN | 223 | 56.1 |
| MTHL.D | InterImage-L | Cascade Mask R-CNN | 223| 57.7 |
[A] Zhang, Hao, et al. "Dino: Detr with improved denoising anchor boxes for end-to-end object detection." ICLR 2023.
[B] Li, Liunian Harold, et al. "Grounded language-image pre-training." CVPR 2022.
[C] Zhang, Haotian, et al. "Glipv2: Unifying localization and vision-language understanding." NeurIPS 2022.
[D] Yuan, Lu, et al. "Florence: A new foundation model for computer vision." arXiv.
[E] Wang, Dai, et al. "Internimage: Exploring large-scale vision foundation models with deformable convolutions" CVPR 2023.
\
Thanks again for your time and effort! For any other questions, please feel free to let us know during the rebuttal window.
## Reviewer mA5t (3)
Thank you for listing a great amount of valuable related works. We will include all of them in the revision!
**Q1. [The setting explored in this paper is not new. The method is limited to Mod-Squad.]**
We agree that there are some works related to Multi-task Heterogeneous Training and **we do not claim this is a new problem/setting**. In the revision, we will include a comprehensive discussion and comparisons with these related methods.
A key difference of our work compared to other work is that we **push the limit of multi-task model to practical application and exploring the scaling laws on the multi-task MoE model**. This is supported by the following evidence:
1) one model to perform three tasks can be on par with single-task state-of-the-art (L63),
2) directly compared with the commonly used pre-train then finetune training scheme on both pre-training and downstream tasks (Tab.1 and Tab.2),
3) real efficient adaptation on downstream tasks without performance drop as a general-purpose vision backbone (Tab.3).
4) scale up on mainstream vision datasets. (L8)
To achieve this goal, we borrow the success in multi-task learning (Mod-Squad) to alleviate task conflicts and single-task learning to incorporate task-specific modules. "Our multi-task heterogeneous training is a general framework that is orthogonal to model architecture design. All Transformer or MLP-based structures are applicable" (L133).
MTHL is not limited to a certain multi-task model (e.g., Mod-Squad). For example, our approach can also incorporate MuST [ref4] by only taking teachers' pseudo-label as ground-truth on several datasets. The result is shown below:
|Backbone | Method | ImageNet Acc | COCO Det mAP | ADE20k mIoU|
| ------- | -------- | -------- | -------- | ------- |
|Swin-T| MuST | 79.1 | 41.1 | 41.8 |
|Swin-T| MTHL w/ MuST | **79.7** | **43.6** | **42.9** |
In summary, **we do not focus on a certain method or improve on a specific problem setting**. The scope of this paper is to explore the scaling laws on multi-task MoE models and new training schemes for general-purpose vision backbone model. While it is true that existing works may have similar settings and architectures, this does not impact the main contribution of our paper.
**Q2. [Compared with multi-task models (CNN-based and transformer based) for verify the effectiveness.]** \
Good suggestion! [ref10,ref11,ref18,ref19] are conducted on different datasets with different tasks and backones with specific task designs (e.g., task-specific preliminary decoders in ref11) so it is hard to make an apple-to-apple comparison. We do our best and make a comparison with XLearner [ref2] and MuST [ref4]. Note that for XLearner, the comparison is not entirely fair since XLearner is CNN-based. We included it to provide some insights into the performance of different architectures. The comparison with MuST is more fair as we use the same network architecture and dataset, demonstrating the advantages of our MTHL approach.
|Backbone | Method | ImageNet Acc | COCO Det mAP | ADE20k mIoU|
| ------- | -------- | -------- | -------- | ------- |
| ResNet-50 | XLearner| 77.3 | 39.9 | 40.3 |
| Swin-T | MuST | 79.1 | 41.1 | 41.8 |
|Swin-T| MTHL.D | 79.7 | 43.8 | 44.4 |
|Swin-T| MTHL | **80.3** | **45.0** | **44.6** |
**Q3. [Including a more complete related literature review and discussion.]** \
Thanks for listing all the related work in an organized order. We will include all of them in the revision to provide a better context for our work!
**Q4. [Leveraging information of unlabeled tasks could be interesting.]** \
That is an interesting direction! We have explored activating our detection head and segmentation head on ImageNet to do further self-distillation with pseudo labels to see if it improves detection and segmentation. The results are shown in the table below. Further extension will be left as our future work.
|Backbone | Method | ImageNet Acc | COCO Det mAP | ADE20k mIoU|
| ------- | -------- | -------- | -------- | ------- |
|Swin-T| MTHL | **80.3** | **45.0** | 44.6 |
|Swin-T| MTHL w/ self-distillation | 79.8 | 44.6 | **44.8** |
**Q5. [Do the authors also consider other fundamental and stuctured problems like depth estimation, surface normal estimation, key point detection?]**
Great idea! We add depth estimation, surface normal estimation, and key point detection on the Taskomony dataset as shown below and compare our MTHL with Mod-Squad under the same setting. We can see that our model achieves overall better performance than Mod-Squad.
|Backbone | Method | Depth RMSE ↓ | Normal L1 ↓| Keyp. L1↓|
| ------- | -------- | -------- | -------- | ------- |
|ViT-B| Mod-Squad | 6.59 | 0.374 | 0.0275 |
|ViT-B| MTHL | **6.48** | **0.357** | **0.0267** |
Our ultimate goal is to develop a model capable of efficiently handling a wider range of vision tasks. We are committed to further building upon our work to achieve this objective.
**Q6. [There might be issue of unifying labels among datasets, e.g. mouse and computer mouse.]** \
Good question! Incorporating multiple datasets into MTHL would not result in conflicts, as the framework includes task-specific designed modules, and each dataset has its own classification head. During training, samples from different datasets are routed through their respective classification heads. This means that even if two datasets contain similar categories, such as "mouse" and "computer mouse," they are treated as distinct categories thereby avoiding any conflicts.
**Q7. [The experts number increase when learning from more tasks/datasets.]** \
Good question! In general, deep learning models require more capacity to handle complex tasks and large datasets, which can create a tradeoff between capacity and performance. Our framework is no exception to this rule, and also faces a similar tradeoff between performance and capacity. However, it is worth noting that our framework can still function effectively without requiring additional experts or resources (see MTHL.D in Tab.1).
\
Thanks again for your time and effort! For any other questions, please feel free to let us know during the rebuttal window.
## Reviewer MBQv (5)
We would like to begin by expressing our sincere gratitude for your thorough review of our paper. The questions you raised are insightful, and we greatly appreciate your suggestions, which are crucial in improving the quality of our paper, and we thank you profoundly for your effort.
**Q1. [Multi-task heterogeneous learning is not a novel task. It seems that the authors ignore [A] and [B]. They solve the same problem.]** \
We agree that there are some works related to Multi-task Heterogeneous Training and we do not claim this is a new problem/setting. We will include [A] and [B] in the related work section to provide a comprehensive review. Comparison to [A] and [B] can be refered to Q3.
A key difference of our work compared to other work is that we **push the limit of multi-task model to practical application and exploring the scaling laws on the multi-task MoE model**. This is supported by the following evidence:
1) one model to perform three tasks can be on par with single-task state-of-the-art (L63),
2) directly compared with the commonly used pre-train then finetune training scheme on both pre-training and downstream tasks (Tab.1 and Tab.2),
3) real efficient adaptation on downstream tasks without performance drop as a general-purpose vision backbone (Tab.3).
4) scale up multi-task models by leveraging mainstream vision datasets designed for different purposes. (L8)
In summary, **we are not exploring a new problem settings**. The scope of this paper is to exploring the scaling laws on multi-task models and new training scheme for generor-purpose vision backbone model. While it is true that existing works may have similar settings, this does not impact the main contribution of our paper.
**Q2. [This method is computational expensive for training since it takes 96 Tesla V100 GPUs.]**
It is a common requirement for methods that handle multiple complex tasks/datasets and large models to require significant computation resources. This is also true for multi-task learning with a base vision transformer and incorporating the MoE technique. For example, **Mod-Squad use 240 V100 GPUs**. As emphasized in line 159 of our paper, "One challenge in optimization is the presence of gradient conflicts between different tasks. These conflicts interfere with the joint optimization and slow down the convergence." Additionally, since our goal is to develop a general-purpose vision backbone with MTHL, large computing resources are required, similar to many other foundation models.
**Q3. [Advantage compared to [A] and [B].]**
Compared to [A, B], our approach differs in that we conduct one-stage joint training, while [A] requires learning teacher models and then conducting pseudo-labeling, and [B] conducts multi-stage pretraining by first learning the backbone on image-text data and then learning experts on specific tasks. We highlight our advantage as **more straight farward, close to practical application, and easy to scale up for multiple datasets/tasks**.
In term of performance and efficiency, here are some results directly compared to MuST [A]:
|Backbone | Method | ImageNet Acc | COCO Det mAP | ADE20k mIoU|
| ------- | -------- | -------- | -------- | ------- |
| Swin-T | MuST | 79.1 | 41.1 | 41.8 |
|Swin-T| MTHL.D | 79.7 | 43.8 | 44.4 |
|Swin-T| MTHL | **80.3** | **45.0** | **44.6** |
We also compare with MuST [A] on downstream tasks (same as in Tab.2 including P365, iNat18, PASC. and so on):
|Backbone | Method | Average on downsttream|
| ------- | -------- | -------- |
| Swin-T | MuST | 73.6 |
|Swin-T| MTHL.D | 77.7 |
|Swin-T| MTHL | **78.2** |
MTHL also offers a unique advantage in efficiently adapting to saving training memory, model capacity, and inference cost, as demonstrated in Table 3. These advantages are not applicable to other methods.
We hope our explanation can address your concern on the performance and efficiency compared to related work.
**Q4. [More discussion of limitation and broader impact.]**\
Thanks for the suggestion. In the revision, we will incorporate additional discussions regarding limitations and broader impacts. One such limitation is the need for additional effort to balance learning from various tasks, such as tuning the sampling frequency, when presented with a new set of learning tasks. This process is heavily dependent on the specific task. However, our paper successfully scaled up heterogeneous training on mainstream vision datasets and demonstrated its effectiveness, which could serve as a valuable reference for new training schemes for large vision models.
One broader impact of our work is that we call for a paradigm change in vision model pre-training: the commonly used pre-train then finetune could be replaced by multi-task heterogenous training. We demonstrate that MTHL is a more straight forward way to handle multiple tasks/datasets, easy to scale up, and efficiently transfer to downstream tasks. We will further work on this to incorporate more vision tasks and datasets.
Thanks again for your time and effort! For any other questions, please feel free to let us know during the rebuttal window.
## Reviewer o6VS (6)
Thank you for the positive comments and insightful suggestions. Your insightful questions and valuable suggestions have been immensely helpful in enhancing the paper's quality.
**Q1. [In Tab. 1, it appears that MTHL generally underperforms its dense counterpart MTHL.D when model capacity increases.]** \
The improvement of introducing sparse model with extra capacity gradually diminish when the model capacity increase. This is due to extra capacity may no longer bring more performance when the model is already large and possibly make training harder. However, MTHL does have a clear advantage over MTHL.D when model capacity is not extensive (e.g., **see swin-T and Davit-T in Tab.1**). Further, even with increased model capacity, we want to highlight that MTHL consistently outperform MTHL.D in **downstream performance** as shown in Tab.2, which serves as a main advantage of MoE model.
**Q2. [As for continuous learning (Sec 3.4), while the MLP block can be updated by adding new experts, other parameters such as the Attention block are freezed, which may limit the generalization ability of the whole model.]** \
Good question! To address this concern, we can also add experts into other parameters, such as the attention block, using the MoE technique (e.g., MoE Attention). In our experiments, we found that adding MLP experts often provided us with sufficient capacity for new knowledge, but this may depend on the specific dataset. The results of adding attention experts (4 attention experts at begin and adding 1 new attention experts per new task) is shown in following (same as in Tab.3 evaluated on P365, iNat18, PASC. and other datasets):
| Method | New params per task (M) | Average Performance |
| -------- | -------- | -------- |
| Add 2 Mlp.Ex | 10.4 | 80.6 |
| Add 2 Mlp.Ex & 1 Att.Ex | 12.5 | 80.8 |
This demonstrates the potential of expanding experts in various model components to improve generalization in continual learning.
**Q3. [What is the input resolution of detection and segmentation tasks? Are they the same as classification’s or not?]** \
They are not the same. Each image has its own original resolution and is resized to different scale for different tasks. The classification resolution is 224\*224. The detection and segmentation have a multi scale task-specific module and can be resized into sevaral input resolution for one image, e.g., the resolutions (480, 1333), (512, 1333), (544, 1333), (576, 1333), (608, 1333), (640, 1333), (672, 1333), (704, 1333), (736, 1333), (768, 1333), (800, 1333) may be used for the COCO dataset.
**Q4. [Is there any plan to release the code and checkpoints for reproduction?]** \
We release the link to our anonymous code in a separate "official comment" to AC according to the rules of Neurips.
\
Thanks again for your time and effort! For any other questions, please feel free to let us know during the rebuttal window.
## Reviewer aYYd (6)
Thank you for the positive comments and insightful suggestions. We are sincerely grateful for your efforts and the time you dedicated to providing feedback.
**Q1. [The figure 2 can be improved such that it can be less confusing, e.g. in A do different tasks share some of the experts?]** \
Yes, they can share experts. We will modify the figure and further clarify that in revision.
**Q2. [lack ablation study of the approximation for probability estimation in Line 172.]** \
The approximation in L172 addresses the issue of biased joint distribution estimation for a single task within a sampled batch, which is crucial for the successful use of the mutual information loss in multi-task heterogeneous training.
Without the approximation, the mutual information loss have to be computed in a single batch of data, which from only one task/dataset. For every dataset/task, the model will evenly use all experts and constrain its performance. The result (MTHL w/o app.) is shown as the following:
|Backbone | Method | ImageNet Acc | COCO Det mAP | ADE20k mIoU|
| ------- | -------- | -------- | -------- | ------- |
|Swin-T| MTHL.D | 79.7 | 43.8 | 44.4 |
|Swin-T| MTHL | **80.3** | **45.0** | **44.6** |
|Swin-T| MTHL w/o app. | 79.6 | 44.0 | 44.1 |
**Q3. [wallclock runtime comparison except for the Params and FLOPs.]** \
We will add it in revision. We measure the wallclock runtime for model to complete training and inference on all datasets. For train from scratch and pre. & FT (including pre-training time), we sum up the time on all datasets. The results is shown as the following:
| Backbone | Model | training GPU hours | inference GPU minutes|
| -------- | -------- | -------- | -------- |
| Swin-T | Scratch | 420 | 2.1 |
| Swin-T | Pre. & FT. | 390 | 2.1 |
| Swin-T | MTHL.D | 550 | 2.1 |
| Swin-T | MTHL | 700 | 2.3 |
**Q4. [It is not very clear the exact difference between MTHL.D and MTHL.]** \
In L240, "MTHL.D: our multi-task heterogeneous learner using a dense model (no MoE). MTHL: our multi-task heterogeneous learner using a sparse model (with MoE)". The difference is whether using MoE or not.
**Q5. [In Line 60, it would be better to clarify the technical contributions of the paper apart from the effectiveness.]** \
Thanks for the suggestion. We will make it more clear. The technical contributions of our paper include:
- Proposing a method to address challenges arising from large intrinsic differences among vision tasks, such as data distribution, architectures, task-specific modules, dataset scales, and sampling strategies (L9).
- Introducing a modified and scaled-up mixture-of-experts (MoE) vision transformers, leveraging their strong modularity to easily decompose into high-performing components and allowing for a more flexible selection of semantically meaningful components when transferring to downstream tasks (L192).
- Introducing a new mutual information loss for multi-task heterogeneous training that can handle batches with only one task (L168).
- Introducing several ways for efficient adaptation and continous learning (Fig.2)
\
Thanks again for your time and effort! For any other questions, please feel free to let us know during the rebuttal window.
############################ Discussion ################################
## Reviewer mA5t (3)
Thank you for your response and thoughtful suggestions. We will definitely **cite all related papers** and conduct a more **comprehensive literature review**. Additionally, we will **include MuST and other relevant works** in our main experiment table for better comparison. Further, we have a new experiment directly **incorporating MUST** into our MTHL framework and we are happy to find out that actually improve the performance. We hope that our efforts will address your concerns.
**Q1. [MTHL is a standard way of learning a multi-task model from multiple datasets.]**
We respectfully push back your premise of assessing the paper from a more general multi-task learning perspective. Our approach, while sharing similarities with traditional multi-task learning, holds particular significance in the era of large models and large-scale data. We demonstrate our whole framework, including MTHL, modularized design, downstream task transfer, and even continual learning, as a feasible route for general-purpose models. This distinctive approach sets it apart from the "general multi-task learning perspective". However, we totally agree that we should compare to traditional MTL, which will be address in Q3.
We never claim learning a multi-task model from multiple datasets is a new setting. Instead we aim to investigate how to train such a model while ensuring comparable even better task performance to state-of-the-art single-task learning. This is a key challenge that has not been resolved yet -- whether a multi-task model can be effectively scaled up for practical applications.
This is very challenging, since multi-task learning can suffer from task conflicts, and mainstream vision datasets have powerful baselines with tricks and task-specific designs. To our best knowledge, our proposed MTHL is the first framework to solve this fundamental challenge.
More fundamentally, our findings demonstrate that the dominant learning scheme in computer vision (pre-train from ImageNet, then fine-tune on detection and segmentation tasks) can be replaced directly by MTHL. Previous MTL work has rarely shown that it can be used in foundation model pre-training and compete with commonly used pre-training schemes on downstream tasks (See Table 2 in our main paper). For example, even though Mod-Squad is trained on the large-scale Taskonomy dataset, it has never demonstrated its generalizability on downstream tasks like other general-purpose models (e.g., ImageNet pre-trained Transformer). Our work, for the first time, shows that it is feasible to learn a general-purpose vision backbone with MTHL.
**Q2. [Adapt the models for downstream problems with modulars/adapters has been explored in CV.]**
While downstream adaptation of models has indeed been explored, it is essential to note that our contribution goes beyond this perspective. As far as our knowledge extends, our work is the pioneer in proposing a model capable of flexibly tuning training parameters, model parameters, and computation costs specifically for downstream tasks.
We do compare our model with Adapters in the Tab.3 of our main paper, demonstrating the effectiveness of our work. Additionally, we are not aware of any prior work utilizing modular design for continual learning, distinguishing our approach from existing research.
**Q3. [Comparison of Mod-Squad and MTL methods.]**
Great suggestions! We will add other MTL methods in our Tab.1 in revision. Here we include two table compare MTL methods with Mod-Squad and MTHL.
The first table is **an overall comparison on PASCAL-Context**. All results are from their papers. InvPT reports maxF while other work report mIoU for saliency so we leave it as blank. We also add NDDR-CNN (ref A) and M3viT (ref B) for your reference. We also report the new result of Mod-Squad using Swin-T by running their publicly available code. Note that Mod-Squad and M3ViT both use MoE ViT but are slightly different in expert numbers.
| Method | Backbone | Params | Seg.↑ | Norm.↓ | H.Parts↑ | Sal.↑ | Edge↑ |
| ------- | ---------- | ------ | -------- | -------- | -------- | -------- | -------- |
| MTI-Net | HRNet-18 | 8M | 64.2 | 14.7 | 62.0 | 68.0 | 73.4 |
| MTAN | ResNet-18 | 11M | 63.7 | 14.8 | 58.9 | 65.4 | 69.6 |
| InvPT | Swin-T | 27.5M | 73.9 | 14.1 | 62.7 | - | **72.6** |
| NDDR-C | ResNet-18 | 11M | 65.4 | 13.9 | 60.5 | 66.8 | 69.8 |
| Mod-S | MoE Swin-T | 55.4M | **74.3** | **13.5** | **63.0** | **67.2** | 72.4 |
| M3viT | MoE ViT-T | 42M | 72.8 | 14.5 | 62.1 | 66.3 | 71.7 |
| Mod-S | MoE ViT-T | 50M | 74.1|13.7|62.7|66.9|72.0|
Based on the table, Mod-Squad exhibits the best overall performance with a powerful backbone. Therefore, it would be logical for us to utilize it as the foundation for our work, given its current high level of performance. Additionally, Mod-Squad has conducted experiments on the large dataset Taskonomy, which have demonstrated its reliability on a large data scale and most other MTL work are conducted on relatively smaller dataset (e.g., NYUv2, PASCAL-Context). This fact provides us with an additional reason to explore the practical applications of Mod-Squad.
Here is the second table comparing some great MTL work (e.g., MuST, XLearner) and our MTHL. Note that our MTHL.D baseline is a plain transformer with task-specific modules and it can be a general framework to incorporate any new methods. It is also a strong baseline. We also implement MTAN by replacing our backbone with the MTAN structure based on ResNet-50. Further, we report the result of MUST+MTHL.D by taking both pseudo-label from the single-task model and the ground truth as learning objectives.
|Backbone | Params | Method | ImageNet Acc | COCO Det mAP | ADE20k mIoU|
| ------- | -------- | -------- | -------- | -------- | ------- |
| ResNet-50 | 23.5M | MTAN| 77.5 | 40.4 | 40.1 |
| ResNet-50 | 23.5M | XLearner| 77.3 | 39.9 | 40.3 |
| Swin-T |27.5M | MuST | 79.1 | 41.1 | 41.8 |
|Swin-T| 27.5M| MTHL.D | 79.7 | 43.8 | 44.4 |
|MoE Swin-T| 50.9M| MTHL | 80.3 | **45.0** | 44.6 |
|MoE Swin-T| 50.9M| MTHL+MUST| **80.8** | 44.9 | **44.9** |
From the table, we can see that MTHL still has a clear advantage over previous MTL methods. Also, it is worth noticing that MUST is orthogonal to our MTHL and can further improve the performance.
**Q4. [Use Mod-Squad with additional pseudo-labels for training from multiple datasets? ]**
<!-- [happy to include new results to further explore blablabla. orthogonal. general technique. we even release the code to AC including this part. ] -->
We are happy to include new results to further explore training with additional pseudo-labels following the idea in MUST. We conduct MTHL by using the single-task model's output as the pseudo-label. Specifically, we first train three single task models (with the same backbone and task-specific design as our MTHL approach) on ImageNet, COCO, and ADE20K respectively and use them as teachers to further train our MTHL on the three datasets (only pseudo-label and no ground truth label). This can be seen as a variant of MUST with additional ground-truth label under the multi-task heterogenous training setting.
Here are some results of using ground truth (GT), pseudo-label only (PL), and both (GT+PL):
|Backbone | Method | ImageNet Acc | COCO Det mAP | ADE20k mIoU|
| ------- | -------- | -------- | -------- | ------- |
|MoE Swin-T| GT | 80.3 | **45.0** | 44.6 |
|MoE Swin-T| PL | 79.1 | 41.1 | 41.8 |
|MoE Swin-T| GT+PL | **80.8** | 44.9 | **44.9** |
From the table, we can see that incorporating MUST further improve the results. This proves MTHL is a general technique that can be compatible with the advanced MTL methods like MusT. We will add the results of incorporating MUST in our revision.
**Q5. [Discussion of cross-task relations.]**
Good suggestion! We agree that cross-task relation and how to utilize unlabeled part of data could be an exciting direction. We will discuss it in our revision.
**Q6. [For including other tasks (response to Q5), do the authors train the model on ImageNet, COCO, ADE20k and Taskonomy?]**
Yes, we train our model on ImageNet, COCO, ADE20k and Taskonomy. And following your suggestion, we focus on specific tasks from Taskonomy, namely depth, normal, and keypoint, due to its expansive nature, which includes 19 tasks.
<!-- Note that since Taskonomy is too big with 19 tasks. We only train on the depth, normal, and keypoint from Taskonomy along with ImageNet, COCO, and ADE20K following your suggestion. -->
**Q7. [Regrading the cost comparisons, I would recommend including the comparisons (params) with MuST [ref4] and XLearner [ref2]]**
We appreciate your input and will certainly add these results in our revision. You can refer to our second table in the response to Q3.
[A]Gao, Ma, et al. "Nddr-cnn: Layerwise feature fusing in multi-task cnns by neural discriminative dimensionality reduction." CVPR 2019.
[B]Liang, Fan, et al. "M³vit: Mixture-of-experts vision transformer for efficient multi-task learning with model-accelerator co-design." NeurIPS 2022.
Thank you for taking the time and effort to review our work. We are pleased to see that incorporating MuST has improved the performance following your suggestion. If you have any further concerns, please do not hesitate to let us know, and we will be happy to address them.
<!-- We sincerely thank you for the time and effort. We are glad that the idea of incorporating MUST actually improve the performance. If you have any other concern, we are happy to address it. -->
<!-- Here is the result with parameters:
|Backbone | Params | Method | ImageNet Acc | COCO Det mAP | ADE20k mIoU|
| ------- | -------- | -------- | -------- | -------- | ------- |
| ResNet-50 | 23.5M | XLearner| 77.3 | 39.9 | 40.3 |
| Swin-T |27.5M | MuST | 79.1 | 41.1 | 41.8 |
|Swin-T| 27.5M| MTHL.D | 79.7 | 43.8 | 44.4 |
|MoE Swin-T| 50.9M| MTHL | **80.3** | **45.0** | **44.6** | -->
### Response from authors
Thanks for your quick response. Here is some effort to address your confusion.
**Q1[Authors claim that they are the first one to solve the task conflict challenge if I understand correctly.]**
"This is a key challenge that has not been resolved yet -- whether a multi-task model can be effectively scaled up for practical applications."
"This is very challenging since multi-task learning can suffer from task conflicts."
We don't claim we are the first to solve task conflict. We just say task conflict make scaled up multi-task model with efficiency for practical applicants challenging.
**Q2[Pre-train multi-datasets with multiple tasks benefits performance on downstream tasks have been shown in MuST and XLearner.]**
We agree with that. We will further clarify that.
**Q3[The results of MTI-Net on Sal. and Edge should be in bold.]**
Thanks for reminding us. It should be in bold.
**Q4[Experiments on Taskonomy.]**
Yes, the method pretrained on the union of ImageNet, COCO, ADE20k, taskonomy and taskonomy only use depth, normals, and key points annotations in training. We evaluated on all datasets and the results are here:
|Backbone | Method |IN-1K top-1| COCO mAP| ADE mIoU| Depth RMSE ↓ | Normal L1 ↓| Keyp. L1↓|
| ------- | -------- | -------- | -------- | ------- | -------- | -------- | ------- |
|ViT-B| Mod-Squad |- | - | - | 6.59 | 0.374 | 0.0275 |
|ViT-B| MTHL w/o Taskon.| **82.2** | 47.6| 49.5 | - | - | - |
|ViT-B| MTHL |82.0 | **47.8** | **50.3** | **6.48** | **0.357** | **0.0267** |
Thanks again for your time. Let me know if there is still any confusioon.