Backup on ICML 2023 (SUT).
# TODO
1. If no LM results discuss posssibilities SUT opens up.
2. Include overall contributions and criticisms in the general response before the new experiments and addressing the concerns.
3.
---
### Response to all reviewers:
We thank all reviewers and ACs for their time and effort in reviewing the paper. We are glad that the reviewers generally recognized the following contributions.
* **Method**. The proposed method seamlessly integrates Sparse Mixture of Experts, Mixture of Attention Heads, Universal Transformers; and novelly introduces MIM loss and dynamic halting to improve the scalability and practicality of Universal Transformers.
> This paper shows the efforts to integrate SMoE, MoA with Universal Transformers with an additive contribution to introduce the MIM loss and the stick breaking dynamic halting method. (**ckQn**)
> It seems like a useful combination that works well to making universal transformers scalable. (**qSx7**)
> The method is a novel and well-designed architecture. The formulation of dynamic halting is an exciting future direction, addressing some main weaknesses of current transformer architectures. (**yqQz**)
* **Experiments**. Different tasks were performed and extensive ablation studies were conducted, demonstrating consistent advantages.
> Various tasks and comprehensive ablation studies have been carried out, showing consistent improvements. (**HrbX**)
> The experiments clearly show that SUT achieves similar accuracy as UT with far fewer MACs. Moreover, the ablation study shows the usefulness of the different components. (**ckQn**)
> The method leads to significant improvement over baselines on CFQ and performs competitively with lower parameters On WMT'14. (**qSx7**)
> The authors showcase the SUT's ability to handle recursive language modeling and reasoning and performs competitive performance on translation. (**yqQz**)
* **Presentation**. The paper is well presented.
> This manuscript is well organized with sufficient details about the proposed methods. (**HrbX**)
> The motivation and ideas presented in the paper are clear and experiments support the claims made in the paper. (**ckQn**)
> The paper is well written and the general narrative of the paper is easy to follow. The problem is well-motivated and the experiments/results are well-explained. (**qSx7**)
> This paper is well written and clear. (**yqQz**)
#### On novelty
Much of the contributions in this paper highlight the various aspects of getting UTs computational footprint low so that it can be used at scale, including:
1. Integrating Sparse MoE techniques into the components of the Transformer block.
2. Introducing a principled auxiliary loss to aid in Sparse MoE training
3. Stick-breaking halting mechanism, allowing us to reduce the per-token compute cost by adjusting the threshold of the halting mechanism.
To our knowledge, there is no prior work on scaling up the UT to fit the LLM regime. We believe this is an important step for further improvements to be made. While the techniques may seem straightforward, there are various challenges in the development of SUT:
1. While there is prior work on using Sparse MoEs on the attention mechanism, we have extended the prior work to allow for multi-heads per expert, and we find that this is a crucial enhancement.
2. As the number of experts are increased to match parameter count for the SUT, the more experts are required. This leads to sparser gradients and can result in over-reliance on some experts. We have found that the MIM loss helps in mitigating these issues, but we believe more work in this area will improve Sparse MoEs.
#### Additional experimental results
##### CFQ MACs
We will update Table 2. with a column to reflect the computation cost of each model:
| Model | MCD1 | MCD2 | MCD3 | Avg. | MACs |
|-----------------------|------|------|------|------|--------|
| Universal Transformer | 42.7 | 9.5 | 11.6 | 21.3 | 1154M |
| Edge Transformer | 47.7 | 13.1 | 13.2 | 24.7 | 6504M |
| Transformer | 42.5 | 11.2 | 10.6 | 21.4 | 1154M |
| T5 | 61.6 | 31.3 | 33.3 | 42.1 | 1154M |
| Roberta | 60.6 | 33.6 | 36.0 | 43.4 | 1660M |
| Dangle | 78.3 | 59.5 | 60.4 | 66.1 | 51033M |
| SUT w/o halting | 71.0 | 48.6 | 51.3 | 56.9 | 654M |
| SUT with halting | 72.4 | 51.1 | 51.7 | 58.4 | 654M |
MACs here are computed against a dummy input of 20 tokens for both source and target sequences.
Generally, SUTs have a smaller computational cost than prior methods due to models having larger parameter counts. Dangle, which outperforms SUTs, require forward passes of a Roberta model equal to the number of output tokens. We believe that better compositionality results (particularly on CFQ) arise from more recursive steps (which is the commonality between SUTs and Dangle) rather than larger models, or simply more compute.
##### ListOps results
Several reviewers have asked for additional experiments to be done on algorithmic tasks, and given the limited time frame, we've managed to run some experiments on [ListOps](https://arxiv.org/abs/1804.06028) and the [Long-range Arena variant of ListOps](https://arxiv.org/abs/2011.04006) with the SUT.
**TODO add in the other LRA benchmarks**
| Dataset | Acc.|
|--------------------------|-----|
| Long-range Arena ListOps | 49% |
**TODO preliminary results on language modelling**
---
> ### Official Review by Reviewer HrbX
> Official ReviewReviewer HrbX09 Mar 2023, 13:36 (modified: 14 Mar 2023, 09:41)Program Chairs, Area Chairs, Authors, Reviewer HrbX, Submitted, Senior Area Chairs
> Summary:
> This work proposed Sparse Universal Transformer(SUT), which can effectively scale universal transformers for superior performance with less computation cost. SUT consists of three components: (1) Mixture of Multi-head Attention layers; (2) a new regularization to improve loading balance; (3) the Dynamic halting mechanism. Experiments validate the effectiveness and parameter/computation efficiency on formal language tasks and standard natural language benchmarks.
> Strengths And Weaknesses:
> Strengths
> This manuscript is well organized with sufficient details about the proposed methods.
> Experiments are conducted across various tasks and comprehensive ablation studies have been carried out.
> The sparse universal transformer shows consistent improvement across different tasks.
> Weaknesses
> The novelty of this work is moderate. Multiple previous works have demonstrated that MoE can effectively scale transformers. It seems natural we can also use MoE to scale up universal transformers.
> What is the computational cost of different methods in Table 2?
> grammar: line 327: "Overall, these experiments suggest using halting performs slightly better on compositional generalization than without".
> Questions:
> line 77-78: " This sparsity brings extra difficulty in training due to discrete gradients." Could the authors elaborate more about this sentence? why the gradients are discrete?
> Limitations:
> None
> Ethics Flag: No
> Ethics Review Area: I don't know
> Soundness: 2 fair
> Presentation: 3 good
> Contribution: 2 fair
> Rating: 4: Borderline reject: Technically solid paper where reasons to reject, e.g., limited evaluation, outweigh reasons to accept, e.g., good evaluation. Please use sparingly.
> Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.
> Code Of Conduct: Yes
----
### Response to Reviewer #1
Thank you for your comments!
>Q1. The novelty of this work is moderate. Multiple previous works have demonstrated that MoE can effectively scale transformers. It seems natural we can also use MoE to scale up universal transformers.
**TODO**
1. scaling up is not the only
While this is true, Sparse MoEs are still an active area of research, and scaling up UTs with MoEs is not straightforward. Applying MoA and extending it to multi-head attention per-expert was a change required to obtain competitive results with existing non-UT methods.
We also contributed several principled perspectives on known prior methods: a new load-balancing loss based on mutual information, and a stick-breaking based halting mechanism.
These contributions together with the experimental results of SUTs demonstrate that there is a plausible path forward with both the compositional properties of UTs and scalability of VTs.
>Q2. What is the computational cost of different methods in Table 2?
We have also included the MACs of the CFQ experiments. Where we can see that the Dangle method, which outperforms SUT, requires much more computation.
>Q3. line 77-78: " This sparsity brings extra difficulty in training due to discrete gradients." Could the authors elaborate more about this sentence? why the gradients are discrete?
We apologise, it would have been more accurate to describe the gradients as sparse rather than discrete. We will make the appropriate changes to those lines.
Please let us know if you have further questions or changes you would like to see.
---
> Official Review by Reviewer ckQn
> Official ReviewReviewer ckQn06 Mar 2023, 01:51 (modified: 14 Mar 2023, 09:41)Program Chairs, Area Chairs, Authors, Reviewer ckQn, Submitted, Senior Area Chairs
> Summary:
> Universal Transformers (UT) generalizes transformers to handle tasks that RNN handles with ease. This generalization comes at high computation cost. This paper proposes a sparse mutli-head attention universal transformer that reduces the computational complexity of UTs. To this end the authors propose to optimize a mutual information based loss on a sparse mixture of multi-head attention architecture. The proposed loss overcomes the difficulty in training due to discreteness of the gradients. They also propose to extend the halting process using stick breaking. The paper shows empirically that the proposed SUT method generalizes on formal language tasks such as CFQ while being computationally efficient. The paper also provides an ablation to study which of the different components proposed in the paper contributes to reducing training loss.
> Strengths And Weaknesses:
> Originality
> This paper proposes to combine the efforts such as Sparse Mixture of Experts (SMoE), Mixture of Attention Heads (MoA) with Universal Transformers. This is an additive contribution which is supplemented by the introduction of the MIM loss and the stick breaking dynamic halting method.
> Quality
> The problem is well motivated and the SUT method is clear. The experiments presented in the paper clearly show that SUT achieves similar accuracy as UT, but with far fewer MACs than UT. Moreover the ablation study shows the usefulness of the different components in SUT.
> Clarity
> The motivation and ideas presented in the paper are clear and experiments support the claims made in the paper.
> Significance
> The paper is attempting at resolving a significant computational challenge in UTs. The significance of the contribution would have significantly increased if the authors could show what kind of applications that previously UTs could not be applied to is now unlocked due to the lower computational complexity of SUT
> Questions:
> Why does UT perform so poorly when compared to SUT in table 1?
> In table 2, I would have liked to see a comparison of UT MACs to SUT’s MAC at the base level. A comparison/ plot of MACs vs accuracy with increasing size will also be pretty interesting.
> Have the authors done any ablation study to compare MACs with model size for both UT and SUT. Since the authors are proposing the sparse variants, it will be better to give guidelines as to when or at what size SUT should be preferred over UT.
> As mentioned in the previous section, it would be very helpful for the paper, if the authors can comment/provide evidence on new capabilities that has been unlocked by SUTs eg better accuracy using deeper models, since they now have reduced MACs
> Minor Comments: Line 152, 153 second column typo W^{q}{e,i} \in -> W^{O}{e,i} \in ...
> Limitations:
> Some comments on when to use UT vs SUT will be very helpful
> Ethics Flag: No
> Soundness: 3 good
> Presentation: 4 excellent
> Contribution: 2 fair
> Rating: 5: Borderline accept: Technically solid paper where reasons to accept outweigh reasons to reject, e.g., limited evaluation. Please use sparingly.
> Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.
> Code Of Conduct: Yes
---
### Response to Reviewer #2
Thank you for your comments!
> Why does UT perform so poorly when compared to SUT in table 1?
The Transformer in Table 1 is a vanilla transformer, and are results obtained from the Ordered Memory (Shen et. al. 2019) paper.
> In table 2, I would have liked to see a comparison of UT MACs to SUT’s MAC at the base level.
We will include an extra column in Table 2 to reflect the MACs for the respective models. We have provided the numbers in a comment above.
> A comparison/ plot of MACs vs accuracy with increasing size will also be pretty interesting.
We can include this plot in the camera ready, but we believe that the performance on CFQ is not a function of the size of the model, which also influences MACs. This is evidenced by the MAC values we have computed, the parameter sizes of the models, and the performance on CFQ.
>Have the authors done any ablation study to compare MACs with model size for both UT and SUT.
These comparisons are made in the final experiment where we provide both the parameter counts and MACs. The SUT MACs stay constant despite the increase in parameter count.
> As mentioned in the previous section, it would be very helpful for the paper, if the authors can comment/provide evidence on new capabilities that has been unlocked by SUTs eg better accuracy using deeper models, since they now have reduced MACs
The main capability of the SUT is in opening up the possibility of further scaling up UTs to benefit from the generalisation properties of UT. This is a first but important step in training an SUT-based LLM.
One hypothesis we hold is that the smaller components (experts) now have the possibility to be used at any layer of the computation, unlike a standard Sparse Transformer that would also leverage experts. This modularity in the components may lead to generalisation. We have included a diagram in the appendix showing the specialisation of roles in the Propositional Logic task that suggests that this sort of modularisation-composition is possible.
> Since the authors are proposing the sparse variants, it will be better to give guidelines as to when or at what size SUT should be preferred over UT. As mentioned in the previous section, it would be very helpful for the paper, if the authors can comment/provide evidence on new capabilities that has been unlocked by SUTs eg better accuracy using deeper models, since they now have reduced MACs
Thank you for this excellent point. We will discuss this in the conclusion of the paper to emphasise the point by adding the following:
**We are able to scale the UT without incurring the increase in computation (See Figure 1). In our translation experiments, we compare the SUT against an equivalent sized UT, UTs tends to perform better. Given the evidence, SUTs are useful when hardware constraints prevent scaling up the UT.**
---
> Official Review by Reviewer qSx7
> Official ReviewReviewer qSx705 Mar 2023, 13:43 (modified: 14 Mar 2023, 09:41)Program Chairs, Area Chairs, Authors, Reviewer qSx7, Submitted, Senior Area Chairs
> Summary:
> Premise. This work extends the universal Transformer architecture by making 3 modifications to reduce the computational complexity and allow scalability. They show that their proposed modifications lead to higher performance on two compositional benchmark datasets and the WMT'14 translation datasets without excessive computational costs.
> Method. The universal Transformer architecture is theoretically more expressive than vanilla Transformers (due to unlimited depth in theory) but computational costs are quadratic with the number of layers when they have the same number of parameters as vanilla Transformers (due to parameter sharing across layers). To address that, this work adapts the sparse mixture of experts approach and replaces the standard feedforward and attention blocks with mixtures. They introduce an auxiliary loss adapted from recent work to mitigate issues with training sparse networks. Lastly, they propose a different halting mechanism which allows them to penalize layer use.
> Results. They show that their proposed approach SUT outperforms several baselines in the logical inference benchmark task and the CFQ benchmark. On WMT'14 their method performs competitively with relatively lower computational costs. They conduct ablation studies on WMT'14 to show the effectiveness of each component. They also demonstrate that the computational costs of the model can be reduced post-training by changing the threshold for halting.
> Strengths And Weaknesses:
> Strengths
> Interesting Approach. As mentioned in the paper, universal Transformers have certain desirable properties in terms of expressiveness but are more difficult to scale. This work takes a step towards making them scalable and more usable from a practical point of view. Although the individual components of the method (sparsity, auxiliary loss, and halting) themselves are not novel, it seems like a useful combination that works well to improve the original universal Transformers architecture.
> Significant Improvements in Performance. On tasks such as CFQ, the method seems to lead to significant improvement in performance over baselines. Even on WMT'14, the approach performs quite competitively with a reasonably lower number of parameters.
> Writing. The paper is well written and the general narrative of the paper is easy to follow. The problem is well-motivated and the experiments/results are well-explained.
> Weaknesses
> Choice of Tasks and Baselines. The choice of tasks seems quite narrow (2 synthetic and 1 translation) to confidently ascertain the efficacy of the proposed approach. In my opinion, since the proposed method is aimed to scale universal Transformers, it would be important to see how the method performs on tasks where the original paper [1] claimed to perform better than vanilla Transformers. This involves algorithmic tasks and language modelling tasks.
> The paper cites [2] stating that they build on universal transformers to improve compositional generalization capabilities. There are a few questions related to that. I believe the adaptive halting mechanism separates universal transformers from other variants. Based on my understanding, [2] does not use adaptive halting and their method involves a copy gate mechanism. Since the SUT method performs well on CFQ, it would be useful to see how it would perform on (some) tasks used in [2] such as listops or compositional lookup table. Secondly, it is imperative to understand how important the halting mechanism is in improving performance. It would be very useful to see how the (extremely simple) copy gate in [2] performs in comparison with the halting mechanism proposed here. In [2], they reported that they did not observe any gains via previous halting mechanisms in their experiments.
> Novelty. The individual components of the method are derived from previous works. As mentioned earlier, the combination still seems useful from the perspective of scaling universal transformers but the insights that can be drawn from their method seems quite limited.
> Unclear Results. Some aspects of the results and motivation are unclear. I have posed them as questions in the next section.
>
> Other Comments. Some of the motivations relating to expressiveness and generalization do not seem technically sound. If a model is more expressive or Turing complete then it does not necessarily imply that it will generalize better. Models cannot learn what they cannot express but they cannot necessarily learn functions that they can express. Lines 015 - 017 in the abstract do not seem correct. The expressiveness of a model does not have any relation with systematic/compositional generalization. They are separate properties. For sequence-to-sequence tasks (such as translation considered in the paper) involving encoder-decoder models, even vanilla Transformers are Turing-complete. The theoretical limitations in terms of expressivity apply to Transformer encoders. It would be useful to rephrase some of these sentences.
> I have given a score of 5 to the paper. I am open to increasing my score based on the authors' responses.
> [1] Universal Transformers. Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, Łukasz Kaiser. ICLR 2019
> [2] Csorda ́s, R., Irie, K., and Schmidhuber, J. The neural data router: Adaptive control flow in transformers improves systematic generalization. ICLR 2022
> [3] On the Turing Completeness of Modern Neural Network Architectures. Jorge Pérez, Javier Marinković, Pablo Barceló. ICLR 2019
>
> Typos:
> L048: Feedforword -> Feedforward
> L070: condition -> conditioned
> L124: 4-tuple -> 3-tuple ?
> L190: auxilliary -> auxiliary
> L236: it these
> Eq (1): S is not defined prior to usage. It is clear from context but could be clearer for the reader if defined earlier.
> Questions:
> The complete ablation is provided for WMT but it would be useful to see that for the other two tasks as well where there are significant gains due to the proposed approach. That would help the reader understand which component leads to major gains on tasks like CFQ. Currently, the results with and without halting are provided in table 2.
> The performance of the original UT on CFQ (table 2) is quite poor. In lines 317-319, it is mentioned that it could be due to differences in the experimental setup/implementation. Would it not be more useful if you could run the original UT with parameters that closely follow your setup for a fairer comparison?
> Line 348: SUT big 787M MAC, Is that a typo? Since SUT base in Line 340 also seems to have 787M.
> Limitations:
> NA
> Ethics Flag: No
> Soundness: 3 good
> Presentation: 3 good
> Contribution: 3 good
> Rating: 5: Borderline accept: Technically solid paper where reasons to accept outweigh reasons to reject, e.g., limited evaluation. Please use sparingly.
> Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.
> Code Of Conduct: Yes
---
### Response to Reviewer #3
Thank you for your in-depth and insightful comments!
**TODO summarise reviewers comments somewhat**
> The paper cites [2] stating that they build on universal transformers to improve compositional generalization capabilities. There are a few questions related to that. I believe the adaptive halting mechanism separates universal transformers from other variants. Based on my understanding, [2] does not use adaptive halting and their method involves a copy gate mechanism.
**TODO clear up references**
Sorry for the confuison. There are actualy two papers from the same year with the same authors [1,2]:
1. *The devil is in the detail: Simple tricks improve systematic generalization of transformers*, showed that with parameter sharing and relative position embedding, they achieved better performance on compositional generalisation tasks.
2. *The neural data router: Adaptive control flow in transformers improves systematic generalization*.
In both cases they do not use adaptive halting as used in the UT paper. However, because the UT paper itself makes usage of the halting mechanism “optional” (some experiments use it while others did not), common usage of UT tended to refer to using parameter sharing across layers (Bergen et. al. 2021, for example).
You are right that this makes the point we are making unclear, and we will add in a sentence noting that:
**these works do not use dynamic halting as part of their description of UT, but rather demonstrate the usefulness of using shared parameters across layers.**
**TODO elaborate on halting**
> Secondly, it is imperative to understand how important the halting mechanism is in improving performance. It would be very useful to see how the (extremely simple) copy gate in [2] performs in comparison with the halting mechanism proposed here. In [2], they reported that they did not observe any gains via previous halting mechanisms in their experiments.
- Sorry for confusion.
- We ffind the same results
- Halting affords a slight but insignificant performance gain
- We show in final experiments that the halting can however be used to improve efficiency as a tradeoff with performance.
- quote "Thanks to the new halting mechanism, SUT can reduce up to 50\% of the computation cost during inference time while maintaining the same level of performance. "
>The choice of tasks seems quite narrow (2 synthetic and 1 translation) to confidently ascertain the efficacy of the proposed approach. In my opinion, since the proposed method is aimed to scale universal Transformers, it would be important to see how the method performs on tasks where the original paper [1] claimed to perform better than vanilla Transformers. This involves algorithmic tasks and language modelling tasks.
> Since the SUT method performs well on CFQ, it would be useful to see how it would perform on (some) tasks used in [2] such as listops or compositional lookup table.
Given the time we had, we ran experiments on the standard ListOps dataset:
| Dataset | Acc.|
|---------|-----|
| ListOps | 91% |
| Long-range Arena ListOps | 49% |
On the Long-range Arena ListOps benchmark, this is currently the best performing Transformer-based model to
> Some of the motivations relating to expressiveness and generalization do not seem technically sound. If a model is more expressive or Turing complete then it does not necessarily imply that it will generalize better. Models cannot learn what they cannot express but they cannot necessarily learn functions that they can express. Lines 015 - 017 in the abstract do not seem correct. The expressiveness of a model does not have any relation with systematic/compositional generalization. They are separate properties.
> For sequence-to-sequence tasks (such as translation considered in the paper) involving encoder-decoder models, even vanilla Transformers are Turing-complete.
**TODO**
- We agree that turing completeness does not lead to compositional generalisation.
- However, only with expressiveness can the model learn to generalise to more complex problems _given enough data_...
- We will modify the text to reflect...
> **Unlike Vanilla Transformers (VTs), UTs can be shown to be Turing-complete under certain assumptions. This feature allows for better compositional generalization than VTs in formal language tasks.**
to
> **Unlike Vanilla Transformers (VTs), UTs can be shown to be Turing-complete under certain assumptions.**
- The turing-complete conclusion requires initial assumptions requires very specific adjustments ot the use of Transfomrers that are contrary to the general practice of using htem....
- We believe the turing-completeness of transformers is still a hotly discussed topic, we will modify the abstract and introduction to the following...
From our understanding, this result is from *On the Computational Power of Transformers and its Implications in Sequence Modeling (Bhattamishra et. al., 2020)*
Our review of prior theoretical work is that the results on theoretical limitations differ based on starting assumptions (arbitrary vs limited precision), and empirical evidence has shown that the limited precision assumption is more grounded in reality.
We will add some discussion about the topic in the appendix, as it may be useful.
According to your suggestion, we have modified:
**Unlike Vanilla Transformers (VTs), UTs can be shown to be Turing-complete under certain assumptions. This feature allows for better compositional generalization than VTs in formal language tasks.**
to
**Unlike Vanilla Transformer (VT) encoders, UT encoders can be shown to be Turing-complete under certain assumptions. There is also empirical evidence that UT has better compositional generalization that VTs in formal language tasks.**
> The performance of the original UT on CFQ (table 2) is quite poor. In lines 317-319, it is mentioned that it could be due to differences in the experimental setup/implementation. Would it not be more useful if you could run the original UT with parameters that closely follow your setup for a fairer comparison?
We've since realised there are several reasons for the differences beyond the experimental setup / implementation. The Bergen et. al. 2021 UT benchmark is based on a T5 architecture with all layers shared. The T5 has slight differences in the Transformer block architecture in T5 (pre-layernorm vs post-layernorm) that result in different training dynamics.
> Line 348: SUT big 787M MAC, Is that a typo? Since SUT base in Line 340 also seems to have 787M.
There is a negligible increase (considering the overall computation required) in MACs from scaling up the number of experts as the additional compute requirements is from the router of the MoE. This is a feature of using MoEs to scale: Computation costs stay relatively constant, while capacity increases. This is helpful not only during training, but also during deployment on more limited resources as well.
We will fix the other mistakes you pointed out in the camera-ready, thank you for the attention to detail! We hope our response is satisfactory and you are willing to increase your score. Do let us know if you have any further questions or suggestions for improving the paper.
---
> Official Review by Reviewer yqQz
> Official ReviewReviewer yqQz04 Mar 2023, 18:54 (modified: 14 Mar 2023, 09:41)Program Chairs, Area Chairs, Authors, Reviewer yqQz, Submitted, Senior Area Chairs
> Summary:
> The Universal Transformer is a variant of the transformer that shares parameters between layers and runs for a variable number of layers. It uses more FLOPs and fewer parameters, and is better at some language tasks that involve recurrence or recursion. Sparse models and mixture-of-experts (SMOEs) allow more parameters to be used with the same FLOPs.
> The authors combine these two techniques, implementing a SMOEs Universal Transformer. The concept is straightforward, and advances SOTA in two ways. First, it provides more flexibility in balancing the FLOPs vs parameters cost of language models. Second, it provides recursive capabilities to larger models, which the vanilla transformer lacks.
> The authors also make several architectural improvements. (1) They apply SMOEs to multi-head attention, not just the FFD portion of the transformer. (2) They introduce a better load-balancing loss for MOE, and (3) they introduce a soft-halting mechanism which has better gradient flow. These are all straightforward extensions, based on prior published work, but have not been combined in this way before.
> Strengths And Weaknesses:
> This paper is well written and clear. The authors have done an excellent job of pulling together a number of different ideas from the literature, and combining them into a novel and well-designed architecture. Universal Transformers, SMOEs, MOE attention, and MIM scheduling were all previously published, but they have not been combined in this way before. The author's formulation of dynamic halting, while conceptually simple, also seems quite important to me. I think that this is an exciting direction for future research, which addresses some of the main weaknesses of current transformer architectures.
> The authors do a good job of finding tasks that showcase the Universal Transformer's ability to handling recursive language modeling and reasoning. They also show that it is competitive on at least one traditional language modeling task: translation.
> I would have like to see a comparison of the Sparse Universal Transformer on other tasks, like autoregressive language modeling, or some of the downstream tasks that LLMs are currently being evaluated on. I would also like to see scaling studies, especially since the main advantage of the SUT is its scalability.
> Questions:
> (1) Because each token in the sequence may halt at a different time, the number of tokens that need to be evaluated in each layer will steadily decrease, thus saving additional compute, at the cost of implementation complexity. Does your implementation do this, or does it merely mask out the halted instructions?
> (2) Have you done any language-modeling experiments besides translation?
> (3) Many scaling studies have been done with vanilla transformers, but fewer have been done with other architectures. In particular, it is not at all clear whether FLOPs (compute) or parameters (memory) are the limiting factor in large language models, because in a vanilla transformer, both parameters and FLOPs are scaled together.
> Universal Transformers allow a model to increase compute, without increasing the number of parameters, while SMOEs increase parameters, without increasing compute. In addition, there are variations on a theme: e.g. one could have partial sharing of parameters between layers (some shared experts, some non-shared). It is not clear that the UT uses additional compute as effectively, or that SMOEs use additional parameters as effectively, when compared to a vanilla transformer.
> The scaling space here is admittedly much harder to characterize, since there are more variables involved. Have you thought about doing any scaling experiments to characterize the tradeoffs between parameters and compute, or are you aware of other studies in the literature?
> Limitations:
> No. The authors do not consider societal impact.
> Ethics Flag: No
> Soundness: 4 excellent
> Presentation: 4 excellent
> Contribution: 3 good
> Rating: 8: Strong Accept: Technically strong paper, with novel ideas, excellent impact on at least one area, or high-to-excellent impact on multiple areas, with excellent evaluation, resources, and reproducibility, and no unaddressed ethical considerations.
> Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.
> Code Of Conduct: Yes
---
### Response to Reviewer #4
Thank you for your comments!
> I would have like to see a comparison of the Sparse Universal Transformer on other tasks, like autoregressive language modeling, or some of the downstream tasks that LLMs are currently being evaluated on. I would also like to see scaling studies, especially since the main advantage of the SUT is its scalability.
Yes, we agree that these are important benchmarks and we are applying SUTs on these tasks (particularly LLMs) with some success. With limited compute, the SUTs unlock a new capability that we can now use to explore further.
> (1) Because each token in the sequence may halt at a different time, the number of tokens that need to be evaluated in each layer will steadily decrease, thus saving additional compute, at the cost of implementation complexity. Does your implementation do this, or does it merely mask out the halted instructions?
Yes it does. It is a convenient side-effect of having to implementing routing at each module within the Transformer block. Since each time routing is performed the hidden embeddings are split into $E$ separate tensors for computation, we can drop the halted states from being routed to their respective experts, saving compute.
> (2) Have you done any language-modeling experiments besides translation?
TODO add preliminary results on wikitext
This is currently underway, but has other novel contributions besides the SUT presented in this paper. For this work, we believed that translation still has room for improvement that can benefit from the inductive biases of UTs and scale, which is why we chose WMT'14.
> (3) ... The scaling space here is admittedly much harder to characterize, since there are more variables involved. Have you thought about doing any scaling experiments to characterize the tradeoffs between parameters and compute, or are you aware of other studies in the literature?
We are not aware of other work in the literature, but indeed there is some uniqueness for SUTs in that we are able to increase computation (increase depth) without increasing parameter count, and increasing parameter counts (increase experts) without increasing computation, resulting in a decoupling of both factors.
The experiments in the paper, however, indicate that an equivalent-sized UT compared to a SUT, the dense UT tends to perform better. There are other confounding factors like the router that would possibly close this performance gap, and that, too, is ongoing work.
---
#### Extra notes
The authors make several assumptions to arrive at that conclusion:
1. Siegelmann & Sontag, 1992 shows that RNNs are Turing-complete.
2. They prove that an encoder-decoder Transformer can simulate an RNN by further assuming that the previous hidden state of a transformer decoder is fed into the current time-step of the decoder.
With both (1) & (2) they show that Transformers are Turing complete.
For (1), Siegelmann & Sontag start with the premise that the values of the RNN hidden states can be arbitrarily precise. Without this assumption, the standard Elman RNN can recognise only regular languages (https://aclanthology.org/2020.acl-main.43.pdf).
For (2), standard usage of the Transformer does not follow this pattern, using teacher-forcing of the input _tokens_ to achieve parallelism during training. To achieve the outcome in (2), training can no longer be efficiently parallelised as every hidden state will have to be computed before computing the next hidden state, much like an RNN.
In fairness, standard usage of UT uses finite depth, so standard usage and theoretical capabilities will also differ. However, due to parameter-sharing, arbitrary depth is possible,though we did not find this to help in the tasks we applied SUT on.
There are several theoretical results that do not make the arbitrary precision assumption (albeit with their own set of assumptions):
https://arxiv.org/abs/1906.06755
https://arxiv.org/abs/2106.16213
https://arxiv.org/abs/2204.06618
And several empirical tests with transformers on synthetic formal languages that roughly confirm those results:
https://arxiv.org/abs/2207.02098
https://arxiv.org/abs/2009.11264