awav
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    > # Memory safe computations with XLA compiler (NeurIPS 2022 Rebuttal) ## Scores * Mdhs (6 weak accept, confidence 4) * esE8 (6 weak accept, confidence 3) * 1TM1 (4 borderline reject, confidence 3) <!-- #### Common reply We would like to thank all reviewers for the time spent and the valuable feedback on the paper. * Our work addresses the situations when user faces a situatations of feasiblity running machine learning models. * We would like to clarify that XLA is a general purpose machine learning compiler. XLA abbriviation stands for accelerated linear algebra [1]. TensorFlow, PyTorch and JAX (the driving force behind JAX is XLA) frameworks are successfully used broadly in machine learning, and ... * kNN and Gaussian process models in many ways are either more complicated computationally or comparable to deep learning models. For example Gaussian processes include non-trivial linear algebra operations such as Cholesky, triangular solve and naive version of kNN querying relies on sorting methods in the top-k operation. * Historically, deep learning is one of reasons for appearance and popularity of machine learning frameworks. Questions: * Seems like there is a confusion about performance in kNN experiments and KeOps. Should we discuss the implementation details of KeOps and show its restrictions like: * PyTorch only implementation, users who do not use KeOps cannot benefit from it * Creates their own computational kernels which add maintainability burden: reduction operations, matrix products and etc., custom efficient argkmin operations which is created specifically for kNN purpose. * Constrained to 3d tensors maximum (?) * Dependency on specific C++ compiler, cmake and etc. Slow compilation for large tensors (up to a minute for 1e6 the same squared exponential kernel where as XLA is almost instant). * 1TM1 has multiple strange claims, but my favorite is the downplay of kNN and SGPR models. How can we address it? * All reviewers pointed out to the lack of deep learning experiments. How do we address them? * Convolutional models are out of consideration. In principle we could implement automatic splitting on the batching dimension in convolutional NNs. But they will take a lot of time. * Attention mechanism. Look at discoveries below. * Single layer MLP increase the size of hidden layer and train model with full batch Adam? Discoveries: * Attention function is splittable. We should try to run a transformer from tensorflow tutorial [3] TODOs: * Upload a new pdf with new (Arxiv) algorithm description (don't forget to remove links to the code and repos) * [1] https://www.tensorflow.org/xla [2] Chern, Felix and Hechtman, Blake and Davis, Andy and Guo, Ruiqi and Majnemer, David and Kumar, Sanjiv, TPU-KNN: K Nearest Neighbor Search at Peak FLOP/s, 2022 [3] https://www.tensorflow.org/text/tutorials/transformer --> <!-- Slice of machine of machine learning, geometric deep learning, repeat the point to the. Short is better, brainstrom, cut down and make crucial bits, arguments should be short as possible. Missing KeOps performance: we are generic as possible, to able optimise as much code as possible, whereas KeOps, complexity, cannot optimise SGPR, Challenge the claim of outperforming the KeOps. Make tiling transparent --> ## Review: Mdhs (6 weak accept, confidence 4) Summary: This paper extends the XLA compiler with several memory optimizations. The users can run much larger scale computations without changing any code. The evaluation shows that this exertion enables k-nearest neighbor and sparse Gaussian process regression methods to run at a much larger scale on a single device. Strengths And Weaknesses: Strengths: * Good direction: integrate memory optimization as compiler passes. * Impressive results: Run much larger scale computations on a single device without code change. Weaknesses: * It is still a prototype with many missing optimizations, as it cannot outperform KeOps Questions: Tiling/Blocking is a widely used technique to save memory. This paper [1] applies this idea to self-attention, which is a very important and memory-intensive operator for long sequence transformers. Can the technique in eXLA also automatically discover the tiling pattern in [1]? [1] Self-attention Does Not Need O(n^2) Memory, arXiv 2021 Limitations: There are other memory optimization techniques such as swapping[2] and gradient-checkpoint/rematerialization[3][4]. These techniques naturally fit into compiler optimizations. You can also try these methods in XLA. [2] Swapadvisor: Pushing deep learning beyond the GPU memory limit via smart swapping, ASPLOS 2020. [3] Training deep nets with sublinear memory cost, arXiv 2016. [4] Dynamic Tensor Rematerialization, ICLR 2021. ### Reply ideas to Mdhs - "It is still a prototype with many missing optimizations, as it cannot outperform KeOps" - While it is true that we do not outperform KeOps for k-NN, this is not true in general. - Our goal was to introduce optimisations that could be applied in very generic situations. - As a consequence, eXLA can optimise code that KeOps cannot. For example SGPR. We provide an optimised solution that KeOps simplyl cannot provide. - This is because KeOps was designed specifically with optimising k-NN in mind. - However, their solution is very inflexible, for example, they cannot optimise when tensors have rank >3, which is necessary in many machine learning settings. - "Tiling/Blocking is a widely used technique to save memory." - Yes it is widely used, and it has been proven to work. - Our contribution is to automatically apply these optimisations, so you don't need to write cumbersome custom code anymore. - This makes these optimisations more accessible and less time-consuming to use. ### Openreview reply to Mdhs Dear Mdhs, Thank you for your feedback on the paper. We will address your comments point by point, starting with the emphasized weakness. > ... it cannot outperform KeOps The focus of our optimisations is memory safety and minimal user involvement. Both KeOps and eXLA satisfy memory safety requirement, but the KeOps is not as generic as eXLA in applying memory optimisations. For example, memory optimisations applied by eXLA to SGPR are not possible with KeOps. The performance of KeOps explained by smart code generation using symbolic representations together with efficient implementations of reduction schemes like `argkmin` (indices for minimal k elements in an array) and following by the native code compilation. Another drawback of the KeOps is that if users want to benefit from KeOps kNN implementation users would have to use PyTorch and and KeOps interfaces. With eXLA, users have freedom to select any framework that supports XLA and they are not constrained to specific interfaces. We will make those points more clear in the paper. <!-- (ARTEM: version 2) The focus of our optimisations is memory safety and minimal user involvement. Primarily, the kNN case satisfies the following comparison criterias between KeOps and eXLA: * *Accessibility*, i.e. how many platforms and machine learning frameworks benefit from optimisations introduced by either KeOps or eXLA. KeOps is restricted to GPUs and CPUs. eXLA in turn can potentially also be used two more architectures such as TPUs or IPUs. Additionally, eXLA is available to frameworks that have support for XLA: TensorFlow, JAX, PyTorch (also partly Julia and Swift languages). * *User-friendliness*. The eXLA optimisations are implicit for users, while KeOps user would need to adapt the code to KeOps interfaces. * *Memory safety*. Both safisfy this requirement. For KeOps although, this feature is limited to much smaller scope of operations, and models such as SGPR cannot be optimised with KeOps. Performance-wise KeOps does have advantage over XLA solutions in certain situations like kNN, as KeOps has full control of the code generation and does produce efficient code for reduction schemes, such as `argkmin` which is the core procedure in kNN's query search requests. eXLA modifies internal representation of the computational graph (HLO IR) and does not have control over the instruction executer, whereas KeOps generates code into C++ directly that compiled into a native code. In the future, the integration of XLA with MLIR, and exploiting LLVM native compilation could help improve performance. We will make those points more clear in the paper. --> > Tiling/Blocking is a widely used technique to save memory. Indeed tiling, blocking or splitting are widely used and they are proven to work well. The advantage of our approach is that we can apply these optimisations automatically, without needing the user to change any code. So, while these optimisations are already kown, our solution allows them to be used without additional effort, which we believe will be useful to a large range of people. [1] is a handcrafted solution and not incorporated into a compiler, therefore restricting its usage. > Can the technique in eXLA also automatically discover the tiling pattern in [1]? We ran the Transformer model from [5] using eXLA. For the attention block from the transformer model, eXLA generated a similar computation to the proposed computation at [1]. The advantage of our method is that eXLA automatically adjusts the computational graph depending on tensors' sizes and their shapes. I.e., for an attention block with short sequence length and large batch dimension, eXLA could go for splitting the batch dimension. When the sequence length is large, and the batch is small, eXLA could decide to split the computational graph at the sequence dimension. - [7] is an image of attention HLO IR graph *before* eXLA optimisations - [8] is an image of attention HLO IR graph *after* eXLA optimisations > There are other memory optimization techniques such as swapping[2] and gradient-checkpoint/rematerialization[3][4]. These techniques naturally fit into compiler optimizations. You can also try these methods in XLA. Thank you for sharing [2, 3, 4]. [2, 3, 4] resemble lazy tensors idea where tensors loaded when needed. We considered that idea in the beginning of the project. However, in practice, implementing lazy tensors in a compiler is very challenging. Lazy tensors would require significant changes in XLA compiler, internals of executor and the memory allocation scheduler in particular. The level on which we introduce eXLA optimisations is too abstract for that feature. [1] Self-attention Does Not Need O(n^2) Memory, arXiv 2021 [2] Swapadvisor: Pushing deep learning beyond the GPU memory limit via smart swapping, ASPLOS 2020. [3] Training deep nets with sublinear memory cost, arXiv 2016. [4] Dynamic Tensor Rematerialization, ICLR 2021. [5] https://www.tensorflow.org/text/tutorials/transformer [6] Chern, Felix and Hechtman, Blake and Davis, Andy and Guo, Ruiqi and Majnemer, David and Kumar, Sanjiv, TPU-KNN: K Nearest Neighbor Search at Peak FLOP/s, 2022 [7] https://imgur.com/b3hBIHn [8] https://imgur.com/grQwYQE <!--the experiment with kNN demonstrates feasibility of running the kNN naive code on larger scale and re-using the same user's code. The comparison criterias are the following: accessibility, user-friendliness and memory safety.--> ## Review: esE8 (6 weak accept, confidence 3) Summary: Authors propose and implement XLA compiler extension that replaces algorithms using a lot of memory with more memory efficient algorithms. This allows to run computations that use much larger data sizes on a single limited-memory device. Authors demonstrate their tool on kNN and sparse Gaussian process regression methods. Strengths And Weaknesses: Strengths: * Authors implement XLA compiler extension that automatically allows restructuring underlying computation graph to use less memory and therefore execute larger computations on a single device * Authors demonstrate that the system can run computations that would have failed with out-of-memory without optimization * Authors demonstrate that the system can be comparable to manual optimization Weaknesses: * The optimization is limited and runs much slower than KeOps system (authors acknowledge and explain this limitation) * Limited explanation of splitting algorithm Questions: * It would be good if authors ran their system on a set of models that may or may not contain optimizable code and reported whether the system did not yield any negative results. What I am asking for is a testing run: does the system crash, fail, break the underlying code, slow down the underlying code when run on inputs that it may not expect? * It would be good if authors explained or walked through Algorithm 1. While it is clear what algorithm is supposed to do, the details as written are hard to follow. Some detailed questions: What exactly does line 5 of the algorithm mean? What does "define best_split_dim" mean in line 13? Line 14 says to iterate over best_split_dim, but best_split_dim was not assigned anything. Line 14 bullets are not clear either. ### Reply ideas to esE8 - Write similar thing to previous review about KeOps - "Limited explanation of splitting algorithm" - "We agree that it would be good to add more details the splitting algorithm" - Update PDF - Promoise a bit more on writing - "It would be good if authors ran their system on a set of models that may or may not contain optimizable code and reported whether the system did not yield any negative results. What I am asking for is a testing run: does the system crash, fail, break the underlying code, slow down the underlying code when run on inputs that it may not expect?" - Translation: Will eXLA break code that will otherwise run well without it? - Ideally we would say that the algorithm will *only* split computations, when the result remains unchanged. A good description of the algorithm will: - Define unambiguously iguously what situations are splittable. - Write an algorithm that recognises this. - Write an algorithm that introduces a correct splitting. - Todo: Write in the rebuttal that what you implemented does this. - If the algorithm doesn't recognise a splittable operation, it leaves the graph unchanged. - Say that you ran something that is (currently) **NOT** splittable, and that eXLA didn't break it.. Artem said dynamic shapes. ### Openreview reply to esE8 We thank reviewer esE8 for giving valuable feedback. We will address your listed weaknesses first, and then answer your questions. > The optimization is limited and runs much slower than KeOps system (authors acknowledge and explain this limitation) Reviewer Mdhs have a similar point. The main motivation for this work is users' convenience in overcoming memory bottlenecks. Although, KeOps is also achieving this goal, it is more specific in its application area such as geometric problems. KeOps focuses on a certain type of operations, like `argkmin` that is the core of kNN. eXLA on the contrary as generic as possible. KeOps optimisations would be ineffective for SGPR model. Also it is worth mentioning that KeOps is not available for TensorFlow and JAX users, and even PyTorch users would need to modify existing code to benefit from KeOps. <!-- (ARTEM: version 2) The main motivation for this work was a user convenience in overcoming memory bottlenecks. Usually, in situations when a user is experiencing OOM errors, performance goes to the second plan as running less performant code becomes preferable to not being able to run the code at all. KeOps optimisations target reduction schemes such as `argkmin` (indices of minimal k elements in an array) operations by generating highly performant C++ code. In eXLA we have control over the internal representation (HLO IR) of the comptutaional graph, but not over the execution or memory allocation process. Performance gains could be seen in the future when integration of MLIR and XLA is finalised, such that LLVM could produce efficient native code similar to KeOps. In case of kNN we found that recent work [1] gives more performant `argkmin` procedure (TPU only), which we might utilise to catch up the KeOps performance. --> > Limited explanation of splitting algorithm We will improve the writing of splitting section in the paper to the extent the page limit allows us. We also hope that published code will clarify the algorithm even further. We have updated Algorithm 1, please check the screenshot [2]. > What I am asking for is a testing run: does the system crash, fail, break the underlying code, slow down the underlying code when run on inputs that it may not expect? So far we have tested eXLA on multiple use cases: sparse Gaussian processes, kNN, and transformers. We also tested MLPs with large numbers of hidden units (up to 1e6) in the hidden layer with full batch training on CIFAR-10 and CIFAR-100. We tested those models up to the extreme where the out of the box implementation failed with OOM. In a smaller scale we tested the correctness of computation after eXLA optimisations. eXLA results were equal to the results of the original computation compiled only with XLA up to the numerical precision (small pertubations are inevitable when the computational graph changes). eXLA recognises input graphs which are not optimisable and leaves these inputs unchanged. For example the splitting of triangular solve $Lx = y$ is possible only on the batch dimension, i.e. $x, y \in R^{n \times \text{batch}}$. If eXLA's splitting algorithm detects that memory limits set by a user are not satisfied by splitting the batch dimension for the triangular solve, eXLA will not try modify that part of the graph any further. In addition, XLA has safety checks: if eXLA introduced incompatible shapes or float types into the input graph, XLA would report that immediately. > It would be good if authors explained or walked through Algorithm 1. While it is clear what algorithm is supposed to do, the details as written are hard to follow. Some detailed questions We will update the Algorithm 1 in the paper with [2]. We hope it adds more clarity. * Algorithm 1, line 5: When lhs and rhs inputs to the `dot` product are not the same tensor and both lhs and rhs tensors fail to satisfy the user's constraint on memory, the splitting algorithm will not attempt to split the computational path starting from that `dot` product. The `dot` product in that case does not perform size reduction. * Algorithm 1, line 13: `best_split_dim` is a dimension over which algorithm performs splitting. The splitting is done on the longest continuous path in the graph that contains that dimension. The splitting path starts at a reduction operation, such as `dot` or `reduce_(sum, max, prod, ...)` and ends at the nodes which satisfy memory limits and generate `best_split_dim` dimension. To define `best_split_dim` we use a heuristic. * Algorithm 1, line 14: Once the splitting path and `best_split_dim` in the graph is defined, the algorithm replaces that path with a `for` loop which iterates over slices of the `best_split_dim` dimension. For example, suppose the splitting algorithm decides to split a tensor of shape `[10, 1000, 1]` at the 2nd dimension of size 1000 on stripes of size 100. The 2nd dimension is `best_split_dim`, and a `while` loop will iterate with step size 100 over that dimension. In XLA the `while` loop is a procedure. The nodes at the start of the splitting paths become arguments to the `while` loop procedure. The splitting paths themselves become the body of the `while` loop that act on a slice. The result of the `while` loop is combined according to the reduction scheme. [1] Chern, Felix and Hechtman, Blake and Davis, Andy and Guo, Ruiqi and Majnemer, David and Kumar, Sanjiv, TPU-KNN: K Nearest Neighbor Search at Peak FLOP/s, 2022 [2] Updated version of the Algorithm 1, https://imgur.com/a/RYEWTxe ## Review: 1TM1 (4 borderline reject, confidence 3) Summary: To save memory usage in machine learning workloads, this paper paper presents a graph-level rewriting algorithm that is integrated with the XLA compiler. The algorithm is tested against two non-deep learning workloads, k-nearest neighbor (kNN) and sparse Gaussian process regression (SGPR). Strengths And Weaknesses: Strength: * The primary strength of this paper is that it proposes to solve the memory constraint problem, which is an existing painpoint of deep learning systems; * The algorithm is integrated with XLA, a mature deep learning compiler, which makes it easier for productionalization; * The paper optimizes kNN and SGPR and delivers positive results. Weakness: * There exists quite a lot of work reducing memory footprint in deep learning [1, 2, 3], all of which are quite non-trivial and made significant contributions to the field of deep learning compiler. This paper does not seem to provide comparison with those existing literatures. * The algorithms present in this paper, including match-and-replace, reordering and dataflow graph splitting, seems to be very familiar but less novel. For example, changing computation order (Sec 4.2) (AB)v = A(Bv) seems to be a classic trick to save memory. * Even though XLA is a deep learning compiler, and also dataflow rewriting (Sec 4.3) is a classic mechanism in deep learning compilation, the experiments are conducted only on much simpler workloads like kNN, which only contains a few matrix operators and does not have a strong motivation to conduct generic graph-level rewriting. Therefore, it's less clear to the reviewer that the proposed algorithm could be a perfect fit for deep learning compilation. Questions: The first confusion point to the reviewer is that the work seems to be focusing very much on deep learning compilation, as mentioned on line 20-21, line 48-49, line 98-100, Section 4.4, all of which are deep learning frameworks and compilers. However, the experiments are only conducted on two much simpler non-deep learning algorithms. Is this possible to demonstrate any experiments with deep learning workloads? The second question is about literature review. The paper seems to propose that KeOps as major prior work, while ignoring all significant works in memory reduction techniques in deep learning. Is there any consideration not to do so? Limitations: As mentioned before, the limitation includes three components: * The novelty of this work, if only limited to various rewriting, is less significant. * While focusing on deep learning compiler and frameworks in paper writing, the experiments are not relevant to deep learning workloads. ### Reply ideas to 1TM1 - "There exists quite a lot of work reducing memory footprint in deep learning [1, 2, 3], all of which are quite non-trivial and made significant contributions to the field of deep learning compiler. This paper does not seem to provide comparison with those existing literatures." - We are intersted in seeing your exact references. - The unique point of our paper, is that we argue that these optimisations need to be inside a compiler, so they can be transparently applied to eed to existing code without needing afication. - Based on our literature review, our approach is unique, as other papers that describe optimisation techniques require significant changes in the code to make them work. - "The algorithms present in this paper, including match-and-replace, reordering and dataflow graph splitting, seems to be very familiar but less novel. For example, changing computation order (Sec 4.2) (AB)v = A(Bv) seems to be a classic trick to save memory." - Whole point is that it's done automatically in our paper - Give an example where you may want to specify the code with one order, but run with different order. E.g. specify matrix that turns out to be low rank. If this is the case, then compiler automatically changes order to make more efficient without extra effort. - "Even though XLA is a deep learning compiler, and also dataflow rewriting (Sec 4.3) is a classic mechanism in deep learning compilation, the experiments are conducted only on much simpler workloads like kNN, which only contains a few matrix operators and does not have a strong motivation to conduct generic graph-level rewriting. Therefore, it's less clear to the reviewer that the proposed algorithm could be a perfect fit for deep learning compilation." - XLA is a linear algebra compiler, and both k-NN and SGPR rely heavily on linear algebra operations. - SGPR does require significant whole graph rewriting, particularly for the gradients. Picture of how complicated these graphs look (imgur?) - k-NN is a sub-algorithm for amny geomoetric deep learning algoirthms [cite] ### Openreview reply to 1TM1 Thank you for your comments and useful feedback. > There exists quite a lot of work reducing memory footprint in deep learning [1, 2, 3] <!-- We are interested in seeing your references as it seems you forgot to add the footnotes. Based on our literature review our solution (eXLA) is unique, as we apply memory optimisations automatically, without the need for modifying code. We argue that for the transparent resolution of out of memory (OOM) issues in existing code, the optimisations should be done on the compiler level. Other papers that discuss similar techniques require significant changes in the existing code to make them to work, and often these technique are not considered for OOM-free computations. --> We would be interested in seeing your references as it appears you forgot to add the footnotes to your review. Based on our own literature review our solution (eXLA) is unique in the manner in which optimisations are made. eXLA performs its optimisations on the HLO level, this provides the user the ability to inspect the optimisations made in a straight-forward manner. In addition, the memory optimisations are performed automatically given bounds on maximum tensor sizes, there is no need for the user to modify their code. Other papers that discuss similar techniques require significant changes in the existing code, and often these techniques do not focus on memory-safety and instead focus on speed. ::: > The algorithms present in this paper, including ,... , seems to be very familiar but less novel. For example, changing computation order (Sec 4.2) (AB)v = A(Bv) seems to be a classic trick to save memory. <!-- Our eXLA optimisations are additions to the XLA compiler, which means that they are applied automatically so that the user does not have to worry about implementing them in their own code. Therefore, the user can continue to write their own code in a mainstream framework (e.g., TensorFlow, PyTorch, JAX) and have our memory optimisations automatically applied. Even the classic and simple case of ordering matrix-vector products can be extremely useful when the readability and the design of algorithm is preferable over the speed. Our optimisations allow a user to focus more on design choices and think less about efficient computations. Let us consider an example such as the conjugate gradient (CG) algorithm for a solving system of linear equations $Ax = y$. CG is an iterative procedure for finding the solution $x$ with matrix-vector products as the core of the algorithm. To achieve faster convergence of CG, practitioners use different preconditioners to the matrix $A$. A developer implementing the CG algorithm has two choices for designing an interface for the algorithm: either provide to a user a flexible interface for any type of preconditioners without efficiency guarantees, e.g. matrix-matrix multiplications before vector product, or provide a restrictive interface for some preconditioners. Our eXLA approach allows the former interface to be provided in code, while still running the fast reordered procedure after optimisation. --> We agree that many of the optimisations incorporated in our eXLA optimiser are known approaches for reducing the memory footprint. A key benefit of incorporating these optimisations into the eXLA compiler is that they can then applied automatically, without the user having to manually tune their code, such as manual ordering of computations, in the manner you describe. Performing these optimisations manually is inherently error-prone and time-consuming; the eXLA approach alleviates these issues. As such, the user can continue to write their own code in any of a range of mainstream linear-algebra frameworks (e.g., TensorFlow, PyTorch, JAX) and have our memory optimisations automatically applied. <!-- ?Introduced set of our eXLA optimisations: the matrix chain multiplication optimisation, replacement of inefficient expressions and splitting optimisation, become all transparent to the end user of the mainstream frameworks. --> As an example, in more elaborate case of ordering matrix-vector products than you provided, it is common that readability is being sacrificed at the cost of memory and speed improvements. This is not desirable, we would like the user to be able to write clear and concise code, that works regardless of the specific ordering. Our optimisations allow a user to focus more on design choices and think less about efficient computations, as these optimisations are performed automatically. Let us consider an example such as the conjugate gradient (CG) algorithm for a solving system of linear equations $Ax = y$. CG is an iterative procedure for finding the solution $x$ with matrix-vector products as the core of the algorithm. To achieve faster convergence of CG, practitioners use different preconditioners to the matrix $A$. A developer implementing the CG algorithm has two choices for designing an interface for the algorithm: either provide to a user a flexible interface for any type of preconditioners without efficiency guarantees, e.g. matrix-matrix multiplications before vector product, or provide a restrictive interface for some preconditioners. Our eXLA approach allows the former interface to be provided in code, while still running the fast reordered procedure after optimisation. > Even though XLA is a deep learning compiler, and also dataflow rewriting (Sec 4.3) is a classic mechanism in deep learning compilation, the experiments are conducted only on much simpler workloads like kNN, which only contains a few matrix operators and does not have a strong motivation to conduct generic graph-level rewriting. Therefore, it's less clear to the reviewer that the proposed algorithm could be a perfect fit for deep learning compilation. XLA is designed to be more general than just for optimising deep learning, which is indicated by its name standing for accelerated linear algebra [1]. Additionally, TensorFlow positions itself as a system for large-scale machine learning [2], and JAX's description emphasizes that it is a tool for high-performance machine learning research [3]. Both kNN and SGPR models heavily depend on linear algebra operations, and are still widely used in machine learning problems: * kNN plays an important role in geometric problems, 3D computer graphics and it is a sub-algorithm for many existing machine learning models. kNN is used in geometric deep learning [4], for example in models such as Point CNNs [5] and Dynamic Graph CNNs [6]. * It is arguable whether SGPR is a simple model. It is a single "layer", and relies on elegant mathematical concepts and linear algebra, that allow it to cope with noisy data, while giving indications on how to adjust computational complexity through inducing points. The GPflow implementation [15] is fairly elegant, but the final computational graph is far from simple [13]. > The first confusion point to the reviewer is that the work seems to be focusing very much on deep learning compilation, as mentioned on line 20-21, line 48-49, line 98-100, Section 4.4, all of which are deep learning frameworks and compilers. <!-- We are sorry that the framing of our work is not clear as we target linear algebra * Lines 20-21 and lines 48-49 mention TensorFlow, PyTorch, JAX and XLA, which are general purpose machine learning frameworks and an accelerator of linear algebra. * Lines 98-100 additionally mention TVM and Glow. Citations for TVM and Glow are focused on deep learning, whereas the websites [8] and [9] mention that TVM and Glow are machine learning compilers. We tested TVM on kNN and SGPR problems without success. * Section 4.4 discusses the limitation of the XLA. In that section we cited the Julia and Linnea languages that support strong type systems for tensors. In that section we advocate that having a strong type system would give a powerful tool that could pave the way to even more efficient speed and memory optimisations in computational graphs. --> We apologise for not being clear in our initial submission, but our work is focused on improvements to the existing XLA compiler that is widely used by a number of general purpose machine learning frameworks and linear algebra accelerators, including Tensorflow, PyTorch and JAX. Although these are widely used as a basis for deep-learning frameworks, they are not exclusively applicable to deep-learning, as illustrated in our experiments. > Is this possible to demonstrate any experiments with deep learning workloads? Yes. We tested eXLA on the transformer model implemented in the TensorFlow website tutorial [10]. We successfully ran that model on random data. The model consists of an encoder and decoder. Both the encoder and decoder have 2 multi-headed attention blocks. In our test run we changed only the input sequence length, which we varied from 250 and up to 7000. The rest of the configuration of the transformer model is the same as in the tutorial: the batch size is 64, the dimensionality of the representations is 512, and the number of attention heads is 8. The experiment was run on one Nvidia V100 GPU with 32GB. We created a plot [14] that has 3 curves that represent different compilation settings for the same code: TensorFlow, i.e. the code with `@tf.function()`, XLA, i.e. the code with applied jit compilation `@tf.function(jit_compile=True)`, and eXLA , i.e. the code with applied jit compilation `@tf.function(jit_compile=True)` and with our eXLA optimisations turned on where the limits to tensor sizes was set to 20GB, and the maximum splitting size was set to 5GB. Both TF and XLA failed to run with a sequence length of more than >2000, whereas the same code with eXLA was successfully run with sequence length >2000. The left side of the plot [14] shows the maximum memory allocated on the GPU device vs increasing sequence length, and the right side of the plot [14] shows elapsed time of forward propagation vs increasing sequence length. The kink on the left plot at value 2000 is where eXLA started applying optimisations to the graph and smaller memory consumption is a sign that the splitting sizes are too small. The linear memory increase after 2000 indicates that the memory allocator still can allocate multiple tensors at the same time. * [11] is the Transformer HLO graph *before* eXLA optimisations. * [12] is the Transformer HLO graph *after* eXLA optimisations. > The second question is about literature review. The paper seems to propose that KeOps as major prior work, while ignoring all significant works in memory reduction techniques in deep learning. Is there any consideration not to do so? In our related work section, we discuss the frameworks that reduce memory cost and provide some degree of automatic ability to do so. We chose this focus, because the contribution of our paper is to automatically perform memory optimization on a specific method, without the need for a user to modify code. It is true that we did not explicitly discuss papers that introduce new deep learning methods that have lower memory costs. These new methods always require new implementations, and usually introduce approximations which change the output of the code. This lies outside of our contribution, and so we did not discuss them. However, if there are specific papers that you think we should discuss, please do mention them. We are also willing to consider widening the scope of our related work, if you think this is necessary. [1] https://www.tensorflow.org/xla [2] https://research.google/pubs/pub45381/ [3] https://github.com/google/jax#what-is-jax [4] M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst. Geometric deep learning: going beyond Euclidean data, 2017. [5] Y. Li, R. Bu, M. Sun, W. Wu, X. Di, and B. Chen. PointCNN: Convolution on X-transformed points, 2018 [6] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon. Dynamic graph CNN for learning on point clouds, 2019. [7] Barthels, H., Psarras, C., and Bientinesi, P. (2021). Linnea: Automatic generation of efficient linear algebra programs. [8] https://tvm.apache.org/ [9] https://github.com/pytorch/glow [10] https://www.tensorflow.org/text/tutorials/transformer [11] Transformer HLO graph before eXLA. This is an svg file, please click `Raw` button to see zoomed in version. https://gist.github.com/YToKOq80mh/833c0a7ae9b035cd5a2058626e0a73fb. [12] Transformer HLO graph after eXLA. This is an svg file, please click `Raw` button to see zoomed in version. https://gist.github.com/YToKOq80mh/261480d9b116eeefb6dc55878a06d2a5. [13] SGPR HLO graph. This is an svg file, please click `Raw` button to see zoomed in version. https://gist.github.com/YToKOq80mh/0349b1d66847f7759a7ee9ee29b7abc [14] Memory and time metrics for Transformer forward pass: https://imgur.com/a/R9i7VH8 [15] GPflow, https://github.com/GPflow/GPflow/ ## New openreview reply Thank you for your comments and useful feedback. > There exists quite a lot of work reducing memory footprint in deep learning [1, 2, 3] We would be interested in seeing your references as it appears you forgot to add the footnotes to your review. Based on our own literature review our solution (eXLA) is unique in the manner in which optimisations are made. eXLA performs its optimisations on the HLO level, this provides the user the ability to inspect the optimisations made in a straight-forward manner. In addition, the memory optimisations are performed automatically given bounds on maximum tensor sizes, there is no need for the user to modify their code. Other papers that discuss similar techniques require significant changes in the existing code, and often these techniques do not focus on memory-safety. > For example, changing computation order (Sec 4.2) (AB)v = A(Bv) seems to be a classic trick to save memory. We agree that many of the optimisations incorporated in our eXLA optimiser are known approaches for reducing the memory footprint. A key benefit of incorporating these optimisations into the eXLA compiler is that they can then applied automatically, without the user having to manually tune their code, such as manual ordering of computations, in the manner you describe. Performing these optimisations manually is inherently error-prone and time-consuming; the eXLA approach alleviates these issues. As such, the user can continue to write their own code in any of a range of mainstream linear-algebra frameworks (e.g., TensorFlow, PyTorch, JAX) and have our memory optimisations automatically applied. > Even though XLA is a deep learning compiler, and also dataflow rewriting (Sec 4.3) is a classic mechanism in deep learning compilation, the experiments are conducted only on much simpler workloads like kNN, which only contains a few matrix operators and does not have a strong motivation to conduct generic graph-level rewriting. XLA is designed to be more general than just for optimising deep learning, which is indicated by its name standing for accelerated linear algebra [1]. Additionally, TensorFlow positions itself as a system for large-scale machine learning [2], and JAX's description emphasizes that it is a tool for high-performance machine learning research [3]. Both kNN and SGPR models heavily depend on linear algebra operations, and are still widely used in machine learning problems: * kNN plays an important role in geometric problems, 3D computer graphics and it is a sub-algorithm for many existing machine learning models. kNN is used in geometric deep learning [4], for example in models such as Point CNNs [5] and Dynamic Graph CNNs [6]. * It is arguable whether SGPR is a simple model. SGPR relies on elegant mathematical concepts, that allow it to cope with noisy data, while giving indications on how to adjust computational complexity through inducing points. The GPflow implementation is fairly elegant, but the final computational graph is far from simple [13]. > The first confusion point to the reviewer is that the work seems to be focusing very much on deep learning compilation, as mentioned on line 20-21, line 48-49, line 98-100, Section 4.4, all of which are deep learning frameworks and compilers. We apologise for not being clear in our initial submission, but our work is focused on improvements to the existing XLA compiler that is widely used by a number of general purpose machine learning frameworks and linear algebra accelerators, including Tensorflow, PyTorch and JAX. Although these are widely used as a basis for deep-learning frameworks, they are not exclusively applicable to deep-learning, as illustrated in our experiments. > Is this possible to demonstrate any experiments with deep learning workloads? Yes. We tested eXLA on the transformer model implemented in the TensorFlow website tutorial [10]. In our test run Transformer's forward pass on random data. We changed only the input sequence length, which we varied from 250 and up to 7000. The rest of the configuration of the transformer model is the same as in the tutorial. The experiment was run on one GPU V100 GPU with 32GB. We created a plot [14] running the same code with TensorFlow, XLA, and eXLA. TF and XLA failed to run with a sequence length of more than >2000, whereas eXLA code kept working. The left side of the plot shows the maximum memory allocated on the GPU device, and the right side of the plot shows elapsed time of forward propagation vs increasing sequence length. The kink on the left plot at value 2000 is where eXLA started applying optimisations to the graph and smaller memory consumption is a sign that the splitting sizes are too small. The linear memory increase after 2000 indicates that the memory allocator still can allocate multiple tensors at the same time. > The second question is about literature review. The paper seems to propose that KeOps as major prior work, while ignoring all significant works in memory reduction techniques in deep learning. Is there any consideration not to do so? In our related work section, we discuss the frameworks that reduce memory cost and provide some degree of automatic ability to do so. We chose this focus, because the contribution of our paper is to automatically perform memory optimization on a specific method, without the need for a user to modify code. It is true that we did not explicitly discuss papers that introduce new deep learning methods that have lower memory costs. These new methods always require new implementations, and usually introduce approximations which change the output of the code. This lies outside of our contribution, and so we did not discuss them. However, if there are specific papers that you think we should discuss, please do mention them. We are also willing to consider widening the scope of our related work, if you think this is necessary. [1] https://www.tensorflow.org/xla [2] https://research.google/pubs/pub45381/ [3] https://github.com/google/jax#what-is-jax [4] M. Bronstein et al. Geometric deep learning: going beyond Euclidean data, 2017. [5] Y. Li et al. PointCNN: Convolution on X-transformed points, 2018 [6] Y. Wang et al. Dynamic graph CNN for learning on point clouds, 2019 [7] Barthels et al. Linnea: Automatic generation of efficient linear algebra programs, 2021 [10] https://www.tensorflow.org/text/tutorials/transformer [13] SGPR HLO graph. https://gist.github.com/YToKOq80mh/0349b1d66847f7759a7ee9ee29b7abc [14] Memory and time metrics for Transformer: https://imgur.com/a/R9i7VH8 ## Links * Algorithm 1: https://imgur.com/a/RYEWTxe * Attention https://imgur.com/a/PlsBfSs * after: https://imgur.com/grQwYQE * before: https://imgur.com/b3hBIHn * Transformer: * after: svg file, https://gist.github.com/YToKOq80mh/261480d9b116eeefb6dc55878a06d2a5 (click Raw to see zoomed in version) * before: svg file, https://gist.github.com/YToKOq80mh/833c0a7ae9b035cd5a2058626e0a73fb (click Raw to see zoomed in version) * SGPR HLO IR: svg file, https://gist.github.com/YToKOq80mh/0349b1d66847f7759a97ee9ee29b7abc (click Raw to see zoomed in version) * The transformer plot: https://imgur.com/a/R9i7VH8

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully