Improving Babel's performance

# Improving Babel's performance ## Profiling Babel is a complex JavaScript library. In a 100K+ line codebase you might find many opportunities for performance improvements. However, one should not optimize everything unless the performance bottleneck is identified, otherwise it may be difficult to notice. To find the bottleneck we can [profile](https://nodejs.org/en/docs/guides/simple-profiling/) the following Babel use cases `transform.js` ```javascript const babel = require("/path/to/local/babel-core"); const code = require("fs").readFileSync("big-javascript-source.js"); babel.transform(code, { filename: "example.js" }); ``` ```shell # Generate an `isolate-0xnnnnnnnnnnnn-v8.log` (n is digit) node --prof ./test.js node --prof-process isolate-0xnnnnnnnnnnnn-v8.log > proccessed-transform.txt ``` Among the generated profiles, we can examine the bottom-up profile. We would analyze a snippet of bottom-up profile line by line. ``` [Bottom up (heavy) profile]: Note: percentage shows a share of a particular caller in the total amount of its parent calls. Callers occupying less than 1.0% are not shown. ticks parent name 10296 74.9% t __ZN2v88internal12_GLOBAL__N_118DefineDataPropertyEPNS0_7IsolateENS0_6HandleINS0_8JSObjectEEENS4_INS0_4NameEEENS4_INS0_6ObjectEEENS0_18PropertyAttributesE 3791 36.8% t __ZN2v88internal12_GLOBAL__N_118DefineDataPropertyEPNS0_7IsolateENS0_6HandleINS0_8JSObjectEEENS4_INS0_4NameEEENS4_INS0_6ObjectEEENS0_18PropertyAttributesE 243 6.4% t __ZN2v88internal12_GLOBAL__N_118DefineDataPropertyEPNS0_7IsolateENS0_6HandleINS0_8JSObjectEEENS4_INS0_4NameEEENS4_INS0_6ObjectEEENS0_18PropertyAttributesE 37 15.2% LazyCompile: ~get /Users/jh/code/parser_performance/node_modules/@babel/traverse/lib/path/index.js:99:13 37 100.0% LazyCompile: ~create /Users/jh/code/parser_performance/node_modules/@babel/traverse/lib/context.js:49:9 37 100.0% LazyCompile: ~visitMultiple /Users/jh/code/parser_performance/node_modules/@babel/traverse/lib/context.js:73:16 25 10.3% LazyCompile: *visitMultiple /Users/jh/code/parser_performance/node_modules/@babel/traverse/lib/context.js:73:16 25 100.0% LazyCompile: *traverse.node /Users/jh/code/parser_performance/node_modules/@babel/traverse/lib/index.js:87:26 25 100.0% LazyCompile: *visitQueue /Users/jh/code/parser_performance/node_modules/@babel/traverse/lib/context.js:96:13 --- more profile is ellipsed for demonstration purpose --- ``` The lines 1:3 ``` 10296 74.9% t __ZN2v88internal12_GLOBAL__N_118DefineDataPropertyEPNS0_7IsolateENS0_6HandleINS0_8JSObjectEEENS4_INS0_4NameEEENS4_INS0_6ObjectEEENS0_18PropertyAttributesE 3791 36.8% t __ZN2v88internal12_GLOBAL__N_118DefineDataPropertyEPNS0_7IsolateENS0_6HandleINS0_8JSObjectEEENS4_INS0_4NameEEENS4_INS0_6ObjectEEENS0_18PropertyAttributesE 243 6.4% t __ZN2v88internal12_GLOBAL__N_118DefineDataPropertyEPNS0_7IsolateENS0_6HandleINS0_8JSObjectEEENS4_INS0_4NameEEENS4_INS0_6ObjectEEENS0_18PropertyAttributesE ``` reveals that a function called `__ZN2v88internal12_/* tl;dr */` (I will explain it later) consumes 74.9% of the CPU execution time. The [Amdahl's Law] tells us that it is more likely to gain performance boost by optimizing the most time-consuming part than randomly optimizing any seemingly optimizable codes. To reduce the total execution time of `__ZN2v88internal12_/* tl;dr */`, we could 1) reduce the number of `__ZN2v88internal12_/* tl;dr */` calls 2) reduce the execution time of a single `__ZN2v88internal12_/* tl;dr */` call It turns out `__ZN2v88internal12_/* tl;dr */` is not presented in Babel's source code but from v8's native API [`DefineDataProperty`] and its inlined subroutine calls. So we could do nothing on part 2) as it is controlled by the JavaScript engine. From now on it suffices to know that this API will be called whenever we assign a data property to a JavaScript object. i.e. `foo.bar = something`, so we can try to reduce the number of `.bar = something` calls in the JavaScript sources. Apparently the object property is almost used everywhere in the source, so we still need narrow down the bottleneck. And that's what lines 4:6 and lines 7:9 could help us. ``` 37 15.2% LazyCompile: ~get /Users/jh/code/parser_performance/node_modules/@babel/traverse/lib/path/index.js:99:13 37 100.0% LazyCompile: ~create /Users/jh/code/parser_performance/node_modules/@babel/traverse/lib/context.js:49:9 37 100.0% LazyCompile: ~visitMultiple /Users/jh/code/parser_performance/node_modules/@babel/traverse/lib/context.js:73:16 ``` Lines 4:6 tells us that the `LazyCompile: ~get` routine consumes an 15.2% in the execution time of its callers. The `LazyCompile:` prefix is the mark of a compiled JavaScript sources. The `~` prefix indicates that this function is interpreted by v8's [Ignition interpreter] but _not_ yet optimized. We say a code is `optimized` when it is compiled by [Turbofan compiler] into machine code, a sequence of CPU instructions. You may be tempted to investigate why `get` is not `optimized` but most of the optimization bails out because the compiler thinks it is too _early_ to optimize. ``` 25 10.3% LazyCompile: *visitMultiple /Users/jh/code/parser_performance/node_modules/@babel/traverse/lib/context.js:73:16 25 100.0% LazyCompile: *traverse.node /Users/jh/code/parser_performance/node_modules/@babel/traverse/lib/index.js:87:26 25 100.0% LazyCompile: *visitQueue /Users/jh/code/parser_performance/node_modules/@babel/traverse/lib/context.js:96:13 ``` The line 7:9 reports that another `LazyCompile: *visitMultiple` routine consumes 10.3% of parent execution time. Hence the `*` prefix indicates that this function is _compiled_ by [Turbofan compiler]. You may wonder why v8 generates both an unoptimized `~visitMultiple` (line 6) and an optimized `*visitMultiple`, the reason behind this is that `babel-traverse` is a deep-first tree traverse implemented by recursion. ``` traverse.node => context.visit => visit | => visitMultiple/visitSingle | => create node, visitQueue | | |----------| ``` v8, or more specifically turbofan, will optimize different sections (i.e. `visit => visitMultiple => create node` or `visitMultiple => create node, visitQueue => visit`) of the following execution chains, inline the sub calls and generates multiple compiled variance of a specific routine. Now let's take a look at babel-traverse's [`get`] function. ```javascript static get({ hub, parentPath, parent, container, listKey, key }): NodePath { if (!hub && parentPath) { hub = parentPath.hub; } if (!parent) { throw new Error("To get a node path the parent needs to exist"); } const targetNode = container[key]; const paths = pathCache.get(parent) || []; if (!pathCache.has(parent)) { pathCache.set(parent, paths); } let path; for (let i = 0; i < paths.length; i++) { const pathCheck = paths[i]; if (pathCheck.node === targetNode) { path = pathCheck; break; } } if (!path) { path = new NodePath(hub, parent); paths.push(path); } path.setup(parentPath, container, listKey, key); return path; } ``` One can show the code above contains no object property assignment. However, it is not sufficient to examine the code of `get` only. As v8 will try to inline all the subroutine calls, we should also examine every nested function calls _inside_ `get`. To know which parts of inlined code is critical path, we could turn to the use of code coverage tools. ## Code coverage We use [istanbul] to reveal the frequency of statement execution and find out the popular path. A simple integration with istanbul will be run ```shell= jest --coverage transform.test.js ``` on an artificial test file `transform.test.js` ```javascript describe("test", () => { it("should work", () => { const babel = require("/path/to/local/babel-core"); const code = require("fs").readFileSync("big-javascript-source.js"); babel.transform(code, { filename: "example.js" }); }) }); ``` The generated `lcov` report of the `get` function reveals that both `new NodePath` and `path.setup` is popular. ![](https://i.imgur.com/mggwCSt.png) We can then check `new NodePath`. Apparently the constructor is a sequence of object property assignment. Let's see if we can reduce the number of assignments here. ![](https://i.imgur.com/nIhNF9A.png) ## Compress boolean flags There are 3 boolean flags in the `NodePath` class. ```javascript= this.shouldSkip = false; this.shouldStop = false; this.removed = false; ``` A classic performance improvement we may consider is to use [Bit Array]. But as a high level language, we are uncertain 1) if the compiler would optimize boolean properties for us 2) if bit array could lead to performance gain after compiled Let's dive into the compiled instructions to find the answers. On Node.js, one can generate the compiled code using the following v8 flags. ``` node --predictable \ --print-opt-code --code-comments \ --redirect-code-traces --redirect-code-traces-to=./transform.asm \ ./transform.js ``` The `transform.asm` is pretty big code-like files and would thus likely to choke any modern code editors. We use `vim` readonly mode to open this file. ```shell= vim -R transform.asm ``` We can navigate to the compiled instructions of `new NodePath`. The first part is allocating memory for the `NodePath` class. One can see the special `undefined` value is used to filled in the memory locs. ```asm 0x1fafdf31d339 699 4983c301 REX.W addq r11,0x1 0x1fafdf31d33d 69d 49bf990fac56a8090000 REX.W movq r15,0x9a856ac0f99 ;; object: 0x09a856ac0f99 <Map(HOLEY_ELEMENTS)> 0x1fafdf31d347 6a7 4d897bff REX.W movq [r11-0x1],r15 0x1fafdf31d34b 6ab 4d8bbda0000000 REX.W movq r15,[r13+0xa0] (root (empty_fixed_array)) 0x1fafdf31d352 6b2 4d897b07 REX.W movq [r11+0x7],r15 0x1fafdf31d356 6b6 4d897b0f REX.W movq [r11+0xf],r15 0x1fafdf31d35a 6ba 498b5dd8 REX.W movq rbx,[r13-0x28] (root (undefined_value)) 0x1fafdf31d35e 6be 49895b17 REX.W movq [r11+0x17],rbx 0x1fafdf31d362 6c2 49895b1f REX.W movq [r11+0x1f],rbx 0x1fafdf31d366 6c6 49895b27 REX.W movq [r11+0x27],rbx 0x1fafdf31d36a 6ca 49895b2f REX.W movq [r11+0x2f],rbx 0x1fafdf31d36e 6ce 49895b37 REX.W movq [r11+0x37],rbx 0x1fafdf31d372 6d2 49895b3f REX.W movq [r11+0x3f],rbx 0x1fafdf31d376 6d6 49895b47 REX.W movq [r11+0x47],rbx 0x1fafdf31d37a 6da 49895b4f REX.W movq [r11+0x4f],rbx 0x1fafdf31d37e 6de 49895b57 REX.W movq [r11+0x57],rbx 0x1fafdf31d382 6e2 49895b5f REX.W movq [r11+0x5f],rbx 0x1fafdf31d386 6e6 49895b67 REX.W movq [r11+0x67],rbx 0x1fafdf31d38a 6ea 49895b6f REX.W movq [r11+0x6f],rbx 0x1fafdf31d38e 6ee 49895b77 REX.W movq [r11+0x77],rbx 0x1fafdf31d392 6f2 49895b7f REX.W movq [r11+0x7f],rbx 0x1fafdf31d396 6f6 49899b87000000 REX.W movq [r11+0x87],rbx 0x1fafdf31d39d 6fd 49899b8f000000 REX.W movq [r11+0x8f],rbx 0x1fafdf31d3a4 704 49899b97000000 REX.W movq [r11+0x97],rbx 0x1fafdf31d3ab 70b 49899b9f000000 REX.W movq [r11+0x9f],rbx 0x1fafdf31d3b2 712 49899ba7000000 REX.W movq [r11+0xa7],rbx 0x1fafdf31d3b9 719 49899baf000000 REX.W movq [r11+0xaf],rbx 0x1fafdf31d3c0 720 49899bb7000000 REX.W movq [r11+0xb7],rbx ``` The second part is interesting. For the sake of convenience I have added some code comments on each instruction. ```asm -- </Users/jh/code/parser_performance/node_modules/@babel/traverse/lib/path/index.js:80:21> inlined at </Users/jh/code/parser_performance/node_modules/@babel/traverse/lib/path/index.js:134:14> -- ;; Read JavaScript `false` value into register `r12` 0x1fafdf31d523 883 4d8b65f8 REX.W movq r12,[r13-0x8] (root (false_value)) ;; Write contents of `r12` to `this.shouldSkip` 0x1fafdf31d527 887 4d896337 REX.W movq [r11+0x37],r12 -- </Users/jh/code/parser_performance/node_modules/@babel/traverse/lib/path/index.js:81:21> inlined at </Users/jh/code/parser_performance/node_modules/@babel/traverse/lib/path/index.js:134:14> -- ;; Write contents of `r12` to `this.shouldStop` 0x1fafdf31d52b 88b 4d89633f REX.W movq [r11+0x3f],r12 -- </Users/jh/code/parser_performance/node_modules/@babel/traverse/lib/path/index.js:82:18> inlined at </Users/jh/code/parser_performance/node_modules/@babel/traverse/lib/path/index.js:134:14> -- ;; Write contents of `r12` to `this.removed` 0x1fafdf31d52f 88f 4d896347 REX.W movq [r11+0x47],r12 ``` On a 64bit memory address space, there are 24 bytes from `[r11+0x37]` to `[r11+0x47]`. Apparently the JavaScript engine does not optimize sequence of boolean properties to a bit array, which occupies no more than 8 bytes. By reading the instructions we also know that if we replace these 3 booleans into bit array, we can at least save 45961 * 16 = 718 KiB of memory. ## Lazy initialization Another improvements is about the initialization of `this.data` ```javascript= this.data = Object.create(null); ``` It is a one-liner and you may doubt if it is much slower than ```javascript this.data = null ``` until we read the compiled instructions. ```asm ;; Object.create(null) -- </Users/jh/code/parser_performance/node_modules/@babel/traverse/lib/path/index.js:79:24> inlined at </Users/jh/code/parser_performance/node_modules/@babel/traverse/lib/path/index.js:134:14> -- 0x1fafdf31d440 7a0 49bcb109dc9ea8090000 REX.W movq r12,0x9a89edc09b1 ;; object: 0x09a89edc09b1 <JSFunction Object (sfi = 0x9a8a0447b49)> 0x1fafdf31d44a 7aa 4d8b642407 REX.W movq r12,[r12+0x7] 0x1fafdf31d44f 7af 48bf0114dc9ea8090000 REX.W movq rdi,0x9a89edc1401 ;; object: 0x09a89edc1401 <JSFunction create (sfi = 0x9a8a0448169)> 0x1fafdf31d459 7b9 49397c244f REX.W cmpq [r12+0x4f],rdi 0x1fafdf31d45e 7be 0f85f00a0000 jnz 0x1fafdf31df54 <+0x12b4> 0x1fafdf31d464 7c4 4c8da698000000 REX.W leaq r12,[rsi+0x98] 0x1fafdf31d46b 7cb 4d89a520eca400 REX.W movq [r13+0xa4ec20],r12 0x1fafdf31d472 7d2 488d5601 REX.W leaq rdx,[rsi+0x1] 0x1fafdf31d476 7d6 498b8d90010000 REX.W movq rcx,[r13+0x190] (root (name_dictionary_map)) 0x1fafdf31d47d 7dd 48894aff REX.W movq [rdx-0x1],rcx 0x1fafdf31d481 7e1 48b90000000011000000 REX.W movq rcx,0x1100000000 0x1fafdf31d48b 7eb 48894a07 REX.W movq [rdx+0x7],rcx 0x1fafdf31d48f 7ef 48c7420f00000000 REX.W movq [rdx+0xf],0x0 0x1fafdf31d497 7f7 48c7421700000000 REX.W movq [rdx+0x17],0x0 0x1fafdf31d49f 7ff 488b0dcff8ffff REX.W movq rcx,[rip+0xfffff8cf] 0x1fafdf31d4a6 806 48894a1f REX.W movq [rdx+0x1f],rcx 0x1fafdf31d4aa 80a 48b90000000001000000 REX.W movq rcx,0x100000000 0x1fafdf31d4b4 814 48894a27 REX.W movq [rdx+0x27],rcx 0x1fafdf31d4b8 818 48c7422f00000000 REX.W movq [rdx+0x2f],0x0 0x1fafdf31d4c0 820 48895a37 REX.W movq [rdx+0x37],rbx 0x1fafdf31d4c4 824 48895a3f REX.W movq [rdx+0x3f],rbx 0x1fafdf31d4c8 828 48895a47 REX.W movq [rdx+0x47],rbx 0x1fafdf31d4cc 82c 48895a4f REX.W movq [rdx+0x4f],rbx 0x1fafdf31d4d0 830 48895a57 REX.W movq [rdx+0x57],rbx 0x1fafdf31d4d4 834 48895a5f REX.W movq [rdx+0x5f],rbx 0x1fafdf31d4d8 838 48895a67 REX.W movq [rdx+0x67],rbx 0x1fafdf31d4dc 83c 48895a6f REX.W movq [rdx+0x6f],rbx 0x1fafdf31d4e0 840 48895a77 REX.W movq [rdx+0x77],rbx 0x1fafdf31d4e4 844 48895a7f REX.W movq [rdx+0x7f],rbx 0x1fafdf31d4e8 848 48899a87000000 REX.W movq [rdx+0x87],rbx 0x1fafdf31d4ef 84f 48899a8f000000 REX.W movq [rdx+0x8f],rbx 0x1fafdf31d4f6 856 498d5c2418 REX.W leaq rbx,[r12+0x18] 0x1fafdf31d4fb 85b 49899d20eca400 REX.W movq [r13+0xa4ec20],rbx 0x1fafdf31d502 862 4983c401 REX.W addq r12,0x1 0x1fafdf31d506 866 48bb792ad4eaa8090000 REX.W movq rbx,0x9a8ead42a79 ;; object: 0x09a8ead42a79 <Map(HOLEY_ELEMENTS)> 0x1fafdf31d510 870 49895c24ff REX.W movq [r12-0x1],rbx 0x1fafdf31d515 875 4989542407 REX.W movq [r12+0x7],rdx 0x1fafdf31d51a 87a 4d897c240f REX.W movq [r12+0xf],r15 ; assign to this.data -- </Users/jh/code/parser_performance/node_modules/@babel/traverse/lib/path/index.js:79:15> inlined at </Users/jh/code/parser_performance/node_modules/@babel/traverse/lib/path/index.js:134:14> -- 0x1fafdf31d51f 87f 4d89632f REX.W movq [r11+0x2f],r12 ``` Under the hood `Object.create(null)` involves multiple `movq` operations. Therefore it is definitely a magnitude of slower compared to `this.data = null` which involves only 2 `movq` operation. ```asm REX.W movq r15,[r13-0x18] (root (null_value)) REX.W movq [r11+0x2f],r15 ``` In `@babel/traverse`, `this.data` is used to provide customized `data` storage to `NodePath` class. Babel core and the official plugins does not utilize this storage, which means in practice it is highly likely that the storage holds nothing. Therefore we could set `this.data` to `null` and lazily initialize `this.data` only when it is requested: ```javascript function getData() { if (this.data == null) { this.data = Object.create(null); } } ``` These improvements I mentioned above helps babel transform to be 10% faster and 20% lower memory footprint. You can check out [this PR](https://github.com/babel/babel/pull/10480) if you are interested. ## Take away 1. Profiling and code coverage tools call help you identify the bottleneck 2. Read the compiled instructions to know the cost of a specific JavaScript statement 3. Optimizing the memory layout of a popular object would lead to both speed improvement and memory shrinkage [Amdahl's Law]: https://en.wikipedia.org/wiki/Amdahl%27s_law [speculative optimization]: https://ponyfoo.com/articles/an-introduction-to-speculative-optimization-in-v8 [`DefineDataProperty`]: https://github.com/v8/v8/blob/26fd582d854ec378bda1168cdc660651b368ae0c/src/api/api-natives.cc#L105 [`get`]: https://github.com/babel/babel/blob/b7333ea97ae50a2b5f2aa747c485579b28082f26/packages/babel-traverse/src/path/index.js#L73 [Bit Array]: https://en.wikipedia.org/wiki/Bit_array