--- robots: noindex, nofollow tags: pitch, build --- <!-- For instructions on shaping a project see here: [Shaping a Project](/kX02SXVbS6KzMOQd56i6Cg) --> # CI stability and perf ## Problem <!-- *From: [Problem guidance](https://basecamp.com/shapeup/1.5-chapter-06#ingredient-1-problem)* *The best problem definition consists of a single specific story that shows why the status quo doesn’t work.* --> CI builds are slow and some steps are unreliable. (On all the graphs below, it's best to zoom out to 1 month or 180 day view, since last 2 weeks of August had a lot more failures that were due to issues we've already resolved.) [Main PR/CI build times](https://dev.azure.com/uifabric/fabricpublic/_pipeline/analytics/duration?definitionId=84&contextType=build): avg 30-50 minutes (hard to tell for sure due to outliers and combination of PR/CI). The graphs show a long time is spent waiting on screener (v0 and v7), but otherwise they aren't very informative. [Main PR/CI build failure reasons](https://dev.azure.com/uifabric/fabricpublic/_pipeline/analytics/stageawareoutcome?definitionId=84&contextType=build): In 180 day view, roughly 1/4 to 1/3 of failures are in "run FUI VR Test" (other steps had reliability issues recently but should mostly be resolved) [Size auditor build times](https://dev.azure.com/uifabric/fabricpublic/_pipeline/analytics/duration?definitionId=115&contextType=build): avg 50 minutes. Primary issue is that this build must run on Windows, the default Windows VMs are slow, and we don't have custom fast Windows VMs. [Perf test build times](https://dev.azure.com/uifabric/fabricpublic/_pipeline/analytics/duration?definitionId=146&contextType=build): avg 38 minutes. v7 perf test times appear to have doubled in the last week? Running separate jobs for v0 and v7 might help. [Release build times](https://dev.azure.com/uifabric/UI%20Fabric/_pipeline/analytics/duration?definitionId=104&contextType=build) were averaging 40-50 minutes but are down to 25-35 after scoping down the build. ## Appetite <!-- *From: [Appetite guidance](https://basecamp.com/shapeup/1.5-chapter-06#ingredient-2-appetite)* *Think of this as another part of the problem definition. We want to solve this problem, but we also need to do it in a way that will leave time to solve other problems. Here we depart from Shape Up, and allow for timeframes or appetites of 1-6 weeks.* --> ? ## Solution Ideas for improvements (Shift feel free to add or remove things): ### E2E & Unit tests in v0 - ✅ E2E are not stable and randomly failing without any reasons - ✅ Some unit tests in v0 are flacky ### Screener - ✅ Update v0 screener to use Shift/Miro's GH app to send status with webhooks - ✅ Github App is used, almost two weeks without issues - ✅ v0 screener reliability improvements - ✅ investigate issue with flacky Popper tests (ChatExampleScrollable, ToolbarVariables) - ✅ There was another one flacky test and it was fixed - Investigate getting rid of `screener-ngrok` (in v0 and v7) by uploading the storybook site to Azure storage and running against that - OOUI did something similar we can partly copy - ✅ Challenge: how to ensure PR status updates when build is done--maybe Shift/Miro's screener webhook app will help (*Shift: it will help as Screener now reports the build state*)? - Eliminating ngrok plus doing status with webhooks might allow us to "fire and forget" screener builds instead of having 2 VMs tied up for ~20 minutes waiting ### Build - Faster release builds? - `yarn install` takes almost 3 minutes - Investigate removing deps with postinstall scripts (in CI or always) - Remove `screener-ngrok` - Disable puppeteer chromium download in CI when not needed - Electron - Others? - Do not build twice typings/typecheck for v7 packages - https://github.com/microsoft/fluentui/issues/14449 - Other? ### Lage - Use lage to only run appropriate screener (v0/v7) for each PR - How required checks will work in this case? - Some work may be needed to split v0 screener stories and runner out into another package - Enable lage build caching? (ask Ken for details; currently disabled due to security concerns about exposing azure storage token) - VM improvements - Could add even more VMs to self-host agent pool (currently 20) or increase specs - VMSS (VM scale set solution JD was working on) was disabled because of weird screener-ngrok postinstall failure and possibly other reliability issues. If we can fix this (or get rid of screener-ngrok), re-enabling VMSS could help with better perf. - Activate Lage scoping for v0 packages? - Broken artifacts (sometimes there is no *Deployed sites for PR*) ### `perf-test` - Run separate jobs for v0 and v7 builds? - Investigate why v7 build times doubled? (if trend continues and it's not just a reporting anomaly from a week of build issues) - Size auditor: if we decide this build time is a problem, we might have to look into making faster custom Windows agent VMs [Build wish list](/mjciSB_aTqGUU_ox3o7eCQ): for improvements that might be out of scope for now or need more investigation #### ✅ Already done - Add more VMs to self-host agent pool to reduce queue time (speeds up all builds) - Faster release builds: scope build/test/bundle to only beachball-published packages. Reduced time from 40-50 minutes to 25-35. - Faster install: Remove `ngrok` (but not `screener-ngrok` yet) and instruct to use global install for the one local test script where it's used ### Risks (Rabbit holes) <!-- *From: [Rabbit hole guidance](https://basecamp.com/shapeup/1.5-chapter-06#ingredient-4-rabbit-holes)* *Another key aspect of shaping is de-risking. This involves identifying potential issues and complications in the solution. These may be non-obvious cases where the solution doesn't work. These could be constraints from other parts of the system (dependencies or dependent code). This aspect of shaping is what typically requires the most experience and understanding of the domain. This is likely somethign we will all collectively get better at with practice.* --> ### Out of scope (No-gos) <!-- *From: [No-gos guidance](https://basecamp.com/shapeup/1.5-chapter-06#ingredient-5-no-gos)* *A key way to deal with complicated risks or issues with the solution, is to decide a particular funcionatliy is out of scope. If reducing the scope to remove a risk or issue does not prevent us from fulfilling the original problem, then it is fine. Reducing scope may require reaching out to the original customer that had the problem we are solving, or working with a representive of the customer on our team.* -->