CI stability and perf

--- robots: noindex, nofollow tags: pitch, build ---  # CI stability and perf ## Problem  CI builds are slow and some steps are unreliable. (On all the graphs below, it's best to zoom out to 1 month or 180 day view, since last 2 weeks of August had a lot more failures that were due to issues we've already resolved.) [Main PR/CI build times](https://dev.azure.com/uifabric/fabricpublic/_pipeline/analytics/duration?definitionId=84&contextType=build): avg 30-50 minutes (hard to tell for sure due to outliers and combination of PR/CI). The graphs show a long time is spent waiting on screener (v0 and v7), but otherwise they aren't very informative. [Main PR/CI build failure reasons](https://dev.azure.com/uifabric/fabricpublic/_pipeline/analytics/stageawareoutcome?definitionId=84&contextType=build): In 180 day view, roughly 1/4 to 1/3 of failures are in "run FUI VR Test" (other steps had reliability issues recently but should mostly be resolved) [Size auditor build times](https://dev.azure.com/uifabric/fabricpublic/_pipeline/analytics/duration?definitionId=115&contextType=build): avg 50 minutes. Primary issue is that this build must run on Windows, the default Windows VMs are slow, and we don't have custom fast Windows VMs. [Perf test build times](https://dev.azure.com/uifabric/fabricpublic/_pipeline/analytics/duration?definitionId=146&contextType=build): avg 38 minutes. v7 perf test times appear to have doubled in the last week? Running separate jobs for v0 and v7 might help. [Release build times](https://dev.azure.com/uifabric/UI%20Fabric/_pipeline/analytics/duration?definitionId=104&contextType=build) were averaging 40-50 minutes but are down to 25-35 after scoping down the build. ## Appetite  ? ## Solution Ideas for improvements (Shift feel free to add or remove things): ### E2E & Unit tests in v0 - ✅ E2E are not stable and randomly failing without any reasons - ✅ Some unit tests in v0 are flacky ### Screener - ✅ Update v0 screener to use Shift/Miro's GH app to send status with webhooks - ✅ Github App is used, almost two weeks without issues - ✅ v0 screener reliability improvements - ✅ investigate issue with flacky Popper tests (ChatExampleScrollable, ToolbarVariables) - ✅ There was another one flacky test and it was fixed - Investigate getting rid of `screener-ngrok` (in v0 and v7) by uploading the storybook site to Azure storage and running against that - OOUI did something similar we can partly copy - ✅ Challenge: how to ensure PR status updates when build is done--maybe Shift/Miro's screener webhook app will help (*Shift: it will help as Screener now reports the build state*)? - Eliminating ngrok plus doing status with webhooks might allow us to "fire and forget" screener builds instead of having 2 VMs tied up for ~20 minutes waiting ### Build - Faster release builds? - `yarn install` takes almost 3 minutes - Investigate removing deps with postinstall scripts (in CI or always) - Remove `screener-ngrok` - Disable puppeteer chromium download in CI when not needed - Electron - Others? - Do not build twice typings/typecheck for v7 packages - https://github.com/microsoft/fluentui/issues/14449 - Other? ### Lage - Use lage to only run appropriate screener (v0/v7) for each PR - How required checks will work in this case? - Some work may be needed to split v0 screener stories and runner out into another package - Enable lage build caching? (ask Ken for details; currently disabled due to security concerns about exposing azure storage token) - VM improvements - Could add even more VMs to self-host agent pool (currently 20) or increase specs - VMSS (VM scale set solution JD was working on) was disabled because of weird screener-ngrok postinstall failure and possibly other reliability issues. If we can fix this (or get rid of screener-ngrok), re-enabling VMSS could help with better perf. - Activate Lage scoping for v0 packages? - Broken artifacts (sometimes there is no *Deployed sites for PR*) ### `perf-test` - Run separate jobs for v0 and v7 builds? - Investigate why v7 build times doubled? (if trend continues and it's not just a reporting anomaly from a week of build issues) - Size auditor: if we decide this build time is a problem, we might have to look into making faster custom Windows agent VMs [Build wish list](/mjciSB_aTqGUU_ox3o7eCQ): for improvements that might be out of scope for now or need more investigation #### ✅ Already done - Add more VMs to self-host agent pool to reduce queue time (speeds up all builds) - Faster release builds: scope build/test/bundle to only beachball-published packages. Reduced time from 40-50 minutes to 25-35. - Faster install: Remove `ngrok` (but not `screener-ngrok` yet) and instruct to use global install for the one local test script where it's used ### Risks (Rabbit holes)  ### Out of scope (No-gos)