Try   HackMD

Grayscale Testing For Serverless Platform

Problems to solve

The gray-scale feature gives the ability to Serverless Platform of letting users switch to the new version (canary version) gradually. With this we can control unexpected exceptions that occur within a limit range in prod, also we can easily rollback if needed.

The dev ENV test is not enough to cover some actual use cases with real user data in prod.

Example: Service unavailability for sizable areas (30%) of users during changes. _Example: During earlier data migration, there is a rule for the name field in the YAML configuration can’t be like xx.xx cause service unavailable affected 194 users (of 2,590 MAU, 9,760 Total Users). And there are similar 5 cases (system unavailable) in the past half year. _

Phase 1 Quick Implementation

Here are some points we agreed earlier on previous meetings. And the following design is based on these agreements.

  • Publishing two versions of Serverless Cloud Functions. One for canary, and one for stable.
  • Use appId to identify users that do gray-scale tests.
  • At Phase 1, no canary CLI, and users not aware about gray-scale tests.
  • At Phase 1, Serverless Components templates are not included in gray-scale tests. (probably use @2.0.1_rc2 in the future )
  • Based on the current release approach (Each time release a new version of Serverless Platform, which includes Engine, Registry, and Events services).

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Tencent SCF grayscale fundamental

Tencent SCF support grayscale test feature. And the proposed solution of grayscale is build based on the grayscale capability existing on SCF. Each new release of a SCF will have a new version of the function code deployed in SCF. The grayscale test feature is build on the configuration of alias in SCF and traffic logic set up in the proxy (before SCF).

For the grayscale of Serverless Platform, there are versions of Engine code version 5, 6 exist on the Engine SCF. after a new release of Serverless platform, there will be a new version 7 of Engine code on the Engine SCF. The SCF will auto take the newly released version 7 as the canary version, and the version 6 as the stable version. Based on the percentage (traffic to canary version) set through the bot. A part of users' requests will be passed to canary versions.

  • if we rollback the version 7, and no newly released version. There will only be the stable version 6, and all the requests will go into the stable version.
  • if we rollback the version 7, and later release a new version 8, the canary version will be version 8, and the stable version will still be version 6.

Proposed Solution

Based on Tencent SCF gray-scale capability. We need to do versioning on Serverless Platform release and mapping the version info with SCF internal versions (1,2,3)

  • We don't need to create a new environment to do gray-scale tests.
  • We will do "rollback" by changing the traffic to a specific version in SCF.
  • Monitoring and Grafana reports are supported with different version.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Selection of test users

  • The Proxy on Tencent will do a modulo operation on users’ appId (get from the accessToken) and select a percentage of users as grayscale test users.
  • For non authed users, the proxy will randomly routing a percentage of users as grayscale test users.

Release Process

  1. Complete feature development and submit PR to the dev branch. And pass the integration tests in dev ENV.
  2. Code review by committee and other team members (optional but recommended).
  3. Notify the team and Tencent team in serverless-tencent channel in 24 hours advance. seek product feedback.
  4. Make PR to the master branch and update the change logs.
  5. Tag on master branch with the version info, and trigger the production promotion during the promotion window (2PM every Wednesday).
  6. Run integration test on the newly released version in production. Confirm all the tests passed and notify in serverless-tencent slack.
  7. Enable gray-scale testing. and check the monitoring status. a. If there are no exceptions, increase the canary version test user scale, until all the traffic goes to the newest version. b. If there are exceptions, rollback to the stable version and start fixing the issue. After fixing the issue, restart the release process.

Slack-Bot Commands

Control the release operation by using the @sls-bot in the serverless-tecnent channel in Slack. Following are the example commands.

  • Init traffic to canary version by default configuration: @sls-release-bot -c -m sls_gray_test
  • Change traffic to canary version by percentage: @sls-release-bot -g -m sls_gray_test -p 30
  • Change traffic to canary version by value: @sls-release-bot -g -m sls_gray_test -p 20:55
Executor: @kaiyuzeng 
GrayScale Status: Done 
Stable Version 7(v 2.1.0) : other 
Canary Version 8(v 2.1.1) appid range: [20 - 55] 
Non-Auth GrayScale Weight: 36%
  • Complete gray test: @sls-release-bot -f -m sls_gray_test
  • Rollback all traffic to stable version: @sls-release-bot -r -m sls_gray_test
  • List Traffic: @sls-release-bot -l -m sls_gray_test
  • List Rules: @sls-release-bot -l -m sls_gray_test

Success Criteria

  • After this feature deployed, once they're an unexpected error lead to service unavailable, it will only affect a small group of gray-scale users, it should not affect the users that use stable version (a different version of code without the error).
  • After this feature deployed, once they're an unexpected error lead to service unavailable, we can easily reduce the traffic to that version, or make user requests to a previous stable version.
  • After this feature deployed, once we check the Grafana dashboard or the system log, we can easily filter the result and logs related to a specific version.

Requirements

  • Each Serverless platform deploy should have a unique version info as reference. [Serverless Inc]
  • The Log system should save log with the matched version info. [Serverless Inc & Tencent]
  • The monitor should report issues found with version info. [Tencent]
  • The dashboard should show report by version or together. [Tencent]
  • It should be easy to change the rate of the traffic to Serverless platform. [Tencent]
  • It should be easy to check the rate percentages rules for Serverless platform. [Tencent]
  • Each release should randomly pick different users do gray-scale tests. [Tencent]
  • All feature development should be backward compatible. [Serverless Inc]
    • Add integrations tests for compatible tests.
    • use arguments transformers if needed.

Phase 2 Improvement (TBD & TBC)

In Phase 2, it will mainly focus on solving the following problems left for grayscale test of Serverless Platform.

  • Grayscale testing on separately released services of Serverless Platform (Engine, Registry, Events).
  • Grayscale testing on Events services (Sockets Protocol) , switch to Tencent service.
    • 10.15 tencent solution.
    • The Events service is used for Serverless debug feature and maintain referential integrity between microservices(IE: handle situations where an org, app, or service is destroyed/updated).
  • Grayscale testing on the shared layer of components Template.
  • Support the Canary CLI and Stable CLI to give user choice to join the grayscale tests.