# Uber Cadence as alternative to Camunda for AS team ## Motivation to look for an alternative Overall development performance is terrible. Probably caused by: 1. it's difficult to develop with camunda due to lack of basic tooling * debugger * break points * variable state in the running task * syntax higlighting, syntax autocomplete * type checking * signals * input/output of tasks * easy reusability of code 2. built-in UI does not offer history view 3. migrations are not easy 4. problems in runtime * database grows like crazy 5. can't put all logic in Camunda as it's not performant enough 6. drawing instead of coding - devastating to developer's psychic 7. emerging need for more and more tooling which consumes energy (e.g. signals, wrapper for modelling DSl etc.) ## How does Uber Cadence fulfill the expectations 1. - [x] workflow created directly from code with very little abstraction over it - normal tooling for Java is applicable 2. - [x] the UI can show the history 3. - [x] there are no migrations in Uber Cadence ;) Workflows must support all the needed versions with simple conditions `if (workflow.getVersion() == '1') { doSth()} else {doSthElse()}` 4. - [ ] time will show. Cassandra used as a default database is known for its performance 5. - [x] documentation states that Cadence is performant enough to implement *interactive applications* https://cadenceworkflow.io/docs/02_use_cases/10_interactive 6. - [x] mostly coding again 7. - [x] initial experiments show limited need for additional tooling ## What is Uber Cadence As per their description it is "*(...) orchestration engine to execute asynchronous long-running business logic(...)*" There are client for Go and **Java**. One for Python is not ready yet https://github.com/firdaus/cadence-python Creating a workflow in Uber Cadence is not so much different than normal coding. 1. You define activities - worfklow steps that have side effects. They can be reused between workflows 2. You define flow by just calling activities ```kotlin= class QuarantineWorkflowImpl : Quarantine.QuarantineWorkflow { private val activityOptions = ActivityOptions.Builder() .setTaskList("quarantine") .setScheduleToCloseTimeout(Duration.ofSeconds(10)) .build() /** * Activity stub implements activity interface and proxies calls to it to Cadence activity * invocations. Because activities are reentrant, only a single stub can be used for multiple * activity invocations. */ private val activities = Workflow.newActivityStub(QuarantineActivities::class.java, activityOptions) private val activities1 = Workflow.newActivityStub(HttpCallActivity::class.java) override fun quarantineHost(jiraKey: String): String { val res1 = activities.startProgressOnIssue(jiraKey) // call first activity println("In between activitites. Good as long as withou side effect") val res2 = activities1.someHttp("crowjira/v1/issues/" + jiraKey) // call second activity return res1 + res2 } } ``` ## How to run sample? Uber cadence comes with a non problematic docker compose > git clone https://github.com/uber/cadence.git > cd cadence/docker > docker-compose up It will spin up all needed dependencies and web ui Access web ui: http://localhost:8088/ ___ Cadence comes with a nice set of samples in Java Client > git clone https://github.com/uber/cadence-java-client.git > cd cadence-java-client > ./gradlew -q execute -PmainClass=com.uber.cadence.samples.common.RegisterDomain Open the project in intellij (just open if you have graddle plugin installed) and go to `src/main/java/com/uber/cadence/samples/hello` Each class has a main method - run it and things should just work. ## Cezary's samples https://bitbucket.sec.sony.com/projects/CRO/repos/cadence-samples/browse The repo consists of two projects. One implements activities(`cadence-sample-activities`) and the other one implements workflow(`kotlin-cadence`) The workflow is really simple. It just calls 2 activities. One activity is hosted together with the workflow and other one is hosted by different process. First activity just returns string. The other one fetches some information from Jira (it uses proxy on localhost:4140). 1. Run activities > cd cadence-sample-activities > ./gradlew -q execute -PmainClass=com.sony.crow.cadence.samples.quarantine.activities.QuarantineActivitiesImpl Note: Running via intellij does not work (as per 12.11.2019) as new gradle version is not so well supported. 3. Run workflow worker Open `kotlin-cadence` in Intellij, navigate to: `src/main/kotlin/cadence/SampleFlow.kt` and run the main class 5. Start workflow `docker run --network host --rm ubercadence/cli:master --domain sample workflow run --tl quarantine1 --wt 'QuarantineWorkflow::quarantineHost' --et 6 -i '"host1"'` Cadence comes with a nice CLI. The above command is not so handy as it uses `docker run`. A standalone version is also available. The flow should fail if you have no tunnel set up. To set up the tunnel: `ssh -f plladace1@43.194.55.86 -L 4140:localhost:4140 -N` ## Testing Uber Cadence It is really simple https://github.com/uber/cadence-java-samples/blob/master/src/test/java/com/uber/cadence/samples/hello/HelloActivityTest.java There is in-memory workflow engine. You can easily provide mocked activities. Activities are normal classess and can be tested easily. ## Performance tests The test will be performed locally using crowcompose. This removes the necessity to involve other teams. If tests show not satisfactory performance then they will be repeated on sth environment with more proper setup. **test measures time of starting proccesses. Does not measure time when they're completed** Glossary: ww - workflow worker aw - activity worket ### Synchronous process `xargs -P 50 -I{} ./cadence --domain sample workflow start --tl perftest --wt 'PerformanceTestWorkflow::quarantineHost' --et 10000 -i '"host1"'` | No of proc | No of aw | No of aw | time | | -------- | -------- | -------- | ----- | | 1000 | 1 | 1 | 15.7s | | 1000 | 2 | 1 | 11.4s | | 1000 | 2 | 2 | 10.9s | | 10000 | 2 | 2 | 116s | ### Asynchronous process `xargs -P 50 -I{} ./cadence --domain sample workflow start --tl perftest --wt 'PerformanceTestWorkflow::quarantineHostAsync' --et 10000 -i '"host1"'` | No of proc | No of aw | No of aw | time | | -------- | -------- | -------- | ----- | | 1000 | 1 | 1 | 11.6s | | 1000 | 2 | 1 | 9.6s | | 1000 | 2 | 2 | 10.9s | Note: Running `cadence` binary 1000 times only displaying version takes about 3s: `time seq 1 1000 | xargs -P 50 -I{} ./cadence -v` -> `3,573 total` CPU usage on cassandra is 100%. Running additional cassandra node does not change this behaviour. Running 3 cassandra nodes improved the performance. Cadence server CPU usage is also high. ## Open points 1. How to communicate with cadence server on DC/OS as it's not HTTP, but rather TCP. How to learn the ip address? `camunda.prd.crow.marathon.mesos` 3. How to secure communication(SSL) with cadence server? - not a blocker It is not supported: https://github.com/uber/cadence/issues/2018 They suggest to use a SSL/TLS proxy like: https://github.com/square/ghostunnel 4. Securing connection to cassandra was merged to master 2019.11.08 and therefore is not released yet 5. statsd and telegraf clarification - not a blocker ## Action plan - [ ] deploy Cadence server to stg using provided docker images: https://github.com/uber/cadence/tree/master/docker#quickstart-for-production - [ ] use telegraf to collect metrics from cadence server. Have som - [ ] implement patch velocity using cadence OR ? - [ ] deploy more serious version as in in the chart museum ## Resources https://eng.uber.com/open-source-orchestration-tool-cadence-overview/ https://github.com/banzaicloud/banzai-charts/tree/master/cadence/ Polling which does not overwhelm history https://stackoverflow.com/questions/57562772/polling-for-external-state-transitions-in-cadence-workflows ## Snippet from uber cadence slack Let’s start with the cluster setup first. If you look at the Cadence service config, there are few knobs you need to pay attention before you can start a production Cadence server. 1) History Shards: Development config uses 4 history shards. For a production cluster usually this is the first config you need to tune as this is the most critical knob which controls the scalability of the cluster. At Uber we run our production clusters with 16K shards. I recommend you atleast provision the cluster with 1K shards for your use case. Number of shards for a given cluster is static. You cannot change this knob after the cluster is provisioned. https://github.com/uber/cadence/blob/master/config/development.yaml#L4 2) Membership: We use ringpop for membership. When you are deploying Cadence Server on multiple hosts you need to specify host:port of atleast one seed node for all hosts to discover each other. You can either use static ip:port for one of frontend host or dns address. https://github.com/uber/cadence/blob/master/config/development.yaml#L20 3) Persistence: Persistence section in the config has a section to configure Cassandra related options. You need to specify Cassandra hosts to connect to and also provide keyspaces. Cadence uses two keyspaces. It uses keyspace ‘cadence’ for core workflow execution and keyspace ‘cadence_visibility’ for providing visibility into workflow executions. Make sure to use cassandra-tool to deploy schema before deploying Cadence Server. https://github.com/uber/cadence/blob/master/config/development.yaml#L8 4) Services: Cadence server consists of 4 roles (frontend, history, matching, worker). Services section of config has configuration for each of the role. Things to configure here are port service listens on, statsd metric config, pprof port. https://github.com/uber/cadence/blob/master/config/development.yaml#L23 5) Global Domains: This is a high availability feature provided by Cadence where state of workflow execution within a domain is replicated across multiple Cadence clusters. I suggest you leave it disabled as shown in the development config. https://github.com/uber/cadence/blob/master/config/development.yaml#L68 Rest of the configuration are about other Cadence features like Archival, ElasticSearch integration for Cadence, Kafka needed for Global Domains. You can ignore those for now. Your workflow and activities for are hosted outside of Cadence Server within your own worker. You can scale them according to the needs of your own use case. Your workers running your application will continuously poll Cadence Server for tasks. When an event happens like startWorkflowExecution, signalWorkflowExecution, activityCompletion, etc Cadence Server will dispatch a task to your worker to execute your application logic. All your workflow state is managed by Cadence Server and it will route the signal or any other event to worker hosting that particular workflow execution Please watch maxim’s presentation to understand the model: https://www.youtube.com/watch?v=llmsBGKOuWI Just a couple clarifications. The sticky is indeed about caching a workflow instance on a worker. When an instance is cached it receives only new events in a decision task instead of replaying the whole history on every task. As word caching implies the workflow can be pushed out of that any time or due to worker failures and be cached on another worker by replaying the whole history. So stickiness is a purely performance optimization and it doesn't guarantee that the workflow is executed on a single worker.