Try   HackMD

Uber Cadence as alternative to Camunda for AS team

Motivation to look for an alternative

Overall development performance is terrible. Probably caused by:

  1. it's difficult to develop with camunda due to lack of basic tooling
    • debugger
      • break points
      • variable state in the running task
    • syntax higlighting, syntax autocomplete
    • type checking
      • signals
      • input/output of tasks
    • easy reusability of code
  2. built-in UI does not offer history view
  3. migrations are not easy
  4. problems in runtime
    • database grows like crazy
  5. can't put all logic in Camunda as it's not performant enough
  6. drawing instead of coding - devastating to developer's psychic
  7. emerging need for more and more tooling which consumes energy (e.g. signals, wrapper for modelling DSl etc.)

How does Uber Cadence fulfill the expectations

    • workflow created directly from code with very little abstraction over it - normal tooling for Java is applicable
    • the UI can show the history
    • there are no migrations in Uber Cadence ;) Workflows must support all the needed versions with simple conditions if (workflow.getVersion() == '1') { doSth()} else {doSthElse()}
    • time will show. Cassandra used as a default database is known for its performance
    • mostly coding again
    • initial experiments show limited need for additional tooling

What is Uber Cadence

As per their description it is "() orchestration engine to execute asynchronous long-running business logic()"

There are client for Go and Java. One for Python is not ready yet https://github.com/firdaus/cadence-python

Creating a workflow in Uber Cadence is not so much different than normal coding.

  1. You define activities - worfklow steps that have side effects. They can be reused between workflows
  2. You define flow by just calling activities
class QuarantineWorkflowImpl : Quarantine.QuarantineWorkflow { private val activityOptions = ActivityOptions.Builder() .setTaskList("quarantine") .setScheduleToCloseTimeout(Duration.ofSeconds(10)) .build() /** * Activity stub implements activity interface and proxies calls to it to Cadence activity * invocations. Because activities are reentrant, only a single stub can be used for multiple * activity invocations. */ private val activities = Workflow.newActivityStub(QuarantineActivities::class.java, activityOptions) private val activities1 = Workflow.newActivityStub(HttpCallActivity::class.java) override fun quarantineHost(jiraKey: String): String { val res1 = activities.startProgressOnIssue(jiraKey) // call first activity println("In between activitites. Good as long as withou side effect") val res2 = activities1.someHttp("crowjira/v1/issues/" + jiraKey) // call second activity return res1 + res2 } }

How to run sample?

Uber cadence comes with a non problematic docker compose

git clone https://github.com/uber/cadence.git
cd cadence/docker
docker-compose up

It will spin up all needed dependencies and web ui
Access web ui: http://localhost:8088/


Cadence comes with a nice set of samples in Java Client

git clone https://github.com/uber/cadence-java-client.git
cd cadence-java-client
./gradlew -q execute -PmainClass=com.uber.cadence.samples.common.RegisterDomain

Open the project in intellij (just open if you have graddle plugin installed) and go to src/main/java/com/uber/cadence/samples/hello
Each class has a main method - run it and things should just work.

Cezary's samples

https://bitbucket.sec.sony.com/projects/CRO/repos/cadence-samples/browse

The repo consists of two projects. One implements activities(cadence-sample-activities) and the other one implements workflow(kotlin-cadence)

The workflow is really simple. It just calls 2 activities. One activity is hosted together with the workflow and other one is hosted by different process. First activity just returns string. The other one fetches some information from Jira (it uses proxy on localhost:4140).

  1. Run activities

cd cadence-sample-activities
./gradlew -q execute -PmainClass=com.sony.crow.cadence.samples.quarantine.activities.QuarantineActivitiesImpl

Note: Running via intellij does not work (as per 12.11.2019) as new gradle version is not so well supported.
3. Run workflow worker
Open kotlin-cadence in Intellij, navigate to: src/main/kotlin/cadence/SampleFlow.kt and run the main class
5. Start workflow
docker run --network host --rm ubercadence/cli:master --domain sample workflow run --tl quarantine1 --wt 'QuarantineWorkflow::quarantineHost' --et 6 -i '"host1"'

Cadence comes with a nice CLI. The above command is not so handy as it uses docker run. A standalone version is also available.

The flow should fail if you have no tunnel set up. To set up the tunnel:

ssh -f plladace1@43.194.55.86 -L 4140:localhost:4140 -N

Testing Uber Cadence

It is really simple

https://github.com/uber/cadence-java-samples/blob/master/src/test/java/com/uber/cadence/samples/hello/HelloActivityTest.java

There is in-memory workflow engine. You can easily provide mocked activities.

Activities are normal classess and can be tested easily.

Performance tests

The test will be performed locally using crowcompose. This removes the necessity to involve other teams. If tests show not satisfactory performance then they will be repeated on sth environment with more proper setup.

test measures time of starting proccesses. Does not measure time when they're completed

Glossary:
ww - workflow worker
aw - activity worket

Synchronous process

xargs -P 50 -I{} ./cadence --domain sample workflow start --tl perftest --wt 'PerformanceTestWorkflow::quarantineHost' --et 10000 -i '"host1"'

No of proc No of aw No of aw time
1000 1 1 15.7s
1000 2 1 11.4s
1000 2 2 10.9s
10000 2 2 116s

Asynchronous process

xargs -P 50 -I{} ./cadence --domain sample workflow start --tl perftest --wt 'PerformanceTestWorkflow::quarantineHostAsync' --et 10000 -i '"host1"'

No of proc No of aw No of aw time
1000 1 1 11.6s
1000 2 1 9.6s
1000 2 2 10.9s

Note:
Running cadence binary 1000 times only displaying version takes about 3s: time seq 1 1000 | xargs -P 50 -I{} ./cadence -v -> 3,573 total
CPU usage on cassandra is 100%. Running additional cassandra node does not change this behaviour. Running 3 cassandra nodes improved the performance.
Cadence server CPU usage is also high.

Open points

  1. How to communicate with cadence server on DC/OS as it's not HTTP, but rather TCP. How to learn the ip address?
    camunda.prd.crow.marathon.mesos
  2. How to secure communication(SSL) with cadence server? - not a blocker
    It is not supported: https://github.com/uber/cadence/issues/2018
    They suggest to use a SSL/TLS proxy like: https://github.com/square/ghostunnel
  3. Securing connection to cassandra was merged to master 2019.11.08 and therefore is not released yet
  4. statsd and telegraf clarification - not a blocker

Action plan

Resources

https://eng.uber.com/open-source-orchestration-tool-cadence-overview/
https://github.com/banzaicloud/banzai-charts/tree/master/cadence/

Polling which does not overwhelm history
https://stackoverflow.com/questions/57562772/polling-for-external-state-transitions-in-cadence-workflows

Snippet from uber cadence slack

Let’s start with the cluster setup first. If you look at the Cadence service config, there are few knobs you need to pay attention before you can start a production Cadence server.

  1. History Shards: Development config uses 4 history shards. For a production cluster usually this is the first config you need to tune as this is the most critical knob which controls the scalability of the cluster. At Uber we run our production clusters with 16K shards. I recommend you atleast provision the cluster with 1K shards for your use case. Number of shards for a given cluster is static. You cannot change this knob after the cluster is provisioned. https://github.com/uber/cadence/blob/master/config/development.yaml#L4
  2. Membership: We use ringpop for membership. When you are deploying Cadence Server on multiple hosts you need to specify host:port of atleast one seed node for all hosts to discover each other. You can either use static ip:port for one of frontend host or dns address. https://github.com/uber/cadence/blob/master/config/development.yaml#L20
  3. Persistence: Persistence section in the config has a section to configure Cassandra related options. You need to specify Cassandra hosts to connect to and also provide keyspaces. Cadence uses two keyspaces. It uses keyspace ‘cadence’ for core workflow execution and keyspace ‘cadence_visibility’ for providing visibility into workflow executions. Make sure to use cassandra-tool to deploy schema before deploying Cadence Server. https://github.com/uber/cadence/blob/master/config/development.yaml#L8
  4. Services: Cadence server consists of 4 roles (frontend, history, matching, worker). Services section of config has configuration for each of the role. Things to configure here are port service listens on, statsd metric config, pprof port. https://github.com/uber/cadence/blob/master/config/development.yaml#L23
  5. Global Domains: This is a high availability feature provided by Cadence where state of workflow execution within a domain is replicated across multiple Cadence clusters. I suggest you leave it disabled as shown in the development config. https://github.com/uber/cadence/blob/master/config/development.yaml#L68
    Rest of the configuration are about other Cadence features like Archival, ElasticSearch integration for Cadence, Kafka needed for Global Domains. You can ignore those for now.

Your workflow and activities for are hosted outside of Cadence Server within your own worker. You can scale them according to the needs of your own use case. Your workers running your application will continuously poll Cadence Server for tasks. When an event happens like startWorkflowExecution, signalWorkflowExecution, activityCompletion, etc Cadence Server will dispatch a task to your worker to execute your application logic.
All your workflow state is managed by Cadence Server and it will route the signal or any other event to worker hosting that particular workflow execution
Please watch maxim’s presentation to understand the model: https://www.youtube.com/watch?v=llmsBGKOuWI

Just a couple clarifications. The sticky is indeed about caching a workflow instance on a worker. When an instance is cached it receives only new events in a decision task instead of replaying the whole history on every task. As word caching implies the workflow can be pushed out of that any time or due to worker failures and be cached on another worker by replaying the whole history. So stickiness is a purely performance optimization and it doesn't guarantee that the workflow is executed on a single worker.