<style>
.blue {
color: blue;
}
</style>
# 2nd Response to Shepherd
Dear Doug,
Thank you so much for your detailed response! We really appreciate your thoughtful comments and questions.
We've created a new CR revision based on your comments and are attaching it to this email. Changes are in green fonts. Also, our response to specific comments are in-line in this email. Please let us know your thoughts and comments. Thank you again!
Best,
David et al.
I appreciate the effort that you put into revising the paper to address the reviewers’ comments. The paper is substantially improved in my opinion.
<span class="blue">Thank you! It's encouraging to hear this!</span>
However, I do not think that you’ve gone far enough to deal with the main concern that was expressed by the reviewers: accurately describing the benefits of your approach compared to using an orchestration service. The paper is a bit inconsistent in this regard. For example, the abstract states the top benefits as better performance and lower costs, but these are not even mentioned in the introduction until the very last paragraph. Instead, the intro focuses on avoiding server management (which I think is an uncompelling argument) and support for application-specific optimizations (which I think is the real benefit). I suggest that you put the application-specific optimizations front and center in your argument.
<span class="blue">We revised the Abstract and Introduction to further emphasize the benefits of application flexibility and cost reduction.</span>
<span class="blue">We address the three other benefits you mentioned—(1) better performance, (2) lower costs, and (3) avoiding server management—in more detail later in the email where they appear in your original comments.</span>
Here’s my take on the other potential benefits that you mention, and please indicate where you disagree with me:
strong execution guarantees, i.e. exactly-once execution: This is the same for both approaches. You make that clear. Nothing more needs to be said.
👍
weaker execution guarantees: This is mentioned in the intro as a limitation of orchestration services that only offer exactly-once execution but never revisited. So, I would leave it out unless you have a concrete example.
<span class="blue"> In the Introduction, we added examples of deterministic applications that can benefit from weaker execution guarantees.</span>
performance: It is misleading to claim that your approach “performs better” since it does reduce latency in some cases and increases latency in others (as shown Figure 5). In Figure 5, the break-even point appears to be at a branching factor of around 20 which is a lot of branching for a workflow app.
<span class="blue">We revised statements that may imply that better performance is an important benefit or that Unum performs uniformly better than orchestrators.</span>
<span class="blue">To clarify, we do not argue application-level orchestration *fundamentally* performs better than standalone orchestrators, nor do we believe Unum's contribution relies on it always performing better than Step Functions. As we state in Introduction:</span>
><span class="blue">Moreover, while performance and cost are difficult to compare objectively with existing black-box production orchestrators—both are influenced by deployment and pricing decisions that may not reflect the underlying efficiency or cost of the system—Unum performs well in practice (§5).</span>
<span class="blue">We intended to simply state the fact that "Unum performs well" compared with Step Functions on AWS, as shown in Table 2 and Figure 4. Importantly, our results show that Unum is clearly performant enough to be practical because it performs comparably (better in some cases) than a major production system. It's maybe possible for that production system to be improved, but the point is that whatever performance it currently has is clearly sufficient for many applications.</span>
cost: You do show that your approach saves costs compared to using AWS Step Functions. That’s fair. But it is not a fundamental technical argument; it’s really about how AWS sets it pricing for services, which is not an exact science
<span class="blue">We do believe that costs savings is an important benefit of Unum's design compared with the standalone orchestrator approach, and we revised the Introduction and Evaluation (Section 5.3) to further expand and clarify our argument. We are happy to discuss and expand the cost discussion even further if you think it would benefit the paper.</span>
<span class="blue">In particular, when we consider the 2 options that users have with standalone orchestrators, (1) deploy their own orchestrator or (2) use a provider-hosted orchestrator. Per-user orchestrator has the problem of risking under-utilization. Provider-hosted orchestrators can multiplex to amortize the cost, but still incur the cost of hosting, such as a 10-person engineering team on call to handle outages. This cost may be marginal for AWS but can be significant for smaller providers. </span>
<span class="blue">Moreover, Unum's costs are more fine-grained. Compute resources for executing orchestration logic are charged per-millisecond and storage resources for persisting states are charged per-read and write. Services that Unum builds on---FaaS schedulers and data stores---multiplex over a likely larger audience of applications for greater economy of scales. Also these services have gone through long periods of improvement to make them efficient.</span>
<span class="blue">In regard to pricing, we believe billed cost is a reasonable proxy to actual costs (hardware and staff costs). The comparison with Step Functions in particular is an 8x difference: Step Functions charges per state transition at $27.9 per 1 million state transitions. One state transitions in Unum involves (1) ~200ms extra Lambda runtime to execute orchestration library code, which costs ~$0.42 per 1 million transitions (2) 1 DynamoDB write to checkpoint, which costs ~$1.3942 per 1 million writes, (3) 1 DynamoDB read to check checkpoint existence, which costs ~$0.279 per 1 million reads, and (4) 1 DynamoDB write to garbage collect checkpoint, which costs ~$1.3942 per 1 million writes. In total, that is $27.9 vs $3.4874.</span>
avoiding the expense of building and maintaining a stand-alone orchestration service: Yes, it’s true that your approach could allow AWS to manage one less service. But this is not much of a savings in the grand scheme of things. And your approach also has engineering expenses in that someone needs to maintain your library, fix bugs, deploy new versions, etc.; those functions are actually easier in a service rather than client library. Moreover, it seems contradictory to me that you talk about the difficulty of building a scalable, fault-tolerant orchestration service when you depend on a scalable, fault-tolerant database service (DynamoDB) which is much, much more complex than an orchestration service.
<span class="blue">To clarify, it is less about building and maintaining the code, but more about the cost of *hosting* a service. And we revised Introduction to clarify this point.</span>
<span class="blue">Our comparison is really with the orchestrator _approach_ where the solution relies on adding yet-another service into the infrastructure. While it is likely not a big deal, technically and financially, for AWS to add another service, it is for many providers not at AWS' scale.</span>
<span class="blue">On the point that Unum "depend[s] on a scalable, fault-tolerant database service (DynamoDB) which is much, much more complex than an orchestration service", the difference is that Unum is not adding DynamoDB into the infrastructure; instead Unum simply uses what already exists. DynamoDB, or other data store services for this matter, is already an integral part of cloud computing and popular among serverless applications. Compared with orchestrators, Unum does not introduce additional work to host yet-another service and to make sure it scales to the needs of serverless applications.</span>
scalability: You still claim at the end of the intro that you “scale better” and then also mention scalability in the conclusion, even though many of the reviewers objected to this claim. There is nothing in the paper, in the evaluation or otherwise, to support any claims of better scalability. So, I suggest that you simply remove this.
<span class="blue">We removed "scale better" from the last paragraph in Introduction.</span>
<span class="blue">To clarify, we intended for "scale better" to mean what Figure 5 shows, namely that Unum's design leverages FaaS' performance and can easily support highly-parallel applications, whereas standalone orchestrators must repeat the work of building a service that can support highly-parallel applications well. Although it _may_ not be the case that standalone orchestrators fundamentally have to impose limits on the number of outstanding function executions, it is not trivial to eliminate this limit, as the example of Step Functions shows. As a library, Unum can support as much parallelism as FaaS schedulers and data stores permit. FaaS schedulers and data stores already support highly parallel applications well, and Unum's approach is free from the need to carefully design and build yet-another service that supports parallel applications well. Also, Unum's performance will improve automatically when these underlying services further improve.</span>
portability across cloud providers: You may have an advantage here in that your application is not dependent on the orchestration service API, but you still are dependent on the database and functions APIs. So, it probably doesn’t matter in practice. In any case, it is strange that this is only mentioned in the last paragraph of the conclusion.
<span class="blue">We removed "portability" from the last sentence in Conclusion.</span>
Given this, you should emphasize the flexibility that your approach provides and tone down the other claimed benefits. With that in mind, I am sure that readers would love to see more examples to illustrate the need for flexible workflow orchestration. It would be great if the intro could include concrete examples of patterns that application use that are not supported by Step Functions and examples of applications that could get performance benefits from weaker guarantees. The case study in section 5.4 helps, but comes late in the paper and is only one example. And regarding that example, it would help to provide more detail, such as showing the workflow graph with branching, so that the reader can better understand where you are able to get additional parallelism compared to Step Functions. For a balanced discussion, you should also say something about the developer costs and code complexity of introducing application-specific optimizations.
<span class="blue">We added a graph and text in the ExCamera section to describe and compare the workflow patterns between what's possible with Step Functions and with Unum. In the Introduction, we added examples of deterministic applications that can benefit from weaker execution guarantees and an example pattern not supported by Step Functions.</span>
<!-- <span class="blue">However, we'd like to emphasize that the important benefit of Unum's design is *not* that it can fundamentally support *more* patterns than provider-hosted orchestrators. Rather it is that Unum puts the control to add and implement orchestration in the hands of users. It is possible that Step Functions one day can support aggregating adjacent branches in its Map fan-out pattern (which is what ExCamera needed). However, users have no control over this. They can either wait for Step Functions to add this pattern, or use less efficient implementations with Map. Alternatively, they can choose the approach to build their own orchestrator that supports the pattern they need (which is what ExCamera did). But building and hosting a scalable, performant and fault-tolerant distributed system is a difficult task. Moreover, the approach of deploying an orchestrator per-user or per-application is expensive and risks under-utilization.</span> -->
<span class="blue">We also added a discussion on the tradeoff with code complexity from application-specific optimization and the need to decentralize orchestration in the Discussion section.</span>
Here are some minor things that could be fixed:
* There are still some places where you describe an orchestration service as “centralized” such as in Figure 1 and section 3.1. Saying “logically centralized” is fine, but saying “standalone” is better.
<span class="blue">We revised the paper to use "standalone" consistently when appropriate.</span>
* There are a few typos that I noticed: “and the aggregate” should be “and aggregate”, “checkes”, “costs benefits”, “cooridnator”, “mutliple”
<span class="blue">Apologies! We went through the paper again to correct all typos we found.</span>
* I was confused by the last sentence in the first paragraph of section 5: “The loss of flexibility and application control from centralization often enables better performance, more efficient resource usage, or better scalability.” I think that you are saying that orchestration services can have better performance and such. But they are inflexible. Right? I would try to word this more clearly.
<span class="blue">Yes, and we rephrased it to clarify its meaning.</span>
* Section 5.1 mentions a “serverless option” for DynamoDB. But DynamoDB is always serverless. So, I do not understand this.
<span class="blue">By "serverless option" we intend to mean the "on-demand capacity" mode of DynamoDB (Link) where users pay for the read/write throughput actually used by their applications, versus the traditional "provisioned capacity" mode where users pay for a fix dollar amount per hour for a fixed, maximum throughput. We updated the text to use "on-demand capacity mode" instead.</span>
* The fault-tolerance could be explained in a bit more detail. For instance, who writes the error handler and where does it run? If it runs in the application, then you cannot tolerate application failures (whereas an orchestration service would keep processing the workflow).
<span class="blue">We added more details on Unum's error handler in section 3.3.1. In short, the error handler is part of Unum's orchestration library and is not written by application developers. It runs in a separate FaaS function and we rely on FaaS schedulers to execute the error handler at least once. Similar to the assumption that standalone orchestrators are bug-free and reasonably reliable, the error handler is also assumed to be bug-free as it's part of Unum. And FaaS engines are assumed to be reasonably reliable that functions will run at least once and bug-free functions will not crash constantly due to system faults.</span>