Detailed Responses to Reviewer Comments

# Detailed Responses to Reviewer Comments ## Major Issues ### Clarify and demonstrate fundamental differences and benefits of Unum *From the PC Discussion (PC-Comment-1):* > This paper was discussed in detail during the PC meeting. The reviewers appreciated the paper’s interesting approach of relying on storage and atomic operations to support a fault tolerant orchestration service. However, they were concerned about the lack of evidence that the approach Unum takes has advantages over a more traditional, logically centralized orchestration service. Ultimately, the PC decided to accept the paper, but they expect the authors to work closely with the shepherd to resolve the following issues: > Scale back (or remove if suggested by the shepherd) any claims about the benefits of decentralization that are not strictly supported by evidence in the paper *B-Comment-P4:* > Explain why embedded orchestration can inherently scale better. You never define what you mean by "scalability" in the context of serverless workflows. And you do not address scalability in the evaluation section. So, it is not clear what limitations current orchestration services have regarding scalability. *B-Comment-P3:* > Present a compelling, concrete example of an application that requires more flexibility than afforded by orchestration services. The intro mentions one example dealing with deterministic functions, but provides no details. Please explain the sort of application-specific optimizations that are both desirable and possible with your approach. As a later example, you state that a "fold" operator can be valuable for video encoding but is not supported in Step Functions. The paper does not explain why a Step Function users cannot construct a workflow that does something similar. *B-Comment-P7:* > Demonstrate through your evaluation the fundamental benefits of decentralized (function-based) vs. centralized (server-based) orchestration. The presented results seem to show differences in the implementation of your approach compare d to Step Functions but not whether these differences are inherent to either approach. For instance, you get benefits from increased parallelism in the execution of serverless functions. But why can't Step Functions achieve the same amount of parallelism? It would be great if you could explain the reasons for why Step Functions enforces limits since these same reasons might apply to your approach as well. *D-Comment-P3:* > There is more proof required before making a sweeping claim such as "Leveraging decentralized orchestration, rather than centralized services can help performance, resource usage, flexibility, and portability across cloud providers." Is there really a networking or computation bottleneck with a centralized orchestrator (or the application running in the cloud that is coordinating the serverless functions)? There is a theoretical bottleneck, yes, but where is the practical bottleneck? *D-Comment-P4:* > Related to the above comment, the evaluation of the paper can be improved further. While the evaluation provides some answers to the question of where the latency and cost improvements in Unum comes from, a finer granularity breakdown and a better presentation of the findings would help. *D-Comment-P2:* > Is this the first decentralized serverless workflow system? It seems so. Then the paper should make a bolder claim for it. On the other hand, the paper should also do an honest recount of why there has not been a pressing need for a decentralized workflow orchestration solution. While there is a cost to running a central orchestrator, that cost is often absorbed/compensated as the application using the serverless functions is serving as the orchestrator. Moreover a centralized coordinator can allow dynamic workflow execution, whereas Unum is constrained to static compilation of workflows. *E-Comment-P3:* > On the other hand, I find your motivation for Unum is weak and vague. The motivation statement is: > >centralized orchestrators also preclude users from making their own trade-offs between available interactions or execution guarantees and performance, resource overhead, scalability and expressiveness. > > However, throughout the paper, I didn’t find strong qualitative or quantitative evidence that supports this statement. You made two arguments: > > better for cloud providers as they need not develop and maintain yet another complex service > > This does not make much sense to me. So, who is going to develop the decentralized orchestrator? The users? Or, some third parties? Eventually someone needs to “develop and maintain” the orchestrator, and cloud providers have the best interests to do it. Note that the decentralized orchestrator is more complex than the centralized one. So you are essentially increasing the complexity, rather than reducing it. > > It is better for developers as it gives applications more flexibility to use more performant, applications-specific orchestration optimizations and makes porting applications between different cloud platforms easier > > I’m also not convinced by the more performant, optimization-friendly argument. I don’t see from the paper how the decentralized orchestrator can achieve it. In terms of portability, it is orthogonal to whether the design is centralized or decentralized. If your argument is that Unum provides an independent IR and that’s why it’s more portable. Then, the same thing can apply to centralized designs too. *E-Comment-P5:* > I don’t understand how (2) is improved from your experiments. In the evaluation, you mentioned that > > Unum performs comparably or significantly better than Step Functions in most cases owing to higher parallelism and a more expressive orchestration language, > > What prevents a centralized orchestrator from achieving high parallelism? You mentioned that AWS Step Function limits parallel invocation of concurrent branches; why? I checked Ref [27] and it says, > > The default value is 0, which places no quota on parallelism and iterations are invoked as concurrently as possible. > > Intuitively, Unum needs to pay the overhead of checkpointing, which is not needed by a centralized design. How does it affect the performance? *E-Comment-P6:* > For (3) expressiveness, the ExCamera example in which you optimize the data dependency is interesting. OTOH, I don’t find it supports your argument too strongly, because it can be achieved by a centralized orchestrator also. The key point you are making is that the existing orchestrator is hard to customize, which is orthogonal to whether the design is centralized or decentralized. Is it fair to say that the expressiveness can be provided by a centralized design also (by providing more fine-grained control)? *E-Comment-P8:* > For (4) scalability, what are you referring to? It's not evaluated or discussed. **Our Response**: We will expand our discussion on the fundamental differences and benefits of Unum and scale back any claims on Unum’s benefits that are not strictly supported by evidence in the paper, including application-specific optimizations, flexibility, and performance (e.g., scalability, parallelism, networking or computation bottleneck). We will clarify that by “centralize”, we intend to mean that orchestrators are standalone services and we will clarify that the fundamental differences between Unum and existing orchestrators is that Unum is an application-level library that is built on top of existing serverless services (e.g., Lambda, DynamoDB) and runs in-situ with functions. This design removes the need for building, maintaining and provisioning an additional standalone service and gives applications control over the interface and implementation of orchestration. We intend the evaluation to show that Unum’s approach is performant enough rather than arguing that a decentralized orchestrator is fundamentally more performant than a standalone one. ### Define and clarify “centralization” *PC-Comment-2:* > Clarify that existing orchestrators are not strictly centralized either (e.g., they can rely on sharding, replicated state machines, etc.), and emphasize actual differences in Unum’s design. *A-Comment-P3:* > The first issue is that I find no evidence to back the important claim that current orchestration of cloud services is centralized. And it is hard to believe that a service with AWS scale has a centralized orchestrator. If this is real, please provide evidence. The intro cited [3,4,19,21] for this (“Cloud providers have deployed specialized centralized workflow orchestrators”). But I didn’t find descriptions of the orchestrator being centralized in these citations. I also searched online but didn’t find many descriptions on the architecture of AWS Step Functions. If I have to guess based on my experiences, I’d assume they implemented the service in a partitioned way, backed by a fault-tolerant storage such as DynamoDB, which is a common way of building such a service. Note that being external is different from being centralized, and perhaps what confuses you is that these services are external to the serverless worker nodes. If so, it means that a very important claim in the abstract and intro needs to be withdrawn, which is unfortunately a killer. **Our Response**: We will clarify that existing production orchestrators, such as Step Functions, are logically centralized but likely internally distributed and that we intend “centralization” to mean that (1) orchestrators are logically centralized controllers that drive a workflow by invoking its functions, receiving function results and hosting application states centrally, (2) orchestrators are standalone services that are separate from existing serverless services (e.g., FaaS and serverless data stores) and (3) the interfaces, implementations, and tradeoffs of orchestrators are centrally determined and outside of the control of individual applications. We will consider using a different phrase, such as “standalone”, instead to describe these properties. ### Detail restart mechanism *PC-Comment-3:* > Clarify how other key functions like access control, rate limiting, and restarting failed work can be handled in Unum. *B-Comment-P5:* > Expand on your claims that you are able to run workflows with the same fault-tolerance. For example, a service-based orchestrator is able to re-start a workflow when machines (and the associated function invocations) fail, but it is not clear how your approach can offer the same type of recovery. Your section on fault-tolerance mostly deals with checkpointing and glosses over restarts. *B-Comment-P6:* > Justify the importance of exactly-once execution. Your example of generating thumbnails from images in a photo library seems to contradict your insistence that workflow executors must guarantee exactly-once semantics. Generating a thumbnail more than once is not a problem, arguing that at-least-once semantics may be sufficient. Moreover, AWS Lambda does not even guarantee at-least-once execution, that is, it will perform a number of retries if a function fails but not indefinitely. *C-Comment-P3:* > In S3.3, I understood how the checkpointing mechanism prevents corruption from duplicate executions, but how are failed functions themselves restarted? Is there some timeout mechanism? How is this implemented in a decentralized way? My basic assumption is this relies on Lambda’s retry system, but it’s not so simple to use: https://docs.aws.amazon.com/lambda/latest/dg/invocation-retries.html **Our Response**: We will detail how Unum can restart failed functions as long as the FaaS engine provides a way for applications to catch exceptions and regain control, such as Lambda failure destination or automatic retries. We will include an argument for the necessity of such a mechanism in FaaS systems as otherwise applications cannot handle errors. We will clarify that, as Reviewer B pointed out, Lambda performs a number of retries if a function fails but not indefinitely, and that Lambda’s at-least once execution does not guarantee a buggy function to complete at least once; rather in the case of a buggy function, the “sign” of completion is the exception. We will explain how the Unum library invokes failed functions explicitly so that it does not require the FaaS engine to automatically retry failed functions indefinitely. We will add implementation details on Unum’s retry mechanism for both AWS and Google Cloud. ### Clarify how Unum handles other key orchestrator functionalities *PC-Comment-3:* > Clarify how other key functions like access control, rate limiting, and restarting failed work can be handled in Unum. *E-Comment-P9:* > I’d like to point out that orchestrators typically perform many other functionalities, such as admission control to prevent overload and permission checks. How do Unum achieve those? One question I have is how to prevent malicious or unauthorized users from invoking protected functions? How do you achieve security? **Our Response**: We will add a discussion on how existing orchestrators, such as Step Functions, work with separate services, such as AWS IAM, to enforce access control. Because Unum runs, merely, as a library within functions, any features that a platform provides for, e.g., access control to non-Unum functions applies to Unum functions, and policies that apply to workflow orchestrations can be applied equivalently to an Unum workflow’s entry function. For example, Unum can implement similar access control by setting the permissions on the entry function of workflows. And Unum can implement rating limiting by changing the concurrency limit on the entry function of workflows. ### Compare with gg, esp. on costs *PC-Comment-4:* > Discuss gg as an alternative option in relation to your claims about cost. *A-Comment-P6:* > Second, you should compare the cost of other orchestration techniques, such as gg. This is important as it provides evidence, and a clear positioning of the scientific contribution of Unum: it is the first that can lower the cost of orchestration by up to an order of magnitude. If gg can also reduce cost, then you will have to trim the claim a bit, that Unum is the first to reduce cost of orchestration without deploying extra services. This unfortunately will undermine the contribution of Unum but has to be clear; being unclear hurts more. **Our Response**: We will expand our discussion on gg to detail the architectural differences and compare Unum and gg in terms of monetary costs. In the paper, gg did not target to lower the cost of orchestration. Its primary cost comparison is with long-running VMs. For example, it compares the task of compiling large software using Lambda vs a long-running 384-core cluster of EC2 VMs. We will explain why, from Unum’s perspective, gg is architecturally similar to standalone orchestrators such as Step Functions because gg relies on a standalone orchestrator (called “coordinator” in gg’s terminology), that invokes FaaS functions, receives function results, caches function outputs, and drives the execution of workflow DAGs. The cost of running the coordinator will depend on the machine instance (e.g., EC2 VM type). Moreover, to achieve the same level of fault tolerance as Step Functions, gg coordinator will need to be replicated and distributed because it caches function results and workflow states on-disk, which will increase the cost of running the coordinator. ### Justify the importance of exactly-once execution *B-Comment-P6:* > Justify the importance of exactly-once execution. Your example of generating thumbnails from images in a photo library seems to contradict your insistence that workflow executors must guarantee exactly-once semantics. Generating a thumbnail more than once is not a problem, arguing that at-least-once semantics may be sufficient. Moreover, AWS Lambda does not even guarantee at-least-once execution, that is, it will perform a number of retries if a function fails but not indefinitely. **Our Response**: We will include an argument for exactly-once execution guarantees. ### Demonstrate the necessity for more flexibility *B-Comment-P3:* > Present a compelling, concrete example of an application that requires more flexibility than afforded by orchestration services. The intro mentions one example dealing with deterministic functions, but provides no details. Please explain the sort of application-specific optimizations that are both desirable and possible with your approach. As a later example, you state that a "fold" operator can be valuable for video encoding but is not supported in Step Functions. The paper does not explain why a Step Function user cannot construct a workflow that does something similar. *E-Comment-P7:* > Besides the ExCamera example, what are the other limitations of expressiveness of existing orchestrators? **Our Response**: We will expand on how Unum’s design affords more flexibility to applications and present concrete examples of such applications that require the additional flexibility. ### GC Correctness *A-Comment-P11:* > At last I want to discuss the design parts. These are actually the problems that I believe are easier to fix. The database providing atomic operations simplifies Unum’s design, but also undercutting the challenge Unum has to address and hence its novelty. The coordination goal of Unum, is less challenging than other works such as Beldi, which targets a transactional feature. However, there are still corner cases that need a few revisions, for example, the garbage collection. The current design says “in non-fan-out cases, once a node check-points its result, it can delete the previous checkpoint” (S3.5.1). I think there are corner cases that break the design. Say you have a chain A-B-C, once B has checkpointed its result, it will delete A’s result. However, consider there is a duplicated task of A, named A1, A1 finishes after A’s checkpoint is deleted. Then A1 will create a new checkpoint? There are two problems in this case. 1) how is A1’s new checkpoint garbage collected? 2) What if A1 is non-deterministic and it has a branch in its end, instead of launching B1, A1 launches a totally different node D1? All these cases need to be discussed and covered. **Our Response**: We will address reviewer A’s concern on GC correctness and provide more details on how duplicate checkpoints (including duplicate checkpoints from restart executions) and branching behaviors are handled in GC. Furthermore, we will clarify that a duplicate execution that starts after the previous checkpoint is garbage collected will need to be treated as a separate invocation. Any GC policy, no matter how conservative, has the potential to compromise execution guarantee if duplicates can happen after an unbounded amount of time. The same applies to standalone orchestrators as well. ## Minor Issues ### Clarify exactly-once guarantee in relation to side effects *A-Comment-P12:* > Another design choice that may need revision in discussion is the “exactly once” semantics. I think generally you need to clarify that the “exactly once” semantic cannot be guaranteed if there is an (external) side effect, because the design guarantees the semantic using checkpoints on results, not at launching/execution. I think you mentioned this very late, in related work? **Our Response**: We will clearly state and explain that exactly-once execution guarantees do not extend to side effects in Section 3.3. ### Clarify storage requirements *C-Comment-P2:* > It seems like a key insight is that many operations that normally would require a centralized coordination service can instead be replaced with atomic operations at the storage layer. One follow up question I had about Unum is what does it assume about the storage layer. For example, would Amazon’s S3, which has fairly weak consistency semantics, be sufficient to run Unum? S4 discusses some of these details, but it would have been nice to summarize this sooner in the paper. It seems like the short answer is that DynamoDB is required on the AWS platform. However, it was nice to see that Unum is general enough to support Google Cloud as well. *D-Comment-P5: * > The related work discussion on Beldi and Boki should be expanded to provide more context, and discuss how their approaches compare to and complement this approach. The distributed shared log could serve as a good API to build a decentralized orchestrator as well, no? **Our Response**: We will add to the Design section to clarify Unum’s requirements on the storage layer. ### Clarify Step Functions Costs *A-Comment-P5:* > First, what are the further cost details on the Step Functions? For example, what is the per state transition cost and how many state transitions are there? Figure 6 is amazing, it says that the major cost is actually in Step Functions, not the workloads themselves. It raises a question: should people use Step Functions at all, if they care about costs? **Our Response**: We will include cost details for Step Functions in the Evaluation section and add a discussion about the fact that Step Function transition costs dominate the total costs of running applications. ### Additional cost evaluation *A-Comment-P8:* > Why didn’t the evaluation show the cost on Google Cloud? *A-Comment-P9:* > What is the cost of no orchestration (no information on this in Table 4)? Should people always use handcrafted workflow if they want the lowest cost? *A-Comment-P10:* > What is the cost of using “driver function”? This seems to be the most non-intrusive way of implementing a flexible in-place orchestration. The main drawback claimed by the paper (S2) is that it has the “double billing” issue. This should be backed by the evaluation that it is an actual issue. But it is doubtful as in Figure 6, the User Code Duration’s cost is only a small part. The “driver function” I assume would belong to User Code Duration if implemented? **Our Response**: We will add costs for running Unum on Google Cloud in the revision. We originally intended the GCF implementation to just demonstrate that Unum's design extends across different platforms, and decided to not include GCF cost because of the lack of support for a Step Functions equivalence on Google Cloud for comparison. The main problems with driver functions are that they have runtime limits that prevents implementation of longer running applications (e.g., Lambda can only run up to 15 mins) and their difficulty in handling faults. Therefore, we do not consider driver functions as a complete solution compared with Unum or standalone orchestrators. We would like to clarify that Unum is our flavor of “no orchestration” and Unum’s functionality demonstrates that no orchestration can achieve the same features as standalone orchestrator and Unum’s cost numbers show the cost advantages of no orchestration’s direct use of lower-level APIs. Other developers may come up with a different application-level library for their application, maybe with more tailored optimization and less patterns and features, and achieve lower costs and/or better performance. ### Evaluate and Compare recovery with SF *C-Comment-P4:* > Overall, I thought that Unum had a sensible design. The evaluation was also decent, looking at both AWS and Google cloud, and comparing to Amazon Step Functions, a strong commercial baseline. However, I was disappointed that GCP was only considered for cost, and that performance results were not included. I also would have also liked to see how well Unum can handle failures in the evaluation, as this is the key area that contributes to complexity in Unum. Introducing some simulated failures, and comparing recovery to Step Functions, would improve the paper. **Our Response**: As there is no standalone orchestrator in Unum, recovering from a workflow crash involves re-executing just the failed function. On the other hand, standalone orchestrators, such as Step Functions, need to restart an instance for the failed orchestrator. We agree that it would be great to compare the performance of these two different mechanisms. However, we did not include this experiment because there is not a way to crash Step Functions programmatically. We could crash individual functions in the workflow but we cannot crash the Step Function orchestrator programmatically. ### Expand discussion with Beldi and Boki *D-Comment-P5:* > The related work discussion on Beldi and Boki should be expanded to provide more context, and discuss how their approaches compare to and complement this approach. **Our Response**: We will expand on our discussion on Beldi and Boki in the Discussion section. ### Other concerns and confusions *C-Comment-P5:* > - In table 2, does map allow batching of multiple items? Is efficient to pass one at a time to a function? Could FanIn be renamed to reduce? *C-Comment-P6:* > - In S3.2, is there anything special or different about these ops versus a traditional serverless orchestrator design? Do the ops help in anyway to support a decentralized design? *C-Comment-P7: * > - In S5.2.3, could Unum support preloading in the future? How would this work? *E-Comment-P10:* > I find that there are many important details missing in the current draft. How is the frontend compiler implemented? Given that your IR is likely more expressive than existing offering; how do you translate existing manifest into your IR? What are the evaluation applications? Are those representative serverless workflows? Who implemented them? How big are they (e.g., how many functions and what is the length)? What are the workloads you use? What exactly is your IR? What you described is less a complete instruction set but more a few APIs. Why do you think your IR is expressive enough? **Our Response**: If space permits, we will clarify these concerns. ## Nits > - In S2, “bust” -> “burst”, “functio” -> “function” > - In S3, “fucnations” -> “functions” > - In S3.2, “less uncommon” -> “less common” **Our Response**: We will correct all nits pointed out by our reviewers and go through the paper again to find and correct spelling and grammatical errors. ## No-ops Other comments that do not require actions *B-Comment-P1:* > As the serverless paradigm is gaining traction among application developers, the use of orchestrators that coordinate the execution of graphs of serverless functions is also increasing. Thus, your work is timely. Your introduction does a fair job of assessing the current situation with cloud-based orchestration services. Showing that such orchestration can be done within serverless functions themselves is an interesting contribution. I especially appreciate that you can run arbitrary Step Function workflows with the same guarantees. Though, I have trouble deciding whether this is mainly of academic interest or an approach that will advance the practical use of serverless workflows. *A-Comment-P1:* > This paper targets a problem in practice: current serverless workflow orchestrators are insufficient. The claimed insufficiencies of the current orchestrators include being centralized (thus lacking performance, fault-tolerance, and scalability), and having a high monetary cost. The paper proposes a new drop-in replacement called Unum. Unum works as a library working on the same machines as the serverless workloads; it leverages the provided storage (and its atomic semantics) to coordinate between nodes. Evaluation on AWS and Google Cloud shows an improvement in performance and a (huge) reduction in cost. *A-Comment-P2:* > The paper is very easy to read, and gives the problem statement and the solution in a clear way. I enjoyed reading the paper. The strength and the weakness of the paper are both obvious to me. On the one hand, it targets a real-world problem and it achieves an excellent result. On the other hand, the solution is very simple and straightforward and lacks challenge, and the execution of the paper needs improvement. I would champion this paper if the execution was better (I don’t mind simplicity that much). The paper currently lacks/mis-states a few key things that are foundations of the claimed scientific contribution. I believe an improved execution will greatly strengthen this paper. *A-Comment-P4:* > Despite the above misclaim on centralized services, I still believe the paper has a strong contribution that it provides a huge benefit on cost (the latency benefit is still good, but on its own might not be enough for an NSDI paper). The evaluation should provide more details on this. In particular, two questions need to be addressed. *A-Comment-P13:* > In summary, I like the problem which the paper is targeting and the effectiveness in its solution. What I’d demand is a better description on its contribution, and a more detailed analysis on the cost advantage it has. I’d be happy to review it again in future submissions after these issues are fixed. *D-Comment-P1:* > Thank you for submitting to NSDI. I like the paper overall. It targets a timely practical problem and it has good technical merit. I really liked the idea of leveraging serverless consistent datastores already provided by FAAS providers for solving the consensus/coordination needs of decentralized serverless workflow orchestration. With improved presentation and with some refinements, this makes for an interesting contribution to the conference. *C-Comment-P1:* > Thank you for submitting your work to NSDI’23. There might not be a lot of ground left to cover with coordination services in general, but I do think there’s a contribution to be made in decentralizing them, and I enjoyed the paper. *E-Comment-P1:* > Summary > + Interesting alternative design of existing centralized serverless workflow orchestrator > - The motivation is unclear. I do not see how Unum helps users with tradeoffs between execution guarantees, performance, resource overhead, scalability, and expressiveness. > - The evaluation does not support the motivation either. > - Other orchestrator functionalities are overlooked, such as admission control and permission checks. > - Many important details are missing, e.g., the implementation of the frontend compiler, the representativeness of evaluated applications, etc. *E-Comment-P2:* > Thanks for submitting the work to NSDI! Unum is an interesting design for serverless workflow orchestrators. I'm not aware of a decentralized orchestrator, so the idea is novel. And I do believe that making the orchestrator decentralized can lead to benefits. *E-Comment-P4:* > Overall, you mentioned four things: (1) execution guarantee, (2) performance (3) resource overhead, (4) scalability, and (3) expressive. (1) can be more effectively supported by a centralized design. **Our Response**: No actions intended for these comments.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.