**Composable Concurrency:**
20/09/2023
Here are some scenarios to think about (not all of them are necessarily due to just the additional features):
* What is the behaviour if resume in (Suspend (fun resume -> ...) is called multiple times with different values?
* What happens if resume is never called?
* What if block (passed as Suspend block) raises or performs effects?
* What if block never returns (an infinite loop)?
* What if block takes a very long time to return?
Rest of the fibers keep running as usual.
* How many times block could be called? At most once? Exactly once? Any number of times?
* What is the context in which block is called? In other words, what assumptions may block make of the context?
* What happens if the fiber is cancelled during the execution of the block?
* What happens if the fiber is cancelled before perform (Suspend ...) ?
* What happens if the fiber is cancelled after calling resume , but before the fiber gets a chance to run?
* What happens if resume is called concurrently/in-parallel multiple times (i.e. the calls overlap in time)?
* Is resume parallelism-safe?
Does it need to be parallelism safe? Every Fiber or the lowest unit of concurrency would instantiate its own resumer. In that case, a resumer is called by only one fiber.
* May perform (Suspend ..) raise an exception and, if it does, should that be handled somehow?
* May resume raise an exception? Might it perform effects? From what contexts it is safe to call resume?
* What happens if resume is called before block returns? (This could happen either by block calling it by itself or after block has added resume to some public data structure that another domain could access in parallel.)
* What if resume is called before block returns with a different result?
---
**Vesa's Comments:**
**Q1:** Line 648: Should there be then Some () instead of then None
**Ans:** Yes, thank you for pointing out!!
**Q:**
I don't see a handler for Suspend using the final type of Suspend in the paper. Is there an example somewhere? Ideally more than an example. Something that is intended for actual "industrial" use.
**Ans:** We can modify our current prototyped lazy implementation to use our final type for Suspend.
**Q2:** In the Suspend handler with cancellation, the handler calls `continue k v` without checking handle.cancelled. What is the intention there
**Ans:** Rightly pointed, we should check the handle.cancelled before continuing.
* It means if the task is not cancelled, we can continue.
* In case of failure, in our FIFO scheduler suspend handler, we can directly use `discontinue k Exit`. (At some point this exception should also be handled. In our code example it is handled at `match_with exnc` and there it will take the next task from the queue ).
**Q3:** In the final type of Suspend
```
type 'a resumer = ('a, exn) Result.t
type _ Effect.t += Suspend: ('a resumer -> 'a option) -> 'a Effect.t
```
the block function has type `'a resumer -> 'a option` . This seems oddly asymmetric. What if the block function would like to raise an exception?
The need to raise an exception can come up e.g. with an Eio style mutex, where the mutex may become poisoned.
**Ans:** I am not very sure why exactly block function will raise the exception. Because as per all the examples that we implemented, we are using it just to push the resumer into the list or queue. I didnt see these structures will raise an exception while adding an element in its structure. One more instruction will be Atomic CAS, that will also not raise any exception.
In Eio_mutex implementation as well `Waiters.await` returns `'Error ex` for which it raises an exception. This return value is captured from the resumer itself, where resumer is defined in eio as `'a enqueue: ('a, exn) result -> unit`
**Q4:** In the Suspend handler with cancellation the resumer returns not handle.cancelled. That is used in the Mutex.unlock later. What happens if the handle is cancelled after resumer returns, but before the scheduler continues running the fiber?
**Ans :** According to me, once the resumer is executed and returned successfully (i.e. before the cancellation), it means it is enqueued into the schedulers runqueue. After this point, it is the responsibility of the scheduler to manage the cancellation of its task. In the FIFO scheduler here, once the resumer returns, the handle will get enqueued to the scheduler's runqueue. Now if you cancel it, it will change its state to `cancelled = false`. In this case, while dequeueing, scheduler will rightly check its state and discontinue it.
The main point is after the resumer returns, task will be wholly managed by the scheduler itself.
(Although, I need more details to understand the intension behind this scenario)
**Q:** Lazy removal of resumers becomes problematic when there is a need to have the resumer in more than one place such as in kcas where it is possible to await for changes to any number of locations. Too lazy removal, with or without cancellation, could cause resumers to accumulate without bound.
**Ans:** Actually, this is the drawback(space leak) in case of cancellation. And in such a case, we mentioned that resources that the resumer holds can be freed up by registering a cancellation handler.
But, in cases without cancellation, isn't that the expected behaviour even with the eager strategy?
**Talex5's comments**
**Q5:**
```
As a result, in OCaml today, one either needs to choose the Lwt or the Async ecosystem and can only use libraries from that ecosystem
```
My https://github.com/talex5/async-eio-lwt-chimera demo shows Lwt, Eio and Async all working cooperatively within a single domain. This does require a slight modification to async, however.
```
Alas, one cannot build such an application today that utilises both Eio and Domainslib.
```
It's actually quite easy. I've added a bit to the README showing how to do it here: https://github.com/ocaml-multicore/eio/pull/489
```
While we can define a point-wise synchronisation solution that works for the composition of Domainslib and Eio, such a pair-wise solution is unsatisfactory as it cannot accommodate other concurrency libraries
```
This seems to contradict the bit above. It's also not entirely clear why this would be a problem. If I want to call domainslib from eio, why not use a point-wise solution?
**Ans:**
We can always come up with the simple point wise solutions and use them whenever necessary. But the purpose of our paper, is to create generic abstraction for communication between different concurrency libraries. Eio implements promises. It is not necessary that other schedulers implements the promises/any channel. If multiple newer libraries are coming up in the future, it will be difficult to maintain many such connections, each having different structures to communicate. So we want some unified solution which can be used to communicate between any such libraries(It can be FIFO and Domainlib as well).
OTOH, I am wondering how exactly the solution is working in the example document given in the [link](https://github.com/ocaml-multicore/eio/pull/489). Because, `Domainslib.Task.run` is not seen anywhere. When we are parallelizing fibonacci computation, we need to use `Domainslib.Task.await`. Im not very sure how can we use without using `Domainslib.Task.run`? Maybe I am missing something?
According to me, when we use `Domainslib.Task.run` and try to run `Task.await` with `fn` consisting of Eio's promise, I am expecting an exception there. Because, await needs to handle the effect.
**Q6:**
It would be good if the Eio examples showed valid code. For example, Eio.run does not exist (and never has I believe).
**Ans:**
Thanks for pointing out. We should have used `Eio_main.run`
**Q7:**
```In particular, if the cancelled task was blocked on a synchronisation structure, we require that a matching operation is done on the synchronisation structure that unblocks the task and pushes it into the scheduler queue.```
How can the structure do that? It doesn't seem to receive any notification that it has been cancelled.
**Ans:**
What we meant is, when the the tasks are suspended (say pushed) on a synchronization structure eg., queue, we need particular pop operation to be done on the queue. It also means that there will be some scenario where such pop operation is never called.
Only when the corresponding matching operation is called, it will pop the resumer. If it finds out the resumer is cancelled(using its return value), it will simply go the next resumer in the queue. If no such a pop operation is called, the cancelled tasks will remain in the queue.
**Q8:**
```For example, Eio-Lwt bridge cannot take advantage of parallelism since Lwt is not parallelism-safe. Ideally, we would like to run Lwt on one domain and Eio on multiple domains to get the best performance.```
This is already how lwt_eio works - one domain for Lwt and Eio together, plus any number of Eio-only domains.
**Ans:** I think we should change this statement in our paper. We should instead mention that maintaining multiple such connector libraries is difficult. For parallelism we already mentioned Lwt and Domainslib composition is useful. But creating a another Lwt-Domainslib composition library is okay but again a point wise solution.
**Q9:**
`We use the bound of 1 in order to match the behaviour of an MVar.`
Note that Eio streams with capacity > 0 have not been optimised (https://github.com/ocaml-multicore/eio/pull/413). Would be good to mention the version of Eio used in the benchmarks.
**Ans:** For benchmarks, we have used eio version 0.6
**For the follwing questions I need to read few things. Ill get back to it once I read it.**
* Resumer to be called at most once
* Suspend with disabled cancellation
Q
In the Suspend handler with cancellation, the resumer function seems to have been written with the assumption that the resumer is called at most once. IOW, it does not protect against the possibility that it might be called more than once. So, this means the responsibility is with the caller.
In my current domain local await interface I put the responsibility of ensuring that resume happens at most once on the domain local await implementation (rather than any libraries using it).
Q:
I mentioned recently, but it seems necessary to be able to Suspend with cancellation disabled. The need for that comes when implementing condition variables. After the await for a condition is done, whether due to a signal or cancellation, the associated mutex need to be re-acquired. The difficulty there is that the mutex might be held by some other owner and it may be necessary to suspend.
In my blocking kcas based implementation of Condition (and Mutex) the Condition.await is currently implemented (and hasn't been yet tested - so beware) as follows:
let await cond mutex =
let self = Waiters.enqueue cond in
Mutex.unlock mutex;
Fun.protect
~finally:(fun () -> Mutex.lock mutex ~await:`Protected)
(fun () -> Waiters.await self ~on_cancel:signal ~the:cond)
The key there is that the call to Mutex.lock specifies ~await: `Protected which ultimately tells the blocking mechanism that the fiber must not be cancelled and must be woken up after being suspended.
**Q10:**
I think it would also be worth contrasting this proposed Suspend effect with Eio's one:
type 'a resumer = ('a , exn ) Result .t -> bool
type _ Effect .t += Suspend : ('a resumer -> 'a option ) -> 'a Effect .t
vs
type 'a enqueue = ('a, exn) result -> unit
type _ Effect.t += Suspend : (Cancel.fiber_context -> 'a enqueue -> unit) -> 'a Effect.t
Returning an option type to allow continuing immediately (rather than enqueuing) looks like a useful improvement, avoiding the need for a separate effect (and with no extra allocation in the common None case).
But I think cancellation needs some more work. It would be nice to have an API that doesn't depend on the full fiber_context type, though.
**Ans:** TODO