## Goal
The goal is the create a simple `MPI` program that can represent nebra's distributed prover workload. For simplicity, the program only needs to run on a single machine instead of on a distributed cluster.
## Requirement
- must be implemented using Rust
- all inter-process communication needs to be implemented using `MPI` rather than shared memory
- allow the user to control how many processes are to be spawned: there will be 1 `coordinator` process and `n` `worker` process, your program should work at least with `n=1~64`.
## Specification
### Input
```javascript
{
input:
[[1, 2, 3, 4, 5, 6, 7, 8],
[3, 8, 5, 9, 1, 2, 0, 4],
...
]
}
```
The input of your program take will be 2-D array of uint32 elements as shown above. Each row has width `8`. Assume that the length is greater than `64`.
### Algorithm
There are two roles in the system: `worker` and `coordinator`. Suppose there are `n` `worker`s and 1 `coodinator`. Workers are designated by `worker-1`, ..., `worker-n`.
1. The `coordinator` sends `input[0]` to `worker-1`, `input[1]` to `worker-2`, etc.
2. Round 1
a. Each worker computes a `keccak256` digest of its input, and then repeatedly applies keccak again to the result, i.e `keccak(keccak(... keccak(input)))`. In total, keccak should be applied 64 times.
b. Each worker sends its result, to `coordinator`
c. `coordinator` concatenates the results of all workers, ordered by worker id, i.e.
`[worker-1-result, worker-2-result, ..., worker-n-result]` and applies `keccak256` to the resulting data.
3. Round 2
a. `coordinator` broadcasts the result of round one (a `256 bit` digest) to each worker. Each worker appends its own worker id (i.e. from `1` to `n`) to the result, then applies `keccak256` `64` times. The result is again returned to the coordinator.
b. `coordinator` concatenates the results of all workers, ordered by worker id, i.e.
`[worker-1-result, worker-2-result, ..., worker-n-result]` and applies `keccak256` to the resulting data.
4. `coordinator` outputs the final result
## Outcome
1. the source code of the program
2. instructions to run the program (it should be testable in either Mac or Ubuntu)
3. the input should not be hardcoded so that we can test with our own input
4. (Bonus), profile the performance of this multi-process program and gain performance insight.