Our rebuttal is organized by topic with those of broader interest (arguably) to all reviewers appearing first and originating commenters denoted by R1 for Reviewer 1, etc.
# IMM versus other centrality-based methods? [R2]
The original work proposing IM establishes a _direct_ mathematical model of influence spread _as an optimal control problem_, where one seeks vertices to maximize diffusion, with algorithms having _provable guarantees_ on optimality [[1](https://dl.acm.org/doi/10.1145/956750.956769),[22](https://ieeexplore.ieee.org/document/8295265)]. That work compares directly to heuristics based on node selection via "classical" centrality metrics, showing IMM's efficacy with potential applications in viral marketing, controlling the spread of ideas, and vaccination policy, among others. Nevertheless, IMM's true utility _at scale_ is an open question due to its high computational costs. Our method _enables_ IMM to run at larger scales, meaning network scientists can now do _new_ studies for further validation and use.
(Recall deep learning was "rarely used" until machines and data enabled it!)
# FA-BSP (+HClib) versus BSP? MPI+OpenMP and MPI-3? [R1,R4]
A BSP superstep includes a local computation followed by (a)synchronous communication and concludes with a global barrier. The FA-BSP model enables ***chainable*** asynchronous point-to-point communication during the local computation, which is suitable for writing distributed asynchronous graph algorithms like IMM and typically requires **fewer** supersteps than the BSP model. For example, a [new ISC'24 paper on FA-BSP](https://ieeexplore.ieee.org/document/10528922) shows that an FA-BSP version of Jaccard Similarity needs just two supersteps compared to a state-of-the-art BSP implementation requiring up to 50k supersteps, a significantly higher degree of global synchronization.
FA-BSP can be implemented using nonblocking MPI-3 one-sided primitives. However, _the original work_ on FA-BSP already showed that this strategy is inefficient due to MPI-3's handling of many small messages with scattered destinations [[12](https://www.sciencedirect.com/science/article/pii/S1877750323000741)]. HClib was shown to be better, so we adopted that approach.
# Separating improvements from the algorithm versus the runtime? [R1]
The two SelectSeeds kernels are built on the same HClib runtime, so algorithmic improvement can be separated. Figure 10 suggests that our 2D algorithm is 11-57x faster than the baseline 1D when the runtime (HClib) is "fixed."
GenerateRR is harder to analyze with our data. The algorithmic strategy _is_ FA-BSP and, therefore, requires an FA-BSP implementation. To do a runtime-to-runtime comparison, we would need to replicate our entire algorithm in other runtimes, which was not our focus given the previously established results [[12](https://www.sciencedirect.com/science/article/pii/S1877750323000741)].
# Other multi-GPU implementations? [R4]
[Göktürk and Kaya (2022)](https://ieeexplore.ieee.org/document/9820685) merits discussion but is hard to compare to directly.
First, GK22 **estimates** the output of IMM in a way that _invalidates_ IMM's approximation guarantees. So, GK22's solutions for `Friendster` may differ significantly from an "exact" IMM.
Second, GK22 is multi-GPU but _not_ fully distributed. The GPUs live in a single shared-memory server with 1TB RAM to hold the graph, and each GPU must sample portions of the graph they need from the server's RAM.
Third, GK22 uses a simpler diffusion model than what we and Ripples have considered (the LT or "linear threshold" model), which applies only to unweighted graphs.
A multi-GPU version of our work is a natural next step. However, our focus for this "first paper" is to validate a fully distributed-memory algorithm, so we left a GPU version as future work.
# Cache misses for MPI+X? [R1]
Our analysis of cache misses intends to confirm a _specific hypothesis_ about _our_ implementation where superlinear behavior occurred under strong scaling. We did _not_ intend to compare cache misses between our algorithms and the baseline, so we did not instrument the baseline accordingly. (It's certainly possible for the final paper.)
# Choices of parameters and results to present? [R1,R4,R5]
We performed many more experiments than shown in the paper. To save space, we show representative results (i.e., not cherry-picked to look good). All raw data will be made available as part of the public release. Below, we remark on specific choices raised by different reviewers.
## Different datasets for System A versus B? [R1]
We tried all input problems on both systems. For space reasons, we presented the larger problem instances on System B, which is newer and has more nodes and cores.
## Experiments with different epsilon values? [R1,R4]
We chose epsilon = 0.13 to mimic the parameters published with the Ripples baseline. We did try other values and observed similar speedup and scalability trends.
## Best configuration for MPI+X? [R5]
For the largest problem instances, the best results were achieved on System A with 24 threads per MPI rank, and on System B, 8 to 16 threads per MPI rank. However, we measured many other configurations and picked the best options to make the baseline results strong.
The time units in Table IV and all figures is seconds.
# More details about FA-BSP GenerateRR? [R2]
**What is the level?** "Level" refers to the iteration number when a given vertex is added to the random-walk frontier. Our proposed level-synchronous random-walk algorithm uses a global barrier to wait for all the independent random walks to finish their one-hop operations in the current iteration before proceeding to the next. The barrier is needed to check a termination condition and should **not** be confused with BSP style where no communication occurs within a superstep.
**Elaborate on line 12 of Algorithm 3?** The `l` in `R_l` represents the part of the implicit RRR set collection stored **locally** in that processor.
The output of `F.pop()` is a tuple (U, i, j), which means vertex `U` of the input graph is part of the random walk / RRR set `i` at level / iteration `j`.
R_l[T[0]](= R_l[U]) stores all the random walks / RRR sets vertex `U` is part of.
In line 12, we store the information that vertex U is part of RRR set T[1](= `i`).
**On line 17, why push a neighbor to level PE(V)?** The AsyncPush(V,F,PE(V)) operation makes a similar tuple (V,i,j+1) and adds it the distributed frontier F in processor PE(V).
The global RRR set collection is distributed among processors using the same distribution as the graph nodes. Hence, 'R_l' of processor PE(V) is responsible for storing all the RRR sets visited by vertex V.