Possible non-interactive PoRep performance improvements

--- tags: ellipticresearch, filecoin, PoRep --- Possible non-interactive PoRep performance improvements ======================================================= Non-interactive PoRep is like doing an interactive PoRep about 12 times in a row. The first phase is called synthesis, where the circuit is generated. Due to the implementation it's a memory and CPU intensive operation. Then the actual proving happens. There we have the SupraSeal improvement, which is GPU heavy. The synthesize can be run in parallel, the proving is done sequentially (one proof at a time on a GPU). Currently the synthesis and the proving of 10 proofs takes about the same (wall) time. The possible performance improvements are sorted by the ones I think would improve things the most, to the ones that would do less. Filecoin specific circuit synthesis ----------------------------------- Currently `bellperson` is a general purpose library for proving. Though for Filecoin we use some specific circuit which also cannot be changed without a major upgrade. Hence one possible optimization could be to optimize the generation of the Filecoin specific circuits. They depend on the input, but likely there are large parts that are independent of the input. Even if not, there's likely a way to synthesize those specific circuits in an optimized way that takes less RAM and or CPU time. Reducing the RAM would make it possible to run the synthesis of more proofs in parallel. If also the CPU time could be reduced, we might be able to do the synthesis for all proofs required for the Ni-PoRep in a single run. That would almost half the total run-time. Reducing the memory usage during synthesis ------------------------------------------ The result of the synthesis is an array of field elements. Most of field elements are either zero, one or two. Instead of storing the full element (256-bit) in memory, just store a placeholder. The result would then be an array of an enum that looks like that: ```rust enum Element { Zero, One, Two, Other, } ``` In case of `Other` one could then have an array which stores those other values. This intermediate, optimized representation would have the advantage that more circuits could be synthesized in parallel as the RAM usage is currently the limiting factor. Right before the proving (which is sequential) it would then be expanded to the representation the GPU needs, which is having all elements within a continuous memory range. A possible further optimization could also be to make SupraSeal understand that intermedia representation and then maybe the GPU kernel as well. This maybe could make the proving even faster (though also maybe slower, I'm unsure). Interleaving the synthesis and the proving ------------------------------------------ At the moment there is a batch of circuits synthesized, then proven, then comes the next batch. Instead one could synthesize the first batch, and while that batch is proven, already start to synthesize the second batch. This could work as the synthesize is CPU and memory heavy, while the proving is mostly GPU heavy. The catch here is the "mostly". We still have some CPU/memory spikes during proving, so proper coordination would be needed, so that neither the synthesis, nor the proving gets stalled.