Non-interactive PoRep is like doing an interactive PoRep about 12 times in a row. The first phase is called synthesis, where the circuit is generated. Due to the implementation it's a memory and CPU intensive operation. Then the actual proving happens. There we have the SupraSeal improvement, which is GPU heavy.
The synthesize can be run in parallel, the proving is done sequentially (one proof at a time on a GPU). Currently the synthesis and the proving of 10 proofs takes about the same (wall) time.
The possible performance improvements are sorted by the ones I think would improve things the most, to the ones that would do less.
Filecoin specific circuit synthesis
Currently bellperson is a general purpose library for proving. Though for Filecoin we use some specific circuit which also cannot be changed without a major upgrade. Hence one possible optimization could be to optimize the generation of the Filecoin specific circuits. They depend on the input, but likely there are large parts that are independent of the input. Even if not, there's likely a way to synthesize those specific circuits in an optimized way that takes less RAM and or CPU time.
Reducing the RAM would make it possible to run the synthesis of more proofs in parallel. If also the CPU time could be reduced, we might be able to do the synthesis for all proofs required for the Ni-PoRep in a single run. That would almost half the total run-time.