mopro 22 milestone 3 report

# mopro 22 milestone 3 report ## Related PR - [GPU-wise experiments PR](https://github.com/zkmopro/mopro/pull/188) - [CPU-wise experiments PR](https://github.com/zkmopro/mopro/pull/187) ## Improvement Strategies * [Naive Metal MSM](https://github.com/zkmopro/mopro/pull/155) | [Notes](https://hackmd.io/@moven0831/Sykrqk2UC) * [Bucket-wise MSM](https://github.com/zkmopro/mopro/pull/188) * [Precomputation](https://github.com/ingonyama-zk/icicle/blob/8e62bde16d538802ecc288a9a8a7e505bff5c23f/docs/docs/icicle/rust-bindings/msm-pre-computation.md#L4) ### Bucket-wise MSM To extract as much parallelism as possible, we modified the accumulation phase from window-wise fashion to bucket-wise fashion. To achieve this, we need extra preparation steps before the accumulation. #### 1. Prepare buckets indices First, we have to prepare the indices for the buckets and the points to be added to them. Pseudocode of our implementation in MSL: ``` // instance-wise parallel [[kernel]] prepare_buckets_indices { u256 this_scalar = k_buff[thread_id]; // loop through each window-size bit scalar fragment for each scalar fragment in this scalar: // record the corresponding bucket index and point idx buckets_indices[thread_id * num_windows + frag_idx] = (bucket_idx, point_idx) } ``` #### 2. Sort buckets indices Then, we sorted these points with the bucket indices as keys in ascending order. As described in [this blocker](https://hackmd.io/v8WPAG8-RsCANrbv_seSXw?both#Unable-to-use-built-in-sort-function-in-metal), we use the parallel sorting library in Rust as alternative for now. #### 3. Perform accumulation in buckets-wise fashion Finally, we initiated one thread for each bucket to perform the accumulation. Pseudocode of our implementation in MSL: ``` // bucket-wise parallel, i.e. one thread per each bucket [[kernel]] bucket_wise_accumulation { // get the accumulation range from bucket_indices for this bucket let (start_idx, end_idx) = get_accumulation_range(bucket_indices) // perform accumulation for each idx from start_idx to end_idx: buckets[thread_id] += point[i] } ``` In this way, the number of threads equals the number of buckets ($\text{num_window} \times (2^{wsize} - 1)$), which is significantly more than the number of threads in the window-wise fashion (only $\text{num_window}$). Thus, we extract more parallelism that is suitable for GPU computation. ### Precomputation The target of bucket method is to calculate * $k_i$: ith scalar * $P_i$: ith point * $k_{i,j}$: sliced scalar * $c$: window size * $n$: msm size $$ MSM = \sum_{i=1}^{n} k_iP_i = \sum_{i} \sum_{j} k_{i,j}2^{cj}P_i = \sum_j (\sum_i k_{i,j}P_i) = \sum_j 2^{cj} B_j $$ Prcomputed points are calculating * $s$: scalar bit size * $w$: total window number $= ( s + c - 1) / c$ $$ 2^{cj} \times P_i $$ for `j` in `0..(w-1)`, `i` in `(1..n)` By leveraging these precomputed points, we are able to avoid the finalization stage by reducing the `double_in_place` operations. ### Leverage the curve isogeny for add/addmix ops > According to the optimization strategy used in TrapdoorTech’s MSM implementation on BLS12_377, we have studied the feasibility of leveraging the curve isogeny to accelerate msm. By transforming all the points on the Weierstrass curve to its isomorphic Twisted Edwards curve, running MSM on that curve, and then transforming the results back, we can benefit from the more efficient unified and mixed addition operations on the Twisted Edwards curve. This makes it a reasonable strategy for MSM acceleration. However, when we tried to map the SW point to the TE point following the strategy in [this Arkworks PR](https://github.com/arkworks-rs/algebra/pull/804), we found that the BN254 curve, which is mainly used in Mopro, is a prime order group. Unfortunately, this does not meet the requirement for equivalence between the Montgomery curve and the SW curve, which requires that the group order is divisible by four (more details in the references). We've also received confirmation from Prof. Pratyush Mishra, the main maintainer of Arkworks, that BN254 is not compatible with this strategy. ![SW <-> TE for BN254](https://hackmd.io/_uploads/S1pB2l28R.png) Therefore, we are backlogging this strategy for later use if Mopro enables curves compatible with the curve isogeny of SW and TE curves. ## Benchmarks Running on M3 chips: | msm time (ms) / instance size | 2^10 | 2^16 | 2^18 | 2^20 | 2^22 | 2^24 | |:--------------------------------------------:|:-----:|:-------:|:--------:|:--------:|:---------:|:---------:| | Arkworks msm | 2.68 | 67.96 | 245.38 | 912.32 | 3641.81 | 14254.36 | | Bucket-wise MSM <br> (on CPU) | 14.25 | 529.60 | 1844.78 | 7125.69 | 26255.07 | 103696.06 | | Window-wise MSM with precomputation <br> (on CPU) | 2.13 | 65.83 | 234.32 | 915.91 | x | x | | Window-wise Metal msm <br> (on GPU) | 99.31 | 3845.10 | x | x | x | x | | Bucket-wise Metal msm <br> (on GPU) | 43.10 | 1730.26 | 10277.70 | 41019.22 | 555877.91 | x | We want to highlight two things: 1. Window-wise MSM with precomputation has insignificant speedup but increase a lot of I/O overhead. The relationship among acceleration speed, size of precomputation points and time for precomputation is as follow: - 2^10: precompute_factor=20 (i.e. precomputed point size increase 20 times) - 2^16: precompute_factor=20 - 2^18: precompute_factor=19 - 2^20: precompute_factor=17 (Size of precomputed points is around 570 MB) Therefore, we conclude that the precomputation technique might not be suitable for MSM in a window-wise manner, which is widely adapted for CPU-based MSM. 2. The main bottleneck is still the accumulation phase in bucket-wise Metal MSM on GPU We believe the main reason is that most threads perform only a single point accumulation within a bucket. The cost of initializing a single thread is much higher than performing multiple single point additions within one thread sequentially. We also noticed that ICICLE MSM handle this situation by identifying large buckets and treating them differently compared to small ones. Log for $2^{16}$ using bucket-wise metal msm ``` Encoding instance to GPU memory... Done encoding data in 132.450084ms Init buckets time: 2.26325ms Prepare buckets indices time: 1.430042ms Sort buckets indices time: 14.22575ms Bucket wise accumulation time (using 163820 threads): 991.4465ms Reduction time: 718.595667ms Final accumulation time: 8.650833ms Average time to execute MSM with 65536 points and 65536 scalars in 1 iterations is: 1.737110459s ``` Log for $2^{22}$ using bucket-wise metal msm ``` Encoding instance to GPU memory... Done encoding data in 10.720544416s Init buckets time: 17.544958ms Prepare buckets indices time: 28.032541ms Sort buckets indices time: 839.781541ms Bucket wise accumulation time (using 1966065 threads): 544.148180208s Reduction time: 10.799348791s Final accumulation time: 8.911584ms Average time to execute MSM with 4194304 points and 4194304 scalars in 1 iterations is: 555.877919833s ``` ## Future Works - We've noticed that there's an approved [EF grant proposal](https://hackmd.io/@ZZhqXtsPTHawMbCDb4ZkaA/ByV4Uahap) that is working on migrating ICICLE MSM from CUDA to Metal with C++ host code. Once completed, we can adapt it to Rust host code and optimize the parameters along with the BN254 curve. - Research existing CPU solutions for faster MSM implementation and integrate them into Mopro, using Arkworks as a benchmark. ## Blockers ### GPU Hang Error on mobile device > While executing our metal msm on iphone, it came out with this error. ```shell Execution of the command buffer was aborted due to an error during execution. Caused GPU Hang Error (00000003:kIOGPUCommandBufferCallbackErrorHang) ``` According to [this forum](https://forums.developer.apple.com/forums/thread/70089), the error might cause due to the task we are executing is too large. Here's two recommendation to solve this: 1. try to split the task up cross multiple sub-grid. 2. the job might be too small(?) **but I don't think this is in our case**. For example if your threadgroupsPerGrid MTLSize that you pass to dispatchThreadgroups has zero in any of its dimensions and your image has some values in that dimension, your GPU will hang and give that error. ***This occurred and we could not execute metal-rs on ios system.*** ### Unable to use built-in sort function in metal 1. The `parallel radix sort` MSL will encounter assertion error when creating a pipeline ``` validateWithDevice:1316: failed assertion `Compute Pipeline Descriptor Validation function make_histogram_uint32 cannot be used to build a pipeline state. Use newFunctionWithName:constantValues:... to get the specialized function validateWithDevice:1316: failed assertion `Compute Pipeline Descriptor Validation function reorder_uint32 cannot be used to build a pipeline state. Use newFunctionWithName:constantValues:... to get the specialized function ``` The newFunctionWithName does not help since the device can successfully get the function but cannot new a compute pipeline for it. ```rust // mopro-core/src/middleware/gpu_explorations/metal/abstraction/state.rs#L36-L54 /// Creates a pipeline based on a compute function `kernel` which needs to exist in the state's /// library. A pipeline is used for issuing commands to the GPU through command buffers, /// executing the `kernel` function. pub fn setup_pipeline( &self, kernel_name: &str, ) -> Result<metal::ComputePipelineState, MetalError> { // Success let kernel = self .library .get_function(kernel_name, None) .map_err(MetalError::FunctionError)?; // Fail let pipeline = self .device .new_compute_pipeline_state_with_function(&kernel) .map_err(MetalError::PipelineError)?; Ok(pipeline) } ``` 2. Metal has its [sort](https://developer.apple.com/documentation/metalperformanceshadersgraph/mpsgraph/sort(_:axis:descending:name:)) function in MPS (Metal Performance Shaders). However, the metal-rs didn't fully support MPS. Thus, we cannot leverge the build-in sort in metal library 3. (Alternatives) sort in CPU and send the result back to GPU ## Backlog (Already done in arkworks backend) 1. [Backlog] wnaf, [glv decomposition](https://hackmd.io/@drouyang/glv) 2. [Backlog] signed index endomophism (reducing the amount of bucket) ## References - [Grant proposal for migrating ICICLE from CUDA to Metal](https://hackmd.io/@ZZhqXtsPTHawMbCDb4ZkaA/ByV4Uahap) - [TrapdoorTech MSM for zprize 2022](https://github.com/Trapdoor-Tech/TrapdoorTech-zprize-mobile-harness/tree/c86b415cbc305f2a255aac2bfb0afd841cad2804) - [Point addition cost on Short Weierstrass Curve](https://www.hyperelliptic.org/EFD/g1p/auto-shortw-jacobian-0.html#addition-add-2007-bl) - [Mixed Point addition cost on Short Weierstrass Curve](https://www.hyperelliptic.org/EFD/g1p/auto-shortw-jacobian-0.html#addition-madd-2007-bl) - [Point addition cost on Twisted Edwards Curve](https://www.hyperelliptic.org/EFD/g1p/auto-twisted-extended-1.html#addition-add-2008-hwcd-3) - [Mixed Point addition cost on Twisted Edwards Curve](https://www.hyperelliptic.org/EFD/g1p/auto-twisted-extended-1.html#addition-madd-2008-hwcd-3) - [Equivalence with Weierstrass curves](https://en.wikipedia.org/wiki/Montgomery_curve#Equivalence_with_Weierstrass_curves)