# mopro 22 milestone 1 report [TOC] ## Links - [link to milestone 2 report](https://hackmd.io/YtTVJUArRmWrdvaTza2LGg) ## Things we have done 1. Researched zprize works and papers related to MSM acceleration. 2. Integrate `zprize2022/TrapdoorTech` works on msm into current mopro and get the benchmark of it. 3. Adapt the benchmarking method arkworks pippenger msm into zprize convention, which using 10 instances of $2^{16}$ points and scalars on BLS12_377 curve as benchmark data. ## Benchmark Result We set up with the following parameters: - Instance size: $2^{16} = 65536$ - Number of instance: 10 Overall, Arkworks Pippenger MSM algorithm (in arkworks 0.4 version) runs faster than TrapdoorTech Zprize work in 2022. Specifically, arkworks runs almost **300% faster on iOS simulator and 130% faster on iPhone 14 Pro**. A takeaway is that arkworks MSM runs nearly 50% slower on mobile device. We suspect that some optimizations by arkworks pippenger implementation might work less efficiently on the mobile device. Since the other zprize work, [`nistath`](https://github.com/nistath/zprize-mobile-msm/tree/main), has the same benchmark results according to the [zprize blog](https://www.zprize.io/blog/zprize-i-spotlight-open-division-prize-7-msm-on-mobile), we decided to skip the integration of this work and proceed to the next milestone. In conclusion, we will **focus on optimization of arkworks pippenger msm** as the next step. To reproduce the following benchmarks on laptop with CPU, use the commands on root path i.e. `mopro/`. Replace $ALGO_NAME with one of {halo2curve_msm, arkworks_pippenger}. - For $2^{16}$ size of instances ``` cargo test --release --features=gpu-benchmarks --package mopro-core --lib -- middleware::gpu_explorations::$ALGO_NAME::tests::test_run_benchmark --exact --nocapture ``` - For benchmark results for all different instance sizes ``` cargo test --release --features=gpu-benchmarks --package mopro-core --lib -- middleware::gpu_explorations::$ALGO_NAME::tests::test_run_multi_benchmarks --exact --nocapture ``` 1. Result running of iOS simulator: - logging: ```bash Average time to execute MSM with 65536 points and scalars in 1 iterations is: 135.93175ms Average time to execute MSM with 65536 points and scalars in 1 iterations is: 122.234958ms Average time to execute MSM with 65536 points and scalars in 1 iterations is: 130.396541ms Average time to execute MSM with 65536 points and scalars in 1 iterations is: 123.629666ms Average time to execute MSM with 65536 points and scalars in 1 iterations is: 145.457667ms Average time to execute MSM with 65536 points and scalars in 1 iterations is: 159.236792ms Average time to execute MSM with 65536 points and scalars in 1 iterations is: 134.852375ms Average time to execute MSM with 65536 points and scalars in 1 iterations is: 128.190292ms Average time to execute MSM with 65536 points and scalars in 1 iterations is: 138.075042ms Average time to execute MSM with 65536 points and scalars in 1 iterations is: 132.762041ms Average time to execute MSM with 65536 points and scalars in 1 iterations is: 128.630458ms Average time to execute MSM with 65536 points and scalars in 1 iterations is: 126.423125ms Average time to execute MSM with 65536 points and scalars in 1 iterations is: 128.428083ms Done running benchmark. Check the result at: "../mopro-core/benchmarks/gpu_explorations" Result of Arkwork (Baseline): 133.40375307692307 ms (diff: 0.0 %) Running MSM in algorithm: TrapdoorTech Zprize... Vectors already generated Average time to execute MSM with 65536 points and scalars in 1 iterations is: 559.812209ms Average time to execute MSM with 65536 points and scalars in 1 iterations is: 533.640042ms Average time to execute MSM with 65536 points and scalars in 1 iterations is: 525.178417ms Average time to execute MSM with 65536 points and scalars in 1 iterations is: 527.08525ms Average time to execute MSM with 65536 points and scalars in 1 iterations is: 525.088292ms Average time to execute MSM with 65536 points and scalars in 1 iterations is: 518.739ms Average time to execute MSM with 65536 points and scalars in 1 iterations is: 514.366458ms Average time to execute MSM with 65536 points and scalars in 1 iterations is: 519.871917ms Average time to execute MSM with 65536 points and scalars in 1 iterations is: 520.039125ms Average time to execute MSM with 65536 points and scalars in 1 iterations is: 528.623416ms Average time to execute MSM with 65536 points and scalars in 1 iterations is: 524.508458ms Average time to execute MSM with 65536 points and scalars in 1 iterations is: 525.372292ms Average time to execute MSM with 65536 points and scalars in 1 iterations is: 515.997166ms Done running benchmark. Check the result at: mopro-core/"../mopro-core/benchmarks/gpu_explorations" Result of TrapdoorTech Zprize: 526.0247724615385 ms (diff: -294.3103251062381 %) ``` - result screenshot <img src="https://hackmd.io/_uploads/B1aNfCYlR.png" alt="MSM result on simulator 15 pro Max" height="500"> 2. Result running on FoodChain's Iphone 14 pro - logging ```bash Average time to execute MSM with 65536 points and scalars in 1 iterations is: 195.635083ms Average time to execute MSM with 65536 points and scalars in 1 iterations is: 206.639791ms Average time to execute MSM with 65536 points and scalars in 1 iterations is: 205.1675ms Average time to execute MSM with 65536 points and scalars in 1 iterations is: 197.742167ms Average time to execute MSM with 65536 points and scalars in 1 iterations is: 207.147166ms Average time to execute MSM with 65536 points and scalars in 1 iterations is: 199.729459ms Average time to execute MSM with 65536 points and scalars in 1 iterations is: 203.080416ms Average time to execute MSM with 65536 points and scalars in 1 iterations is: 198.15875ms Average time to execute MSM with 65536 points and scalars in 1 iterations is: 201.636916ms Average time to execute MSM with 65536 points and scalars in 1 iterations is: 210.273792ms Done running benchmark. Check the result at: "../mopro-core/benchmarks/gpu_explorations" Result of Arkwork (Baseline): 202.52110399999998 ms (diff: 0.0 %) Running MSM in algorithm: TrapdoorTech Zprize... Vectors already generated Average time to execute MSM with 65536 points and scalars in 1 iterations is: 470.619458ms Average time to execute MSM with 65536 points and scalars in 1 iterations is: 470.25775ms Average time to execute MSM with 65536 points and scalars in 1 iterations is: 471.028792ms Average time to execute MSM with 65536 points and scalars in 1 iterations is: 470.972375ms Average time to execute MSM with 65536 points and scalars in 1 iterations is: 471.092292ms Average time to execute MSM with 65536 points and scalars in 1 iterations is: 470.932375ms Average time to execute MSM with 65536 points and scalars in 1 iterations is: 471.304875ms Average time to execute MSM with 65536 points and scalars in 1 iterations is: 471.343375ms Average time to execute MSM with 65536 points and scalars in 1 iterations is: 470.9955ms Average time to execute MSM with 65536 points and scalars in 1 iterations is: 470.738208ms Done running benchmark. Check the result at: mopro-core/"../mopro-core/benchmarks/gpu_explorations" Result of TrapdoorTech Zprize: 470.9284999999999 ms (diff: -132.53304998771878 %) ``` - result screenshot <img src="https://hackmd.io/_uploads/BkdTW0FeC.png" alt="MSM result on iphone 14 pro" height="500"> 3. Benchmark comparison on different size of points and scalars on Rust/Laptop **<u>[BLS12_377] Arkworks Pippenger vs TrapdoorTech</u>** - The benchmarking data is the same for all algorithms. - Each scalar has 32 Bytes, which can represent 0 to 2^256 - 1 unsigned integer. - The average msm time is calculated by averaging the msm time of 10 different random instance with the same size | msm_size | Arkworks Pippenger <br> arkworks 0.4 (ms) | TrapdoorTech <br> zprize2022 (ms) | |:--------:| ---------------------------------------------------------------- | -------------------------------------------------------- | | 2^16 | 134.915549 | 473.977274 | | 2^18 | 491.054058 | 1675.766808 | | 2^20 | 1922.294874 | 5787.312958 | | 2^22 | 6661.680833 | 20958.7608 | | 2^24 | 27812.06526 | 86147.87218 | **<u>[BN254/Apple M2 Chips] Arkworks Pippenger vs Halo2curve</u>** - We have tested halo2curve's msm implementation of both single-core version (i.e. `multiexp_serial()`) and multi-core one (i.e. `best_multiexp()`) | msm_size | Arkworks Pippenger <br> arkworks 0.4 (ms) | Halo2curve <br> multi-core (ms) | Halo2curve <br> single-core (ms) | |:--------:| ----------------------------------------- | ------------------------------- | -------------------------------- | | 2^16 | 82.19 | 116.79 | 464.35 | | 2^18 | 307.24 | 410.98 | 1649.23 | | 2^20 | 1140.88 | 1544.34 | 6154.86 | | 2^22 | 4160.43 | 5254.70 | 20962.72 | **<u>[BN254/Intel i7-1185G7] Arkworks Pippenger vs Halo2curve</u>** - The "asm" feature is enabled in x86_64 architecture | msm_size | Arkworks Pippenger <br> arkworks 0.4 (ms) | Halo2curve <br> multi-core (ms) | |:--------:| ----------------------------------------- | ------------------------------- | | 2^16 | 140.90 | 154.78 | | 2^18 | 579.82 | 569.52 | | 2^20 | 2974.86 | 2547.60 | | 2^22 | 10732.08 | 9866.36 | ## Optimization methods comparison | MSM <br> algorithms | Tuned window size | Operations on Twisted Edwards | Batch Addition | Signed Bucket Indexes | GLV Decomposition | |:-----------------------------------------------------------------------------------------------------------------:| ----------------- | ----------------------------- | -------------- | --------------------- | ----------------- | | [Arkworks 0.4](https://github.com/arkworks-rs/algebra/blob/master/ec/src/scalar_mul/variable_base/mod.rs) | ✅ | Optional | | ✅ | ✅ | | [TrapdoorTech <br> zprizez 2022](https://github.com/Trapdoor-Tech/TrapdoorTech-zprize-mobile-harness/tree/master) | ✅ | ✅ | | | | | [Halo2curve <br> multi-core](https://github.com/privacy-scaling-explorations/halo2curves/blob/main/src/msm.rs) | ✅ | | ✅ | ✅ | ✅ | | [Arkmsm <br> by Snarkify](https://github.com/snarkify/arkmsm/tree/main) | ✅ | | ✅ | ✅ | ✅ | ## Working logs 1. Get benchmark of zprize works - [x] create utils function to generate benchmarking points on BLS12_377 (**Moven**) - [X] deserialize the points serialized in arkworks 0.3 version to arkworks 0.4 version (**Moven**) - [X] add `arkworks-pippenger` msm to mopro-ffi's uniffi-binding (**Moven**) - [X] benchmarking arkworks-pippenger msm in Swift. - [X] add `Trapdoor` msm to mopro-ffi's uniffi-binding (**FoodChain**) - [X] adding test in `mopro-ffi/src/lib/trapdoortech_zprize_msm` - [x] adding test in `mopro-ffi/test/bindings` (yes, but not passed yet) - [X] changing the result and error type to zprize for consistency. `benchmark_msm` function format: ```rust // the function running msm pub fn benchmark_msm<I>( benchmark_dir: &str, instances: I, iterations: u32, ) -> Result<Vec<Duration>, preprocess::HarnessError> where I: Iterator<Item = preprocess::Instance>; // the function running on IOS pub fn run_benchmark( instance_size: u32, num_instance: u32, utils_dir: &str, benchmark_dir: &str, ) -> Result<BenchmarkResult, preprocess::MoproError> ``` - [X] Set up Download url for points and scalars in mobile device (**FoodChain**) - [X] leveraging AWS S3 to store the files and provide the download url - [X] benchmarking Trapdoor's msm in IOS. (**FoodChain**) 2. Select msm we want to use, supported by benchmarking result - [X] (if zprize works is better) upgrade overall arkworks from 0.3 to 0.4 version to fit mopro. (`arkworks_pippenger` currently uses MSM in 0.4) 3. Backlog (the implementation had the same improvement of Trapdoor Tech's work and we have issues running their code, so it's not worthy to integrate) - [ ] ~~add `nistath` msm to mopro-ffi's uniffi-binding~~ - [ ] ~~benchmarking `nistath` msm in Swift~~ Next Step (milestone 3): - Metal API: For now we might use `metal-rs` to make the algorithm more compatible with `arkworks`. - Optimizing ideas: please see this [page](https://hackmd.io/@moven0831/BeH22) ## Blockers ### ✅ Issue on Arkworks version Both of the zprize works in Rust, [nistath](https://github.com/nistath/zprize-mobile-msm/tree/main) and [TrapdoorTech](https://github.com/Trapdoor-Tech/TrapdoorTech-zprize-mobile-harness/tree/master), are based on Arkworks version "0.3". Therefore, some important components in `ark-ff` and `ark-ec` are deprecated on Arkworks "0.4", which is used in mopro. For instance, - Trait `AffineCurve` is not in ark_ec anymore. - Method `serialize_unchecked()` is not in ark-serialize anymore. - Method `serialized_size()` in not in `ark_ff::field` so that the `ark_bls12_377::fields::fr` cannot implement this. - `ToBytes` and `FromBytes` in ark-ff have been removed by [ark-algebra#417](https://github.com/arkworks-rs/algebra/pull/417) #### 0319 - [ChangeLog](https://github.com/arkworks-rs/algebra/blob/master/CHANGELOG.md) about Arkworks "0.3" to "0.4" - [Commit](https://github.com/arkworks-rs/circom-compat/commit/b892c62597687c23341cda1e8e89d58bb6428f36) on "Update ark-circom for arkworks 0.4.0" #### (Closed with [commit](https://github.com/oskarth/mopro/commit/48de8c170c1b1d6d1aab0cd72a8ed645371c2825)) 0324 - Solved by allowing multiple version with patch feature in Rust ```toml # GPU explorations ark-bls12-377 = { version = "0.3", optional = true } ark-bls12-381 = { version = "0.3", optional = true } ark-ed-on-bls12-377 = { version = "0.3", optional = true } ark-ed-on-bls12-381 = { version = "0.3", optional = true } ark-poly = { version = "0.3", optional = true } ark-poly-commit = { version = "0.3", optional = true } ark-sponge = { version = "0.3", optional = true } duration-string = { version = "0.0.6", optional = true } rand = { version = "0.8.0", optional = true } rand_chacha = { version = "0.3.1", optional = true } lazy_static = { version = "1.4.0", optional = true } # GPU explorations from mopro/Cargo.toml patch ark-ec-3 = { git = 'https://github.com/arkworks-rs/algebra.git', package = 'ark-ec', tag = 'v0.3.0', features = ["parallel"], optional = true} ark-ff-3 = { git = 'https://github.com/arkworks-rs/algebra.git', package = 'ark-ff', tag = 'v0.3.0', features = ["parallel"], optional = true } ark-serialize-3 = { git = 'https://github.com/arkworks-rs/algebra.git', package = 'ark-serialize', tag = 'v0.3.0', optional = true } ark-std-3 = { git = 'https://github.com/arkworks-rs/std.git', package = 'ark-std', tag = 'v0.3.0', optional = true } ``` - try the benchmarking on `mopro/mopro-core` ```bash cargo test --release --features=gpu-benchmarks --package mopro-core --lib -- middleware::gpu_explorations::trapdoortech_zprize_msm::tests::test_benchmark_msm --exact --nocapture ``` **<u>Result:</u>** ``` running 1 test Running benchmark for baseline result Average time to execute MSM with 65536 points and 65536 scalars and 1 iterations is: 461.931291ms Average time to execute MSM with 65536 points and 65536 scalars and 1 iterations is: 463.537083ms Average time to execute MSM with 65536 points and 65536 scalars and 1 iterations is: 462.997125ms Average time to execute MSM with 65536 points and 65536 scalars and 1 iterations is: 464.71475ms Average time to execute MSM with 65536 points and 65536 scalars and 1 iterations is: 462.676667ms Average time to execute MSM with 65536 points and 65536 scalars and 1 iterations is: 462.52125ms Average time to execute MSM with 65536 points and 65536 scalars and 1 iterations is: 464.768708ms Average time to execute MSM with 65536 points and 65536 scalars and 1 iterations is: 463.627459ms Average time to execute MSM with 65536 points and 65536 scalars and 1 iterations is: 464.488875ms Average time to execute MSM with 65536 points and 65536 scalars and 1 iterations is: 463.715ms Done running benchmark: Ok([461.931291ms, 463.537083ms, 462.997125ms, 464.71475ms, 462.676667ms, 462.52125ms, 464.768708ms, 463.627459ms, 464.488875ms, 463.715ms]) test middleware::gpu_explorations::trapdoortech_zprize_msm::tests::test_benchmark_msm ... ok ``` ### ✅ Issue on benchmarking method #### 0330 - Benchmarking points and scalar incompatible: zprize works are benchmarks in BLS12_377; current arkworks baseline are in BN254 - proposal: move the benchmarking point and scalar generation into `gpu_explorations/utils` and make current arkworks baseline in bls12_377 #### 0402 - Found that `ark_bls12_377::G1Affine` type is different from version 0.3 and 0.4, specifically one is using `short_weierstrass_jacobian` and another only `short_weierstrass` **<u>Error msg:</u>** ``` error[E0117]: only traits defined in the current crate can be implemented for types defined outside of the crate --> mopro-core/src/middleware/gpu_explorations/arkworks_pippenger.rs:23:1 | 23 | impl From<ark_bls12_377_3::G1Affine> for G1Affine { | ^^^^^-------------------------------^^^^^-------- | | | | | | | `ark_ec::short_weierstrass::Affine` is not defined in the current crate | | `ark_ec::short_weierstrass_jacobian::GroupAffine` is not defined in the current crate | impl doesn't use only types from inside the current crate | = note: define and implement a trait or new type instead ``` - Solution: try to impl from() to convert the types - find source code [`ark_ec::short_weierstrass::Affine`](https://docs.rs/ark-ec/0.4.0/src/ark_ec/models/short_weierstrass/affine.rs.html#68) and [`ark_ec::short_weierstrass_jacobian::GroupAffine`](https://docs.rs/ark-ec/0.3.0/src/ark_ec/models/short_weierstrass_jacobian.rs.html#74-150) then implement the transform method #### 0409 - Found critical problem that the point repr method can be different in arkworks 0.3 and 0.4. Specifically, one is using Jacobian form to repr point and others use SW form. - Failed code ```rust for instance in instances { let points = &instance.0; for p in points { let mut bytes = Vec::new(); let k = BigInt::new(p.x.0.0); k.serialize_compressed(&mut bytes).unwrap(); println!("k: {:?}", k); println!("len: {:?} bytes: {:?}", bytes.len(), bytes); let new_p = G1Affine::deserialize_compressed(&*bytes).unwrap(); println!("new_p: {:?}", new_p); } } ``` ``` running 1 test dir: "src/middleware/gpu_explorations/utils/vectors/10x16" k: BigInt([9731552286689537486, 12127634018691867890, 9776079508585867596, 14025949321989632482, 3173994719657650652, 1061268229925581]) len: 48 bytes: [206, 45, 69, 169, 62, 107, 13, 135, 242, 100, 18, 146, 226, 2, 78, 168, 76, 105, 60, 101, 132, 156, 171, 135, 226, 177, 210, 126, 152, 46, 166, 194, 220, 233, 40, 136, 97, 75, 12, 44, 205, 42, 219, 195, 55, 197, 3, 0] thread 'middleware::gpu_explorations::arkworks_pippenger::tests::test_run_benchmark' panicked at mopro-core/src/middleware/gpu_explorations/arkworks_pippenger.rs:54:67: called `Result::unwrap()` on an `Err` value: InvalidData note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace test middleware::gpu_explorations::arkworks_pippenger::tests::test_run_benchmark ... FAILED ``` - Success code (sample bls points in 0.4 version) ```rust for instance in instances { let points = &instance.0; for p in points { let mut bytes = Vec::new(); let p_a = G1Affine::rand(&mut rng); p_a.x.serialize_compressed(&mut bytes).unwrap(); println!("p: {:?}", p_a.x); println!("len: {:?} bytes: {:?}", bytes.len(), bytes); let new_p = G1Affine::deserialize_compressed(&*bytes).unwrap(); println!("new_p: {:?}", new_p); } } ``` ``` p: BigInt([5512034307592596933, 99440614254125635, 9312191808998893853, 13158212786987652246, 9267201699218040409, 55789045323959188]) len: 48 bytes: [197, 173, 210, 173, 255, 173, 126, 76, 67, 34, 221, 56, 182, 72, 97, 1, 29, 105, 174, 183, 28, 141, 59, 129, 150, 212, 102, 135, 208, 92, 155, 182, 89, 166, 201, 127, 216, 182, 155, 128, 148, 139, 228, 84, 215, 51, 198, 0] new_p: (119164677558380895836223082391237862732875551641716767440243712604200065663798289487670039744221328193177263451589, 122683500655082031675443015824881521317135573936647759675201525980281219854269715687554971804151299033112627198155) ``` #### (Closed with [commit](https://github.com/oskarth/mopro/commit/e025f08d7f1a9b96dbfe4ed2179d56aa2160b8c0?diff=unified&w=0)) 0411 - Found a way to successfully parsed points and scalars. Then, get the benchmark result of arkworks pippenger ```rust= let points = &instance.0; let scalars = &instance.1; let mut parsed_points = Vec::<G1Affine>::new(); let mut parsed_scalars = Vec::<ScalarField>::new(); // parse points and scalars from arkworks 0.3 compatible format to 0.4 compatible for p in points { let new_p = G1Affine::new_unchecked(BigInt::new(p.x.0 .0).into(), BigInt::new(p.y.0 .0).into()); parsed_points.push(new_p); } for s in scalars { let new_s = ScalarField::new(BigInt::new(s.0)); parsed_scalars.push(new_s); } ``` ``` running 1 test dir: "src/middleware/gpu_explorations/utils/vectors/10x16" Average time to execute MSM with 65536 points and 65536 scalars and 1 iterations is: 127.120917ms Average time to execute MSM with 65536 points and 65536 scalars and 1 iterations is: 131.391917ms Average time to execute MSM with 65536 points and 65536 scalars and 1 iterations is: 129.130125ms Average time to execute MSM with 65536 points and 65536 scalars and 1 iterations is: 128.530083ms Average time to execute MSM with 65536 points and 65536 scalars and 1 iterations is: 140.565792ms Average time to execute MSM with 65536 points and 65536 scalars and 1 iterations is: 129.133667ms Average time to execute MSM with 65536 points and 65536 scalars and 1 iterations is: 128.416542ms Average time to execute MSM with 65536 points and 65536 scalars and 1 iterations is: 133.244625ms Average time to execute MSM with 65536 points and 65536 scalars and 1 iterations is: 129.896042ms Average time to execute MSM with 65536 points and 65536 scalars and 1 iterations is: 128.161ms Done running benchmark: Ok([127.120917ms, 131.391917ms, 129.130125ms, 128.530083ms, 140.565792ms, 129.133667ms, 128.416542ms, 133.244625ms, 129.896042ms, 128.161ms]) test middleware::gpu_explorations::arkworks_pippenger::tests::test_run_benchmark ... ok test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 12 filtered out; finished in 3.00s ``` ### ✅ File not detected on swift > The precalculated files were not readable by the target due to the configuration problem. #### 0414 We uses `AWS S3` to store the pre-calculated points and scalars. With `Example/MoproKit/FileDownloader`, we are able to download the files from `S3` and store inside our target (whether simulator or mobile device). Current points and scalar 10 instances and each of them has size of 2^16: - points: `https://mopro-msm.s3.eu-north-1.amazonaws.com/vectors/16x10/points` - scalars: `https://mopro-msm.s3.eu-north-1.amazonaws.com/vectors/16x10/scalars` ## References - [zprize2022-mobile-nistath](https://github.com/nistath/zprize-mobile-msm/tree/main) - [zprize2022-mobile-TrapdoorTech](https://github.com/Trapdoor-Tech/TrapdoorTech-zprize-mobile-harness/tree/master) - [zprize2022-mobile-gnark](https://github.com/gbotrel/zprize-mobile-harness/tree/main) - [zprize2022-gpu-yrrid](https://github.com/z-prize/2022-entries/tree/main/open-division/prize1-msm/prize1a-msm-gpu/yrrid) - [Optimization strategies](https://hackmd.io/@drouyang/msm)