# Quantus External Miner — Roadmap (Updated) This roadmap updates the original optimisation and GPU offload plan to reflect: - New CPU microarchitecture-specific backends (Montgomery CIOS + BMI2/ADX + aarch64 UMULH). - Backend-aware metrics for apples-to-apples comparisons. - A staged GPU CUDA plan with explicit SHA3 bring-up phases. Status: actively developed and deployed in production-like environments. --- ## 0) Context & Goals - Problem: For a given header (32 bytes) and nonce (U512), compute QPoW distance: - y = m^(h + nonce) mod n - distance = target XOR SHA3-512(y_be64) - For consecutive nonces, replace the exponentiation with a single modular multiply per step: - y_{k+1} = (y_k * m) mod n Goals: - Maximize hashrate without changing external APIs or protocol semantics. - Keep correctness parity and apples-to-apples metrics across engines. - Introduce a high-performance GPU backend with staged, verifiable milestones. --- ## 1) Current Architecture & Engine Selection Engines: - cpu-baseline: Reference implementation; per-nonce exponentiation. - cpu-fast: Incremental path (init y0 once, then y ← y·m mod n). - cpu-montgomery: Optimized modular multiply using Montgomery arithmetic (512-bit, 8×64 limbs). - Portable CIOS (u128 accumulators). - x86_64 BMI2-only: MULX-based CIOS (single carry chain). - x86_64 BMI2+ADX: MULX + ADCX/ADOX dual carry chains for higher ILP. - aarch64 UMULH/ADCS: UMULH for high halves, optimized accumulation. - Runtime backend selection: - Auto-detects: x86_64-bmi2-adx, x86_64-bmi2, aarch64-umulh, or portable. - Override: MINER_MONT_BACKEND=portable|bmi2|bmi2-adx|umulh - Per-job precompute cache: n0_inv, R^2 mod n, m_hat. - Direct SHA3 from normalized big-endian bytes (no intermediate big-int allocations). - Identical MinerEngine semantics and metrics to cpu-fast for apples-to-apples comparisons. - gpu-cuda (scaffold): Feature-gated. Selection requires building with CUDA support and a working runtime; otherwise the service fails fast (no CPU fallback when gpu-cuda is selected). CLI: - --engine cpu-fast | cpu-montgomery | gpu-cuda | cpu-baseline | cpu-chain-manipulator - --progress-chunk-ms <ms> controls metrics/cancellation cadence (5s is a good production default). --- ## 2) Metrics & Observability We preserve identical metrics emission semantics across engines and add a backend label for cpu-montgomery. Core series (examples): - miner_hash_rate (gauge): global nonces/sec estimate. - miner_job_hash_rate{engine,job_id} (gauge) - miner_thread_hash_rate{engine,job_id,thread_id} (gauge) - miner_job_hashes_total{engine,job_id} (counter) - miner_thread_hashes_total{engine,job_id,thread_id} (counter) - miner_jobs_total{status} (counter) - miner_effective_cpus (gauge) Backend info: - miner_engine_backend{engine="cpu-montgomery", backend="<name>"} = 1 Recommended comparisons: - Plot per-engine/per-instance hashrate lines. - Filter or group by backend to compare cpu-fast vs cpu-montgomery vs cpu-montgomery(backends). - Keep chunking equal across engines for fair comparison. --- ## 3) CPU Roadmap (Updated) Phase 1 — Incremental O(1) y update (DONE) - init y0 = m^(h + start_nonce) mod n once per worker. - Per nonce: y ← y·m mod n. - Hash: SHA3-512 of y (64 bytes BE); distance = target XOR hash. Phase 2 — Montgomery multiplication (DONE) - Portable CIOS path with 8×64 limbs and u128 accumulators. - Precompute (n0_inv, R^2 mod n) per job; keep y, m in Montgomery domain. - Before hashing: from_mont(y_hat) → big-endian bytes; reuse a single hasher with finalize_reset. - Micro-arch backends: - x86_64 BMI2 (MULX). - x86_64 BMI2+ADX (MULX + ADCX/ADOX). - aarch64 UMULH. - Runtime detection, env overrides, and backend metric added. - Gains: measurable uplift vs cpu-fast; bigger on modern x86_64 with ADX, good uplift on Apple Silicon. Phase 3 — Micro-optimizations (PARTIAL / BACKLOG) - Optional CPU SIMD SHA3 path (deprioritized if GPU is primary focus). - Loop unrolling/batching; further allocator avoidance. - Thread affinity / NUMA tuning for large nodes. - Micro-benchmark harness (ns/op per backend) to report at startup (feature-gated). --- ## 4) GPU CUDA Roadmap (Staged) Design: Offload per-nonce loop to GPU. CPU prepares per-job constants and orchestrates ranges. What stays on CPU (per job): - get_random_rsa(header), Miller–Rabin for n. - Precompute target, threshold, and Montgomery parameters. - Range partitioning, host polling, HTTP/metrics. What runs on GPU (per thread/block): - Either compute y0 on device (windowed exp) or upload per-thread starts. - Per nonce: - y_hat ← mont_mul(y_hat, m_hat) - y ← from_mont(y_hat) - hash ← SHA3-512(y_be64) - distance = target XOR hash; compare to threshold - Early-exit via an atomic flag & ordered result writes. Stages: - G1: Montgomery on device; SHA3 on host (bring-up). - Implement Montgomery (8×64) on GPU using 64×64→128 via __umul64hi. - For a test nonce window, return y to host → SHA3-512 → parity checks vs CPU. - G2: SHA3 on device + early-exit + threshold checks. - Implement Keccak-f[1600] (24 rounds) optimized for 64-byte absorb per step. - Use constant memory for n, m_hat, n0_inv, R^2, target, threshold. - Grid-stride loop; each thread processes K nonces; tune K. - Atomics for global “found” flag; per-thread batched hash counts. - G3: Tuning & scale-out. - Unroll Montgomery limbs; control register pressure to hit occupancy targets. - Unroll Keccak rounds; minimize divergence; consider warp specialization if beneficial. - Multi-GPU orchestration (deferred until single-GPU path stabilizes). Selection & build: - Binary must be built with CUDA feature to select gpu-cuda: - Build: cargo build -p miner-cli --features cuda - Runtime: --engine gpu-cuda (errors & exits if CUDA unavailable) - Until the full kernel is implemented, selection acts as a guardrail. --- ## 5) Testing & Validation CPU: - Property tests: - Portable Montgomery vs pow-core step_mul across many steps. - Backend parity: bmi2 / bmi2-adx / umulh vs portable on supported hosts. - End-to-end: - cpu-montgomery vs cpu-fast parity for small ranges (distance, winning nonce, hash_count). - In production: - Hold chunking equal across engines for fair metrics. GPU: - G1: Device Montgomery → host SHA3 parity on test vectors (small ranges). - G2: Full parity against cpu-fast on small ranges with on-device SHA3. - Randomized sampling (optional shadow mode): compute device results for a small fraction and cross-validate against CPU. --- ## 6) Rollout Strategy - Keep public API stable; identical HTTP/service semantics. - Maintain apples-to-apples metrics (same hash_count increment semantics). - Enable cpu-montgomery by default; auto-select the best backend. - Gate gpu-cuda behind feature + runtime validation; fail fast when not supported. - Use dashboards tagged by engine and backend to monitor regressions and wins. --- ## 7) Risks & Mitigations - Big-int correctness bugs: - Extensive property tests; parity against pow-core reference; backend parity checks. - SHA3 mismatch: - Use test vectors; ensure distance parity vs CPU. - Early-exit race conditions: - Device-global atomic, ordered writes; host-side polling cadence tuned. - Performance portability: - Provide portable CPU path; architecture-specialized CPU backends; staged GPU tuning. --- ## 8) Concise Engineering Checklist CPU (backlog / polish) - [ ] Optional: CPU SIMD SHA3 path (feature-gated), retain parity tests. - [ ] Micro-bench harness to print ns/op for mont_mul backends at startup. - [ ] Affinity/NUMA tuning hooks (optional CLI/env) for large servers. GPU — Stage G1 (bring-up) - [ ] Add build.rs to compile kernel.cu to PTX (feature: cuda), detect nvcc, good errors on failure. - [ ] Device constants: upload n, m, n0_inv, R^2, m_hat, target, threshold. - [ ] Kernel v1: - [ ] Montgomery (8×64 CIOS) using 64×64→128 via __umul64hi. - [ ] Grid-stride loop; each thread processes K nonces (configurable). - [ ] Return y (normal domain) to host for SHA3; parity tests vs CPU on small ranges. - [ ] Host launcher: buffer management, range partitioning, polling. GPU — Stage G2 (functional end-to-end) - [ ] Device SHA3-512 (Keccak-f[1600], 24 rounds). - [ ] Early-exit global atomic; ordered result writes; host polling. - [ ] Threshold check on device; write candidate with nonce + distance. - [ ] Parity vs cpu-fast for small ranges; apples-to-apples hash_count. GPU — Stage G3 (tuning) - [ ] Tune K (nonces/thread), block sizes, occupancy; measure. - [ ] Unroll Montgomery inner loops; confirm register pressure/occupancy. - [ ] Unroll Keccak rounds; evaluate instruction-level parallelism vs register use. - [ ] Optional: multi-GPU orchestration. Observability & Ops - [ ] Confirm backend label metric is scraped and displayed in dashboards. - [ ] Document recommended Grafana panels for engine/backend comparisons. - [ ] Keep --progress-chunk-ms consistent across engines when comparing. Release Readiness - [ ] CI jobs for feature matrices: default, +montgomery, +cuda (build-only), aarch64 cross-checks. - [ ] Benchmarks and profiles captured in docs for representative machines (x86_64 ADX, Apple Silicon). --- ## 9) Notes for Contributors - Keep unsafe code minimal, private, and well-commented (x86_64 ADX, aarch64 intrinsics, CUDA). - Always add property tests for new backends and end-to-end parity checks. - Maintain apples-to-apples metrics; do not alter hash_count semantics or chunk cadence without documenting the change. - Prefer staged changes; correctness before tuning; quantify each improvement with dashboards.