# Quantus External Miner — Roadmap (Updated)
This roadmap updates the original optimisation and GPU offload plan to reflect:
- New CPU microarchitecture-specific backends (Montgomery CIOS + BMI2/ADX + aarch64 UMULH).
- Backend-aware metrics for apples-to-apples comparisons.
- A staged GPU CUDA plan with explicit SHA3 bring-up phases.
Status: actively developed and deployed in production-like environments.
---
## 0) Context & Goals
- Problem: For a given header (32 bytes) and nonce (U512), compute QPoW distance:
- y = m^(h + nonce) mod n
- distance = target XOR SHA3-512(y_be64)
- For consecutive nonces, replace the exponentiation with a single modular multiply per step:
- y_{k+1} = (y_k * m) mod n
Goals:
- Maximize hashrate without changing external APIs or protocol semantics.
- Keep correctness parity and apples-to-apples metrics across engines.
- Introduce a high-performance GPU backend with staged, verifiable milestones.
---
## 1) Current Architecture & Engine Selection
Engines:
- cpu-baseline: Reference implementation; per-nonce exponentiation.
- cpu-fast: Incremental path (init y0 once, then y ← y·m mod n).
- cpu-montgomery: Optimized modular multiply using Montgomery arithmetic (512-bit, 8×64 limbs).
- Portable CIOS (u128 accumulators).
- x86_64 BMI2-only: MULX-based CIOS (single carry chain).
- x86_64 BMI2+ADX: MULX + ADCX/ADOX dual carry chains for higher ILP.
- aarch64 UMULH/ADCS: UMULH for high halves, optimized accumulation.
- Runtime backend selection:
- Auto-detects: x86_64-bmi2-adx, x86_64-bmi2, aarch64-umulh, or portable.
- Override: MINER_MONT_BACKEND=portable|bmi2|bmi2-adx|umulh
- Per-job precompute cache: n0_inv, R^2 mod n, m_hat.
- Direct SHA3 from normalized big-endian bytes (no intermediate big-int allocations).
- Identical MinerEngine semantics and metrics to cpu-fast for apples-to-apples comparisons.
- gpu-cuda (scaffold): Feature-gated. Selection requires building with CUDA support and a working runtime; otherwise the service fails fast (no CPU fallback when gpu-cuda is selected).
CLI:
- --engine cpu-fast | cpu-montgomery | gpu-cuda | cpu-baseline | cpu-chain-manipulator
- --progress-chunk-ms <ms> controls metrics/cancellation cadence (5s is a good production default).
---
## 2) Metrics & Observability
We preserve identical metrics emission semantics across engines and add a backend label for cpu-montgomery.
Core series (examples):
- miner_hash_rate (gauge): global nonces/sec estimate.
- miner_job_hash_rate{engine,job_id} (gauge)
- miner_thread_hash_rate{engine,job_id,thread_id} (gauge)
- miner_job_hashes_total{engine,job_id} (counter)
- miner_thread_hashes_total{engine,job_id,thread_id} (counter)
- miner_jobs_total{status} (counter)
- miner_effective_cpus (gauge)
Backend info:
- miner_engine_backend{engine="cpu-montgomery", backend="<name>"} = 1
Recommended comparisons:
- Plot per-engine/per-instance hashrate lines.
- Filter or group by backend to compare cpu-fast vs cpu-montgomery vs cpu-montgomery(backends).
- Keep chunking equal across engines for fair comparison.
---
## 3) CPU Roadmap (Updated)
Phase 1 — Incremental O(1) y update (DONE)
- init y0 = m^(h + start_nonce) mod n once per worker.
- Per nonce: y ← y·m mod n.
- Hash: SHA3-512 of y (64 bytes BE); distance = target XOR hash.
Phase 2 — Montgomery multiplication (DONE)
- Portable CIOS path with 8×64 limbs and u128 accumulators.
- Precompute (n0_inv, R^2 mod n) per job; keep y, m in Montgomery domain.
- Before hashing: from_mont(y_hat) → big-endian bytes; reuse a single hasher with finalize_reset.
- Micro-arch backends:
- x86_64 BMI2 (MULX).
- x86_64 BMI2+ADX (MULX + ADCX/ADOX).
- aarch64 UMULH.
- Runtime detection, env overrides, and backend metric added.
- Gains: measurable uplift vs cpu-fast; bigger on modern x86_64 with ADX, good uplift on Apple Silicon.
Phase 3 — Micro-optimizations (PARTIAL / BACKLOG)
- Optional CPU SIMD SHA3 path (deprioritized if GPU is primary focus).
- Loop unrolling/batching; further allocator avoidance.
- Thread affinity / NUMA tuning for large nodes.
- Micro-benchmark harness (ns/op per backend) to report at startup (feature-gated).
---
## 4) GPU CUDA Roadmap (Staged)
Design: Offload per-nonce loop to GPU. CPU prepares per-job constants and orchestrates ranges.
What stays on CPU (per job):
- get_random_rsa(header), Miller–Rabin for n.
- Precompute target, threshold, and Montgomery parameters.
- Range partitioning, host polling, HTTP/metrics.
What runs on GPU (per thread/block):
- Either compute y0 on device (windowed exp) or upload per-thread starts.
- Per nonce:
- y_hat ← mont_mul(y_hat, m_hat)
- y ← from_mont(y_hat)
- hash ← SHA3-512(y_be64)
- distance = target XOR hash; compare to threshold
- Early-exit via an atomic flag & ordered result writes.
Stages:
- G1: Montgomery on device; SHA3 on host (bring-up).
- Implement Montgomery (8×64) on GPU using 64×64→128 via __umul64hi.
- For a test nonce window, return y to host → SHA3-512 → parity checks vs CPU.
- G2: SHA3 on device + early-exit + threshold checks.
- Implement Keccak-f[1600] (24 rounds) optimized for 64-byte absorb per step.
- Use constant memory for n, m_hat, n0_inv, R^2, target, threshold.
- Grid-stride loop; each thread processes K nonces; tune K.
- Atomics for global “found” flag; per-thread batched hash counts.
- G3: Tuning & scale-out.
- Unroll Montgomery limbs; control register pressure to hit occupancy targets.
- Unroll Keccak rounds; minimize divergence; consider warp specialization if beneficial.
- Multi-GPU orchestration (deferred until single-GPU path stabilizes).
Selection & build:
- Binary must be built with CUDA feature to select gpu-cuda:
- Build: cargo build -p miner-cli --features cuda
- Runtime: --engine gpu-cuda (errors & exits if CUDA unavailable)
- Until the full kernel is implemented, selection acts as a guardrail.
---
## 5) Testing & Validation
CPU:
- Property tests:
- Portable Montgomery vs pow-core step_mul across many steps.
- Backend parity: bmi2 / bmi2-adx / umulh vs portable on supported hosts.
- End-to-end:
- cpu-montgomery vs cpu-fast parity for small ranges (distance, winning nonce, hash_count).
- In production:
- Hold chunking equal across engines for fair metrics.
GPU:
- G1: Device Montgomery → host SHA3 parity on test vectors (small ranges).
- G2: Full parity against cpu-fast on small ranges with on-device SHA3.
- Randomized sampling (optional shadow mode): compute device results for a small fraction and cross-validate against CPU.
---
## 6) Rollout Strategy
- Keep public API stable; identical HTTP/service semantics.
- Maintain apples-to-apples metrics (same hash_count increment semantics).
- Enable cpu-montgomery by default; auto-select the best backend.
- Gate gpu-cuda behind feature + runtime validation; fail fast when not supported.
- Use dashboards tagged by engine and backend to monitor regressions and wins.
---
## 7) Risks & Mitigations
- Big-int correctness bugs:
- Extensive property tests; parity against pow-core reference; backend parity checks.
- SHA3 mismatch:
- Use test vectors; ensure distance parity vs CPU.
- Early-exit race conditions:
- Device-global atomic, ordered writes; host-side polling cadence tuned.
- Performance portability:
- Provide portable CPU path; architecture-specialized CPU backends; staged GPU tuning.
---
## 8) Concise Engineering Checklist
CPU (backlog / polish)
- [ ] Optional: CPU SIMD SHA3 path (feature-gated), retain parity tests.
- [ ] Micro-bench harness to print ns/op for mont_mul backends at startup.
- [ ] Affinity/NUMA tuning hooks (optional CLI/env) for large servers.
GPU — Stage G1 (bring-up)
- [ ] Add build.rs to compile kernel.cu to PTX (feature: cuda), detect nvcc, good errors on failure.
- [ ] Device constants: upload n, m, n0_inv, R^2, m_hat, target, threshold.
- [ ] Kernel v1:
- [ ] Montgomery (8×64 CIOS) using 64×64→128 via __umul64hi.
- [ ] Grid-stride loop; each thread processes K nonces (configurable).
- [ ] Return y (normal domain) to host for SHA3; parity tests vs CPU on small ranges.
- [ ] Host launcher: buffer management, range partitioning, polling.
GPU — Stage G2 (functional end-to-end)
- [ ] Device SHA3-512 (Keccak-f[1600], 24 rounds).
- [ ] Early-exit global atomic; ordered result writes; host polling.
- [ ] Threshold check on device; write candidate with nonce + distance.
- [ ] Parity vs cpu-fast for small ranges; apples-to-apples hash_count.
GPU — Stage G3 (tuning)
- [ ] Tune K (nonces/thread), block sizes, occupancy; measure.
- [ ] Unroll Montgomery inner loops; confirm register pressure/occupancy.
- [ ] Unroll Keccak rounds; evaluate instruction-level parallelism vs register use.
- [ ] Optional: multi-GPU orchestration.
Observability & Ops
- [ ] Confirm backend label metric is scraped and displayed in dashboards.
- [ ] Document recommended Grafana panels for engine/backend comparisons.
- [ ] Keep --progress-chunk-ms consistent across engines when comparing.
Release Readiness
- [ ] CI jobs for feature matrices: default, +montgomery, +cuda (build-only), aarch64 cross-checks.
- [ ] Benchmarks and profiles captured in docs for representative machines (x86_64 ADX, Apple Silicon).
---
## 9) Notes for Contributors
- Keep unsafe code minimal, private, and well-commented (x86_64 ADX, aarch64 intrinsics, CUDA).
- Always add property tests for new backends and end-to-end parity checks.
- Maintain apples-to-apples metrics; do not alter hash_count semantics or chunk cadence without documenting the change.
- Prefer staged changes; correctness before tuning; quantify each improvement with dashboards.