Miden GPU acceleration

# Miden GPU acceleration The goal of the project is to enable acceleration of Miden VM proof generation on GPUs - specifically for Metal and CUDA. We already have support for Metal acceleration on Apple silcon. This is done via [ministark-gpu](https://crates.io/crates/ministark-gpu) crate by implementing a GPU accelerated prover [here](https://github.com/0xPolygonMiden/miden-vm/blob/next/prover/src/gpu.rs). Currently, we accelerate two steps of STARK proof generation: 1. Computing trace commitment (see [here](https://github.com/0xPolygonMiden/miden-vm/blob/next/prover/src/gpu.rs#L381)). 2. Computing constraint commitment (see [here](https://github.com/0xPolygonMiden/miden-vm/blob/next/prover/src/gpu.rs#L104)). These two steps are by far the most expensive, and for this project, we are not looking to accelerate anything else. This project is broken down into two phases described below. ## Metal acceleration improvements The goals of this phase are: 1. Move the relevant GPU code into a new crate under [0xPolygonMiden](https://github.com/0xPolygonMiden) org. Probably something like `miden-gpu`. 2. Refactor the code to be more in-line with component-based prover design. For example, instead of implementing a separte `MetalRpoExecutionProver`, would be ideal to use a generic prover where [TraceLde](https://github.com/0xPolygonMiden/miden-vm/blob/next/prover/src/lib.rs#L178) associated type is swapped out for `MetalTraceLde`. Though, we need to evaluate pros and cons of this approach. 3. Implement support for the [RPX256](https://github.com/0xPolygonMiden/crypto/blob/next/src/hash/rescue/rpx/mod.rs#L58) hash function (currently, Metal acceleration works only with the RPO hash function). ## CUDA acceleration The goal of this phase is to enable CUDA acceleration support. We already have CUDA code in a stand-alone repo which is sufficient to offload trace and constraint commitment computations to the GPUs (the CUDA code we have can actually use multiple GPUs in parallel). As a part of this phase we'd need to: 1. Move CUDA code into the `miden-gpu` crate. a. This may require some refactoring/updates. For example, I believe that the current Rust code accompanying CUDA code can only be compiled with Rust nightly, but we need it to work with stable. 2. Implement support for the [RPX256](https://github.com/0xPolygonMiden/crypto/blob/next/src/hash/rescue/rpx/mod.rs#L58) hash function (currently, CUDA acceleration works only with the RPO and Poseidon hash functions). 3. Integrate this code into the Miden prover, ideally, in a simlar manner as the Metal code. In the VM, CUDA acceleration would be enabled similarly to how Metal acceleration is enalbed now - i.e., via something like `cuda` feature flag. Ideally, the deployment would be as simple as the one for Metal (i.e., no extra setup beyond specifing `miden-gpu` as a dependency), but not sure if this is possible. Finally, CUDA acceleration should be tested in a variety of settings (e.g., single GPU deployment, 2 GPU deployment, 4 GPU deployement).