filcollins storage provider setup

# filcollins storage provider setup ## table of contents * [overview](#overview) * [ssh config](#ssh-config) * [firewall](#firewall) * [hardware](#hardware) * [required sp processes](#required-sp-processes) * [long-term storage](#long-term-storage) * [known issues](#known-issues) ## overview The storage provider (SP) is running on Protocol Labs managed infrastructure. https://filfox.info/en/address/f01953925 public ip: 209.94.92.6 public peer id: `12D3KooWNSRG5wTShNu6EXCPTkoH7dWsphKAPrbvQchHa5arfsDC` --- ## ssh config ``` Host worker-gpu-9 User nonsense HostName 209.94.92.6 ForwardAgent yes StrictHostKeyChecking no UserKnownHostsFile /dev/null LogLevel QUIET Host worker-gpu-10 Host worker-cpu-1-1 Host worker-cpu-1-4 ``` --- ## firewall ufw is enabled on `worker-gpu-9` which is our public instance ports 22, 24001, 2345 are enabled --- ## tailscale tailscale / wireguard is installed on all hosts and a vpn is established with `sofiaminer` --- ## hardware 4 instances - 2 cpu instances ; 2 gpu instances ### cpu instance type #### cpu model name: AMD EPYC 7F32 8-Core Processor #### storage 5 x 1.8TB HDDs running as a striped array, mounted at /mnt/hddvol, designated for scratch area 2 x NVMe running as a stripped array, mounted at /mnt/nvmevol, designated for scratch area nonsense@worker-cpu-1-1:~$ sudo vgs VG #PV #LV #SN Attr VSize VFree hddgroup 5 1 0 wz--n- 8.73t 33.65g nvmegroup 2 1 0 wz--n- <2.62t <71.54g #### memory 1 TB RAM ### gpu instance type #### cpu model name: Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz #### memory 502 GB RAM #### gpus 4 x NVIDIA Quadro RTX 6000 (24 GB GDDR6) ### gpu instance - worker-gpu-10 #### storage 500 TB Ceph system mounted at /mnt/ceph -- designated for long-term storage of sealed/unsealed sectors 3 x 900 GB HDDs running as a striped array, mounted at /storage/d1, designated for scratch area ### gpu instance - worker-gpu-9 #### storage 500 TB Ceph system mounted at /mnt/ceph -- designated for long-term storage of sealed/unsealed sectors 3 x 900 GB HDDs running as a striped array, mounted at /storage/d1 --- ## required sp processes ### lotus daemon * running in a tmux pane on worker-gpu-9 * uses / filesystem for chain storage ``` lotus daemon &>> lotus-daemon.log ``` ### lotus-miner process * running in a tmux pane on worker-gpu-9 * Make sure to use the below variable before starting the miner process ``` export CUDA_VISIBLE_DEVICES="0,1" lotus-miner run &>> lotus-miner.log ``` ### lotus-worker processes - sealing 1. running on worker-cpu-1-1 and worker-cpu-1-4 ``` lotus-worker run --no-default --addpiece --precommit1 --data-cid ``` 2. running on worker-gpu-10 (tmux GPU) Sealing disk = /storage/d1/scratch ``` export CUDA_VISIBLE_DEVICES="0,1,2,3" LOTUS_WORKER_PATH=/home/nonsense/.lotusworker lotus-worker run --no-default --precommit2 --commit --replica-update --prove-replica-update2 --regen-sector-key --name worker-gpu-10-worker0 ``` 3. running on worker-gpu-10 (tmux Storage) Sealing disk = /storage/d1/scratch1 ``` export CUDA_VISIBLE_DEVICES="4" LOTUS_WORKER_PATH=/home/nonsense/.lotusworker1 lotus-worker run --no-default --no-local-storage --listen 0.0.0.0:3457 --name worker-gpu-10-STORAGE ``` ### lotus-worker processes - windowpost on worker-gpu-9 ``` export CUDA_VISIBLE_DEVICES="2" LOTUS_WORKER_PATH=/home/nonsense/.lotusworkerpost lotus-worker run --no-default --windowpost --no-local-storage --listen 0.0.0.0:3457 --name worker-gpu-9-wdPost ``` ### lotus-worker processes - winningPost on worker-gpu-9 ``` export CUDA_VISIBLE_DEVICES="3" LOTUS_WORKER_PATH=/home/nonsense/.lotuswinningpost lotus-worker run --no-default --winningpost --no-local-storage --listen 0.0.0.0:3458 --name worker-gpu-9-wnPost ``` ### YugabyteDB docker container * running on worker-gpu-10 ``` sudo docker run -d --name yugabyte -p7000:7000 -p9000:9000 -p15433:15433 -p5433:5433 -p9042:9042 -v /home/nonsense/yb-home:/home/yugabyte yugabytedb/yugabyte:latest bin/yugabyted start --base_dir=/home/yugabyte/yb_data --daemon=false ``` ### boostd-data service * in TMUX on worker-gpu-10 after starting YugabyteDB container ``` boostd-data run yugabyte --hosts 127.0.0.1 --connect-string="postgresql://postgres:postgres@127.0.0.1:5433?sslmode=disable" --addr 0.0.0.0:8044 ``` ### boostd process * running on worker-gpu-9 running as a service - `/etc/systemd/system/boostd.service` incoming staging area is symlinked to Ceph ### additional configurations * finalizeearly is `true` * New sectors for deals are disabled --- ## long-term storage we use Ceph system for long-term storage of sealed/unsealed sectors. * 2x 500TiB is formatted with ext4 * worker-gpu-10 has access to one ceph device, mounted at `/mnt/ceph` * worker-gpu-9 has access to another ceph device, mounted at `/mnt/ceph` * devices are only attached to single machine, because they are currently RBD volumes (RWO). --- ## Tailscale in order to be able to test out multi-boost / single LID flows, we have set up a VPN between sofiaminer and filcollins. If the VPN needs to be restarted (eg it doesn't come back up when a machine is rebooted) use: ``` $ sudo tailscale logout $ sudo tailscale login Login credentials: boostteam55@gmail.com (password is in OnePassword) ``` --- ## known issues ### pruning of chain data at the moment we must manually prune the chain data on worker-gpu-9 as we are still not using splitstore with discard store ``` cd /mnt/ceph/tmp/ && aria2c -x5 https://snapshots.mainnet.filops.net/minimal/latest lotus daemon stop rm -rf /home/nonsense/.lotus/datastore/chain/ rm -rf /home/nonsense/.lotus/datastore/splitstore/ lotus daemon --import-snapshot /mnt/ceph/tmp/<snapshot-name>.car ``` ### sealing pipeline gets blocked when trying to AddPiece if there are no extended Available sectors for snap deals workaround: periodically use the `./new-extend-sectors.sh` script to extend sectors TODO: Run on a daily cron ### lotus-miner almost never shuts down gracefully when wanting to restart `lotus-miner` we have to `kill -9` it ``` ps -ef | grep "lotus-miner run" | head -1 | awk '{print $2}' | xargs kill -9 ``` ### replace username to something more generic replace user from `nonsense` to `filadmin` or something ### no backups for repos we should add periodic backups for all important repos -- lotus-miner