HP Cluster Deployment

# HP Cluster Deployment ## Introduction > HP cluster hosts the services related to serve functionalities with Vectors, Adjacency and TokenCollections data. List of services are as follows, Service| Data | Functionality - | - | - Gateway | - | Only point of contact to the cluster to serve requests to external world Vspace | VectorSpace's word and index data | Vector computations VspaceShard | VectorSpace's vectors as shards(chunks) | Vector computations, serves Vspace Subspace | VectorSpace for the TokenCollections | Vector computations for TokenCollections Adjacency | Adjacency data | Adjacent/Co-occurrence computations TokenCollections | - | TokenCollections add/update/delete operations Semantics | - | semantic operations over Subspace and Adjacency data Temporal | - | Temporal operation for a corpora Misc | - | Miscellaneous operations over any data in cluster ## Server Estimation depending on 2018 Datagen > Note: The following data sizes are approximate values |Service(s)| Instances per server|Servers| Data size |-------|----------| ------ | ----- | Gateway | 1 | 2 | - |Temporal | 2 | Gateway server | - |TokenCollections | 2 | Gateway server | - |Semantics| 4 | Gateway server| - |Redis | 3 | 1 | <1GB (variable) |Adjacency| 12 | 2| 1.4 TB |Vspace| 12| 2 | 0.8 TB |VspaceShards|12| 7| 4 TB |Subspace|12|10| 4.8 TB |Total | | 23 | 11 TB ## Server Estimation depending on 2019 Datagen > Note: The following data sizes are approximate values |Service(s)| Instances per server|Servers| Data size |-------|----------| ------ | ----- | Gateway | 1 | 2 | - |Temporal | 2 | Gateway server | - |TokenCollections | 2 | Gateway server | - |Semantics| 4 | Gateway server| - |Redis | 3 | 1 | < 1GB (variable) |Adjacency| 12 | 3| ~2 TB |Vspace| 12| 2 | ~ 0.8 TB |VspaceShards|12| 14| ~5 TB |Subspace|12|13| ~6 TB |Total | | 31 | ~14 TB ## Server config | Service Provider | Model | Disk | config | RAM | Processor | Cores | | -- | --| --| -- | -- | --| --| | Hetzner | Ex-61 | NVME (1 TB)| Raid 0 | 64GB | Intel® Core™ i7-8700 | 12 | | OVH | SP-64 |NVME (1 TB)| Raid 0| 64GB |Xeon E3-1270v6 | 8 | ## Deployment > Every service in HP cluster runs on docker. We use a docker swarm to integrate all the services as a cluster. Docker swarm make deployment, security and internal service communication more easier. 1. **Docker Installation** ``` $ sudo apt-get remove docker docker-engine docker.io docker-ce $ sudo apt-get update $ sudo apt-get install \ linux-image-extra-$(uname -r) \ linux-image-extra-virtual $ sudo apt-get update $ sudo apt-get install \ apt-transport-https \ ca-certificates \ curl \ software-properties-common $ curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add - $ sudo apt-key fingerprint 0EBFCD88 $ sudo add-apt-repository \ "deb [arch=amd64] https://download.docker.com/linux/ubuntu \ $(lsb_release -cs) \ stable" $ sudo apt-get update $ sudo apt-get install docker-ce=18.06.0~ce-0~ubuntu ``` 2. **Docker Login**.(Credentials will be shared) ``` $ sudo docker login ``` 3. **Disable UFW in all the machines** - Will update this steps with allowing particular ports ## Create Docker swarm 1. **Initiate a docker swarm on one of the machines (usually Gateway) - Leader Node** - Initiate a docker swarm ``` $ sudo docker swarm init ``` - The output of the above command looks like the following ``` docker swarm join --token <auto_generated_token> ``` 2. **Add the other machines into the swarm as workers and managers** - Add workers to the swarm - To join a new machine into a swarm, run the above obtained output (command) in the new worker machine. - By default all the machines are added as workers and every machine is called a node. - To check all the nodes in swarm, run the following command in the leader node. ``` $ sudo docker node ls ``` - Status of nodes is displayed as - Leader - leader - Managers - reachable - Workers - empty. - Add managers to the swarm - Any node in the swarm can be promoted as a manager. This step is not mandatory to create the cluster. ``` $ sudo docker node promote <node_host_name> ``` - Remove manager - To remove a manager we can demote the manager to a worker ``` $ sudo docker node demote <node_host_name> ``` - Remove a node from swarm - Run the following command in the node machine that is to leave the swarm ``` $ sudo docker swarm leave $ sudo docker swarm leave --force //If the node is a manager ``` 3. **Create a swarm network**: - Create a local network for the swarm, where all the docker container run the same network. - Run the following command in the leader node machine ``` $ sudo docker network create --attachable --driver=overlay --subnet=172.28.0.0/16 <network_name> ``` > Currently we use <network_name> = backend - check if the network has been created properly ``` $ sudo docker network ls ``` 4. **Add labels to the service nodes** - In order to identify the services, each node is labelled with the name of service, which helps in deploying services accross multiple nodes. - Assign the labels to machines/nodes as per the number of machines allocated to the respective service. - Run the following command in Leader node machine ``` $ sudo docker node update --label-add type=<service> <server> ``` > <service> - vspace, vspaceshard, subspace, adj > <server> - node_host_name - Check whether labels are added properly. ``` $ sudo docker node ls -q | xargs sudo docker node inspect -f '{{ .ID }} [{{ .Description.Hostname }}]: {{ .Spec.Labels }}' ``` ## Redis Installation > Create redis in an independent machine that doesn't host any other service. 1. **Clone the github repo** ``` $ git clone https://github.com/lumenbiomics/nferx_tokencollections $ cd nferx_tokencollections/scripts ``` - `redis_setup.sh` is an automation script for installing and uninstalling mulitple instances of Redis. 2. **Installation and configuration** - We install 3 instances of redis one each on a port. > TokenCollections - 6379 > Cache - 6380 > ClusterState - 6381. - We use a default password which will be shown at the end of installtion. - Do the following steps for all the above-mentioned instances - Install Redis ``` $ sudo ./redis_setup.sh install <port> ``` - Start Redis ``` $ sudo systemctl start redis_<port> ``` - Stop Redis ``` $ sudo systemctl stop redis_<port> ``` - Restart redis ``` $ sudo systemctl restart redis_<port> ``` - Status of redis ``` $ sudo systemctl status redis_<port> ``` - Enable redis auto restart if the machine reboots ``` $ sudo systemctl enable redis_<port> ``` - Uninstalling Redis Instance (Not necessary) ``` $ sudo ./redis_setup.sh uninstall <port> ``` ## Create Data paths and log files: > Data paths can be created using `sshfs` or `s3fs`(read-only). Currently, the existing servers are created with the `sshfs mount` using `fstab` config. Ideally for security we'll be moving to the `s3fs` or any other read-only mount options. > The following steps are to be done on all the machines that are labelled as vspace, vspaceshard, subspace and adj accordingly. - **Create mount point using s3** 1. Install required packages ``` $ sudo apt-get install build-essential libcurl4-openssl-dev libxml2-dev pkg-config libssl-dev libfuse-dev ``` 2. Install Fuse ``` $ cd /usr/src/ $ sudo wget https://github.com/libfuse/libfuse/releases/download/fuse-3.0.0/fuse-3.0.0.tar.gz $ sudo tar xzf fuse-3.0.0.tar.gz $ cd fuse-3.0.0 $ sudo ./configure –prefix=/usr/local $ sudo make && sudo make install $ sudo export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig $ sudo ldconfig $ sudo modprobe fuse ``` 3. Install s3fs ``` $ sudo apt-get -y install automake $ git clone https://github.com/s3fs-fuse/s3fs-fuse.git $ cd s3fs-fuse $ sudo ./autogen.sh $ sudo ./configure $ sudo make $ sudo make install ``` 4. check installation ``` $ s3fs --help ``` 5. make a credentials file ``` $ vi ~/.passwd-s3fs $ echo <aws-key>:<aws-secret> > ~/.passwd-s3fs $ chmod 600 .passwd-s3fs ``` 6. create mount folder ``` $ sudo mkdir <s3mount_dir> && sudo chown <user>:<user> <s3mount_dir> ``` > Example: $ sudo mkdir /nferx_data/ && sudo chown deepcompute:deepcompute /nferx_data/ > user - User that runs the services. i.e. the same user should be configured in the `docker.yml` files in further steps 7. Mount the s3 bucket to created folder ``` $ s3fs <s3_bucket_name> <s3mount_dir> ``` > Example: s3fs nfer-server-data /nferx_data/ > Currently data is not available in s3, will update this step once it is done (TODO) 8. Add a fstab entry for remounting s3 after a server reboot ``` $ vi /etc/fstab ``` add the following entry into the fstab file. ``` s3fs#<bucket_name> <mount_dir> fuse _netdev,allow_other,passwd_file=/home/nferx/.passwd-s3fs,umask=227,uid=33,gid=33,use_cache=/root/cache 0 0 ``` - **Create mount point for sshfs** 1. Create `SSH Public and Private keys` for all the node machines with vspace, vspaceshard, adj and subspace as labels - Command to create keys for the node machine: ``` $ ssh-keygen -t rsa -b 4096 -C <service-name> -f mykey ``` - After this, 2 files `/mykey` and `/mykey.pub` are generated in the current directory - Move these two files to `~/.ssh` folder.[Create one if you don't already have one.] - Rename `mykey` to `id_rsa` - Rename `mykey.pub` to `id_rsa.pub` - The file with `.pub` extension will be the public key. 2. Add the ssh public key of the node machines to authorized_keys file of datamachine (`hp.dataserver.servers.nferx.com`) ``` $ vi ~/.ssh/authorized_keys ``` > copy the node machine's public_key to this file 3. create a sshfs mount point - If there’s already a sshfs mount point with sshfsmnt as the dir, unmount it and delete the dir. ``` $ fusermount -u <mount_dir> $ sudo rm -r <mount_dir> ``` > example: $ fusermount -u /sshfsmnt $ sudo rm -r /sshfsmnt - create a new mount path with dir sshfsmnt ``` $ echo <password> | sudo -S mkdir <mount_dir> && sudo chown <user>:<user> <mount_dir> && sshfs -o allow_other <central_mount_path> /<mount_dir>/ ``` 4. FSTAB configuration : - Copy and paste the following command in all data holding machines /etc/fstab for auto moun whenver node machine restarts. ``` <Central_data_location> /<mount_dir>/ fuse.sshfs _netdev,user,idmap=user,transform_symlinks,identityfile=/home/deepcompute/.ssh/id_rsa,allow_other,default_permissions,uid=1000,gid=1000 0 0 ``` - **Create the required data paths and log files** We usually run `12 processes` per machine which has a `12 core processor`. So, we create data directories and log files from `<service>0` to `<service>11`. Do the following for vspace, vspaceshard, subspace and adj labelled machines. 1. Data paths ``` $ for i in {0..11}; do sudo mkdir -p <data_path>/<service$i>; done $ sudo chown -R <user>:<user> /<data_path></data_path> ``` > Currently we use the following pattern ``` $ for i in {0..11}; do sudo mkdir -p /data/nferx_cluster/cache/vspace/2M/vspace$i; done $ sudo chown -R deepcompute:deepcompute /data ``` 2. log files ``` $ for i in {0..11}; do sudo touch -p <data_path>/<service$i>.log; done ``` > Currently we use the following pattern ``` $ for i in {0..11}; do sudo touch /var/log/nferx_cluster/vspace$i.log; done ``` ## Bring up services > As all the services runs on docker, we bring up the docker containers using the swarm from leader machine to run the HP services in the individual nodes - **Docker yml files** > Currently the yml files are available in the following link. These will be checked in github and the links will be updated here (TODO). 1. Download the tar file from the below link which contains generic docker.yml files for HP services - https://drive.google.com/drive/folders/187GM2yuMc1MO78ehmQc4s8VzTc9R_znW?usp=sharing 2. untar the downloaded file ``` tar -xf yml_generic.tar ``` List of yml files with respect to thier services > gateway - nferx_gateway.yml vspace - nferx_vspace.yml vspace_shard - nferx_vspace_shard.yml adjacency - nferx_adj.yml subspace - nferx_subspace.yml tokencollections - nferx_tokencollections.yml temporal - nferx_temporal.yml semantics - nferx_semantics.yml mics - nferx_misc.yml 3. Configure the following variables in the docker yml files accordingly - <version> -> replace all instances with image version - <data-path> -> replace all instances with your created datapath - <redis-host> -> replace all instances with redis node machine. - <password> -> replace all instances with given equivalent password. - <hostname> -> replace all instances with host name of the machine. - Run services 1. Run the services in the each folder using the following command: - Deploy service stack ``` $ sudo docker stack deploy --with-registry-auth -c <service-yml-file> <service> ``` >sudo docker stack deploy --with-registry-auth -c nferx_vspace.yml vspace - Remove service stack ``` $ sudo docker stack rm <nodel-label> ``` >sudo docker stack rm vspace > - Check the services ``` $ sudo docker service ls ``` ## Host Data - Loader script 1. The following link contains dataset_loading_script.py - https://drive.google.com/drive/folders/187GM2yuMc1MO78ehmQc4s8VzTc9R_znW?usp=sharing 2. Download the above python file and make necessary changes as below. - xxx - replace with gateway machine name - yyy - replace with redis machine name 3. Command to run the above python script ``` ls /path/to/dataset | python3 dataset_loading_script.py ``` >$ ls -d vspace:* | python3 dataset_loading_script.py #### Note : > skip below steps, creation of ES cluster part, if already brought up ES cluster. ## HP Cluster ES - Use the following steps to create a elastic search cluster. - In a new machine , install the elastic search using the following commands: - Elasticsearch requires at least Java 7,So install java if not present using the following commands. - ``` $ sudo apt-get update $ sudo apt-get install default-jre $ sudo apt-get install default-jdk ``` - check installation using command - ``` java -version ``` - Follow below steps to install elastic search. ``` $ wget -qO - https://artifacts.elastic.co/GPG-KEY-elasticsearch | sudo apt-key add - $ sudo apt-get install apt-transport-https $ echo "deb https://artifacts.elastic.co/packages/6.x/apt stable main" | sudo tee -a /etc/apt/sources.list.d/elastic-6.x.list $ sudo apt-get update && sudo apt-get install elasticsearch ``` - use sudo editor to uncomment these below lines in /etc/default/elasticsearch/ ``` ES_HOME=/usr/share/elasticsearch ES_PATH_CONF=/etc/elasticsearch ``` - Go to /etc/elasticsearch/elasticsearch.yml,find and change change the following line : ``` network.host: 0.0.0.0 ``` - To start the service : ``` $ sudo /bin/systemctl daemon-reload $ sudo /bin/systemctl enable elasticsearch.service $ sudo systemctl start elasticsearch.service ``` - Check status using command : `sudo systemctl status elasticsearch` - If any error occurrs check logs using the below instructions: - go and check /var/log/elasticsearch for logs. - use `sudo journalctl -xe` to see errors. - After the elastic search has been started, to load HP datasets follow below instructions : - Clone the below repository: - https://github.com/lumenbiomics/nferx_tokencollections.git - Navigate to scripts/coll_search_scripts/load_collections/ - Run the load_to_es.py file using below command ``` python load_to_es.py run -i <index_name> --host <elasticsearch_server_node_name> -w /home/<dump_name>.json ``` > python load_to_es.py run -i collection_search_dev --host 'http://localhost:9200' -w /home/_collection_data.json - Navigate to scripts/coll_search_scripts/load_tokens/ - Run the load_to_es.py file using the below command. ``` python load_to_es.py run -i <index_name> --host <elasticsearch_server_node_name> -w /home//<tokens_dump_name>.json ``` > python load_to_es.py run -i token_collection_search_dev --host http://localhost:9200/ -w /home//token_collection_search_biogen.json - After setting up the elastic search cluster. - Replace , the tokencollection run command in yml as follows ``` "--es-link","<elasticsearch_node_name>" "--coll-index","<collection_index_name>" "--token-index","<tokens_index_name>" ``` > "--es-link","http://ovh-search1.nferx.com:9200" "--coll-index","collection_search" "--token-index","token_collection_search" ## Dataset Key update: - clone the below repository - https://github.com/lumenbiomics/nferx_gateway.git - navigate to `nferx_gateway/scripts/` - run the following command on update_datasets.py ``` python update_datasets.py run --hpgw <gateway_host> --redis-info host=<redis_host>:port=6381:db=1:pass=<password> --copy-to-file <path_to_create_dump> ``` > python update_datasets.py run --hpgw hz-devgw.nferx.com --redis-info host=hz-devredis.nferx.com:port=6381:db=1:pass=passwd --copy-to-file - Visit the following url to test healthcheck of the cluster. ``` <gateway_url>/healthcheck ``` > http://ovh-hpgw.nferx.com/healthcheck - If the response is 200 , then your cluster is up with all HP configurations properly configured. ## Manifest info update: - get_manifest_info API is used to get the meta information of all corpora - get_manifest_info API is returning the response by using the manifest_info key of Redis DB, and we need one script to create the manifest_info key of Redis. - Reference API call: http://hz-devgw.nferx.com/gateway/v1/gateway/get_manifest_info - Response format of the API is as follows ```json { "result": [ { "name": "Core Corpus", "id": "corpus", "manifest_info": { "all": { "multigram_threshold": 39, "unigram_threshold": 15 }, "time_slices": { "multigram_threshold": 39, "unigram_threshold": 15 } } }, { "name": "PubMed", "id": "pubmed", "manifest_info": { "all": { "multigram_threshold": 4, "unigram_threshold": 4 }, "time_slices": { "multigram_threshold": 4, "unigram_threshold": 4 } } } ] } ``` - Clone the below repository - https://github.com/lumenbiomics/nferx_gateway.git - Change the directory: ```cd nferx_gateway/scripts/manifest_info``` - run the following command to see full usage: ```python prepare_manifest_info.py run -h``` ```bash deepcompute@hp$ python prepare_manifest_info.py run -h usage: prepare_manifest_info.py run [-h] --hpgw HPGW --redis-info REDIS_INFO [--copy-to-file] optional arguments: -h, --help show this help message and exit --hpgw HPGW Please provide HP cluster gateway URL Ex: hz- devgw.nferx.com --redis-info REDIS_INFO Please provide the redis information Ex: host=hz- devredis.nferx.com:port=6381:db=1:pass=passwd --copy-to-file Pass this as a command line argument to store datasets information to a tmp folder filename=manifest_info.json ``` - Run command: ```bash python prepare_manifest_info.py run --hpgw hz-devgw.nferx.com --redis-info host=hz-devredis.nferx.com:port=6381:db=1:pass=passwd --copy-to-file ``` ###### tags: `HP`