###### tags: `metabolomic` `config` `spark`
# Metabolomics Semantic Data Lake
## Install Party
### Session du 07 sept. 2021 matin & après midi
https://sparkbyexamples.com/hadoop/apache-hadoop-installation/
Name Node : snapshot avec hyperviseur ProxMox
Data nodes : Timeshift
#### Install d'un systeme de backup/snapshot:
- timeshift : https://linuxconfig.org/ubuntu-20-04-system-backup-and-restore
Deploiement via le playbook Ansible du projet
Commandes
```shell=
fgiacomoni@ara-unh-saroumane:~$ sudo timeshift --create --backup-device /dev/ubuntu-vg/lv-home
RSYNC Snapshot saved successfully (100s)
```
```shell=
cduperier@ara-unh-angmar:~$ sudo timeshift --create --backup-device /dev/ubuntu-vg/lv-home
```
```shell=
ofilangi@ara-unh-herumor:~$ sudo timeshift --create --backup-device /dev/ubuntu-vg/lv-home
```
```shell=
mboudet@ara-unh-khamul:~$ sudo timeshift --create --backup-device /dev/ubuntu-vg/lv-home
```
#### Apache Hadoop installation
- [X] Deploiement de la partie hadoop via Ansible (role déjà existant)
- Prepa de l'environnement sur le serveur CI/CD gitlab pfem
- Instalation du rôle andrewrothstein.hadoop (avec ses dépendances) via ansible-galaxy
```shell=
mkdir roles
ansible-galaxy install andrewrothstein.hadoop -p ./roles
```
- Creation d'un playbook [playbook hadoop](https://services.pfem.clermont.inrae.fr/gitlab/metabosemdatalake/ansible-datalake/-/blob/master/playbook_hadoop.yml)
- Fixation de la version Java jdk 8.x pour la compilation d'hadoop
- commentaire de la partie lint (A reprendre pour la prod)
- CI/CD pipelines [resultats](https://services.pfem.clermont.inrae.fr/gitlab/metabosemdatalake/ansible-datalake/-/pipelines)
- sur data node (exemple: ara-unh-angmar)
- *PROBLEME RENCONTRE* : **Proprietaire 1001 (fgiacomoni sur 4 machines et ofilangi sur 1 machine). C est apparemment du au systeme de detarage de l archive. David a ajouté une regle ansible pour y remedier**
```shell=
fgiacomoni@ara-unh-angmar:~$ hadoop version
Hadoop 3.3.0
Source code repository https://gitbox.apache.org/repos/asf/hadoop.git -r aa96f1871bfd858f9bac59cf2a81ec470da649af
Compiled by brahma on 2020-07-06T18:44Z
Compiled with protoc 3.7.1
From source with checksum 5dc29b802d6ccd77b262ef9d04d19c4
This command was run using /usr/local/hadoop-3.3.0/share/hadoop/common/hadoop-common-3.3.0.jar
fgiacomoni@ara-unh-angmar:~$ java -version
openjdk version "1.8.0_242"
OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_242-b08)
OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.242-b08, mixed mode)
```
- [X] @mateo - Config SSH - Communication entre Named Node et Data Nodes
- Creation d'un user ubuntu sur tout les noeuds (même mot de passe ansible) Saroumane-Angmar-Gothmog-Herumor-Khamul
- Creation d'une clé pour l'utilisateur sur Saroumane `ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa)`
- ssh-copy-id sur tous les noeuds pour envoi de la clé `ssh-copy-id ubuntu@147.100.175.222`
- [X] @mateo - Ajout des alias dans /etc/hosts de Saroumane (et de toutes les autres machines)
- [X] OpenJDK - a ajouter dans le playbook
- [X] Hadoop - Setup environment variables.
A priori, on a peut etre pas besoin de faire cette etape car les executables de l ecosysteme hadoop sont accessible via le PATH.
```shell=
root@ara-unh-angmar:~# cat /etc/profile.d/hadoop.sh
export PATH=$PATH:/usr/local/hadoop/bin
root@ara-unh-angmar:~# cat /etc/profile.d/openjdk.sh
export PATH=$PATH:/usr/local/openjdk-jre/bin
export JAVA_HOME=/usr/local/openjdk-jre
```
- Formatage Ext4 des Data nodes
- [X] herumor (Olivier)
- [X] gothmog (Mateo)
- [X] khamul (Christophe)
- [X] angmar (David)
- [X] saroumane (Christophe)
```shell=
# as root
pvcreate /dev/sdb
pvcreate /dev/sdc
pvdisplay
vgcreate data /dev/sdb /dev/sdc
vgs
lvcreate -n lv-data -L 1.5t data
lvs
mkfs -t ext4 /dev/data/lv-data
mkdir /data
mount /dev/data/lv-data /data
df -h
nano /etc/fstab
+ /dev/data/lv-data /data ext4 defaults 0 0
chmod 700 /data
chown -R ubuntu:ubuntu /data
```
Répertoire Data pour le name node
# as root
lvcreate -n lv-data -L 80g ubuntu-vg
mkfs -t ext4 /dev/ubuntu-vg/lv-data
mkdir /data
mount /dev/ubuntu-vg/lv-data /data
nano /etc/fstab
+ /dev/mapper/ubuntu--vg-lv--data /data ext4 defaults 0 0
- [X] Hadoop Cluster Env
- [X] Config files
Creation de roles ansible
+ core-site.xml (Exemple David)
+ hdfs-site.xml ()
+ yarn-site.xml
+ mapred-site.xml
- [X] Creation de la partie data (folder)
Cf partie "Formatage Ext4 des Data nodes"
- Add hadoop-env.sh with `export JAVA_HOME=/usr/local/openjdk/` via ansible
```
export JAVA_HOME=/usr/local/openjdk-jre/
```
- [X] Create master and workers files
- Create master file - via ansible
- Create workers file - via ansible
- [X] Format HDFS and Start Hadoop Cluster - - ONLY NAME-NODE
```shell=
hdfs namenode -format
```
- [X] Start HDFS Cluster
Running pipeline as ubuntu user !!
```shell=
cd /usr/local/hadoop/sbin
ubuntu@namenode:~$ start-dfs.sh
Starting namenodes on [namenode.socal.rr.com]
Starting datanodes
Starting secondary namenodes [namenode]
# Java Virtual Machine Process Status Tool
jps
||
ps -aux | grep Java | awk '{print $ 12}'
```
- [X] Upload File to HDFS
```shell=
hdfs dfs -mkdir -p /user/ubuntu/
wget -O alice.txt https://www.gutenberg.org/files/11/11-0.txt
hdfs dfs -pu alice.txt books/alice.txt
hdfs dfs -ls
hdbf dfs -ls books
$ hdfs dfs -df -h
Filesystem Size Used Available Use%
hdfs://147.100.175.223:9000 5.9 T 810.7 K 5.6 T 0%
```
- [X] Stopping HDFS Cluster
Snapshot des datanodes via timeshift
Snapshot du namenode via proxmox
### Session du 08 sept. 2021 matin
Snapshot des machines (name/data nodes) sous timeshift + ProxMox
https://sparkbyexamples.com/hadoop/yarn-setup-and-run-map-reduce-program/
- [X] Configure yarn-site.xml
integration dans le role ansible
- [X] Configure mapred-site.xml file
integration dans le role ansible
- [X] Configure Data Nodes
Déploiement sur ts les nodes via ansible
- [X] Start YARN from Name Node/node-master
```shell=
start-yarn.sh
ps -aux | grep java | awk '{print $2"\t"$12}'
10740 SecondaryNameNode
10442 NameNode
10972 ResourceManager
stop-yarn.sh
start-yarn.sh
```
- [] Run MapReduce Example
```shell=
yarn --loglevel 4 jar /usr/local/hadoop-3.3.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.0.jar wordcount "books/*" output --verbose
```
You should see an entry with application ID similar to “application_1547102810368_0001” and the status “FINISHED” state.
```shell=
yarn logs -applicationId application_1631087847701_0001
````
!!!!! ISSUE ICI !!!!!!
yarn-env.sh, you should see YARN_LOG_DIR
### Install spark
https://sparkbyexamples.com/spark/spark-setup-on-hadoop-yarn/
- [] Spark Install and Setup
```shell=
ansible-galaxy install andrewrothstein.spark -p ./roles
```
```shell=
- downloading role 'spark', owned by andrewrothstein
- downloading role from https://github.com/andrewrothstein/ansible-spark/archive/v3.0.9.tar.gz
- extracting andrewrothstein.spark to /media/olivier/hdd-local/workspace/INRAE/ansible-datalake/roles/andrewrothstein.spark
- andrewrothstein.spark (v3.0.9) was installed successfully
[WARNING]: - dependency andrewrothstein.unarchive-deps (v1.2.0) from role andrewrothstein.spark differs from already installed version (v1.1.5), skipping
[WARNING]: - dependency andrewrothstein.openjdk (v2.0.15) from role andrewrothstein.spark differs from already installed version (v2.0.14), skipping
```
Edit $SPARK_HOME/conf/spark-defaults.conf and set spark.master to yarn via Ansible
```shell=
spark-submit --deploy-mode client --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.12-3.1.2.jar 10
...
Pi is roughly 3.1360556802784014
````
- [] Spark History server
- [] Configure history server via Ansible
- [] Run history server
```shell=
$SPARK_HOME/sbin/start-history-server.sh
# probleme lors de l execution de spark-shell -> resolve:
hdfs dfs -mkdir -p /spark-logs
#output
2021-09-08 09:59:58,202 WARN util.Utils: Your hostname, ara-unh-saroumane resolves to a loopback address: 127.0.1.1; using 147.100.175.223 instead (on interface ens18)
2021-09-08 09:59:58,203 WARN util.Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2021-09-08 10:00:04,657 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
Spark context Web UI available at http://147.100.175.223:4040
Spark context available as 'sc' (master = yarn, app id = application_1631087847701_0012).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.1.2
/_/
Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_242)
Type in expressions to have them evaluated.
Type :help for more information.
scala> val textFile = spark.read.textFile("books/alice.txt")
textFile: org.apache.spark.sql.Dataset[String] = [value: string]
scala> text
text textFile
scala> textFile.count()
res0: Long = 3761
scala> textFile.first()
res1: String = The Project Gutenberg eBook of Alice’s Adventures in Wonderland, by Lewis Carroll
```
### Workspace
avec l utilisateur ubuntu
```shell=
hdfs dfs -mkdir /user/ofilangi
hdfs dfs -chown ofilangi /user/ofilangi
```
#### test
utilisateur ofilangi
```
hdfs dfs -ls
```
### Session du 08 sept. 2021 après midi
- [x] deploiement de SANSA: http://sansa-stack.net/
http://sansa-stack.net/downloads-usage/
Gère une couche websem dans un systeme Spark : operation sur des tableaux de données
Conversion auto du format spark en RDF...
Install de SBT
RDF2ML v0.8.1 (SBT en 0.7...) : compilation en echec
```shell=
## SBT
git clone https://github.com/SANSA-Stack/SANSA-Template-SBT-Spark.git
cd SANSA-Template-SBT-Spark
sbt clean package
```
Install de Maven
```shell=
## Maven
git clone https://github.com/SANSA-Stack/SANSA-Template-Maven-Spark.git
cd SANSA-Template-Maven-Spark
mvn clean package
```
https://github.com/SANSA-Stack/SANSA-Stack/tree/v0.8.0-RC1#release-version
Import the full SANSA Stack, please add the following Maven dependency to your project POM file:
```
<!-- SANSA Stack -->
<dependency>
<groupId>net.sansa-stack</groupId>
<artifactId>sansa-stack-spark_2.12</artifactId>
<version>$LATEST_RELEASE_VERSION$</version>
</dependency>
```
### Erreur Maven/Sansa 0.8.0-RC1
```sh
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 08:01 min
[INFO] Finished at: 2021-09-08T12:42:09Z
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal on project SANSA-Template-Maven-Spark: Could not resolve dependencies for project net.sansa-stack:SANSA-Template-Maven-Spark:jar:0.8.0-RC1: The following artifacts could not be resolved: net.sansa-stack:sansa-stack-spark_2.12:jar:0.8.0-RC1, net.sansa-stack:sansa-rdf-spark_2.11:jar:0.8.0-RC1, net.sansa-stack:sansa-owl-spark_2.11:jar:0.8.0-RC1, net.sansa-stack:sansa-inference-spark_2.11:jar:0.8.0-RC1, net.sansa-stack:sansa-query-spark_2.11:jar:0.8.0-RC1, net.sansa-stack:sansa-ml-spark_2.11:jar:0.8.0-RC1: Could not find artifact net.sansa-stack:sansa-stack-spark_2.12:jar:0.8.0-RC1 in oss-sonatype (https://oss.sonatype.org/content/repositories/snapshots/) -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException
```
#### pom
```
```
### Installation de Sansa en local 0.8.0-RC2-SNAPSHOT
```
git clone https://github.com/SANSA-Stack/SANSA-Stack.git
cd SANSA-Stack
./dev/mvn_install_stack_spark.sh
```
- [] Scalable RDF Analytics with SANSA - Tutorial
Half-Day Tutorial at The 19th International Semantic Web Conference (ISWC2020)
https://docs.google.com/document/d/e/2PACX-1vQlTF5y_Y8SIIzzNJhBiAe77iqA93Dqcuky-pJe3Yi5JiloX379qx2vORgsCYavEj9A7JhSZT00phnK/pub
- Projet NEXT:
Depot de données RDF (bases de données chimie...)
- QUESTIONS ??
- Java JDK vs JRE... (version... + add-on)
- Comment savoir si le fichier est sur ts les nodes
- logiciel de charge des noeuds
- interface WEB ? (demande ouverture)
- Name dir : taille de 80Go ??
- Configuration de yarn au niveau des ressources allouées (mémoire, ...)
- Erreur : "2021-09-08 08:40:43,464 INFO mapreduce.Job: Task Id : attempt_1631087847701_0005_m_000000_0, Status : FAILED
"lors du lancement de l exemple "wordcount". pourquoi certain job failed ?