--- tags: Spark title: Basic Setting --- # Ubuntu 20.04.1 Hadoop 3.2.1 Spark 3.0.1 -- Single Node(Standalone) --- [TOC] --- ## Ubuntu 20.04.1 LTS [Website of Ubuntu Desktop](https://ubuntu.com/download/desktop) [Download Link](https://ubuntu.com/download/desktop/thank-you?version=20.04.1&architecture=amd64) ## update ``` sudo apt-get update sudo apt-get upgrade sudo apt-get autoremove ``` ## Hadoop 3.2.1 [Hadoop 3.2.1](https://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz) [Download Link](https://downloads.apache.org/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz) `wget https://downloads.apache.org/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz` `sudo tar -xvf hadoop-3.2.1.tar.gz` ## Spark 3.0.1 [Spark 3.0.1](https://spark.apache.org/downloads.html) [Download Link w/ hadoop 3.2 prebuild](https://downloads.apache.org/spark/spark-3.0.1/spark-3.0.1-bin-hadoop3.2.tgz) `wget https://downloads.apache.org/spark/spark-3.0.1/spark-3.0.1-bin-hadoop3.2.tgz` `sudo tar -xvf spark-3.0.1-bin-hadoop3.2.tgz` ## java testing: `java --version` `sudo apt install openjdk-8-jre-headless` `sudo apt install openjdk-8-jdk-headless` java place: `update-alternatives --display java` ## ssh, pdsh Download: `sudo apt install ssh` `sudo apt install pdsh` Generate key: ``` ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys chmod 600 ~/.ssh/authorized_keys ``` try ssh: `ssh localhost` ## folder setting `cd /home/master/cluster` pcdm@master:~$ --> pcdm@master:~/cluster$ `pcdm@master:~/cluster$ sudo mv hadoop-3.2.1 hadoop` `pcdm@master:~/cluster$ sudo mv spark-3.0.1-bin-hadoop3.2 spark` 這裡小心,如果有用adduser 記得要用adduser後的名字 `pcdm@master:~/cluster$ sudo chown -R pcdm:master hadoop` `pcdm@master:~/cluster$ sudo chown -R pcdm:master spark` ## .bashrc setting `sudo gedit ~/.bashrc` ``` export PDSH_RCMD_TYPE=ssh export JAVA_HOME=/usr/lib/jvm/"java-8-openjdk-amd64" export CLASSPTH=$JAVA_HOME/lib export HADOOP_INSTALL=/home/master/cluster/hadoop export PATH=$PATH:$HADOOP_INSTALL/bin export PATH=$PATH:$HADOOP_INSTALL/sbin export HADOOP_MAPRED_HOME=$HADOOP_INSTALL export HADOOP_COMMON_HOME=$HADOOP_INSTALL export YARN_HOME=$HADOOP_INSTALL export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib" export HADOOP_CLASSPATH=$(hadoop classpath) export SPARK_HOME=/home/master/cluster/spark export PATH=$PATH:$SPARK_HOME/bin export PATH=$PATH:$SPARK_HOME/sbin ``` python: `sudo ln -s /usr/bin/python3 /usr/bin/python` Dont forget: `source ~/.bashrc` ## Hadoop Setting (with Yarn) core-site.xml: `cd ~/cluster/hadoop/etc/hadoop/` `sudo gedit core-site.xml` ``` <!-- namenode的位置 --> <configuration> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property> </configuration> ``` hdfs-site.xml: datafile ``` mkdir ~/cluster/hdfs mkdir ~/cluster/hdfs/datanode mkdir ~/cluster/hdfs/namenode ``` setting `cd ~/cluster/hadoop/etc/hadoop/` `sudo gedit hdfs-site.xml` ``` <configuration> <!-- hdfs的數據副本數量 --> <property> <name>dfs.replication</name> <value>1</value> </property> <!-- namenode的儲存位置 --> <property> <name>dfs.namenode.name.dir</name> <value>file:/home/master/cluster/hdfs/namenode</value> </property> <!-- datanode的儲存位置 --> <property> <name>dfs.datanode.data.dir</name> <value>file:/home/master/cluster/hdfs/datanode</value> </property> </configuration> ``` mapred-site.xml ``` <configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <property> <name>mapreduce.application.classpath</name> <value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*</value> </property> </configuration> ``` yarn-site.xml ``` <configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.env-whitelist</name> <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value> </property> </configuration> ``` ## Execution_hadoop Format the filesystem `hdfs namenode -format` Start dfs, yarn: `start-dfs.sh` `start-yarn.sh` or `start-all.sh` Stop dfs, yarn: `stop-dfs.sh` `stop-yarn.sh` or `stop-all.sh` ## Execution_spark Master: `start-master.sh` Slaves: `start-slaves.sh` ## how-to-turn-off-info-logging-in-spark first, `cp conf/log4j.properties.template conf/log4j.properties` then change `log4j.rootCategory=INFO, console` to `log4j.rootCategory=WARN, console `