Big Data - Data Engineering Grundlagen (Complete) ====================================== Date: December 2020 ## Information Short URL: https://bit.ly/3l7NQCQ Cheat Sheets: [Hier](https://teams.microsoft.com/_#/files/Allgemein?groupId=309bc066-c881-483b-8623-4758a03139ab&threadId=19%3A0c4235dbf1f4482dafbc122ac83ea581%40thread.tacv2&ctx=channel&context=General&rootfolder=%252Fsites%252FG-BigDataEngineeringSchulungPolizeiakademie%252FFreigegebene%2520Dokumente%252FGeneral) Folien: tbd Aufgaben: [Hier](https://teams.microsoft.com/_#/files/Allgemein?groupId=309bc066-c881-483b-8623-4758a03139ab&threadId=19%3A0c4235dbf1f4482dafbc122ac83ea581%40thread.tacv2&ctx=channel&context=General&rootfolder=%252Fsites%252FG-BigDataEngineeringSchulungPolizeiakademie%252FFreigegebene%2520Dokumente%252FGeneral) HDFS Puzzle: tbd Putty Download: https://the.earth.li/~sgtatham/putty/latest/w32/putty.exe MapReduce Beispiel (Word Count): https://cwiki.apache.org/confluence/display/HADOOP2/WordCount YARN Beispiel: https://github.com/hortonworks/simple-yarn-app/blob/master/src/main/java/com/hortonworks/simpleyarnapp/ApplicationMaster.java Distributed Consensus (Zookeeper): https://www.confluent.io/blog/distributed-consensus-reloaded-apache-zookeeper-and-replication-in-kafka/ ## Kommunikation Microsoft Teams: https://bit.ly/2J5sU2t ### Miro: Einführung: https://miro.com/app/board/o9J_ldtEX_g=/ HDFS Puzzle: https://miro.com/app/board/o9J_ldu4sJs=/ ## Serverliste vmniedersachsen-polizei-de001.westeurope.cloudapp.azure.com vmniedersachsen-polizei-de002.westeurope.cloudapp.azure.com vmniedersachsen-polizei-de003.westeurope.cloudapp.azure.com vmniedersachsen-polizei-de004.westeurope.cloudapp.azure.com vmniedersachsen-polizei-de005.westeurope.cloudapp.azure.com vmniedersachsen-polizei-de006.westeurope.cloudapp.azure.com vmniedersachsen-polizei-de007.westeurope.cloudapp.azure.com vmniedersachsen-polizei-de008.westeurope.cloudapp.azure.com vmniedersachsen-polizei-de009.westeurope.cloudapp.azure.com vmniedersachsen-polizei-de010.westeurope.cloudapp.azure.com ### SSH Zugangsdaten User: trainadm Passwort: Train@Woodmark Port: 22 ## Ambari [http://vmniedersachsen-polizei-de[001-010].westeurope.cloudapp.azure.com:8080](http://vmniedersachsen-polizei-de[001-010].westeurope.cloudapp.azure.com:8080) User: admin Passwort: admin ## Zeppelin [http://vmniedersachsen-polizei-de[001-010].westeurope.cloudapp.azure.com:9995](http://vmniedersachsen-polizei-de[001-010].westeurope.cloudapp.azure.com:9995) User: admin Passwort: admin ## Spark Streaming Zeppelin: `GIP_Aufgaben/Data Engineering/Twitter_SparkStreaming_Kafka` Twitter-Credentials: https://bit.ly/3lf6XuO ## Nifi [http://vmniedersachsen-polizei-de[001-010].westeurope.cloudapp.azure.com:9090/nifi](http://vmniedersachsen-polizei-de[001-010].westeurope.cloudapp.azure.com:9090/nifi) ## Sqoop ``` sqoop export --connect jdbc:postgresql://localhost:5432/twitter --username twitter --password bigdata --table tweets_per_day --export-dir /apps/hive/warehous/twitter.db/tweets_per_day --fields-terminated-by '|' ``` ``` SELECT word, count(*) as wcount FROM solution_twitter.tweets LATERAL VIEW explode(split(tweets.hashtag, ', ')) tweets as word GROUP BY word ORDER BY wcount DESC LIMIT 10; ``` ## Kafka ``` cd /usr/hdp/current/kafka-broker/bin ## Produce # Erstelle ein Topic ./kafka-topics.sh --create --zookeeper hdp-sandbox.train.woodmark.de:2181 --replication-factor 1 --partitions 4 --topic testtopic # Zeige welche Topics angelegt sind /usr/hdp/current/kafka-broker/bin/kafka-topics.sh --list --zookeeper hdp-sandbox.train.woodmark.de:2181 # Zeige mir Details zu einem bestimmten Topic an ./kafka-topics.sh --describe --zookeeper hdp-sandbox.train.woodmark.de:2181 --topic testtopic # Schreibe eine Message echo "test line" | ./kafka-console-producer.sh --broker-list hdp-sandbox.train.woodmark.de:6667 --topic testtopic # Schreibe eine Message (kontinuierlich) ./kafka-console-producer.sh --broker-list hdp-sandbox.train.woodmark.de:6667 --topic testtopic ## Consume # Zeige mir den Inhalt einer Topic an (Alles) sh kafka-console-consumer.sh --zookeeper hdp-sandbox.train.woodmark.de:2181 --topic twitter_topic --from-beginning #Verlasse den Kafka Kontext #Strg + C ``` ## HDFS ``` export HADOOP_USER_NAME=hdfs echo "Hallo Welt" > a.txt hdfs dfs -put a.txt /tmp/ hdfs dfs -ls /tmp/a.txt hdfs dfs -mkdir -p /tmp/nifi/test hdfs dfs -mv /tmp/a.txt /tmp/nifi/test hdfs dfs -ls /tmp/nifi/test hdfs dfs -text /tmp/nifi/test/a.txt hdfs dfs –chmod 777 /tmp/nifi ``` ## Hive ``` beeline !connect jdbc:hive2://localhost:10000/default Enter username for jdbc:hive2://localhost:10000/default: <press Enter> Enter password for jdbc:hive2://localhost:10000/default: <press Enter> ```