DE-L2 Udemy 遇到問題的紀錄

--- title: DE-L2 Udemy 遇到問題的紀錄 tags: wedata description: DE-L2 Udemy 疑難雜症 --- # 目錄 [TOC] # DE-L2 Udemy 課堂問題 ## section2 1. 2-16 install pip and mrjob pip install google-api-python-client==1.6.4 會有問題 ![](https://i.imgur.com/0ByCe0q.png) 解法 1. 其實可以跳過他不影響後面的MRJob，Docker內也有0.7的版本 2. 先更新pip 再去更新setuptools(by Jeffrey 提供) ```bash= pip install --upgrade pip yum reinstall -y python2-pip.noarch python27-python-pip.noarch pip install setuptools==33.1.1 pip install google-api-python-client==1.6.4 ``` 3. 透過去安裝anaconda跑正常的python pip ``` wget https://repo.anaconda.com/archive/Anaconda3-2021.05-Linux-x86_64.sh mkdir tools bash Anaconda3-2021.05-Linux-x86_64.sh yum install vim source ~/.bashrc conda update --all conda update conda conda update anaconda conda install google-api-python-client pip install google-api-python-client pip install google-api-python-client==1.6.4 ``` VVV 成功後就可以跑,但是沒有也可以跑MRJOB ![](https://i.imgur.com/q0LUpMf.png) 2-18 Code up the ratings histogram MapReduce job and run it TopMovies.py檔案內要增加這個以進行排序參考:https://www.cnpython.com/qa/346173 ```bash= MRJob.SORT_VALUES = True ``` 修改版本 ```bash= from mrjob.job import MRJob from mrjob.step import MRStep class RatingsBreakdown(MRJob): MRJob.SORT_VALUES = True def steps(self): return [ MRStep(mapper=self.mapper_get_ratings, reducer=self.reducer_count_ratings), MRStep( mapper =self.map ,reducer=self.reducer_sorted_output) ] def mapper_get_ratings(self, _, line): (userID, movieID, rating, timestamp) = line.split('\t') yield int(movieID), 1 def reducer_count_ratings(self, key, values): yield key ,sum(values) def map(self , data, line): yield None,("%04d"%int(data),line) def reducer_sorted_output(self,n, movies): for movie in movies: yield movie[1],movie[0] if __name__ == '__main__': RatingsBreakdown.run() ``` ## section4 1. [Activity] Find the movie with the lowest average rating - with RDD's 沒有Spark 只有Spark2 要用admin身分才可以改設定 ![](https://i.imgur.com/9yJIr6W.png) 33. [Activity] Movie recommendations with MLLib ``` bash= sudo pip install numpy==1.16 export SPARK_MAJOR_VERSION=2 spark-submit MovieRecommendationsALS.py ``` ![](https://i.imgur.com/O1fMTQ4.png) 進到 MovieRecommendationsALS.py 修改code 在spark 物件產生後增加這一行 ```bash = spark.conf.set("spark.sql.crossJoin.enabled", "true") ``` ![](https://i.imgur.com/obOxd31.png) ![](https://i.imgur.com/r62uNTc.png) ## section5 43. [Activity] Use Sqoop to import data from MySQL to HFDS/Hive 影片中是這樣 ```bash= GRANT ALL PRIVILEGES ON movielens.* to ''@'localhost'; sqoop import --connect jdbc:mysql://localhost/movielens --driver com.mysql.jdbc.Driver --table movies -m 1 ``` 如有問題 ( ERROR manager.SqlManager: Error executing statement: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure) https://www.udemy.com/course/the-ultimate-hands-on-hadoop-tame-your-big-data/learn/lecture/5963452#questions/6479200 ```bash= GRANT ALL PRIVILEGES ON movielens.* to 'root'@'localhost' IDENTIFIED BY 'hadoop'; sqoop import --connect jdbc:mysql://localhost/movielens --driver com.mysql.jdbc.Driver --username root -P --table movies -m 1 whereis sqoop cd /usr/lib/ mkdir sqoop cd sqoop/ mkdir lib cd lib wget http://ftp.ntu.edu.tw/MySQL/Downloads/Connector-J/mysql-connector-java-8.0.23.tar.gz tar -zxf mysql-connector-java-8.0.23.tar.gz cd mysql-connector-java-8.0.23 mv mysql-connector-java-8.0.23.jar .. cd $SQOOP_HOME/bin sqoop-version cd ~ ``` 原因是docker中沒有mysql connector 參考 http://tw.gitbook.net/sqoop/sqoop_installation.html ![](https://i.imgur.com/LHAxnjV.png) ----------------------- 過程中有資料匯入Hive 但是有錯誤訊息如下 ![](https://i.imgur.com/SK8a34H.jpg) ``` Failed with exception org.apache.hadoop.security.AccessControlException: User null does not belong to hadoop at org.apache.hadoop.hdfs.server.namenode.FSDirAttrOp.setOwner(FSDirAttrOp.java:89) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.setOwner(FSNamesystem.java:1877) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.setOwner(NameNodeRpcServer.java:828) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.setOwner(ClientNamenodePr otocolServerSideTranslatorPB.java:476) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlocking Method(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2347) ``` 原因腳色沒有加入群組 ``` usermod -G hadoop root usermod -a -G hdfs root ``` ## section6 51. [Activity] Installing Cassandra 如果python2.7不用跑 (影片中是2.6.6 用python -V檢查) ```bash= yum install scl-utils yum install centos-release-scl-sh scl enable python27 bash ``` 安裝的時候出問題的解決方法過程包含輸入repo 安裝cassandra https://gist.github.com/jpollard3/af376dceb24ed6d06a0eaed60fd33b6b ```bash= # Take snapshot of current state in virtualbox as a backup # step 1: install cassandra (datastax version) cat >/etc/yum.repos.d/datastax.repo <<EOL [datastax] name = DataStax Repo for Apache Cassandra baseurl = http://rpm.datastax.com/community enabled = 1 gpgcheck = 0 EOL yum -y install dsc30 yum -y install jna cassandra nodetool status # step 2: install python 2.7 if you want to run cqlsh: https://github.com/h2oai/h2o-2/wiki/installing-python-2.7-on-centos-6.3.-follow-this-sequence-exactly-for-centos-machine-only yum -y groupinstall "Development tools" yum -y install zlib-devel bzip2-devel openssl-devel ncurses-devel sqlite-devel cd /opt wget --no-check-certificate https://www.python.org/ftp/python/2.7.6/Python-2.7.6.tar.xz tar xf Python-2.7.6.tar.xz cd Python-2.7.6 ./configure --prefix=/usr/local make && make altinstall ls -ltr /usr/bin/python* ls -ltr /usr/local/bin/python* echo $PATH which python mv /usr/lib/python2.7/site-packages/* /usr/local/lib/python2.7/site-packages/ rmdir /usr/lib/python2.7/site-packages ln -s /usr/local/lib/python2.7/site-packages /usr/lib/python2.7/site-packages # finished. Launch cqlsh cqlsh ``` 54. [Activity] Install MongoDB, and integrate Spark with MongoDB ```bash= sudo service ambari restart (如果失敗用下面這個) sudo service ambari-server restart 另外如果需要設定ambari密碼(root的) ambari-admin-password-reset (影片中是) spark-submit --packages org.mongodb.spark:mongo-spark-connector_2.11:2.0.0 MongoSpark.py 但是我們要(因為我們的sandbox版本比較高) spark-submit --packages org.mongodb.spark:mongo-spark-connector_2.11:2.3.0 MongoSpark.py ``` ## section8 74. [Activity] Set up a simple Oozie workflow ![](https://i.imgur.com/pEg9u0r.png) 有問題解法如下 https://community.cloudera.com/t5/Support-Questions/Oozie-web-console-is-disabled/td-p/243675 https://stackoverflow.com/questions/49276756/ext-js-library-not-installed-correctly-in-oozie ```bash= wget http://archive.cloudera.com/gplextras/misc/ext-2.2.zip sudo cp ext-2.2.zip /usr/hdp/current/oozie-client/libext/ sudo chown oozie:hadoop /usr/hdp/current/oozie-client/libext/ext-2.2.zip sudo -u oozie /usr/hdp/current/oozie-server/bin/oozie-setup.sh prepare-war ``` ![](https://i.imgur.com/StxWuaL.png) oozie job -oozie http://localhost:11000/oozie -config /home/maria_dev/job.properties -run ![](https://i.imgur.com/MSrmrL8.png) 有問題 ![](https://i.imgur.com/la5XLOa.png) ![](https://i.imgur.com/2rqlCTd.png) 解法如下首先檢查Ambari 的檔案路徑 /user/maria_dev/movies有 u.data u.item 這兩個檔案(每一次建立JOB都要檢查) ![](https://i.imgur.com/KlMiBYF.png) 再來檢查mysql 有root /hadoop 這組帳密而且權限有給予DB (沒有透過上面的GRANT 賦予) ![](https://i.imgur.com/h6EE2x5.png) 下載下來的三個檔案中 workflow.xml 的內容要修改 ``` import --connect jdbc:mysql://localhost/movielens --driver com.mysql.jdbc.Driver --username root --password hadoop --table movies -m 1 ``` 另外檢查ambari中所有的hive oozie sqoop 都是啟動的 ![](https://i.imgur.com/hzCUbzB.png) 中繼版本 ![](https://i.imgur.com/7Sktboq.png) 成功版本 ![](https://i.imgur.com/42WSmo7.png) ![](https://i.imgur.com/ZoN0xQ1.png) ![](https://i.imgur.com/hBAatld.png) ## section9 81. [Activity] Setting up Kafka, and publishing some data 問題影片中的:sandbox.hortonworks.com 要改成 sandbox-hdp.hortonworks.com ``` cd /usr/hdp/current/kafka-broker/bin # 創一個topic ./kafka-topics.sh --create --zookeeper sandbox-hdp.hortonworks.com:2181 --replication-factor 1 --partitions 1 --topic fred # 列出topic ./kafka-topics.sh --list --zookeeper sandbox-hdp.hortonworks.com:2181 用一個PRODUCER ./kafka-console-producer.sh --broker-list sandbox-hdp.hortonworks.com:6667 --topic fred ``` comsumer 視窗(改掉DNS 拿掉zookeeper) ``` ./kafka-console-consumer.sh --bootstrap-server sandbox-hdp.hortonworks.com:6667 --topic fred --from-beginning ``` ## section10 # DE-L2 Udemy 內容Code語法 ## section2 2-17 Installing Python, MRJob, and nano 下載資料跟推薦的python code ``` bash= wget http://media.sundog-soft.com/hadoop/ml-100k/u.data wget http://media.sundog-soft.com/hadoop/RatingsBreakdown.py wget http://media.sundog-soft.com/hadoop/TopMovies.py ``` 2-18 Code up the ratings histogram MapReduce job and run it 執行的腳本(請一行一行跑看內容差異) ``` bash= python RatingsBreakdown.py u.data python RatingsBreakdown.py -r hadoop --hadoop-streaming-jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar u.data ``` <font color="red"> MostPopularMovie.py 這個檔案官方說,只是示範<br> 看這個RatingsBreakdown.py <br> 或是用TopMovies.py </font> ![](https://i.imgur.com/ct8Y9Ue.png) ```bash= sudo yum-config-manager --save --setopt=HDP-SOLR-2.6-100.skip_if_unavailble=true ``` ![](https://i.imgur.com/h60iiWN.png) ## section4 1. [Activity] Find the movie with the lowest average rating - with RDD's ```bash= mkdir ml-100k cd ml-100k wget http://media.sundog-soft.com/hadoop/ml-100k/u.item cd .. wget http://media.sundog-soft.com/hadoop/Spark.zip unzip Spark.zip spark-submit LowestRatedMovieSpark.py ``` 33. [Activity] Movie recommendations with MLLib ``` bash= sudo pip install numpy==1.16 export SPARK_MAJOR_VERSION=2 spark-submit MovieRecommendationsALS.py ``` 35. [Activity] Check your results against mine! ```bash = spark-submit LowestRatedPopularMovieSpark.py spark-submit LowestRatedPopularMovieDataFrame.py ``` ## section5 42. [Activity] Install MySQL and import our movie data ```bash = systemctl stop mysqld systemctl set-environment MYSQLD_OPTS="--skip-grant-tables --skip-networking" systemctl start mysqld wget http://media.sundog-soft.com/hadoop/movielens.sql mysql -uroot ``` inMYSQL ```bash= FLUSH PRIVILEGES; ALTER USER 'root'@'localhost' IDENTIFIED BY 'hadoop' ; FLUSH PRIVILEGES; create database movielens; show databases; SET NAMES 'utf8'; SET CHARACTER SET utf8; use movielens source movielens.sql; show tables; select * from movies limit 10; describe ratings; select movies.title ,count(ratings.movie_id) as ratingCount From movies inner join ratings on movies.id=ratings.movie_id group by movies.title order by ratingCount ; ``` 44. [Activity] Use Sqoop to export data from Hadoop to MySQL ```bash= sqoop export --connect jdbc:mysql://localhost/movielens --driver com.mysql.jdbc.Driver --username root -P --table exported_movies --export-dir /apps/hive/warehouse/movies --input-fields-terminated-by '\0001' mysql -u root -p ``` in Mysql ```bash= use movielens; select * from exported_movies limit 10; ``` ## section6 47. [Activity] Import movie ratings into HBase 1 安裝python在自己電腦上 https://www.python.org/downloads/ 2 安裝筆記本 pip install jupyter notebook ![](https://i.imgur.com/6u418hc.png) 3 安裝完後跑 ![](https://i.imgur.com/tlhBSkq.png) 進去後 ![](https://i.imgur.com/0LKjqmn.png) 先從內部啟動8000 port 並啟動HBase腳本 ```bash = /usr/hdp/current/hbase-master/bin/hbase-daemon.sh start rest -p 8000 --infoport 8001 ``` 程式內容 python 版本 https://drive.google.com/file/d/1rmKTLXfTEdYIldZW-ycQ2c8VdgjidBle/view?usp=sharing jupyter notebook版本 https://drive.google.com/file/d/1aNIOIUm7Kbs0brzaTU2ruI828XEKwN2t/view?usp=sharing 48. [Activity] Use HBase with Pig to import data at scale. ```bash = hbase shell wget http://media.sundog-soft.com/hadoop/hbase.pig pig hbase.pig hbase shell ``` 50. If you have trouble installing Cassandra... ```bash = cd ~ wget http://mirror.centos.org/centos/6/os/x86_64/Packages/nss-softokn-freebl-3.14.3-23.3.el6_8.x86_64.rpm rpm2cpio http://mirror.centos.org/centos/6/os/x86_64/Packages/nss-softokn-freebl-3.14.3-23.3.el6_8.x86_64.rpm | cpio -idmv cp ./lib64/libfreeblpriv3.* /lib64 ``` 51. [Activity] Installing Cassandra 如果python2.7不用跑 (影片中是2.6.6 用python -V檢查) ```bash= yum install scl-utils yum install centos-release-scl-sh scl enable python27 bash ``` ```bash= cd /etc/yum.repos.d/ vi datastax.repo [datastax] name = DataStax Repo for Apache Cassandra baseurl = http://rpm.datastax.com/community enabled = 1 gpgcheck = 0 yum install dsc30 -y yum install cassandra30-tools-3.0.9-1 -y python -m pip install cqlsh service cassandra start ``` ```bash= CREATE KEYSPACE movielens WITH replication = {'class': 'SimpleStrategy','replication_factor':'1'}AND durable_writes=true; use movielens CREATE TABLE users (user_id int,age int , gender text , occupation text ,zip text, PRIMARY KEY (user_id)); select * from users; ``` ![](https://i.imgur.com/fAkjAgl.png) ![](https://i.imgur.com/qrcWr36.png) 52. [Activity] Write Spark output into Cassandra ```bash= wget http://media.sundog-soft.com/hadoop/CassandraSpark.py export SPARK_MAJOR_VERSION=2 spark-submit --packages com.datastax.spark:spark-cassandra-connector_2.11:2.3.1 CassandraSpark.py ``` ![](https://i.imgur.com/7vz7APF.png) ![](https://i.imgur.com/S373neD.png) 54. [Activity] Install MongoDB, and integrate Spark with MongoDB ![](https://i.imgur.com/6bMUJSt.png) ```bash= cd /var/lib/ambari-server/resources/stacks/ cd HDP/ cd 2.6 cd services/ pwd git clone https://github.com/nikunjness/mongo-ambari.git sudo service ambari restart (如果失敗用下面這個) sudo service ambari-server restart ``` ![](https://i.imgur.com/RfVSwuI.png) ![](https://i.imgur.com/HqfICm4.png) ```bash= pip install pymongo cd ~ wget http://media.sundog-soft.com/hadoop/MongoSpark.py export SPARK_MAJOR_VERSION=2 (影片中是) spark-submit --packages org.mongodb.spark:mongo-spark-connector_2.11:2.0.0 MongoSpark.py 但是我們要(因為我們的sandbox版本比較高) spark-submit --packages org.mongodb.spark:mongo-spark-connector_2.11:2.3.0 MongoSpark.py ``` 55. [Activity] Using the MongoDB shell ```bash= mongo use movielens db.users.find({user_id:100}) ``` mongo不會自動建立INDEX ```bash= db.users.explain().find({user_id:100}) db.users.createIndex({user_id:1}) db.users.aggregate([{$group:{_id:{occupation:"$occupation"},avgAge:{$avg:"$age"}}}]) ``` ![](https://i.imgur.com/qZGSTn0.png) ## section8 69. [Activity] Use Hive on Tez and measure the performance benefit UI操作 ``` DROP VIEW IF EXISTS topMovieIDs; CREATE VIEW topMovieIDs AS SELECT movie_id,count(movie_id) as ratingCount From movielens.ratings GROUP BY movie_id ORDER BY ratingCount DESC; Select n.name ,ratingCount FROM topMovieIDs t JOIN movielens.names n ON t.movie_id=n.movie_id; ``` TEZ 21秒 MR 65秒 ![](https://i.imgur.com/3LJImHx.png) 70. Mesos explained Mesos 鐵人賽解釋:https://ithelp.ithome.com.tw/articles/10184643 72. [Activity] Simulating a failing master with ZooKeeper ```bash= cd /usr/hdp/current/zookeeper-client/ cd bin ./zhCli.sh ``` ![](https://i.imgur.com/pvuacbL.png) ```bash= create -e /testmaster "127.0.0.1:2223" get /testmaster ``` ![](https://i.imgur.com/ZYI1CiE.png) 74. [Activity] Set up a simple Oozie workflow 這個flow 是透過sqoop import 把資料輸入 ```bash= wget http://media.sundog-soft.com/hadoop/movielens.sql systemctl stop mysqld systemctl set-environment MYSQLD_OPTS="--skip-grant-tables --skip-networking" systemctl start mysqld mysql -uroot create database movielens; use movielens source movielens.sql; show tables; --如果沒有給予root權限透過下面給予 flush privileges; GRANT ALL PRIVILEGES ON movielens.* to 'root'@'localhost' IDENTIFIED BY 'hadoop'; flush privileges; ``` ```bash= --角色 in maria_dev 資料夾在角色底下的根目錄 (這個操作會影響 oozie job那一行) wget http://media.sundog-soft.com/hadoop/oldmovies.sql wget http://media.sundog-soft.com/hadoop/workflow.xml wget http://media.sundog-soft.com/hadoop/job.properties hadoop fs -put workflow.xml /user/maria_dev hadoop fs -put oldmovies.sql /user/maria_dev hadoop fs -put /usr/share/java/mysql-connector-java.jar /user/oozie/share/lib/lib_20180618160835/sqoop oozie job -oozie http://localhost:11000/oozie -config /home/maria_dev/job.properties -run ``` 76. [Activity] Use Zeppelin to analyze movie ratings, part 1 76 77 可以直接下載他寫好的Json wget http://media.sundog-soft.com/hadoop/MovieLens.json 然後用 Import note 上傳Json檔案 ![](https://i.imgur.com/meMPr9S.png) 改變順序 ![](https://i.imgur.com/3Ai7VOe.png) spark code ```bash= sc.version ``` bash ```bash= %sh wget http://media.sundog-soft.com/hadoop/ml-100k/u.data -O /tmp/u.data wget http://media.sundog-soft.com/hadoop/ml-100k/u.item -O /tmp/u.item %sh hadoop fs -rm -r -f /tmp/ml-100k hadoop fs -mkdir /tmp/ml-100k hadoop fs -put /tmp/u.item /tmp/ml-100k/ hadoop fs -put /tmp/u.data /tmp/ml-100k/ ``` scala(先建立class 然後套用mapreduce+spark分資料) ```bash= final case class Rating (movieID: Int, rating: Int) val lines = sc.textFile("hdfs:///tmp/ml-100k/u.data") .map(x=>{val fields =x.split("\t"); Rating(fields(1).toInt,fields(2).toInt) }) import sqlContext.implicits._ val ratingsDF=lines.toDF() ratingsDF.printSchema() ratingsDF.registerTempTable("ratings") val topMovieIDs= ratingsDF.groupBy("movieID").count().orderBy(desc("count")).cache() topMovieIDs.show() ``` 77. [Activity] Use Zeppelin to analyze movie ratings, part 2 有問題 ``` java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient; set zeppelin.spark.sql.stacktrace = true to see full stacktrace 代表你的環境的hive死了重啟如果無效直接重新啟動一個VMinstance來跑這一個章節比較快 ``` ![](https://i.imgur.com/Fi0ZfOI.png) ``` %sql select * from ratings Limit 10 SELECT rating,count(*) as count from ratings group by rating final case class Rating (movieID: Int, title: String) val lines = sc.textFile("hdfs:///tmp/ml-100k/u.item") .map(x=>{val fields =x.split('|'); Rating(fields(0).toInt,fields(1)) import sqlContext.implicits._ val moviesDF=lines.toDF() moviesDF.printSchema() moviesDF.registerTempTable("titles") %sql select t.title,count(*) cnt from ratings r join titles t on r.movieID=t.movieID Group by t.title order by cnt DESC ``` ## section9 82. [Activity] Publishing web logs with Kafka 本章節最後要開三個視窗 1.當作監聽者 2.當作發送者 3.寫入者 ```bash= cd /usr/hdp/current/kafka-broker/conf cp connect-file-sink.properties ~ cp connect-file-source.properties ~ cp connect-standalone.properties ~ cd ~ ``` vi connect-standalone.properties ```bash= bootstrap.servers=sandbox-hdp.hortonworks.com:6667 ``` ![](https://i.imgur.com/lJR8nhV.png) vi connect-file-sink.properties ```bash= file=/home/maria_dev/logout.txt topics=log-test ``` ![](https://i.imgur.com/VAGEccP.png) vi connect-file-source.properties ```bash= file=/home/maria_dev/access_log_small.txt topic=log-test ``` ![](https://i.imgur.com/nIoaaJA.png) ```bash= wget http://media.sundog-soft.com/hadoop/access_log_small.txt ``` ```bash= ./connect-standalone.sh ~/connect-standalone.properties ~/connect-file-source.properties ~/connect-file-sink.properties ``` 另一個視窗 ```bash= ./kafka-console-consumer.sh --bootstrap-server sandbox-hdp.hortonworks.com:6667 --topic log-tst ``` 第三個視窗 ```bash= echo -e "this is a new line to kafka" >> access_log_small.txt ``` ![](https://i.imgur.com/cl73YdD.png) 84. [Activity] Set up Flume and publish logs with it. 本章節三個視窗(影片只有2個) server 跟sender 還有一個logger(因為我們的版本跟影片有差) https://www.udemy.com/course/the-ultimate-hands-on-hadoop-tame-your-big-data/learn/lecture/5963868#questions/10549450 It seems that version 2.6.5 has issues logging to console. 視窗1(sender) ```bash= wget http://media.sundog-soft.com/hadoop/example.conf ``` (原始版本:https://flume.apache.org/releases/content/1.9.0/FlumeUserGuide.html) 視窗2(server) ```bash= cd /usr/hdp/current/flume-server/ sudo bin/flume-ng agent --conf conf --conf-file ~/example.conf --name a1 -Dflume.root.logger=INFO,console ``` ![](https://i.imgur.com/c1yWh1l.png) 視窗1(sender) ```bash= telnet localhost 44444 ``` ![](https://i.imgur.com/tlGJGuz.png) 視窗3(logger) ```bash= tail -f /var/log/flume/flume.log ``` ![](https://i.imgur.com/yj8crVz.png) 85. [Activity] Set up Flume to monitor a directory and store its data in HDFS 同上要開三個視窗(如果你要監聽的話) server 跟sender 還有一個logger(因為我們的版本跟影片有差) step1 視窗1(sender) ```bash= wget http://media.sundog-soft.com/hadoop/flumelogs.conf ``` step2 視窗2(server) ```bash= sudo bin/flume-ng agent --conf conf --conf-file /home/maria_dev/flumelogs.conf --name a1 -Dflume.root.logger=INFO,console ``` step3 中間去ambari建一下flume資料夾 step4 視窗1(跟sender) ```bash= mkdir spool cp access_log_small.txt spool/fred.txt ``` ![](https://i.imgur.com/KY9FtO3.png) ## section10 87. [Activity] Analyze web logs published with Flume using Spark Streaming 本章節要三個視窗一個當server 一個當spark計數一個當資料產生者建立spark計數 ```bash= wget http://media.sundog-soft.com/hadoop/sparkstreamingflume.conf wget http://media.sundog-soft.com/hadoop/SparkFlume.py export SPARK_MAJOR_VERSION=2 spark-submit --packages org.apache.spark:spark-streaming-flume_2.11:2.3.0 SparkFlume.py ``` ![](https://i.imgur.com/WAHZvlj.png) 建立server ```bash= cd /usr/hdp/current/flume-server/ sudo bin/flume-ng agent --conf conf --conf-file ~/sparkstreamingflume.conf --name a1 ``` 建立資料生產者 ```bash= wget http://media.sundog-soft.com/hadoop/access_log.txt cp access_log.txt spool/log22.txt (看一下後才換下一個) cp access_log.txt spool/log23.txt ``` ![](https://i.imgur.com/lCjeXxU.png) ![](https://i.imgur.com/7VmgJgS.png) 91. [Activity] Count words with Storm 開啟ambari 啟動storm kafka ```bash= cd /usr/hdp/current/storm-client/ cd contrib/storm-starter/src/jvm/org/apache/storm/starter/ vi WordCountTopology.java storm jar /usr/hdp/current/storm-client/contrib/storm-starter/storm-starter-topologies-*.jar org.apache.storm.starter.WordCountTopology wordcount cd /usr/hdp/current/storm-client/logs/workers-artifacts/ tail -f worker.log ``` 跑太多次一堆LOG ![](https://i.imgur.com/TsK559I.png) ![](https://i.imgur.com/amFboYg.png) 如不小心建立要刪除 storm kill Topology -w 5 (https://stackoverflow.com/questions/48700762/storm-killed-topology-not-getting-removed-from-topology-list) 93. [Activity] Counting words with Flink 本章節一樣是三個視窗 server 監聽資料檢視 server ```bash= wget https://downloads.apache.org/flink/flink-1.13.1/flink-1.13.1-bin-scala_2.11.tgz tar -xvf flink-1.13.1-bin-scala_2.11.tgz cd conf/ vi flink-conf.yaml 修改port 8082 . /bin/start-cluster.sh nc -l 9000 ``` config 內容跟影片有點出入要改的是這裡 ![](https://i.imgur.com/3vy2kix.png) ![](https://i.imgur.com/QCfyICY.png) 另一個視窗(監聽) ```bash= cd flink-1.13.1 ./bin/flink run examples/streaming/SocketWindowWordCount.jar --port 9000 ``` 第三個視窗(資料檢視) ```bash= cd flink-1.13.1 cd log/ ls -ltr cat flink-maria_dev-taskexecutor-0-sandbox-hdp.hortonworks.com.out ``` 網址所在 https://flink.apache.org/downloads.html#apache-flink-1131 下載 https://downloads.apache.org/flink/flink-1.13.1/flink-1.13.1-bin-scala_2.11.tgz code 所在位置 https://github.com/apache/flink/blob/master/flink-examples/flink-examples-streaming/src/main/scala/org/apache/flink/streaming/scala/examples/socket/SocketWindowWordCount.scala # DE-L2 Udemy 課外補充 ## section2 1. [mapreduce 的運作](http://tw.gitbook.net/hadoop/intro-mapreduce.html) 2. [Hadoop介绍基础篇](https://www.slideshare.net/cloudma/hadoop-6189763?qid=686666c2-8823-4bf3-aaa0-b4f8164dbb08&v=&b=&from_search=1) 3. [Big Data Tornado - 2015 台灣 Big Data 企業經典應用案例分享 (他們如何介接發想場景)](https://www.slideshare.net/etusolution/k4-big-datatornadobdt2015?next_slideshow=1) 4. [Hadoop Big Data 成功案例分享 (他們如何介接發想場景)](https://www.slideshare.net/etusolution/hadoop-big-data-40024681?qid=03d1404d-9e42-42c6-b776-e9ea7d8fc7b0&v=&b=&from_search=7) 5. Hadoop 取用stream ( https://cloudxlab.com/assessment/displayslide/279/run-mapreduce-jobs-using-hadoop-streaming) 6. MrJob原始碼 (https://github.com/Yelp/mrjob/tree/091572e87bc24cc64be40278dd0f5c3617c98d4b) 7. Zookeeper概述 (https://codingnote.cc/zh-tw/p/204077/) 8. mapreduce簡介 http://debussy.im.nuu.edu.tw/sjchen/BigData/MapReduce.pdf ## section4 1. Spark 課後資訊 (Spark 三种部署模式: https://waltyou.github.io/Spark-Cluster-Manager/ Spark的三種集群deploy: https://kknews.cc/zh-tw/tech/4q5mov.html Spark RDD:https://kknews.cc/zh-tw/code/2k83ene.html Spark RDD概念與map操作:https://ithelp.ithome.com.tw/articles/10186282 Spark 記憶體管理:https://iter01.com/554916.html) ## section5 1. Hive 課後資訊 (6大開源資料庫引擎比較: https://www.finereport.com/tw/company/sql-2.html) ( Hive概述: https://zhuanlan.zhihu.com/p/81189211 Hive 索引:https://blog.csdn.net/zzq900503/article/details/79391071) 2. Sqoop 課後資訊 (Sqoop 支援SQL資料: https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_supported_databases Sqoop 工作原理: https://www.jianshu.com/p/ec9003d8918c Sqoop 官網: https://sqoop.apache.org/ ) ## section6 1. MongoDB、Cassandra 和 HBase 三種 NoSQL 資料庫比較 - https://kknews.cc/zh-tw/code/amrqj5g.html 2. 快速認識4類主流NoSQL資料庫 - https://www.ithome.com.tw/news/92507 3. Cassandra vs. MongoDB vs. Hbase: A Comparison of NoSQL Databases - https://logz.io/blog/nosql-database-comparison/ 4. 了解NoSQL不可不知的5項觀念 - https://www.ithome.com.tw/news/92506 5. 如何學習及選擇大數據非關係型資料庫NoSQL - https://kknews.cc/zh-tw/tech/m3695y6.html 6. CAP理論的前世今生 https://zhuanlan.zhihu.com/p/66487328 7. 透過區塊鍊認識CAP https://blockcast.it/2020/06/05/bsos-understand-cap-theorem-first-then-blockchain/ 8. 區塊鍊技術手冊中CAP部分 https://poweichen.gitbook.io/blockchain-guide-zh/distribute_system/cap 9. 初步認識分散式資料庫與 NoSQL CAP 理論 https://oldmo860617.medium.com/%E5%88%9D%E6%AD%A5%E8%AA%8D%E8%AD%98%E5%88%86%E6%95%A3%E5%BC%8F%E8%B3%87%E6%96%99%E5%BA%AB%E8%88%87-nosql-cap-%E7%90%86%E8%AB%96-a02d377938d1 10. Cassandra 高可用的原因 https://kknews.cc/zh-tw/code/4pjrlog.html 11. CAP 不一定真的是CAP(請確定自己到底)是什麼 https://blog.the-pans.com/cap/ 12. CAP中的C(一致性) https://tech.youzan.com/cap-coherence-protocol-and-application-analysis/ 13. 架構師角度看CAP系統 https://iter01.com/517265.html ## section8 1. Mesos 鐵人賽解釋:https://ithelp.ithome.com.tw/articles/10184643 2. Mesos vs Yarn https://data-flair.training/blogs/comparison-between-apache-mesos-vs-hadoop-yarn/ 3. Tez 執行引擎的比較 https://stackoverflow.com/questions/41630987/tez-execution-engine-vs-mapreduce-execution-engine-in-hive 4. DAG是什麼 https://blog.csdn.net/weixin_37536020/article/details/106815387 http://web.ntnu.edu.tw/~algo/DirectedAcyclicGraph.html 5. Spark跟Tez的差異 https://www.xplenty.com/blog/apache-spark-vs-tez-comparison/ http://www.zdingke.com/2016/12/05/spark%E4%B8%8Etez%E6%AF%94%E8%BE%83/ 6. Yarn 的manager原理 https://kknews.cc/zh-tw/code/gya93m9.html * 官網 Hue https://gethue.com/ Zeppelin https://zeppelin.apache.org/docs/latest/quickstart/install.html Oozie https://oozie.apache.org/ Tez https://www.cloudera.com/products/open-source/apache-hadoop/apache-tez.html Zookeeper https://zookeeper.apache.org/ ## section9 1. 日誌採集系統flume和kafka有什麼區別及聯絡https://www.itread01.com/content/1544497746.html 1. Flume 整合Kafka 其中一個案例 https://www.cnblogs.com/smartloli/p/9984140.html 1. Flume Kafka Nifi 比較 https://kknews.cc/zh-tw/tech/4654b2q.html 1. Kafka 運作原理解釋 https://zhuanlan.zhihu.com/p/68052232 1. RabbitMQ Kafka 比較 https://kknews.cc/zh-tw/code/abjxo8x.html * 官網 Kafka https://kafka.apache.org/24/documentation/streams/core-concepts Flume https://flume.apache.org/ Nifi(不在課程中) https://nifi.apache.org/ ## section10 1. Apache Storm vs Spark Streaming https://www.ericsson.com/en/blog/2015/7/apache-storm-vs-spark-streaming 1. Apache Storm vs Spark Streaming – Feature wise Comparison https://data-flair.training/blogs/apache-storm-vs-spark-streaming/ 1. Flink及Storm、Spark主流流框架比較https://www.itread01.com/content/1550530998.html 1. 流式計算的三種框架：Storm、Spark和Flink https://codertw.com/程式語言/666820/ 1. Spark Streaming vs Flink vs Storm vs Kafka Streams vs Samza : Choose Your Stream Processing Framework https://medium.com/@chandanbaranwal/spark-streaming-vs-flink-vs-storm-vs-kafka-streams-vs-samza-choose-your-stream-processing-91ea3f04675b 1. 大数据生态圈之流式数据处理框架选择(Storm VS Kafka Streams VS Spark Streaming VS Flink VS Samza) https://blog.csdn.net/WeiJonathan/article/details/83864244 * storm https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.0/developing-storm-applications/content/understanding_sliding_and_tumbling_windows.html * flink https://archive.apache.org/dist/flink/flink-1.2.0/flink-1.2.0-bin-hadoop27-scala_2.10.tgz Flink簡介 https://ithelp.ithome.com.tw/articles/10199506 * stream-processing https://www.upsolver.com/blog/popular-stream-processing-frameworks-compared https://www.datanami.com/2019/05/30/understanding-your-options-for-stream-processing-frameworks/ * 官網 storm https://storm.apache.org/2020/06/30/storm220-released.html Flink https://flink.apache.org/ Stream-processing https://spark.apache.org/streaming/ ## section11 * 流系列 1. 分散式系統架構，回顧2020年常見面試知識點梳理(Redis ,Kafka ,系統設計瓶頸) https://iter01.com/577275.html 1. Kafka使用場景https://www.gushiciku.cn/pl/gJwW/zh-tw 2. 分散式事物產生的場景跟相對應方案 https://codingnote.cc/zh-tw/p/239802/ 3. 快取的應用 https://codertw.com/%E7%A8%8B%E5%BC%8F%E8%AA%9E%E8%A8%80/11747/ * 系統評估系列(安德魯文章) 1. 移轉到微服務必經之路 ~系統與資料庫重構 https://www.facebook.com/123910281317592/videos/545139382528011 1. MQ的前後準備 https://www.facebook.com/123910281317592/videos/399472807454255 1. SLO https://columns.chicken-house.net/2021/06/04/slo/#3-%E5%85%B6%E4%BB%96%E8%B7%9F-slo-%E7%9B%B8%E9%97%9C%E7%9A%84%E9%80%B2%E9%9A%8E%E8%AD%B0%E9%A1%8C * 搜尋系列 1. solr vs elasticsearch https://logz.io/blog/solr-vs-elasticsearch/ 1. lucene https://lucene.apache.org/ 1. lucene 算法 http://www.blogjava.net/kingpub/archive/2012/07/16/64174.html 1. 全文檢索(in sqlserver) https://docs.microsoft.com/zh-tw/sql/relational-databases/search/full-text-search?view=sql-server-ver15 1. 全文檢索in sql 操作過程 https://dotblogs.com.tw/Tm731531/2020/08/23/012252 1. 倒排索引(全文檢索的關鍵) https://www.elastic.co/guide/cn/elasticsearch/guide/current/inverted-index.html * 官網 IMPALA https://impala.apache.org/ Accumulo https://accumulo.apache.org/ REDIS https://redis.io/ IGNITE https://ignite.apache.org/ Elasticsearch https://www.elastic.co/elasticsearch/ Kinesis (AWS ecosystem) (AWS kafka) https://aws.amazon.com/tw/kinesis/ Nifi https://Nifi.apache.org/ Falcon https://falcon.apache.org/ apache slider https://www.cloudera.com/products/open-source/apache-hadoop/apache-slider.html # （Optional）額外資訊 1. 資料集網站（https://grouplens.org/datasets/movielens/ ） https://files.grouplens.org/datasets/movielens/ml-100k.zip 2. 課堂的官方網站（https://sundog-education.com/） 3. 講師的Youtube （https://www.youtube.com/c/SundogEducation/playlists） 4. 課堂中問與答的問題集 ![](https://i.imgur.com/dNybwXh.jpg) 5. Linux 基礎指令集 (http://linux.vbird.org/linux_basic/redhat6.1/linux_06command.php) 6. Linux 應用集(鳥哥) (http://linux.vbird.org/) 7. CRISP-DM 模型(數據應用的開發流程) (https://adamsanalysis.com/data-science/crisp-dm-introduction) # （Optional）大數據專案的眉眉角角"們" * 架構 1. 傳統行業如何搭建大數據團隊？ https://www.finereport.com/tw/knowledge/acquire/bigdatateam.html 1. 6月北台停電4次！藉此談談電力企業大數據管理！ https://www.finereport.com/tw/knowledge/acquire/electricbigdata.html * 案例 1. 各種資料可能可以做的分析案例 https://www.finereport.com/tw/tag/productionreport 1. 解析｜大數據公司挖掘數據價值的49個典型案例（值得收藏） https://kknews.cc/zh-tw/tech/ekx9v54.html 1. “大數據” 與我有什麼關聯？5 張圖了解大數據分析的商業應用 https://www.stockfeel.com.tw/%E5%A4%A7%E6%95%B8%E6%93%9A%E5%88%B0%E5%BA%95%E8%88%87%E6%88%91%E6%9C%89%E4%BB%80%E9%BA%BC%E9%97%9C%E8%81%AF%EF%BC%9F5-%E5%BC%B5%E5%9C%96%E5%BF%AB%E9%80%9F%E4%BA%86%E8%A7%A3%E5%A4%A7/ * 資料視覺化（課程中比較少提到的部分），因為資料一行一行其實沒有感覺差異，要能讓資料說話，必須要有這一個動作，讓資料的價值跟變現考量 * 工具(以下舉例不限定) 1. Tableau 1. PowerBI 1. D3JS 1. python bokeh 1. Excel 圖表 * 能做的事情 1. 各類圖表報表 1. 文字雲 1. 關聯分析 1. 價值估算 * 目標(舉例不限定) 1. 將"看法"證據化 ex: 去年平均溫度比較低今年比較高，因此應該提高物流中冰棒的存量，這個時候就需要將去年的溫度+今年的溫度做一個視覺化。 1. 將"評估"具體化 ex: 依照過去六個月的財務推估今年的營收，換算整年的EPS，反推這支股票的股價。 1. 將"感覺"數據化 ex: 系統一上線幾乎都是XX單位的BUG，如果有api tracing log，就可以取得相關的資訊跟報表，補助決策輕重緩急 # （Optional） Docker 環境問題 ## 1.Ambari admin密碼重設進入cmd ```bash= ambari-admin-password-reset ``` ## 2.Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-wnNrDp/click/ 問題說明 Docker內部環境是2.7.5的python 相對應的pip 是8 ，而且因為支援問題已經不太支援,使用上請小心因為這樣會改變基本設定的Python 後續會有一些套件不相容的問題，建議是真的跑不下去才改設定（https://discuss.python.org/t/on-the-old-python-2-7-5-version-the-pip-install-package-does-not-install-due-to-the-ssl-error/5811） ![](https://i.imgur.com/zH00FUY.png) 解法 1.先更新pip 再去更新setuptools(by Jeffrey 提供) ```bash= pip install --upgrade pip yum reinstall -y python2-pip.noarch python27-python-pip.noarch pip install setuptools==33.1.1 ``` 2. 目前我能成功的做法就是把python拉到3.7.5 ```bash= yum groupinstall "Development Tools" -y wget https://www.python.org/ftp/python/3.7.5/Python-3.7.5.tgz tar -zxvf Python-3.7.5.tgz cd Python-3.7.5 ./configure make all make install make clean make distclean ``` 安裝到此為止下面是做替換預設版本 ```bash= /usr/local/bin/python3.7 -V mv /usr/bin/python /usr/bin/python2.7.5 ln -s /usr/local/bin/python3.7 /usr/bin/python sed -i 's/python/python2.7/g' /usr/bin/yum sed -i 's/python/python2.7/g' /usr/libexec/urlgrabber-ext-down python -V ``` 然後執行pip的時候要改成用 python -m pip ![](https://i.imgur.com/nGdul1I.png)