---
title: DE-L2 Udemy 遇到問題的紀錄
tags: wedata
description: DE-L2 Udemy 疑難雜症
---
# 目錄
[TOC]
# DE-L2 Udemy 課堂問題
## section2
1. 2-16 install pip and mrjob
pip install google-api-python-client==1.6.4
會有問題

解法
1. 其實可以跳過 他不影響後面的MRJob,Docker內也有0.7的版本
2. 先更新pip 再去更新setuptools(by Jeffrey 提供)
```bash=
pip install --upgrade pip
yum reinstall -y python2-pip.noarch python27-python-pip.noarch
pip install setuptools==33.1.1
pip install google-api-python-client==1.6.4
```
3. 透過去安裝anaconda跑正常的python pip
```
wget https://repo.anaconda.com/archive/Anaconda3-2021.05-Linux-x86_64.sh
mkdir tools
bash Anaconda3-2021.05-Linux-x86_64.sh
yum install vim
source ~/.bashrc
conda update --all
conda update conda
conda update anaconda
conda install google-api-python-client
pip install google-api-python-client
pip install google-api-python-client==1.6.4
```
VVV 成功後就可以跑,但是沒有也可以跑MRJOB

2-18 Code up the ratings histogram MapReduce job and run it
TopMovies.py檔案內 要增加這個以進行排序
參考:https://www.cnpython.com/qa/346173
```bash=
MRJob.SORT_VALUES = True
```
修改版本
```bash=
from mrjob.job import MRJob
from mrjob.step import MRStep
class RatingsBreakdown(MRJob):
MRJob.SORT_VALUES = True
def steps(self):
return [
MRStep(mapper=self.mapper_get_ratings,
reducer=self.reducer_count_ratings),
MRStep( mapper =self.map ,reducer=self.reducer_sorted_output)
]
def mapper_get_ratings(self, _, line):
(userID, movieID, rating, timestamp) = line.split('\t')
yield int(movieID), 1
def reducer_count_ratings(self, key, values):
yield key ,sum(values)
def map(self , data, line):
yield None,("%04d"%int(data),line)
def reducer_sorted_output(self,n, movies):
for movie in movies:
yield movie[1],movie[0]
if __name__ == '__main__':
RatingsBreakdown.run()
```
## section4
1. [Activity] Find the movie with the lowest average rating - with RDD's
沒有Spark 只有Spark2 要用admin身分才可以改設定

33. [Activity] Movie recommendations with MLLib
``` bash=
sudo pip install numpy==1.16
export SPARK_MAJOR_VERSION=2
spark-submit MovieRecommendationsALS.py
```

進到 MovieRecommendationsALS.py 修改code 在spark 物件產生後增加這一行
```bash =
spark.conf.set("spark.sql.crossJoin.enabled", "true")
```


## section5
43. [Activity] Use Sqoop to import data from MySQL to HFDS/Hive
影片中是這樣
```bash=
GRANT ALL PRIVILEGES ON movielens.* to ''@'localhost';
sqoop import --connect jdbc:mysql://localhost/movielens --driver com.mysql.jdbc.Driver --table movies -m 1
```
如有問題 (
ERROR manager.SqlManager: Error executing statement: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure)
https://www.udemy.com/course/the-ultimate-hands-on-hadoop-tame-your-big-data/learn/lecture/5963452#questions/6479200
```bash=
GRANT ALL PRIVILEGES ON movielens.* to 'root'@'localhost' IDENTIFIED BY 'hadoop';
sqoop import --connect jdbc:mysql://localhost/movielens --driver com.mysql.jdbc.Driver --username root -P --table movies -m 1
whereis sqoop
cd /usr/lib/
mkdir sqoop
cd sqoop/
mkdir lib
cd lib
wget http://ftp.ntu.edu.tw/MySQL/Downloads/Connector-J/mysql-connector-java-8.0.23.tar.gz
tar -zxf mysql-connector-java-8.0.23.tar.gz
cd mysql-connector-java-8.0.23
mv mysql-connector-java-8.0.23.jar ..
cd $SQOOP_HOME/bin
sqoop-version
cd ~
```
原因是docker中沒有mysql connector
參考 http://tw.gitbook.net/sqoop/sqoop_installation.html

-----------------------
過程中有資料匯入Hive 但是有錯誤訊息如下

```
Failed with exception org.apache.hadoop.security.AccessControlException: User null does not belong to hadoop
at org.apache.hadoop.hdfs.server.namenode.FSDirAttrOp.setOwner(FSDirAttrOp.java:89)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.setOwner(FSNamesystem.java:1877)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.setOwner(NameNodeRpcServer.java:828)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.setOwner(ClientNamenodePr
otocolServerSideTranslatorPB.java:476)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlocking
Method(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2347)
```
原因 腳色沒有加入群組
```
usermod -G hadoop root
usermod -a -G hdfs root
```
## section6
51. [Activity] Installing Cassandra
如果python2.7不用跑 (影片中是2.6.6 用python -V檢查)
```bash=
yum install scl-utils
yum install centos-release-scl-sh
scl enable python27 bash
```
安裝的時候出問題的解決方法
過程包含輸入repo
安裝cassandra
https://gist.github.com/jpollard3/af376dceb24ed6d06a0eaed60fd33b6b
```bash=
# Take snapshot of current state in virtualbox as a backup
# step 1: install cassandra (datastax version)
cat >/etc/yum.repos.d/datastax.repo <<EOL
[datastax]
name = DataStax Repo for Apache Cassandra
baseurl = http://rpm.datastax.com/community
enabled = 1
gpgcheck = 0
EOL
yum -y install dsc30
yum -y install jna
cassandra
nodetool status
# step 2: install python 2.7 if you want to run cqlsh: https://github.com/h2oai/h2o-2/wiki/installing-python-2.7-on-centos-6.3.-follow-this-sequence-exactly-for-centos-machine-only
yum -y groupinstall "Development tools"
yum -y install zlib-devel bzip2-devel openssl-devel ncurses-devel sqlite-devel
cd /opt
wget --no-check-certificate https://www.python.org/ftp/python/2.7.6/Python-2.7.6.tar.xz
tar xf Python-2.7.6.tar.xz
cd Python-2.7.6
./configure --prefix=/usr/local
make && make altinstall
ls -ltr /usr/bin/python*
ls -ltr /usr/local/bin/python*
echo $PATH
which python
mv /usr/lib/python2.7/site-packages/* /usr/local/lib/python2.7/site-packages/
rmdir /usr/lib/python2.7/site-packages
ln -s /usr/local/lib/python2.7/site-packages /usr/lib/python2.7/site-packages
# finished. Launch cqlsh
cqlsh
```
54. [Activity] Install MongoDB, and integrate Spark with MongoDB
```bash=
sudo service ambari restart
(如果失敗 用下面這個)
sudo service ambari-server restart
另外如果需要設定ambari密碼(root的)
ambari-admin-password-reset
(影片中是)
spark-submit --packages org.mongodb.spark:mongo-spark-connector_2.11:2.0.0 MongoSpark.py
但是我們要(因為我們的sandbox版本比較高)
spark-submit --packages org.mongodb.spark:mongo-spark-connector_2.11:2.3.0 MongoSpark.py
```
## section8
74. [Activity] Set up a simple Oozie workflow

有問題
解法如下
https://community.cloudera.com/t5/Support-Questions/Oozie-web-console-is-disabled/td-p/243675
https://stackoverflow.com/questions/49276756/ext-js-library-not-installed-correctly-in-oozie
```bash=
wget http://archive.cloudera.com/gplextras/misc/ext-2.2.zip
sudo cp ext-2.2.zip /usr/hdp/current/oozie-client/libext/
sudo chown oozie:hadoop /usr/hdp/current/oozie-client/libext/ext-2.2.zip
sudo -u oozie /usr/hdp/current/oozie-server/bin/oozie-setup.sh prepare-war
```

oozie job -oozie http://localhost:11000/oozie -config /home/maria_dev/job.properties -run

有問題


解法如下
首先檢查Ambari 的檔案路徑 /user/maria_dev/movies有 u.data u.item 這兩個檔案(每一次建立JOB都要檢查)

再來檢查mysql 有root /hadoop 這組帳密 而且權限有給予DB (沒有透過上面的GRANT 賦予)

下載下來的三個檔案中 workflow.xml 的內容要修改
```
import --connect jdbc:mysql://localhost/movielens --driver com.mysql.jdbc.Driver --username root --password hadoop --table movies -m 1
```
另外檢查ambari中 所有的hive oozie sqoop 都是啟動的

中繼版本

成功版本



## section9
81. [Activity] Setting up Kafka, and publishing some data
問題
影片中的:sandbox.hortonworks.com
要改成 sandbox-hdp.hortonworks.com
```
cd /usr/hdp/current/kafka-broker/bin
# 創一個topic
./kafka-topics.sh --create --zookeeper sandbox-hdp.hortonworks.com:2181 --replication-factor 1 --partitions 1 --topic fred
# 列出topic
./kafka-topics.sh --list --zookeeper sandbox-hdp.hortonworks.com:2181
用一個PRODUCER
./kafka-console-producer.sh --broker-list sandbox-hdp.hortonworks.com:6667 --topic fred
```
comsumer 視窗(改掉DNS 拿掉zookeeper)
```
./kafka-console-consumer.sh --bootstrap-server sandbox-hdp.hortonworks.com:6667 --topic fred --from-beginning
```
## section10
# DE-L2 Udemy 內容Code語法
## section2
2-17 Installing Python, MRJob, and nano
下載資料跟推薦的python code
``` bash=
wget http://media.sundog-soft.com/hadoop/ml-100k/u.data
wget http://media.sundog-soft.com/hadoop/RatingsBreakdown.py
wget http://media.sundog-soft.com/hadoop/TopMovies.py
```
2-18 Code up the ratings histogram MapReduce job and run it
執行的腳本(請一行一行跑 看內容差異)
``` bash=
python RatingsBreakdown.py u.data
python RatingsBreakdown.py -r hadoop --hadoop-streaming-jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar u.data
```
<font color="red"> MostPopularMovie.py 這個檔案官方說,只是示範<br> 看這個RatingsBreakdown.py <br> 或是用TopMovies.py </font>

```bash=
sudo yum-config-manager --save --setopt=HDP-SOLR-2.6-100.skip_if_unavailble=true
```

## section4
1. [Activity] Find the movie with the lowest average rating - with RDD's
```bash=
mkdir ml-100k
cd ml-100k
wget http://media.sundog-soft.com/hadoop/ml-100k/u.item
cd ..
wget http://media.sundog-soft.com/hadoop/Spark.zip
unzip Spark.zip
spark-submit LowestRatedMovieSpark.py
```
33. [Activity] Movie recommendations with MLLib
``` bash=
sudo pip install numpy==1.16
export SPARK_MAJOR_VERSION=2
spark-submit MovieRecommendationsALS.py
```
35. [Activity] Check your results against mine!
```bash =
spark-submit LowestRatedPopularMovieSpark.py
spark-submit LowestRatedPopularMovieDataFrame.py
```
## section5
42. [Activity] Install MySQL and import our movie data
```bash =
systemctl stop mysqld
systemctl set-environment MYSQLD_OPTS="--skip-grant-tables --skip-networking"
systemctl start mysqld
wget http://media.sundog-soft.com/hadoop/movielens.sql
mysql -uroot
```
inMYSQL
```bash=
FLUSH PRIVILEGES;
ALTER USER 'root'@'localhost' IDENTIFIED BY 'hadoop' ;
FLUSH PRIVILEGES;
create database movielens;
show databases;
SET NAMES 'utf8';
SET CHARACTER SET utf8;
use movielens
source movielens.sql;
show tables;
select * from movies limit 10;
describe ratings;
select movies.title ,count(ratings.movie_id) as ratingCount From movies inner join ratings on movies.id=ratings.movie_id group by movies.title
order by ratingCount ;
```
44. [Activity] Use Sqoop to export data from Hadoop to MySQL
```bash=
sqoop export --connect jdbc:mysql://localhost/movielens --driver com.mysql.jdbc.Driver --username root -P --table exported_movies --export-dir /apps/hive/warehouse/movies --input-fields-terminated-by '\0001'
mysql -u root -p
```
in Mysql
```bash=
use movielens;
select * from exported_movies limit 10;
```
## section6
47. [Activity] Import movie ratings into HBase
1 安裝python在自己電腦上
https://www.python.org/downloads/
2 安裝筆記本
pip install jupyter notebook

3 安裝完後跑

進去後

先從內部啟動8000 port 並啟動HBase腳本
```bash =
/usr/hdp/current/hbase-master/bin/hbase-daemon.sh start rest -p 8000 --infoport 8001
```
程式內容
python 版本
https://drive.google.com/file/d/1rmKTLXfTEdYIldZW-ycQ2c8VdgjidBle/view?usp=sharing
jupyter notebook版本
https://drive.google.com/file/d/1aNIOIUm7Kbs0brzaTU2ruI828XEKwN2t/view?usp=sharing
48. [Activity] Use HBase with Pig to import data at scale.
```bash =
hbase shell
wget http://media.sundog-soft.com/hadoop/hbase.pig
pig hbase.pig
hbase shell
```
50. If you have trouble installing Cassandra...
```bash =
cd ~
wget http://mirror.centos.org/centos/6/os/x86_64/Packages/nss-softokn-freebl-3.14.3-23.3.el6_8.x86_64.rpm
rpm2cpio http://mirror.centos.org/centos/6/os/x86_64/Packages/nss-softokn-freebl-3.14.3-23.3.el6_8.x86_64.rpm | cpio -idmv
cp ./lib64/libfreeblpriv3.* /lib64
```
51. [Activity] Installing Cassandra
如果python2.7不用跑 (影片中是2.6.6 用python -V檢查)
```bash=
yum install scl-utils
yum install centos-release-scl-sh
scl enable python27 bash
```
```bash=
cd /etc/yum.repos.d/
vi datastax.repo
[datastax]
name = DataStax Repo for Apache Cassandra
baseurl = http://rpm.datastax.com/community
enabled = 1
gpgcheck = 0
yum install dsc30 -y
yum install cassandra30-tools-3.0.9-1 -y
python -m pip install cqlsh
service cassandra start
```
```bash=
CREATE KEYSPACE movielens WITH replication = {'class': 'SimpleStrategy','replication_factor':'1'}AND durable_writes=true;
use movielens
CREATE TABLE users (user_id int,age int , gender text , occupation text ,zip text, PRIMARY KEY (user_id));
select * from users;
```


52. [Activity] Write Spark output into Cassandra
```bash=
wget http://media.sundog-soft.com/hadoop/CassandraSpark.py
export SPARK_MAJOR_VERSION=2
spark-submit --packages com.datastax.spark:spark-cassandra-connector_2.11:2.3.1 CassandraSpark.py
```


54. [Activity] Install MongoDB, and integrate Spark with MongoDB

```bash=
cd /var/lib/ambari-server/resources/stacks/
cd HDP/
cd 2.6
cd services/
pwd
git clone https://github.com/nikunjness/mongo-ambari.git
sudo service ambari restart
(如果失敗 用下面這個)
sudo service ambari-server restart
```


```bash=
pip install pymongo
cd ~
wget http://media.sundog-soft.com/hadoop/MongoSpark.py
export SPARK_MAJOR_VERSION=2
(影片中是)
spark-submit --packages org.mongodb.spark:mongo-spark-connector_2.11:2.0.0 MongoSpark.py
但是我們要(因為我們的sandbox版本比較高)
spark-submit --packages org.mongodb.spark:mongo-spark-connector_2.11:2.3.0 MongoSpark.py
```
55. [Activity] Using the MongoDB shell
```bash=
mongo
use movielens
db.users.find({user_id:100})
```
mongo不會自動建立INDEX
```bash=
db.users.explain().find({user_id:100})
db.users.createIndex({user_id:1})
db.users.aggregate([{$group:{_id:{occupation:"$occupation"},avgAge:{$avg:"$age"}}}])
```

## section8
69. [Activity] Use Hive on Tez and measure the performance benefit
UI操作
```
DROP VIEW IF EXISTS topMovieIDs;
CREATE VIEW topMovieIDs AS
SELECT movie_id,count(movie_id) as ratingCount
From movielens.ratings
GROUP BY movie_id
ORDER BY ratingCount DESC;
Select n.name ,ratingCount
FROM topMovieIDs t JOIN movielens.names n ON t.movie_id=n.movie_id;
```
TEZ 21秒
MR 65秒

70. Mesos explained
Mesos 鐵人賽解釋:https://ithelp.ithome.com.tw/articles/10184643
72. [Activity] Simulating a failing master with ZooKeeper
```bash=
cd /usr/hdp/current/zookeeper-client/
cd bin
./zhCli.sh
```

```bash=
create -e /testmaster "127.0.0.1:2223"
get /testmaster
```

74. [Activity] Set up a simple Oozie workflow
這個flow 是透過sqoop import 把資料輸入
```bash=
wget http://media.sundog-soft.com/hadoop/movielens.sql
systemctl stop mysqld
systemctl set-environment MYSQLD_OPTS="--skip-grant-tables --skip-networking"
systemctl start mysqld
mysql -uroot
create database movielens;
use movielens
source movielens.sql;
show tables;
--如果沒有給予root權限 透過下面給予
flush privileges;
GRANT ALL PRIVILEGES ON movielens.* to 'root'@'localhost' IDENTIFIED BY 'hadoop';
flush privileges;
```
```bash=
--角色 in maria_dev 資料夾 在角色底下的根目錄 (這個操作會影響 oozie job那一行)
wget http://media.sundog-soft.com/hadoop/oldmovies.sql
wget http://media.sundog-soft.com/hadoop/workflow.xml
wget http://media.sundog-soft.com/hadoop/job.properties
hadoop fs -put workflow.xml /user/maria_dev
hadoop fs -put oldmovies.sql /user/maria_dev
hadoop fs -put /usr/share/java/mysql-connector-java.jar /user/oozie/share/lib/lib_20180618160835/sqoop
oozie job -oozie http://localhost:11000/oozie -config /home/maria_dev/job.properties -run
```
76. [Activity] Use Zeppelin to analyze movie ratings, part 1
76 77 可以直接下載他寫好的Json
wget http://media.sundog-soft.com/hadoop/MovieLens.json
然後用 Import note 上傳Json檔案

改變順序

spark code
```bash=
sc.version
```
bash
```bash=
%sh
wget http://media.sundog-soft.com/hadoop/ml-100k/u.data -O /tmp/u.data
wget http://media.sundog-soft.com/hadoop/ml-100k/u.item -O /tmp/u.item
%sh
hadoop fs -rm -r -f /tmp/ml-100k
hadoop fs -mkdir /tmp/ml-100k
hadoop fs -put /tmp/u.item /tmp/ml-100k/
hadoop fs -put /tmp/u.data /tmp/ml-100k/
```
scala(先建立class 然後套用mapreduce+spark分資料)
```bash=
final case class Rating (movieID: Int, rating: Int)
val lines = sc.textFile("hdfs:///tmp/ml-100k/u.data")
.map(x=>{val fields =x.split("\t");
Rating(fields(1).toInt,fields(2).toInt)
})
import sqlContext.implicits._
val ratingsDF=lines.toDF()
ratingsDF.printSchema()
ratingsDF.registerTempTable("ratings")
val topMovieIDs= ratingsDF.groupBy("movieID").count().orderBy(desc("count")).cache()
topMovieIDs.show()
```
77. [Activity] Use Zeppelin to analyze movie ratings, part 2
有問題
```
java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;
set zeppelin.spark.sql.stacktrace = true to see full stacktrace
代表 你的環境的hive死了 重啟如果無效 直接重新啟動一個VMinstance來跑這一個章節比較快
```

```
%sql
select * from ratings Limit 10
SELECT rating,count(*) as count from ratings group by rating
final case class Rating (movieID: Int, title: String)
val lines = sc.textFile("hdfs:///tmp/ml-100k/u.item")
.map(x=>{val fields =x.split('|');
Rating(fields(0).toInt,fields(1))
import sqlContext.implicits._
val moviesDF=lines.toDF()
moviesDF.printSchema()
moviesDF.registerTempTable("titles")
%sql
select t.title,count(*) cnt from ratings r join titles t on
r.movieID=t.movieID Group by t.title order by cnt DESC
```
## section9
82. [Activity] Publishing web logs with Kafka
本章節最後要開三個視窗
1.當作監聽者
2.當作發送者
3.寫入者
```bash=
cd /usr/hdp/current/kafka-broker/conf
cp connect-file-sink.properties ~
cp connect-file-source.properties ~
cp connect-standalone.properties ~
cd ~
```
vi connect-standalone.properties
```bash=
bootstrap.servers=sandbox-hdp.hortonworks.com:6667
```

vi connect-file-sink.properties
```bash=
file=/home/maria_dev/logout.txt
topics=log-test
```

vi connect-file-source.properties
```bash=
file=/home/maria_dev/access_log_small.txt
topic=log-test
```

```bash=
wget http://media.sundog-soft.com/hadoop/access_log_small.txt
```
```bash=
./connect-standalone.sh ~/connect-standalone.properties ~/connect-file-source.properties ~/connect-file-sink.properties
```
另一個視窗
```bash=
./kafka-console-consumer.sh --bootstrap-server sandbox-hdp.hortonworks.com:6667 --topic log-tst
```
第三個視窗
```bash=
echo -e "this is a new line to kafka" >> access_log_small.txt
```

84. [Activity] Set up Flume and publish logs with it.
本章節三個視窗(影片只有2個)
server 跟sender 還有一個logger(因為我們的版本跟影片有差)
https://www.udemy.com/course/the-ultimate-hands-on-hadoop-tame-your-big-data/learn/lecture/5963868#questions/10549450
It seems that version 2.6.5 has issues logging to console.
視窗1(sender)
```bash=
wget http://media.sundog-soft.com/hadoop/example.conf
```
(原始版本:https://flume.apache.org/releases/content/1.9.0/FlumeUserGuide.html)
視窗2(server)
```bash=
cd /usr/hdp/current/flume-server/
sudo bin/flume-ng agent --conf conf --conf-file ~/example.conf --name a1 -Dflume.root.logger=INFO,console
```

視窗1(sender)
```bash=
telnet localhost 44444
```

視窗3(logger)
```bash=
tail -f /var/log/flume/flume.log
```

85. [Activity] Set up Flume to monitor a directory and store its data in HDFS
同上要開三個視窗(如果你要監聽的話)
server 跟sender 還有一個logger(因為我們的版本跟影片有差)
step1 視窗1(sender)
```bash=
wget http://media.sundog-soft.com/hadoop/flumelogs.conf
```
step2 視窗2(server)
```bash=
sudo bin/flume-ng agent --conf conf --conf-file /home/maria_dev/flumelogs.conf --name a1 -Dflume.root.logger=INFO,console
```
step3 中間去ambari建一下flume資料夾
step4 視窗1(跟sender)
```bash=
mkdir spool
cp access_log_small.txt spool/fred.txt
```

## section10
87. [Activity] Analyze web logs published with Flume using Spark Streaming
本章節要三個視窗
一個當server
一個當spark計數
一個當資料產生者
建立spark計數
```bash=
wget http://media.sundog-soft.com/hadoop/sparkstreamingflume.conf
wget http://media.sundog-soft.com/hadoop/SparkFlume.py
export SPARK_MAJOR_VERSION=2
spark-submit --packages org.apache.spark:spark-streaming-flume_2.11:2.3.0 SparkFlume.py
```

建立server
```bash=
cd /usr/hdp/current/flume-server/
sudo bin/flume-ng agent --conf conf --conf-file ~/sparkstreamingflume.conf --name a1
```
建立資料生產者
```bash=
wget http://media.sundog-soft.com/hadoop/access_log.txt
cp access_log.txt spool/log22.txt
(看一下後才換下一個)
cp access_log.txt spool/log23.txt
```


91. [Activity] Count words with Storm
開啟ambari
啟動storm kafka
```bash=
cd /usr/hdp/current/storm-client/
cd contrib/storm-starter/src/jvm/org/apache/storm/starter/
vi WordCountTopology.java
storm jar /usr/hdp/current/storm-client/contrib/storm-starter/storm-starter-topologies-*.jar org.apache.storm.starter.WordCountTopology wordcount
cd /usr/hdp/current/storm-client/logs/workers-artifacts/
tail -f worker.log
```
跑太多次 一堆LOG


如不小心建立
要刪除
storm kill Topology -w 5
(https://stackoverflow.com/questions/48700762/storm-killed-topology-not-getting-removed-from-topology-list)
93. [Activity] Counting words with Flink
本章節一樣是三個視窗
server
監聽
資料檢視
server
```bash=
wget https://downloads.apache.org/flink/flink-1.13.1/flink-1.13.1-bin-scala_2.11.tgz
tar -xvf flink-1.13.1-bin-scala_2.11.tgz
cd conf/
vi flink-conf.yaml
修改port
8082
. /bin/start-cluster.sh
nc -l 9000
```
config 內容跟影片有點出入 要改的是這裡


另一個視窗(監聽)
```bash=
cd flink-1.13.1
./bin/flink run examples/streaming/SocketWindowWordCount.jar --port 9000
```
第三個視窗(資料檢視)
```bash=
cd flink-1.13.1
cd log/
ls -ltr
cat flink-maria_dev-taskexecutor-0-sandbox-hdp.hortonworks.com.out
```
網址所在
https://flink.apache.org/downloads.html#apache-flink-1131
下載
https://downloads.apache.org/flink/flink-1.13.1/flink-1.13.1-bin-scala_2.11.tgz
code 所在位置
https://github.com/apache/flink/blob/master/flink-examples/flink-examples-streaming/src/main/scala/org/apache/flink/streaming/scala/examples/socket/SocketWindowWordCount.scala
# DE-L2 Udemy 課外補充
## section2
1. [mapreduce 的運作](http://tw.gitbook.net/hadoop/intro-mapreduce.html)
2. [Hadoop介绍 基础篇](https://www.slideshare.net/cloudma/hadoop-6189763?qid=686666c2-8823-4bf3-aaa0-b4f8164dbb08&v=&b=&from_search=1)
3. [Big Data Tornado - 2015 台灣 Big Data 企業經典應用案例分享
(他們如何介接發想場景)](https://www.slideshare.net/etusolution/k4-big-datatornadobdt2015?next_slideshow=1)
4. [Hadoop Big Data 成功案例分享
(他們如何介接發想場景)](https://www.slideshare.net/etusolution/hadoop-big-data-40024681?qid=03d1404d-9e42-42c6-b776-e9ea7d8fc7b0&v=&b=&from_search=7)
5. Hadoop 取用stream (
https://cloudxlab.com/assessment/displayslide/279/run-mapreduce-jobs-using-hadoop-streaming)
6. MrJob原始碼 (https://github.com/Yelp/mrjob/tree/091572e87bc24cc64be40278dd0f5c3617c98d4b)
7. Zookeeper概述
(https://codingnote.cc/zh-tw/p/204077/)
8. mapreduce簡介
http://debussy.im.nuu.edu.tw/sjchen/BigData/MapReduce.pdf
## section4
1. Spark 課後資訊
(Spark 三种部署模式: https://waltyou.github.io/Spark-Cluster-Manager/
Spark的三種集群deploy: https://kknews.cc/zh-tw/tech/4q5mov.html
Spark RDD:https://kknews.cc/zh-tw/code/2k83ene.html
Spark RDD概念與map操作:https://ithelp.ithome.com.tw/articles/10186282
Spark 記憶體管理:https://iter01.com/554916.html)
## section5
1. Hive 課後資訊
(6大開源資料庫引擎比較: https://www.finereport.com/tw/company/sql-2.html)
( Hive概述: https://zhuanlan.zhihu.com/p/81189211
Hive 索引:https://blog.csdn.net/zzq900503/article/details/79391071)
2. Sqoop 課後資訊
(Sqoop 支援SQL資料: https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_supported_databases
Sqoop 工作原理: https://www.jianshu.com/p/ec9003d8918c
Sqoop 官網: https://sqoop.apache.org/
)
## section6
1. MongoDB、Cassandra 和 HBase 三種 NoSQL 資料庫比較 - https://kknews.cc/zh-tw/code/amrqj5g.html
2. 快速認識4類主流NoSQL資料庫 - https://www.ithome.com.tw/news/92507
3. Cassandra vs. MongoDB vs. Hbase: A Comparison of NoSQL Databases - https://logz.io/blog/nosql-database-comparison/
4. 了解NoSQL不可不知的5項觀念 - https://www.ithome.com.tw/news/92506
5. 如何學習及選擇大數據非關係型資料庫NoSQL - https://kknews.cc/zh-tw/tech/m3695y6.html
6. CAP理論的前世今生
https://zhuanlan.zhihu.com/p/66487328
7. 透過區塊鍊認識CAP
https://blockcast.it/2020/06/05/bsos-understand-cap-theorem-first-then-blockchain/
8. 區塊鍊技術手冊中CAP部分
https://poweichen.gitbook.io/blockchain-guide-zh/distribute_system/cap
9. 初步認識分散式資料庫與 NoSQL CAP 理論
https://oldmo860617.medium.com/%E5%88%9D%E6%AD%A5%E8%AA%8D%E8%AD%98%E5%88%86%E6%95%A3%E5%BC%8F%E8%B3%87%E6%96%99%E5%BA%AB%E8%88%87-nosql-cap-%E7%90%86%E8%AB%96-a02d377938d1
10. Cassandra 高可用的原因
https://kknews.cc/zh-tw/code/4pjrlog.html
11. CAP 不一定真的是CAP(請確定自己到底)是什麼
https://blog.the-pans.com/cap/
12. CAP中的C(一致性)
https://tech.youzan.com/cap-coherence-protocol-and-application-analysis/
13. 架構師角度看CAP系統
https://iter01.com/517265.html
## section8
1. Mesos 鐵人賽解釋:https://ithelp.ithome.com.tw/articles/10184643
2. Mesos vs Yarn
https://data-flair.training/blogs/comparison-between-apache-mesos-vs-hadoop-yarn/
3. Tez 執行引擎的比較
https://stackoverflow.com/questions/41630987/tez-execution-engine-vs-mapreduce-execution-engine-in-hive
4. DAG是什麼
https://blog.csdn.net/weixin_37536020/article/details/106815387
http://web.ntnu.edu.tw/~algo/DirectedAcyclicGraph.html
5. Spark跟Tez的差異
https://www.xplenty.com/blog/apache-spark-vs-tez-comparison/
http://www.zdingke.com/2016/12/05/spark%E4%B8%8Etez%E6%AF%94%E8%BE%83/
6. Yarn 的manager原理
https://kknews.cc/zh-tw/code/gya93m9.html
* 官網
Hue https://gethue.com/
Zeppelin https://zeppelin.apache.org/docs/latest/quickstart/install.html
Oozie https://oozie.apache.org/
Tez https://www.cloudera.com/products/open-source/apache-hadoop/apache-tez.html
Zookeeper https://zookeeper.apache.org/
## section9
1. 日誌採集系統flume和kafka有什麼區別及聯絡https://www.itread01.com/content/1544497746.html
1. Flume 整合Kafka 其中一個案例
https://www.cnblogs.com/smartloli/p/9984140.html
1. Flume Kafka Nifi 比較
https://kknews.cc/zh-tw/tech/4654b2q.html
1. Kafka 運作原理解釋
https://zhuanlan.zhihu.com/p/68052232
1. RabbitMQ Kafka 比較
https://kknews.cc/zh-tw/code/abjxo8x.html
* 官網
Kafka https://kafka.apache.org/24/documentation/streams/core-concepts
Flume https://flume.apache.org/
Nifi(不在課程中) https://nifi.apache.org/
## section10
1. Apache Storm vs Spark Streaming
https://www.ericsson.com/en/blog/2015/7/apache-storm-vs-spark-streaming
1. Apache Storm vs Spark Streaming – Feature wise Comparison
https://data-flair.training/blogs/apache-storm-vs-spark-streaming/
1. Flink及Storm、Spark主流流框架比較https://www.itread01.com/content/1550530998.html
1. 流式計算的三種框架:Storm、Spark和Flink
https://codertw.com/程式語言/666820/
1. Spark Streaming vs Flink vs Storm vs Kafka Streams vs Samza : Choose Your Stream Processing Framework
https://medium.com/@chandanbaranwal/spark-streaming-vs-flink-vs-storm-vs-kafka-streams-vs-samza-choose-your-stream-processing-91ea3f04675b
1. 大数据生态圈之流式数据处理框架选择(Storm VS Kafka Streams VS Spark Streaming VS Flink VS Samza)
https://blog.csdn.net/WeiJonathan/article/details/83864244
* storm
https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.0/developing-storm-applications/content/understanding_sliding_and_tumbling_windows.html
* flink
https://archive.apache.org/dist/flink/flink-1.2.0/flink-1.2.0-bin-hadoop27-scala_2.10.tgz
Flink簡介
https://ithelp.ithome.com.tw/articles/10199506
* stream-processing
https://www.upsolver.com/blog/popular-stream-processing-frameworks-compared
https://www.datanami.com/2019/05/30/understanding-your-options-for-stream-processing-frameworks/
* 官網
storm https://storm.apache.org/2020/06/30/storm220-released.html
Flink https://flink.apache.org/
Stream-processing https://spark.apache.org/streaming/
## section11
* 流系列
1. 分散式系統架構,回顧2020年常見面試知識點梳理(Redis ,Kafka ,系統設計瓶頸)
https://iter01.com/577275.html
1. Kafka使用場景https://www.gushiciku.cn/pl/gJwW/zh-tw
2. 分散式事物產生的場景跟相對應方案
https://codingnote.cc/zh-tw/p/239802/
3. 快取的應用
https://codertw.com/%E7%A8%8B%E5%BC%8F%E8%AA%9E%E8%A8%80/11747/
* 系統評估系列(安德魯文章)
1. 移轉到微服務必經之路 ~系統與資料庫重構
https://www.facebook.com/123910281317592/videos/545139382528011
1. MQ的前後準備
https://www.facebook.com/123910281317592/videos/399472807454255
1. SLO
https://columns.chicken-house.net/2021/06/04/slo/#3-%E5%85%B6%E4%BB%96%E8%B7%9F-slo-%E7%9B%B8%E9%97%9C%E7%9A%84%E9%80%B2%E9%9A%8E%E8%AD%B0%E9%A1%8C
* 搜尋系列
1. solr vs elasticsearch
https://logz.io/blog/solr-vs-elasticsearch/
1. lucene
https://lucene.apache.org/
1. lucene 算法
http://www.blogjava.net/kingpub/archive/2012/07/16/64174.html
1. 全文檢索(in sqlserver)
https://docs.microsoft.com/zh-tw/sql/relational-databases/search/full-text-search?view=sql-server-ver15
1. 全文檢索in sql 操作過程
https://dotblogs.com.tw/Tm731531/2020/08/23/012252
1. 倒排索引(全文檢索的關鍵)
https://www.elastic.co/guide/cn/elasticsearch/guide/current/inverted-index.html
* 官網
IMPALA
https://impala.apache.org/
Accumulo
https://accumulo.apache.org/
REDIS
https://redis.io/
IGNITE
https://ignite.apache.org/
Elasticsearch
https://www.elastic.co/elasticsearch/
Kinesis (AWS ecosystem) (AWS kafka)
https://aws.amazon.com/tw/kinesis/
Nifi
https://Nifi.apache.org/
Falcon
https://falcon.apache.org/
apache slider
https://www.cloudera.com/products/open-source/apache-hadoop/apache-slider.html
# (Optional) 額外資訊
1. 資料集網站(https://grouplens.org/datasets/movielens/ )
https://files.grouplens.org/datasets/movielens/ml-100k.zip
2. 課堂的官方網站
(https://sundog-education.com/)
3. 講師的Youtube
(https://www.youtube.com/c/SundogEducation/playlists)
4. 課堂中問與答的問題集

5. Linux 基礎指令集
(http://linux.vbird.org/linux_basic/redhat6.1/linux_06command.php)
6. Linux 應用集(鳥哥)
(http://linux.vbird.org/)
7. CRISP-DM 模型(數據應用的開發流程)
(https://adamsanalysis.com/data-science/crisp-dm-introduction)
# (Optional) 大數據專案的眉眉角角"們"
* 架構
1. 傳統行業如何搭建大數據團隊?
https://www.finereport.com/tw/knowledge/acquire/bigdatateam.html
1. 6月北台停電4次!藉此談談電力企業大數據管理!
https://www.finereport.com/tw/knowledge/acquire/electricbigdata.html
* 案例
1. 各種資料可能可以做的分析案例
https://www.finereport.com/tw/tag/productionreport
1. 解析|大數據公司挖掘數據價值的49個典型案例(值得收藏)
https://kknews.cc/zh-tw/tech/ekx9v54.html
1. “大數據” 與我有什麼關聯?5 張圖了解大數據分析的商業應用
https://www.stockfeel.com.tw/%E5%A4%A7%E6%95%B8%E6%93%9A%E5%88%B0%E5%BA%95%E8%88%87%E6%88%91%E6%9C%89%E4%BB%80%E9%BA%BC%E9%97%9C%E8%81%AF%EF%BC%9F5-%E5%BC%B5%E5%9C%96%E5%BF%AB%E9%80%9F%E4%BA%86%E8%A7%A3%E5%A4%A7/
* 資料視覺化(課程中比較少提到的部分),因為資料一行一行其實沒有感覺差異,要能讓資料說話,必須要有這一個動作,讓資料的價值跟變現考量
* 工具(以下舉例 不限定)
1. Tableau
1. PowerBI
1. D3JS
1. python bokeh
1. Excel 圖表
* 能做的事情
1. 各類圖表報表
1. 文字雲
1. 關聯分析
1. 價值估算
* 目標(舉例 不限定)
1. 將"看法"證據化 ex: 去年平均溫度比較低今年比較高,因此應該提高物流中冰棒的存量,這個時候就需要將去年的溫度+今年的溫度做一個視覺化。
1. 將"評估"具體化 ex: 依照過去六個月的財務推估今年的營收,換算整年的EPS,反推這支股票的股價。
1. 將"感覺"數據化 ex: 系統一上線幾乎都是XX單位的BUG,如果有api tracing log,就可以取得相關的資訊跟報表,補助決策輕重緩急
# (Optional) Docker 環境問題
## 1.Ambari admin密碼重設
進入cmd
```bash=
ambari-admin-password-reset
```
## 2.Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-wnNrDp/click/
問題說明 Docker內部環境是2.7.5的python 相對應的pip 是8 ,而且因為支援問題已經不太支援,使用上請小心因為這樣會改變基本設定的Python 後續會有一些套件不相容的問題,建議是真的跑不下去才改設定(https://discuss.python.org/t/on-the-old-python-2-7-5-version-the-pip-install-package-does-not-install-due-to-the-ssl-error/5811)

解法
1.先更新pip 再去更新setuptools(by Jeffrey 提供)
```bash=
pip install --upgrade pip
yum reinstall -y python2-pip.noarch python27-python-pip.noarch
pip install setuptools==33.1.1
```
2.
目前我能成功的做法
就是把python拉到3.7.5
```bash=
yum groupinstall "Development Tools" -y
wget https://www.python.org/ftp/python/3.7.5/Python-3.7.5.tgz
tar -zxvf Python-3.7.5.tgz
cd Python-3.7.5
./configure
make all
make install
make clean
make distclean
```
安裝到此為止
下面是做替換預設版本
```bash=
/usr/local/bin/python3.7 -V
mv /usr/bin/python /usr/bin/python2.7.5
ln -s /usr/local/bin/python3.7 /usr/bin/python
sed -i 's/python/python2.7/g' /usr/bin/yum
sed -i 's/python/python2.7/g' /usr/libexec/urlgrabber-ext-down
python -V
```
然後執行pip的時候要改成用
python -m pip
