# DSCI - 551 Lecture Notes - 1
## Week 1
- What is the difference between generator and list in python?
- For exmaple double each element in l = [1, 2, 3]
- sol 1: l1 = [2*x for x in l]
- sol 2: g = (2*x for x in l) then list(g)
- sol 3: map is also a generator
- def f(x):
- ... return 2*x
- list(map(f, [1, 2, 3]))
- sol 4:
- list(map(lambda x: 2 * x, [1, 2, 3]))
- Example: Get sum of a list
- sol 1:
- def ourSum(l):
- U = l[0]
- for x in l[1:]:
- U = U + x
- sol 2:
- import functools as fc
- fc.reduce(lambda U, x: U + x, [1, 2, 3])
- sol 3:
- def add(U, x): return U + x
- U = l[0]
- U = add(U, l[1])
- U = add(U, l[2])
- edge case:
- fc.reduce(lambda U, x: U + x, [1])
- It returns: 1
- import functools as fc
- fc.reduce(lambda U, x: U + x, [], 0) -> 0
- fc.reduce(lambda U, x: U + x, [1], 0) -> 1
- fc.reduce(lambda U, x: U + x, [0, 1]) -> 1
- fc.reduce(lambda U, x: U - x, [0, 1]) -> -1
- fc.reduce(lambda U, x: U - x, [1, 0]) -> 1
- linux commands
- ls
- cd ..
- cd dsci551
- mkdir
- man ls
- nano hello.txt
- ls
- cat hello.txt
- man cat
- cp hello.txt hello1.txt # make a copy
- rm hello1.txt
- ls dsci551.pem -l
- chmod 400 dsci551.pem
- ls -l
- https://us-west-1.console.aws.amazon.com/ec2/v2/home?region=us-west-1#ConnectToInstance:instanceId=i-0888f691b6ee07f5e
- ssh -i "dsci-551.pem" ec2-user@ec2-54-219-83-49.us-west-1.compute.amazonaws.com
- sftp -i "dsci-551.pem" ec2-user@ec2-54-219-83-49.us-west-1.compute.amazonaws.com
- pwd
- lls lax.json
- put lax.json
- exit
- rmdir abc
- PySpark
- l = [1, 2, 3]
- data = sc.parallelize([1,2,3], 2)
- data
- data.getNumPartitions()
- def printf(p):
- print(list(p))
- data.foreachPartition(printf)
- data.foreachPartition(printf)
- data.map(lambda x: 2 * x)
- data.map(lambda x: 2 * x).collect()
- data1 = data.map(lambda x: 2 * x)
- data1.collect()
- data1.foreachPartition(printf)
- sum(l)
- sum(data) x
- data.sum() v
- data.reduce(lambda U, x: U + x) v
- import functools as fc
- fc.reduce(lambda U, x: U + x, [1])
- fc.reduce(lambda U, x: I + x, [2, 3])
- fc.reduce(lambda U, x: I + x, [1, 5])
## Week 2
- Inverted index: bmw -> doc 11, 12, 13, ..
- relevancy
- Linux
- echo $PATH
- jsonlint.com
- python 3
- import json
- d = json.loads('{"name": "john", "age": 25}')
- d['name']
- d = json.loads('{"name": "john", "age": 25, "graduate: false"}')
- d = json.loads('{"name": "john", "age": 25, "graduate: null"}')
- json.dumps(d)
- json.loads -> serialization
- json.dumps -> deserialization
- Firebase ignores empty value.
- aws linux
- curl 'https://dsci-551-6be8e-default-rtdb.firebaseio.com/dataset1.json'
- curl 'https://dsci-551-6be8e-default-rtdb.firebaseio.com/dataset1.json?print=pretty'
- print is a parameter, its value = pretty
- curl -X GET 'https://dsci-551-6be8e-default-rtdb.firebaseio.com/dataset1.json'
- curl -X PUT 'https://dsci-551-6be8e-default-rtdb.firebaseio.com/dataset1.json' -d '[1,2,3,4,5]'
- curl -X GET 'https://dsci-551-6be8e-default-rtdb.firebaseio.com/students.json?orderBy="age"'
- curl -X GET 'https://dsci-551-6be8e-default-rtdb.firebaseio.com/students.json?orderBy="$key"'
- curl -X GET 'https://dsci-551-6be8e-default-rtdb.firebaseio.com/students.json?orderBy="$key"&equalTo="200"'
- key is string, it needs to be quoted.
- return key-value pairs (key="200")
- curl -X GET 'https://dsci-551-6be8e-default-rtdb.firebaseio.com/students/200.json'
- return only objects
- curl -X GET 'https://dsci-551-6be8e-default-rtdb.firebaseio.com/students.json?orderBy="$key"&endAt="200"'
- curl -X GET 'https://dsci-551-6be8e-default-rtdb.firebaseio.com/students.json?orderBy="gender"&limitToLast=1'
- curl -X GET 'https://dsci-551-6be8e-default-rtdb.firebaseio.com/students/100/scores.json?orderBy="$value"&limitToFirst=1'
- In SQL
- null shows on the top of the table
- jupyter
- import requests
- r = requests.get('https://dsci-551-6be8e-default-rtdb.firebaseio.com/students/100/scores.json?orderBy="age"&limitToFirst=1')
- r.text
- d = r.json
- aws linux
- curl -X PATCH 'https://dsci-551-6be8e-default-rtdb.firebaseio.com/students/100.json' -d '{"age": 26}'
- curl -X PATCH 'https://dsci-551-6be8e-default-rtdb.firebaseio.com/students/100/age.json' -d '27'
- -> error
- curl -X PUT 'https://dsci-551-6be8e-default-rtdb.firebaseio.com/students/100/age.json' -d '27'
- curl -X POST 'https://dsci-551-6be8e-default-rtdb.firebaseio.com/students/100/age.json' -d '28'
- create a new key-value pair under age and its value = 28
- curl -X DELETE 'https://dsci-551-6be8e-default-rtdb.firebaseio.com/students/100/age.json' -d '28'
- volatile vs persistent
- EC2 linux cmd
- **top** cmd
- **df** cmd
- Lecture note
- 2^10 Byte = 1 KB
- 2^20 Byte = 1 MB
- 2^30 Byte = 1 GB
## Week 3
- AWS EC2 hadoop cmd
- start-dfs.sh
-
- jps
- hdfs dfs -ls /
- hdfs dfs -ls /user
- hdfs dfs -mkdir /user/john
- hdfs dfs -ls /user
- hdfs dfs -mkdir /user/john/a/b
- x
- hdfs dfs -mkdir /user/john/a
- hdfs dfs -mkdir /user/john/a/b
- hdfs dfs -put WordCount.java /user/john
- hdfs dfs -ls /user/john
- hdfs dfs -cat /user/john/WordCount.java
- hdfs dfs -rmdir /user/john/a/b
- AWS EC2 sftp
- sftp -i "dsci-551.pem" ec2-user@ec2-54-153-80-187.us-west-1.compute.amazonaws.com
- ls
- cd dsci
- ls
- get Wo
- get WordCount.java
- downloading
- put WordCount.java
- uploading WordCount-sp22.java
- ls
- remote
- lls *.java
- local
- pwd
- AWS EC2 where hdfs located
- jps
- cd /tmp
- cd hadoop-ec2-user/
- ls
- cd dfs
- cd name
- namenode
- ls
- cd ..
- ls
- cd data
- datanodeb
- cd ..
- ls
- cd name
- ls
- cd current/
- ls
- Notes
- Both directory and file have inodes.
- Under <inode>
- <name>
- <name/> means root directory
- <name>xxx</name> means not root directory
- <mtime>
- last modified time
- <atime>
- last access time
- <permission>
- 7 -> 111 readable, writable, excutable
- 5 -> 101 readable, not writable, excutable
- <blocks>
- how many blocks this file occupyed?
- Under <INodeDirectorySection>
- <directory>
- <parent>16305</parent>
- <child>16306</child> # inumber inside it
- </directory>
- 1 child
- <directory>
- <parent>16387</parent>
- <child>16390</child>
- <child>16412</child>
- <child>16401</child>
- <child>16391</child>
- <child>16388</child>
- </directory>
- 5 children
- metadata & data
- metadata: file name, file size
- data: file content
- EC2 cmds
- nano hello.txt
- append string: hello world
- ls -l
- cat hello.txt
- chomd 400 hello txt
- ls -l hello.txt
- man stat
- stat hello.txt
- Inode -> Index number
- Blocks -> Number of blocks allocated for this file
- IO Block -> Block size
- 4096 -> 4 kilobytes
- regular file -> File type
- Links -> Hard link
- Uid: user id
- Gid: group id
- Notes
- Q: Could it be possible that you only change the metadata of file without changing its content?
- A: Yes
- Ex:
- chmod 644 hello.txt
- mv hello.txt hellow.txt (the same inode, the same file with new name)
- mv hello1.txt hello.txt
- cp hello.txt hello1.txt (diffrent inode, new file)
- If we change the content, all three time change.
- atime: access time
- mtime: modify time
- ctime: change time
- If we use $vi hello1.txt$ without making any change, all time will not change.
- atime will not change
- Because it has noatime
- cd ~/etc
- cat fatab
- will show "UUID=xxxx / xfs defaults,notime 1 1 "
- noatime: turn off the atime
- mount
## Week 4
- hadoop related cmd
- cd hadoop-3.3.4/etc/hadoop
- nano core-site.xml
- nano hdfs-site.xml
- pwd
- Displays the path name of the working directory.
- hdfs namenode -format
- start-dfs.sh
- jps
- --------------------------
- stop-dfs.sh
- cd /tmp/hadoop-ec2-user/
- 去根目录下的这个文件
- ls
- cd dfs
- pwd
- /tmp/hadoop-ec2-user/dfs
- cd name
- ls
- cd current
- --------------------------
- cd sbin
- cat start-dfs.sh|less
- cat start-dfs.cmd|less
- --------------------------
- hdfs oiv --help
- oiv: offline image viewer
- Notes
- SecondaryNameNode
- Recover from this directory
- hadoop related cmd
- hdfs oiv --help
- oiv: offline image viewer
- hdfs oiv -i fsimage
- hdfs oiv -i fsimage_0000000000000000107 -o fsimage107.xml -p XML
- ls *.xml
- cat fsimage107.xml
- Notes
- fsimage xml file
- <inode>
- inode: index node
- <id>
- i number
- <type>
- directory or file
- <name>
- empty name represents root directory
- <mtime>
- modification time
- <preferredBlockSize>
- 134217728 = 128 * 2**20 = 128MB
- Why this xml has block?
- It's the xml of a
- <numBytes>
- the size of the file -> metadata
- the size of the content(data) -> metadata
- <INodeDirectorySection>
- <directory>
- <parent>16385</paretnt>
- <child>16386</child>
- </directory>
- <directory>
- <parent>16386</parent>
- <child>16387</child>
- <child>16391</child>
- </directory
- <directory>
- <parent>16387</parent>
- <child>16390</child>
- <child>16388</child>
- </directory>
- </INodeDirectorySection>
- tree structure
- the root id 16385
- Content in DataNode
- NameNode maintains metadata
- Why would hdfs take a screenshot of metadata?
- Why would we need metadata
- hadoop related cmd
- hdfs oiv --help
- oiv: offline image viewer
- hdfs oiv -i fsimage_0000000000000000107 -o fsimage107.xml -p Delimited
- ls
- ls *.tsv
- nano fsimage107.tsv
- ls
- ls -l fsimage*
- Notes
- Hadoop HDFS
- ClientNamenodeProtocol.proto
- mapi
- rpc: remote procedure call
- getBlockLocations() for reading
- addBlock() for writing
- datatransfer.proto
- message OpReadBlockProto
- ClientProtocol.java
- getBlockLocations(String src, long offset, long length) # method
- input
- src: filepath
- offset: start index
- length: needing length
- If foo has 200, and you just want first 100, you will call ```getBlockLocations("home/user/data/foo", 0, 100);```. If you want second 100, will call ```getBlockLocations("home/user/data/foo", 100, 100);```.
- ```select * from t limit 10 offset 10```: get second 10
- output
- LocateBlocks
- ExtendedBlocks.java
- attribute: List<LocatedBlock> blocks
- ExtendedBlock.java
- attributes
- ExtendedBlock b
- long offset
- DatanodeInfoWithStorage[] locs //cached storage ID for each replica
- String[] storageIDs
- ..
- DatanodeInfo[] cachedLocs
## Week 5
- cmd related to file format
- cd haoop-3.3.4/
- ls
- cd dtc
- cd hadoop/
- vi core-site.xml
- UTF: Unicode transformation format
- UTF-8
- Every code has 8 bits / 1 bytes
- UTF-16
- Every code has 16 bites / 2 bytes
- Lecture notes
- UTF-8
- python3
- \>>> '\u0041'
- \> A
- '\u0041' -> a code point
- hex
- 0, 1, .. 9, A, B, C, D, E, F
- A: 1010
- B: 1011
- C: 1100
- D: 1101
- E: 1110
- F: 1111
- UTF-8
- python 3
- \>>> a = '\u20ac'
- \# ```0010 0000 1010 1100```
- \>>> a
- \> '€'
- \>>> a.encode('utf-8')
- \> b'\xe2\x82\xac'
- \# ```1110 0010 1000 0010 1010 1100```
- 3 bytes = 24 bits
- How do ```0010 0000 1010 1100``` become ```1110 0010 1000 0010 1010 1100```?
- The leading byte
- = ```1110``` + first 4 bits
- = ```1110``` + ```0010```
- = ```1110 0010```
- The 2nd byte / Conitnuation byte
- = ```10``` + second last 6 bits
- = ```10``` + ```0000 10```
- = ```1000 0010```
- The last byte / Continuation byte
- = ```10``` + last 6 bits
- = ```10``` + ```10 1100```
- = ```1010 1100```
- The continuation byte always start from ```10```
- The three ```1``` in the leading byte means there are 3 byte for this number.
- The leading byte starts with at least two ```1``` # different with continuation byte starting with ```10```
- The last bit in the leading protocol is always ```0```.
- ==WHY?==
- \>>> a = '\u20ac'
- \>>> e = a.encode('utf-8')
- \>>> e.decode('utf-8')
- \> '€'
- 128 = 2^7 = 1000 0000 = 80 hex
- ACSII = [0, 7F]
- ```000 0000``` - ```111 1111``` (= 2^7 - 1)
- \>>> a = '\u20ac'
- \>>> a
- \> '€'
- \>>> a.encode('utf-8')
- \> b'xe2\x82\ac
- \>>> a.encode('utf-8').decode('utf-8')
- \> '€'
- Lecture notes regarding XML
- take fsimage564.xml for example
- root has an empty name
- <name />
- <name></name>
- ec2 cmd
- pip3 install lxml
- make sure helper.py is on ec2 as well
- 
- python3 lxml on ec2
- example XML file
- 
- from helper import printf
- from lxml import etree
- etree: element tree
- tree = extree.parse(open('bibs.xml'))
- tree
- printf(tree)
- printf(tree.xpath('/bib/book[1]/year'))
- <year>1995</year>
- printf(tree.xpath('/bib/book[1]/year/text()'))
- 1995
- printf(tree.xpath('/bib/book[1]/@price'))
- 35
- printf(tree.xpath('/bib/book[1]/price/text()'))
- 38.8
- printf(tree.xpath('/bib/book[1]/node()'))
- return all text nodes and element nodes under book[1]
- printf(tree.xpath('/bib/book[1]/*'))
- return all element nodes under book[1]
- printf(tree.xpath('/bib/book[year]'))
- first two books will be returned
- printf(tree.xpath('/bib/book[year > 1995 and year <= 2000]'))
- only 2nd book is returned
- This query is finding books with specific year range
- printf(tree.xpath('/bib/book[@price]'))
- return first two books
- printf(tree.xpath('/bib/book[not (year > 1995 and year <= 2000])'))
- the last one book is returned
- printf(tree.xpath('/bib/book[not (year > 1995)]'))
- means the book year <= 1995 or it does has the year element at all
- printf(tree.xpath('/bib/book[not (year > 1995)] and year'))
- must have a year cnd the year is <= 1995
- printf(tree.xpath('bib/book[contains(author, 'Ullman')]'))
- return a book element
- printf(tree.xpath('bib/book[contains(author/text(), 'Ullman')]'))
- the same with the previous one
- printf(tree.xpath('bib/book/author[contains(., "Ullman")]'))
- return a author element
- printf(tree.xpath('bib/book/author[contains(. = "Ullman")]'))
- return nothing
- printf(tree.xpath('bib/book/author[contains(. = "Jeffrey D. Ullman")]'))
- return the author
- contains is substring match while = is exact match
- printf(tree.xpath('bib/book/author[contains(./text() = "Jeffrey D. Ullman")]'))
- return the author as well
- printf(tree.xpath('bib/book[@price > 35]'))
- printf(tree.xpath('bib/book[@price > "35"]'))
- printf(tree.xpath('/*'))
- printf(tree.xpath('/bib'))
- printf(tree.xpath('/bib/book/*'))
- printf(tree.xpath('//book'))
- return all book elements
- printf(tree.xpath('/bib/book[year]'))
- printf(tree.xpath('/bib/book[author/first-name]'))
- printf(tree.xpath('/bib/book[//year]'))
- return the 1st and 3rd books
- the 3rd book does not have a year element
- printf(tree.xpath('/bib/book[author//first-name]'))
- ??
- printf(tree.xpath('//book[author/text()]'))
- ??
- printf(tree.xpath('//book[author[2]/text()]'))
- return nothing
- printf(tree.xpath('//book[author[1]/text()]'))
- return the first two books
- printf(tree.xpath('/bib/book|/bib/cd'))
- | -> or
- lxml with fgimage file
- from helper import printf
- from lxml import etree
- etree.parse(open('fsimage70.xml'))
- printf(etree.xpath('/fsimage/INodeSection/inode[name[not(node())]]'))
- return root inode
- not(node()) means there is nothing under the name -> empty
- printf(etree.xpath('/fsimage/INodeSection/inode[name[not(node())]]/id'))