DSCI - 551 Lecture Notes - 1

# DSCI - 551 Lecture Notes - 1 ## Week 1 - What is the difference between generator and list in python? - For exmaple double each element in l = [1, 2, 3] - sol 1: l1 = [2*x for x in l] - sol 2: g = (2*x for x in l) then list(g) - sol 3: map is also a generator - def f(x): - ... return 2*x - list(map(f, [1, 2, 3])) - sol 4: - list(map(lambda x: 2 * x, [1, 2, 3])) - Example: Get sum of a list - sol 1: - def ourSum(l): - U = l[0] - for x in l[1:]: - U = U + x - sol 2: - import functools as fc - fc.reduce(lambda U, x: U + x, [1, 2, 3]) - sol 3: - def add(U, x): return U + x - U = l[0] - U = add(U, l[1]) - U = add(U, l[2]) - edge case: - fc.reduce(lambda U, x: U + x, [1]) - It returns: 1 - import functools as fc - fc.reduce(lambda U, x: U + x, [], 0) -> 0 - fc.reduce(lambda U, x: U + x, [1], 0) -> 1 - fc.reduce(lambda U, x: U + x, [0, 1]) -> 1 - fc.reduce(lambda U, x: U - x, [0, 1]) -> -1 - fc.reduce(lambda U, x: U - x, [1, 0]) -> 1 - linux commands - ls - cd .. - cd dsci551 - mkdir - man ls - nano hello.txt - ls - cat hello.txt - man cat - cp hello.txt hello1.txt # make a copy - rm hello1.txt - ls dsci551.pem -l - chmod 400 dsci551.pem - ls -l - https://us-west-1.console.aws.amazon.com/ec2/v2/home?region=us-west-1#ConnectToInstance:instanceId=i-0888f691b6ee07f5e - ssh -i "dsci-551.pem" ec2-user@ec2-54-219-83-49.us-west-1.compute.amazonaws.com - sftp -i "dsci-551.pem" ec2-user@ec2-54-219-83-49.us-west-1.compute.amazonaws.com - pwd - lls lax.json - put lax.json - exit - rmdir abc - PySpark - l = [1, 2, 3] - data = sc.parallelize([1,2,3], 2) - data - data.getNumPartitions() - def printf(p): - print(list(p)) - data.foreachPartition(printf) - data.foreachPartition(printf) - data.map(lambda x: 2 * x) - data.map(lambda x: 2 * x).collect() - data1 = data.map(lambda x: 2 * x) - data1.collect() - data1.foreachPartition(printf) - sum(l) - sum(data) x - data.sum() v - data.reduce(lambda U, x: U + x) v - import functools as fc - fc.reduce(lambda U, x: U + x, [1]) - fc.reduce(lambda U, x: I + x, [2, 3]) - fc.reduce(lambda U, x: I + x, [1, 5]) ## Week 2 - Inverted index: bmw -> doc 11, 12, 13, .. - relevancy - Linux - echo $PATH - jsonlint.com - python 3 - import json - d = json.loads('{"name": "john", "age": 25}') - d['name'] - d = json.loads('{"name": "john", "age": 25, "graduate: false"}') - d = json.loads('{"name": "john", "age": 25, "graduate: null"}') - json.dumps(d) - json.loads -> serialization - json.dumps -> deserialization - Firebase ignores empty value. - aws linux - curl 'https://dsci-551-6be8e-default-rtdb.firebaseio.com/dataset1.json' - curl 'https://dsci-551-6be8e-default-rtdb.firebaseio.com/dataset1.json?print=pretty' - print is a parameter, its value = pretty - curl -X GET 'https://dsci-551-6be8e-default-rtdb.firebaseio.com/dataset1.json' - curl -X PUT 'https://dsci-551-6be8e-default-rtdb.firebaseio.com/dataset1.json' -d '[1,2,3,4,5]' - curl -X GET 'https://dsci-551-6be8e-default-rtdb.firebaseio.com/students.json?orderBy="age"' - curl -X GET 'https://dsci-551-6be8e-default-rtdb.firebaseio.com/students.json?orderBy="$key"' - curl -X GET 'https://dsci-551-6be8e-default-rtdb.firebaseio.com/students.json?orderBy="$key"&equalTo="200"' - key is string, it needs to be quoted. - return key-value pairs (key="200") - curl -X GET 'https://dsci-551-6be8e-default-rtdb.firebaseio.com/students/200.json' - return only objects - curl -X GET 'https://dsci-551-6be8e-default-rtdb.firebaseio.com/students.json?orderBy="$key"&endAt="200"' - curl -X GET 'https://dsci-551-6be8e-default-rtdb.firebaseio.com/students.json?orderBy="gender"&limitToLast=1' - curl -X GET 'https://dsci-551-6be8e-default-rtdb.firebaseio.com/students/100/scores.json?orderBy="$value"&limitToFirst=1' - In SQL - null shows on the top of the table - jupyter - import requests - r = requests.get('https://dsci-551-6be8e-default-rtdb.firebaseio.com/students/100/scores.json?orderBy="age"&limitToFirst=1') - r.text - d = r.json - aws linux - curl -X PATCH 'https://dsci-551-6be8e-default-rtdb.firebaseio.com/students/100.json' -d '{"age": 26}' - curl -X PATCH 'https://dsci-551-6be8e-default-rtdb.firebaseio.com/students/100/age.json' -d '27' - -> error - curl -X PUT 'https://dsci-551-6be8e-default-rtdb.firebaseio.com/students/100/age.json' -d '27' - curl -X POST 'https://dsci-551-6be8e-default-rtdb.firebaseio.com/students/100/age.json' -d '28' - create a new key-value pair under age and its value = 28 - curl -X DELETE 'https://dsci-551-6be8e-default-rtdb.firebaseio.com/students/100/age.json' -d '28' - volatile vs persistent - EC2 linux cmd - **top** cmd - **df** cmd - Lecture note - 2^10 Byte = 1 KB - 2^20 Byte = 1 MB - 2^30 Byte = 1 GB ## Week 3 - AWS EC2 hadoop cmd - start-dfs.sh - - jps - hdfs dfs -ls / - hdfs dfs -ls /user - hdfs dfs -mkdir /user/john - hdfs dfs -ls /user - hdfs dfs -mkdir /user/john/a/b - x - hdfs dfs -mkdir /user/john/a - hdfs dfs -mkdir /user/john/a/b - hdfs dfs -put WordCount.java /user/john - hdfs dfs -ls /user/john - hdfs dfs -cat /user/john/WordCount.java - hdfs dfs -rmdir /user/john/a/b - AWS EC2 sftp - sftp -i "dsci-551.pem" ec2-user@ec2-54-153-80-187.us-west-1.compute.amazonaws.com - ls - cd dsci - ls - get Wo - get WordCount.java - downloading - put WordCount.java - uploading WordCount-sp22.java - ls - remote - lls *.java - local - pwd - AWS EC2 where hdfs located - jps - cd /tmp - cd hadoop-ec2-user/ - ls - cd dfs - cd name - namenode - ls - cd .. - ls - cd data - datanodeb - cd .. - ls - cd name - ls - cd current/ - ls - Notes - Both directory and file have inodes. - Under <inode> - <name> - <name/> means root directory - <name>xxx</name> means not root directory - <mtime> - last modified time - <atime> - last access time - <permission> - 7 -> 111 readable, writable, excutable - 5 -> 101 readable, not writable, excutable - <blocks> - how many blocks this file occupyed? - Under <INodeDirectorySection> - <directory> - <parent>16305</parent> - <child>16306</child> # inumber inside it - </directory> - 1 child - <directory> - <parent>16387</parent> - <child>16390</child> - <child>16412</child> - <child>16401</child> - <child>16391</child> - <child>16388</child> - </directory> - 5 children - metadata & data - metadata: file name, file size - data: file content - EC2 cmds - nano hello.txt - append string: hello world - ls -l - cat hello.txt - chomd 400 hello txt - ls -l hello.txt - man stat - stat hello.txt - Inode -> Index number - Blocks -> Number of blocks allocated for this file - IO Block -> Block size - 4096 -> 4 kilobytes - regular file -> File type - Links -> Hard link - Uid: user id - Gid: group id - Notes - Q: Could it be possible that you only change the metadata of file without changing its content? - A: Yes - Ex: - chmod 644 hello.txt - mv hello.txt hellow.txt (the same inode, the same file with new name) - mv hello1.txt hello.txt - cp hello.txt hello1.txt (diffrent inode, new file) - If we change the content, all three time change. - atime: access time - mtime: modify time - ctime: change time - If we use $vi hello1.txt$ without making any change, all time will not change. - atime will not change - Because it has noatime - cd ~/etc - cat fatab - will show "UUID=xxxx / xfs defaults,notime 1 1 " - noatime: turn off the atime - mount ## Week 4 - hadoop related cmd - cd hadoop-3.3.4/etc/hadoop - nano core-site.xml - nano hdfs-site.xml - pwd - Displays the path name of the working directory. - hdfs namenode -format - start-dfs.sh - jps - -------------------------- - stop-dfs.sh - cd /tmp/hadoop-ec2-user/ - 去根目录下的这个文件 - ls - cd dfs - pwd - /tmp/hadoop-ec2-user/dfs - cd name - ls - cd current - -------------------------- - cd sbin - cat start-dfs.sh|less - cat start-dfs.cmd|less - -------------------------- - hdfs oiv --help - oiv: offline image viewer - Notes - SecondaryNameNode - Recover from this directory - hadoop related cmd - hdfs oiv --help - oiv: offline image viewer - hdfs oiv -i fsimage - hdfs oiv -i fsimage_0000000000000000107 -o fsimage107.xml -p XML - ls *.xml - cat fsimage107.xml - Notes - fsimage xml file - <inode> - inode: index node - <id> - i number - <type> - directory or file - <name> - empty name represents root directory - <mtime> - modification time - <preferredBlockSize> - 134217728 = 128 * 2**20 = 128MB - Why this xml has block? - It's the xml of a - <numBytes> - the size of the file -> metadata - the size of the content(data) -> metadata - <INodeDirectorySection> - <directory> - <parent>16385</paretnt> - <child>16386</child> - </directory> - <directory> - <parent>16386</parent> - <child>16387</child> - <child>16391</child> - </directory - <directory> - <parent>16387</parent> - <child>16390</child> - <child>16388</child> - </directory> - </INodeDirectorySection> - tree structure - the root id 16385 - Content in DataNode - NameNode maintains metadata - Why would hdfs take a screenshot of metadata? - Why would we need metadata - hadoop related cmd - hdfs oiv --help - oiv: offline image viewer - hdfs oiv -i fsimage_0000000000000000107 -o fsimage107.xml -p Delimited - ls - ls *.tsv - nano fsimage107.tsv - ls - ls -l fsimage* - Notes - Hadoop HDFS - ClientNamenodeProtocol.proto - mapi - rpc: remote procedure call - getBlockLocations() for reading - addBlock() for writing - datatransfer.proto - message OpReadBlockProto - ClientProtocol.java - getBlockLocations(String src, long offset, long length) # method - input - src: filepath - offset: start index - length: needing length - If foo has 200, and you just want first 100, you will call ```getBlockLocations("home/user/data/foo", 0, 100);```. If you want second 100, will call ```getBlockLocations("home/user/data/foo", 100, 100);```. - ```select * from t limit 10 offset 10```: get second 10 - output - LocateBlocks - ExtendedBlocks.java - attribute: List<LocatedBlock> blocks - ExtendedBlock.java - attributes - ExtendedBlock b - long offset - DatanodeInfoWithStorage[] locs //cached storage ID for each replica - String[] storageIDs - .. - DatanodeInfo[] cachedLocs ## Week 5 - cmd related to file format - cd haoop-3.3.4/ - ls - cd dtc - cd hadoop/ - vi core-site.xml - UTF: Unicode transformation format - UTF-8 - Every code has 8 bits / 1 bytes - UTF-16 - Every code has 16 bites / 2 bytes - Lecture notes - UTF-8 - python3 - \>>> '\u0041' - \> A - '\u0041' -> a code point - hex - 0, 1, .. 9, A, B, C, D, E, F - A: 1010 - B: 1011 - C: 1100 - D: 1101 - E: 1110 - F: 1111 - UTF-8 - python 3 - \>>> a = '\u20ac' - \# ```0010 0000 1010 1100``` - \>>> a - \> '€' - \>>> a.encode('utf-8') - \> b'\xe2\x82\xac' - \# ```1110 0010 1000 0010 1010 1100``` - 3 bytes = 24 bits - How do ```0010 0000 1010 1100``` become ```1110 0010 1000 0010 1010 1100```? - The leading byte - = ```1110``` + first 4 bits - = ```1110``` + ```0010``` - = ```1110 0010``` - The 2nd byte / Conitnuation byte - = ```10``` + second last 6 bits - = ```10``` + ```0000 10``` - = ```1000 0010``` - The last byte / Continuation byte - = ```10``` + last 6 bits - = ```10``` + ```10 1100``` - = ```1010 1100``` - The continuation byte always start from ```10``` - The three ```1``` in the leading byte means there are 3 byte for this number. - The leading byte starts with at least two ```1``` # different with continuation byte starting with ```10``` - The last bit in the leading protocol is always ```0```. - ==WHY?== - \>>> a = '\u20ac' - \>>> e = a.encode('utf-8') - \>>> e.decode('utf-8') - \> '€' - 128 = 2^7 = 1000 0000 = 80 hex - ACSII = [0, 7F] - ```000 0000``` - ```111 1111``` (= 2^7 - 1) - \>>> a = '\u20ac' - \>>> a - \> '€' - \>>> a.encode('utf-8') - \> b'xe2\x82\ac - \>>> a.encode('utf-8').decode('utf-8') - \> '€' - Lecture notes regarding XML - take fsimage564.xml for example - root has an empty name - <name /> - <name></name> - ec2 cmd - pip3 install lxml - make sure helper.py is on ec2 as well - ![](https://i.imgur.com/8YfE7AW.png) - python3 lxml on ec2 - example XML file - ![](https://i.imgur.com/kTtwtO6.png) - from helper import printf - from lxml import etree - etree: element tree - tree = extree.parse(open('bibs.xml')) - tree - printf(tree) - printf(tree.xpath('/bib/book[1]/year')) - <year>1995</year> - printf(tree.xpath('/bib/book[1]/year/text()')) - 1995 - printf(tree.xpath('/bib/book[1]/@price')) - 35 - printf(tree.xpath('/bib/book[1]/price/text()')) - 38.8 - printf(tree.xpath('/bib/book[1]/node()')) - return all text nodes and element nodes under book[1] - printf(tree.xpath('/bib/book[1]/*')) - return all element nodes under book[1] - printf(tree.xpath('/bib/book[year]')) - first two books will be returned - printf(tree.xpath('/bib/book[year > 1995 and year <= 2000]')) - only 2nd book is returned - This query is finding books with specific year range - printf(tree.xpath('/bib/book[@price]')) - return first two books - printf(tree.xpath('/bib/book[not (year > 1995 and year <= 2000])')) - the last one book is returned - printf(tree.xpath('/bib/book[not (year > 1995)]')) - means the book year <= 1995 or it does has the year element at all - printf(tree.xpath('/bib/book[not (year > 1995)] and year')) - must have a year cnd the year is <= 1995 - printf(tree.xpath('bib/book[contains(author, 'Ullman')]')) - return a book element - printf(tree.xpath('bib/book[contains(author/text(), 'Ullman')]')) - the same with the previous one - printf(tree.xpath('bib/book/author[contains(., "Ullman")]')) - return a author element - printf(tree.xpath('bib/book/author[contains(. = "Ullman")]')) - return nothing - printf(tree.xpath('bib/book/author[contains(. = "Jeffrey D. Ullman")]')) - return the author - contains is substring match while = is exact match - printf(tree.xpath('bib/book/author[contains(./text() = "Jeffrey D. Ullman")]')) - return the author as well - printf(tree.xpath('bib/book[@price > 35]')) - printf(tree.xpath('bib/book[@price > "35"]')) - printf(tree.xpath('/*')) - printf(tree.xpath('/bib')) - printf(tree.xpath('/bib/book/*')) - printf(tree.xpath('//book')) - return all book elements - printf(tree.xpath('/bib/book[year]')) - printf(tree.xpath('/bib/book[author/first-name]')) - printf(tree.xpath('/bib/book[//year]')) - return the 1st and 3rd books - the 3rd book does not have a year element - printf(tree.xpath('/bib/book[author//first-name]')) - ?? - printf(tree.xpath('//book[author/text()]')) - ?? - printf(tree.xpath('//book[author[2]/text()]')) - return nothing - printf(tree.xpath('//book[author[1]/text()]')) - return the first two books - printf(tree.xpath('/bib/book|/bib/cd')) - | -> or - lxml with fgimage file - from helper import printf - from lxml import etree - etree.parse(open('fsimage70.xml')) - printf(etree.xpath('/fsimage/INodeSection/inode[name[not(node())]]')) - return root inode - not(node()) means there is nothing under the name -> empty - printf(etree.xpath('/fsimage/INodeSection/inode[name[not(node())]]/id'))