tvare-toc === # Specs I am running Fedora 30 on a DigitalOcean droplet 8GB RAM / 4 vCPUs / 25GB SSD ``` hadoop@fedoracid1 ~]$ uname -a Linux fedoracid1 5.1.5-300.fc30.x86_64 #1 SMP Sat May 25 18:00:11 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux [hadoop@fedoracid1 ~]$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 40 bits physical, 48 bits virtual CPU(s): 4 On-line CPU(s) list: 0-3 Thread(s) per core: 1 Core(s) per socket: 1 Socket(s): 4 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz Stepping: 4 CPU MHz: 2294.608 BogoMIPS: 4589.21 Virtualization: VT-x Hypervisor vendor: KVM Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 1024K L3 cache: 25344K NUMA node0 CPU(s): 0-3 ... ``` # Hadoop setup Hadoop 3.1.2, pseudo-distributed mode YARN as ResourceManager # Jobs: ### Scraper scrape FB profiles, download as many images as possible (optional) Ultimate FB scraper does a good job, but is unreliable / freezes / fails job randomly. ### Face Recognition * use a pre-trained DNN to scan images for faces. * Save face locations for each face. * Save a 128 dim vector face encoding * Save path to each file and with it the profile name input: as many files as possible output: ``` "hdfs:///user/hadoop/input/photos/martinxluptak/picture.jpg" [[442, 1801, 597, 1646], [...vector-128...]] ``` ### Cluster faces * DBSCAN or Chinese Whispers: find number of cluster * not easy to parallelize * Using DBSCAN, although Chinese Whispers has O(n) complexity -> might be better for big data sets ![](https://s14-eu5.startpage.com/wikioimage/8d5ccb96d8adbcaf30e181652c71dec3.png) ### Work with faces * aggregate data, i.e. how many pictures was each person in? * compare to new faces unknown faces (a real-time processing job) * obtain facebook profile ID based on picture