tvare-toc
===
# Specs
I am running Fedora 30 on a DigitalOcean droplet 8GB RAM / 4 vCPUs / 25GB SSD
```
hadoop@fedoracid1 ~]$ uname -a
Linux fedoracid1 5.1.5-300.fc30.x86_64 #1 SMP Sat May 25 18:00:11 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
[hadoop@fedoracid1 ~]$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 40 bits physical, 48 bits virtual
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 4
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz
Stepping: 4
CPU MHz: 2294.608
BogoMIPS: 4589.21
Virtualization: VT-x
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 25344K
NUMA node0 CPU(s): 0-3
...
```
# Hadoop setup
Hadoop 3.1.2, pseudo-distributed mode
YARN as ResourceManager
# Jobs:
### Scraper
scrape FB profiles, download as many images as possible (optional)
Ultimate FB scraper does a good job, but is unreliable / freezes / fails job randomly.
### Face Recognition
* use a pre-trained DNN to scan images for faces.
* Save face locations for each face.
* Save a 128 dim vector face encoding
* Save path to each file and with it the profile name
input:
as many files as possible
output:
```
"hdfs:///user/hadoop/input/photos/martinxluptak/picture.jpg" [[442, 1801, 597, 1646], [...vector-128...]]
```
### Cluster faces
* DBSCAN or Chinese Whispers: find number of cluster
* not easy to parallelize
* Using DBSCAN, although Chinese Whispers has O(n) complexity -> might be better for big data sets

### Work with faces
* aggregate data, i.e. how many pictures was each person in?
* compare to new faces unknown faces (a real-time processing job)
* obtain facebook profile ID based on picture