# Summary of On-Device, Real-Time Hand Tracking with MediaPipe
URL: https://ai.googleblog.com/2019/08/on-device-real-time-hand-tracking-with.html
## Why this article
* For mobile device, lower computation resource required
* Also have some mechanism similar to Lattice’s Smart Focus
* Source code available
## Web Demo
https://codepen.io/mediapipe/pen/RwGWYJw
## Hand perception pipeline overview

## Palm detector (BlazePalm)
### Why Palm?
* Estimating bounding boxes of rigid objects like palms and fists is significantly simpler than detecting hands with articulated fingers
手掌要比整隻手變化少得多
* Palms are smaller objects, the non-maximum suppression algorithm works well even for two-hand self-occlusion cases, like handshakes.
手掌要比手小,NMS可以work better (?)
* Palms can be modelled using square bounding boxes (anchors in ML terminology) ignoring other aspect ratios, and therefore reducing the number of anchors by a factor of 3-5.
手掌比較接近正方形,所以可以用正方形anchor就能涵蓋,而不需要多個anchor來變成長方形的anchor box (也就是不需考量其他ratio,可以降低運算需求). Anchor box 通常是kernal map大小的倍數,而kernel map通常大小會取3-5。
* An encoder-decoder feature extractor is used for bigger scene context awareness even for small objects (similar to the RetinaNet approach).
Guess: 因為使用encoder-decoder feature extractor更能抓出Core features,如此可讓通用性更高
* We minimize the focal loss during training to support a large amount of anchors resulting from the high scale variance.
應該是因為手很小,所以在Training data中,大部分都是negtive examples,所以更會有foreground/background imbalance的問題
# Hand Landmark Model

* To obtain ground truth data, we have manually annotated ~30K real-world images with 21 3D coordinates, as shown below (we take Z-value from image depth map, if it exists per corresponding coordinate).
如果圖本身帶有z-value 的值 (應該是指物件到鏡頭的距離),會盡量取一致的距離讓手的大小不會差太多
* To better cover the possible hand poses and provide additional supervision on the nature of hand geometry, we also render a high-quality synthetic hand model over various backgrounds and map it to the corresponding 3D coordinates.
第二排的是3d模擬出來的手
文中實驗結果,有混真實以及模擬出來的palm影像,可以改善performance

# Gesture Recognition

上圖中,camera出去的有三條黃線:
1. 最左邊那條是讓palm detector在第一個frame裡偵測有沒有palm
* 如果有,就會給ImagecCropping把palm的部位剪出來
* HandLanmakr就會標上手的landmarks
* 會有REJECT_HAND_FLAG的原因是因為:在接下來的frame並不會經過palm detector (虛線部分),也就是會直接從camera的第二條線直接把圖餵給ImageCropping,ImageCropping會把之前palm的位置重用(也就是假設palm並不會移到太遠的地方去)。藉此可以節省computation resource.
2. 中間那條就如前所述,是在有偵測到手的frame的之後的frame裡,都會直接使用第一個Frame抓到的palm的位置來去讀圖 (重複crop出那部分並餵給HandLandmark)。
3. 由於最左邊和中間那兩條黃線都會被crop,最後要成像輸出給user看時,還是要有完整的圖,所以會有第三條線把圖傳到AnnotationRenderer
最左下角LandmarksToRectangle,應該是會確認目前所用的cropped image裡,是否palm landmarks是否不完整或是有偏移,如果有,就會回饋給ImageCropping調整cropping的範圍