Summary of On-Device, Real-Time Hand Tracking with MediaPipe

# Summary of On-Device, Real-Time Hand Tracking with MediaPipe URL: https://ai.googleblog.com/2019/08/on-device-real-time-hand-tracking-with.html ## Why this article * For mobile device, lower computation resource required * Also have some mechanism similar to Lattice’s Smart Focus * Source code available ## Web Demo https://codepen.io/mediapipe/pen/RwGWYJw ## Hand perception pipeline overview ![](https://i.imgur.com/JvoMYfE.png) ## Palm detector (BlazePalm) ### Why Palm? * Estimating bounding boxes of rigid objects like palms and fists is significantly simpler than detecting hands with articulated fingers 手掌要比整隻手變化少得多 * Palms are smaller objects, the non-maximum suppression algorithm works well even for two-hand self-occlusion cases, like handshakes. 手掌要比手小，NMS可以work better (?) * Palms can be modelled using square bounding boxes (anchors in ML terminology) ignoring other aspect ratios, and therefore reducing the number of anchors by a factor of 3-5. 手掌比較接近正方形，所以可以用正方形anchor就能涵蓋，而不需要多個anchor來變成長方形的anchor box (也就是不需考量其他ratio，可以降低運算需求). Anchor box 通常是kernal map大小的倍數，而kernel map通常大小會取3-5。 * An encoder-decoder feature extractor is used for bigger scene context awareness even for small objects (similar to the RetinaNet approach). Guess: 因為使用encoder-decoder feature extractor更能抓出Core features，如此可讓通用性更高 * We minimize the focal loss during training to support a large amount of anchors resulting from the high scale variance. 應該是因為手很小，所以在Training data中，大部分都是negtive examples，所以更會有foreground/background imbalance的問題 # Hand Landmark Model ![](https://i.imgur.com/8TGm62i.png) * To obtain ground truth data, we have manually annotated ~30K real-world images with 21 3D coordinates, as shown below (we take Z-value from image depth map, if it exists per corresponding coordinate). 如果圖本身帶有z-value 的值 (應該是指物件到鏡頭的距離)，會盡量取一致的距離讓手的大小不會差太多 * To better cover the possible hand poses and provide additional supervision on the nature of hand geometry, we also render a high-quality synthetic hand model over various backgrounds and map it to the corresponding 3D coordinates. 第二排的是3d模擬出來的手文中實驗結果，有混真實以及模擬出來的palm影像，可以改善performance ![](https://i.imgur.com/0x5ofDV.png) # Gesture Recognition ![](https://i.imgur.com/YQxYNW4.png) 上圖中，camera出去的有三條黃線: 1. 最左邊那條是讓palm detector在第一個frame裡偵測有沒有palm * 如果有，就會給ImagecCropping把palm的部位剪出來 * HandLanmakr就會標上手的landmarks * 會有REJECT_HAND_FLAG的原因是因為:在接下來的frame並不會經過palm detector (虛線部分)，也就是會直接從camera的第二條線直接把圖餵給ImageCropping，ImageCropping會把之前palm的位置重用(也就是假設palm並不會移到太遠的地方去)。藉此可以節省computation resource. 2. 中間那條就如前所述，是在有偵測到手的frame的之後的frame裡，都會直接使用第一個Frame抓到的palm的位置來去讀圖 (重複crop出那部分並餵給HandLandmark)。 3. 由於最左邊和中間那兩條黃線都會被crop，最後要成像輸出給user看時，還是要有完整的圖，所以會有第三條線把圖傳到AnnotationRenderer 最左下角LandmarksToRectangle，應該是會確認目前所用的cropped image裡，是否palm landmarks是否不完整或是有偏移，如果有，就會回饋給ImageCropping調整cropping的範圍