F-formation - Handle overlapping boxes

--- title: F-formation - Handle overlapping boxes --- ![image|12x12](https://www.researchgate.net/profile/Naglaa-Megahed/publication/334297706/figure/fig5/AS:873130784456715@1585182009560/Potential-scenarios-for-spatial-patterns-different-F-formation-configurations-a.jpg) **Handle overlapping cases when working on object detection**. Use a person-to-group-center distance metric based on body pose and head pose * *Idea 1*. Compute the distance between a person and a F-formation as $d(\theta_h,\theta_b,\theta_c)$, where $\theta_h,\theta_b$ are head pose and body pose of the person, and $\theta_c$ is the angle of the vector from that person to the box center * *Observations*. * $\cos(\theta_h,\theta_b) \geq 0$ reflects the certainty of focus of a person, i.e. how certain we are about his focusing point * *Simple formulation*. $d(\theta_h,\theta_b,\theta_c) = - \frac{1}{2} [\cos(\theta_b, \theta_c) + \cos(\theta_h,\theta_c)]$ * *Formulation with focus uncertainty*. $d(\theta_h,\theta_b,\theta_c) = - \cos(\theta_b,\theta_h) [\cos(\theta_b, \theta_c) + \cos(\theta_h,\theta_c)]$ * *Idea 2*. Compute the distance between a person and a F-formation as $d(\theta_h,\theta_b,\theta_c,d_c)$ where $d_c$ is the physical distance from the person to the F-formation center * *Simple formulation*. $d(\theta_h,\theta_b,\theta_c,d_c)=d_c\cdot \frac{- \cos(\theta_b,\theta_h) [\cos(\theta_b, \theta_c) + \cos(\theta_h,\theta_c)] - \mu}{\sigma}$ * *Explain*. * $s=\frac{- \cos(\theta_b,\theta_h) [\cos(\theta_b, \theta_c) + \cos(\theta_h,\theta_c) - \mu}{\sigma} \in [(-2-\mu)/\sigma,(2-\mu)/\sigma]$ defines the scaling factor imposed by the angular distance * Suppose that we have $s_i d_i = s_j d_j$, i.e. $\frac{d_i}{d_j} = \frac{s_j}{s_i}$ and suppose that $s_i=(-2-\mu)/\sigma$ and $s_j=(2-\mu)/\sigma$, then $$\frac{d_i}{d_j}=\frac{2-\mu}{-2-\mu}$$ We further require that $-2-\mu\geq 0$, i.e. $\mu\leq -2$, we have that $$\lim_{\mu\to-\infty}\frac{d_i}{d_j}=1,\quad \lim_{\mu\to -2}\frac{d_i}{d_j}=\infty$$ * We have that $\mu$ decides the confusion ratio $s_j/s_i$, and $\sigma$ plays no role here * As $\mu\to-\infty$, the angular distance has no contribution to the total distance * As $\mu\to-2$, the angular distance has more significant contribution to the total distance * Therefore, what we want to choose is the ratio $s_j/s_i$, then deduce $\mu$ from that * *Revised formulation*. $d(\theta_h,\theta_b,\theta_c,d_c)=d_c\cdot (- \cos(\theta_b,\theta_h) [\cos(\theta_b, \theta_c) + \cos(\theta_h,\theta_c)] - \mu)$ $\to$ We should search for optimal value of $\mu$ * *More complex formulation*. $d(\theta_h,\theta_b,\theta_c,d_c)=f(d_c)\cdot (- \cos(\theta_b,\theta_h) [\cos(\theta_b, \theta_c) + \cos(\theta_h,\theta_c)] - \mu)$ where $f(\cdot)$ is some function rectifying the scale of $d$ * Experiments show that $f(d_c)=d_c^2$ results in much better results for clustering * *Idea 3*. Thresholding the distance $d_c$ and $d(\theta_h,\theta_b,\theta_c)$ before choosing closest F-formation * *Idea 4*. We can extend this idea to K-Means clustering since we know the point-to-centroid distance metric * *Simple implementation*. * *Label update*. Update to the cluster with closest centroid * *Centroid update*. Update using gradient descent on $(x,y)$ of the centroid * *Implementation plan*. * *Phase 1*. Assume that we know the number of clusters, then carry out other steps (DONE) * *Conclusion*. * The algorithm does work, however, we need to * Automatically find $K$, this number is crucial * Reduce the runtime of the algorithm * Solve for the case where some people do not belong to any group * The performance of this algorithm shows several insights * It is crucial to get information about person-to-group-center relationships, based on people's pose and location * Angular information and location information are approximately equally important, i.e. $d_c^2$ and $\mu=2.5$ is a good choice * *Phase 2*. Replace gradient descent with other efficient step * *Option 1*. Find a closed form of $\nabla_{x_c,y_c}L$ then solve the normal equation * *Option 2*. Find some more efficient iterative method for optimization * *Phase 3*. Relax the assumption at phase 1 * *Option 1*. Use elbow method for finding $K$ (very time consuming) * *Option 2*. Use some prior to find $K$ (what prior?) * Use object detection as prior * *Phase 4*. Hyperparameter searching * *Observations*. * The objective for centroid location $(x,y)$ is given as $$\begin{aligned} L&=\sum_i d(o_i,c_i)\\ &=\sum_{i}d_c^{(i)}\cdot (- \cos(\theta_b^{(i)},\theta_h^{(i)}) [\cos(\theta_b^{(i)}, \theta_{c_i}) + \cos(\theta_h^{(i)},\theta_{c_i})] - \mu)\\ d^{(i)}_c&=\sqrt{(x_c - x^{(i)})^2 + (y_c - y^{(i)})^2}\\ \cos(\theta_b,\theta_h)&=\cos(|\theta_b-\theta_h|)\\ \cos(\theta_b^{(i)},\theta_c)&=\frac{(x_c-x^{(i)}+\epsilon) \cos\theta_b+(y_c-y^{(i)}+\epsilon) \sin\theta_b}{\sqrt{(x_c - x^{(i)})^2 + (y_c - y^{(i)})^2} + \epsilon},\quad \epsilon \to 0^+\\ \cos(\theta_h^{(i)},\theta_c)&=\frac{(x_c-x^{(i)}+\epsilon) \cos\theta_h+(y_c-y^{(i)}+\epsilon) \sin\theta_h}{\sqrt{(x_c - x^{(i)})^2 + (y_c - y^{(i)})^2} + \epsilon},\quad \epsilon \to 0^+ \end{aligned}$$ * We have that, as $\epsilon \to 0^+$ $$\begin{aligned} L&=\sum_i \bigg(- d^{(i)}_c \cos(\theta_b,\theta_h) [\cos(\theta_b^{(i)}, \theta_c) + \cos(\theta_h^{(i)},\theta_c)] - d^{(i)} \mu\bigg)\\ &=\sum_i\bigg( -\cos(\theta_b,\theta_h) [(x_c-x^{(i)}) \cos\theta_b+(y_c-y^{(i)}) \sin\theta_b + (x_c-x^{(i)}) \cos\theta_h+(y_c-y^{(i)}) \sin\theta_h] - d^{(i)}_c \mu \bigg)\\ &=\sum_i\bigg( -\cos(\theta_b,\theta_h) [(x_c-x^{(i)}) (\cos\theta_b+\cos\theta_h)+(y_c-y^{(i)}) (\sin\theta_b + \sin\theta_h)] - d^{(i)}_c \mu \bigg)\\ &=\sum_i\bigg( -\cos(\theta_b,\theta_h) [(x_c-x^{(i)}) (\cos\theta_b+\cos\theta_h)+(y_c-y^{(i)}) (\sin\theta_b + \sin\theta_h)] - \sqrt{(x_c - x^{(i)})^2 + (y_c - y^{(i)})^2} \mu \bigg) \end{aligned}$$ * We have that $L$ is a convex function, hence gradient descent will converge to global optimum * Note that in case number of objects is $1$, this loss will diverge * Note that in case some centroid falls exactly into the location of a particular then this loss will diverge as well * It turns out that, for that case where $$d(\theta_h,\theta_b,\theta_c,d_c)=d_c^2\cdot (- \cos(\theta_b,\theta_h) [\cos(\theta_b, \theta_c) + \cos(\theta_h,\theta_c)] - \mu)$$ $L$ is convex as well, since $d_c$ is a convex function, and product of convex functions is convex **Groups of two people only**. * *People representation*. Each pair of people is represented by * Physical distance * Angular distance from body pose to the line connecting two people * Angular distance from head pose to the line connecting two people * Angular distance between body pose and head pose of each person $\to$ This expresses our uncertainty about his field of focus * *Formulation*. $\text{score}_\text{focus} = \cos(\theta_b,\theta_h) \geq 0$ * *Drawback*. This approach only involve pairwise information, while we want to involve crowd information **Groups of several people using supervised learning**. * *Brainstorm*. * We need to represent the crowd as objects in a 2D grid * We want people within the same F-formation to be close in the latent space * *People representation*. Each group of people is represented by * Physical distance between them, as a function of people's locations --- * Synthesis O-space by perspective transform