---
title: F-formation - Handle overlapping boxes
---

**Handle overlapping cases when working on object detection**. Use a person-to-group-center distance metric based on body pose and head pose
* *Idea 1*. Compute the distance between a person and a F-formation as $d(\theta_h,\theta_b,\theta_c)$, where $\theta_h,\theta_b$ are head pose and body pose of the person, and $\theta_c$ is the angle of the vector from that person to the box center
* *Observations*.
* $\cos(\theta_h,\theta_b) \geq 0$ reflects the certainty of focus of a person, i.e. how certain we are about his focusing point
* *Simple formulation*. $d(\theta_h,\theta_b,\theta_c) = - \frac{1}{2} [\cos(\theta_b, \theta_c) + \cos(\theta_h,\theta_c)]$
* *Formulation with focus uncertainty*. $d(\theta_h,\theta_b,\theta_c) = - \cos(\theta_b,\theta_h) [\cos(\theta_b, \theta_c) + \cos(\theta_h,\theta_c)]$
* *Idea 2*. Compute the distance between a person and a F-formation as $d(\theta_h,\theta_b,\theta_c,d_c)$ where $d_c$ is the physical distance from the person to the F-formation center
* *Simple formulation*. $d(\theta_h,\theta_b,\theta_c,d_c)=d_c\cdot \frac{- \cos(\theta_b,\theta_h) [\cos(\theta_b, \theta_c) + \cos(\theta_h,\theta_c)] - \mu}{\sigma}$
* *Explain*.
* $s=\frac{- \cos(\theta_b,\theta_h) [\cos(\theta_b, \theta_c) + \cos(\theta_h,\theta_c) - \mu}{\sigma} \in [(-2-\mu)/\sigma,(2-\mu)/\sigma]$ defines the scaling factor imposed by the angular distance
* Suppose that we have $s_i d_i = s_j d_j$, i.e. $\frac{d_i}{d_j} = \frac{s_j}{s_i}$ and suppose that $s_i=(-2-\mu)/\sigma$ and $s_j=(2-\mu)/\sigma$, then
$$\frac{d_i}{d_j}=\frac{2-\mu}{-2-\mu}$$
We further require that $-2-\mu\geq 0$, i.e. $\mu\leq -2$, we have that
$$\lim_{\mu\to-\infty}\frac{d_i}{d_j}=1,\quad \lim_{\mu\to -2}\frac{d_i}{d_j}=\infty$$
* We have that $\mu$ decides the confusion ratio $s_j/s_i$, and $\sigma$ plays no role here
* As $\mu\to-\infty$, the angular distance has no contribution to the total distance
* As $\mu\to-2$, the angular distance has more significant contribution to the total distance
* Therefore, what we want to choose is the ratio $s_j/s_i$, then deduce $\mu$ from that
* *Revised formulation*. $d(\theta_h,\theta_b,\theta_c,d_c)=d_c\cdot (- \cos(\theta_b,\theta_h) [\cos(\theta_b, \theta_c) + \cos(\theta_h,\theta_c)] - \mu)$
$\to$ We should search for optimal value of $\mu$
* *More complex formulation*. $d(\theta_h,\theta_b,\theta_c,d_c)=f(d_c)\cdot (- \cos(\theta_b,\theta_h) [\cos(\theta_b, \theta_c) + \cos(\theta_h,\theta_c)] - \mu)$ where $f(\cdot)$ is some function rectifying the scale of $d$
* Experiments show that $f(d_c)=d_c^2$ results in much better results for clustering
* *Idea 3*. Thresholding the distance $d_c$ and $d(\theta_h,\theta_b,\theta_c)$ before choosing closest F-formation
* *Idea 4*. We can extend this idea to K-Means clustering since we know the point-to-centroid distance metric
* *Simple implementation*.
* *Label update*. Update to the cluster with closest centroid
* *Centroid update*. Update using gradient descent on $(x,y)$ of the centroid
* *Implementation plan*.
* *Phase 1*. Assume that we know the number of clusters, then carry out other steps (DONE)
* *Conclusion*.
* The algorithm does work, however, we need to
* Automatically find $K$, this number is crucial
* Reduce the runtime of the algorithm
* Solve for the case where some people do not belong to any group
* The performance of this algorithm shows several insights
* It is crucial to get information about person-to-group-center relationships, based on people's pose and location
* Angular information and location information are approximately equally important, i.e. $d_c^2$ and $\mu=2.5$ is a good choice
* *Phase 2*. Replace gradient descent with other efficient step
* *Option 1*. Find a closed form of $\nabla_{x_c,y_c}L$ then solve the normal equation
* *Option 2*. Find some more efficient iterative method for optimization
* *Phase 3*. Relax the assumption at phase 1
* *Option 1*. Use elbow method for finding $K$ (very time consuming)
* *Option 2*. Use some prior to find $K$ (what prior?)
* Use object detection as prior
* *Phase 4*. Hyperparameter searching
* *Observations*.
* The objective for centroid location $(x,y)$ is given as
$$\begin{aligned}
L&=\sum_i d(o_i,c_i)\\
&=\sum_{i}d_c^{(i)}\cdot (- \cos(\theta_b^{(i)},\theta_h^{(i)}) [\cos(\theta_b^{(i)}, \theta_{c_i}) + \cos(\theta_h^{(i)},\theta_{c_i})] - \mu)\\
d^{(i)}_c&=\sqrt{(x_c - x^{(i)})^2 + (y_c - y^{(i)})^2}\\
\cos(\theta_b,\theta_h)&=\cos(|\theta_b-\theta_h|)\\
\cos(\theta_b^{(i)},\theta_c)&=\frac{(x_c-x^{(i)}+\epsilon) \cos\theta_b+(y_c-y^{(i)}+\epsilon) \sin\theta_b}{\sqrt{(x_c - x^{(i)})^2 + (y_c - y^{(i)})^2} + \epsilon},\quad \epsilon \to 0^+\\
\cos(\theta_h^{(i)},\theta_c)&=\frac{(x_c-x^{(i)}+\epsilon) \cos\theta_h+(y_c-y^{(i)}+\epsilon) \sin\theta_h}{\sqrt{(x_c - x^{(i)})^2 + (y_c - y^{(i)})^2} + \epsilon},\quad \epsilon \to 0^+
\end{aligned}$$
* We have that, as $\epsilon \to 0^+$
$$\begin{aligned}
L&=\sum_i \bigg(- d^{(i)}_c \cos(\theta_b,\theta_h) [\cos(\theta_b^{(i)}, \theta_c) + \cos(\theta_h^{(i)},\theta_c)] - d^{(i)} \mu\bigg)\\
&=\sum_i\bigg( -\cos(\theta_b,\theta_h) [(x_c-x^{(i)}) \cos\theta_b+(y_c-y^{(i)}) \sin\theta_b + (x_c-x^{(i)}) \cos\theta_h+(y_c-y^{(i)}) \sin\theta_h] - d^{(i)}_c \mu \bigg)\\
&=\sum_i\bigg( -\cos(\theta_b,\theta_h) [(x_c-x^{(i)}) (\cos\theta_b+\cos\theta_h)+(y_c-y^{(i)}) (\sin\theta_b + \sin\theta_h)] - d^{(i)}_c \mu \bigg)\\
&=\sum_i\bigg( -\cos(\theta_b,\theta_h) [(x_c-x^{(i)}) (\cos\theta_b+\cos\theta_h)+(y_c-y^{(i)}) (\sin\theta_b + \sin\theta_h)] - \sqrt{(x_c - x^{(i)})^2 + (y_c - y^{(i)})^2} \mu \bigg)
\end{aligned}$$
* We have that $L$ is a convex function, hence gradient descent will converge to global optimum
* Note that in case number of objects is $1$, this loss will diverge
* Note that in case some centroid falls exactly into the location of a particular then this loss will diverge as well
* It turns out that, for that case where
$$d(\theta_h,\theta_b,\theta_c,d_c)=d_c^2\cdot (- \cos(\theta_b,\theta_h) [\cos(\theta_b, \theta_c) + \cos(\theta_h,\theta_c)] - \mu)$$
$L$ is convex as well, since $d_c$ is a convex function, and product of convex functions is convex
**Groups of two people only**.
* *People representation*. Each pair of people is represented by
* Physical distance
* Angular distance from body pose to the line connecting two people
* Angular distance from head pose to the line connecting two people
* Angular distance between body pose and head pose of each person
$\to$ This expresses our uncertainty about his field of focus
* *Formulation*. $\text{score}_\text{focus} = \cos(\theta_b,\theta_h) \geq 0$
* *Drawback*. This approach only involve pairwise information, while we want to involve crowd information
**Groups of several people using supervised learning**.
* *Brainstorm*.
* We need to represent the crowd as objects in a 2D grid
* We want people within the same F-formation to be close in the latent space
* *People representation*. Each group of people is represented by
* Physical distance between them, as a function of people's locations
---
* Synthesis O-space by perspective transform