Part 4 - HackMD

--- tags: Digital Image Processing disqus: hackmd --- # Part 4 ## Basic Transforms Consider translation of an image point $P(x,y)$ to $Q(x',y')$, by an amount $(x_0,y_0)$. It is clear to see that, $x' = x + x_0$ and $y' = y + y_0$. This can also be visualized in matrix notation as, \begin{equation} \begin{bmatrix} x' \\ y' \end{bmatrix}= \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} \begin{bmatrix} x \\ y \end{bmatrix}+ \begin{bmatrix} x_0 \\ y_0 \end{bmatrix} \end{equation} Compacting it further, \begin{equation} \begin{bmatrix} x' \\ y' \end{bmatrix}= \begin{bmatrix} 1 & 0 & x_0 \\ 0 & 1 & y_0 \end{bmatrix} \begin{bmatrix} x \\ y \\ 1 \end{bmatrix} \end{equation} To make the equation symmetric, \begin{equation} \begin{bmatrix} x' \\ y' \\ 1 \end{bmatrix}= \begin{bmatrix} 1 & 0 & x_0 \\ 0 & 1 & y_0 \\ 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} x \\ y \\ 1 \end{bmatrix} \end{equation} This is known as the unified expression. Before moving to rotation, it is necessary to point out the possibility of converting the cartesian point in polar form as $x = r\cos \alpha$ and $y = r \sin \alpha$, where $r$ is the distance of point from origin and $\alpha$ is the angle from $x$ - axis. Now if the point is rotated by $\theta$ in clockwise direction, we get $x = r\cos (\alpha - \theta)$ and $y = r \sin (\alpha - \theta)$. After some simplification, it can be expressed as $x = x\cos \theta + y \sin \theta$ and $y = y \cos \theta - x \sin \theta$. In matrix notation, \begin{equation} \begin{bmatrix} x' \\ y' \end{bmatrix}= \begin{bmatrix} \cos \theta & \sin \theta \\ -\sin \theta & \cos \theta \end{bmatrix} \begin{bmatrix} x \\ y \end{bmatrix} \end{equation} Suppose there are scaling factors present in the transformations, namely $S_x$ and $S_y$. In that case, \begin{equation} \begin{bmatrix} x' \\ y' \end{bmatrix}= \begin{bmatrix} S_x & 0 \\ 0 & S_y \end{bmatrix} \begin{bmatrix} x \\ y \end{bmatrix} \end{equation} Note that these operations can be performed on a point in a concatenated fashion or even sequentially. Consider a point $P$, which has to be translated. So, the final point shall be $T_r P$, where $T_r$ is the translation matrix. This has to be rotated and therefore, the point after rotation by an angle $\theta$ can be written as $R_\theta T_r P$. Now, consider that the rotation has to be done for a point other than the origin. In that case, the transformations can be written as $T_{-r}(R_\theta T_r P)$. In case of three dimensional case, the translation can be written as, \begin{equation} \begin{bmatrix} x' \\ y' \\ z' \end{bmatrix}= \begin{bmatrix} 1 & 0 & 0 &x_0 \\ 0 & 1 & 0 & y_0 \\ 0 & 0 & 1 & z_0 \end{bmatrix} \begin{bmatrix} X \\ Y \\ Z \\ 1 \end{bmatrix} \end{equation} It's unified representation can be written as, \begin{equation} \begin{bmatrix} x' \\ y' \\ z' \\ 1 \end{bmatrix}= \begin{bmatrix} 1 & 0 & 0 & x_0 \\ 0 & 1 & 0 & y_0 \\ 0 & 0 & 1 & z_0 \\ 0 & 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} X \\ Y \\ Z \\ 1 \end{bmatrix} \end{equation} The square matrix here is referred to as the translation matrix and is represented by $T$. For scaling along each of the three axes by an amount $S_x$, $S_y$ and $S_z$, the matrix shall be obtained as, \begin{bmatrix} S_x & 0 & 0 & 0 \\ 0 & S_y & 0 & 0 \\ 0 & 0 & S_z & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix} Which shall be denoted by $S$. For rotation in three dimensions, there is no one way of rotation, as shown below, ![](https://i.imgur.com/SmQbuAm.png) Several transformations can also be compressed into a single matrix, which will perform all the required operations on the point. Instead of rotation, translation and scaling done on the point as $v' = R_\theta T S v$, it can be written as $v' = Av$, where $A = R_\theta T S$. Note that the operations are not commutative and therefore, the order is important. These operations are essential, especially in the projection of the three-dimensional scene onto a two-dimensional plane. ## Image Formation This is a very important part, which is useful to see how the theory covered above can be used by a camera to perceive the three-dimensional scene. It is quite useful in projecting the scene onto a $2D$ plane. Consider the situation as ![](https://i.imgur.com/Z6rAW2b.png) The world coordinate system has been shown in $(X, Y, Z)$, which is mapped to the plane $(x,y)$ through the lens centre at $(0,0,\lambda)$, where $\lambda = f$, is the focal length of the camera. The relation between them can be calculated as (use properties of similar triangles), \begin{equation} x = \frac{\lambda X}{\lambda - Z} \end{equation} \begin{equation} y = \frac{\lambda Y}{\lambda - Z} \end{equation} ### Homogenous Coordinates Instead of representing in matrix form directly, it will be useful to exploit the properties of projective geometry, which incorporates another dimension in the existing vector. For example, the vector $(X,Y,Z)$ can be represented as $(kX,kY,kZ,k)$, where $k$ is an arbitrary nonzero constant and the vector is represented as $w_h$. Now, define a perspective matrix $P$ as, \begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & \frac{-1}{\lambda} & 1 \end{bmatrix} Therefore, $c_h = Pw_h$, giving \begin{bmatrix} kX \\ kY \\ kZ \\ -k\frac{Z}{\lambda} + k \end{bmatrix} Which is the homogenous camera coordinates. Converting $c_h$ to $c$, simply divide the coordinates with the last term of the vector, yielding, \begin{bmatrix} \frac{\lambda X}{\lambda - Z} \\ \frac{\lambda Y}{\lambda - Z} \\ \frac{\lambda Z}{\lambda - Z} \end{bmatrix} Considering the derivation done before, the first and second components are of our importance, since it defines the coordinates at the plane (Ignoring the third coordinate). This is known as perspective transformation. Similarly, inverse perspective transformation is also possible to obtain the $3D$ coordinates as $w_h = P^{-1}c_h$, where on calculation, \begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & \frac{1}{\lambda} & 1 \end{bmatrix} Consider the image point $c = (x_0,y_0,0)$. Therefore, $c_h = (kx_0, ky_0, 0, k) \implies w_h = (kx_0, ky_0, 0, k)$. This gives $(X,Y,Z) \equiv (x_0, y_0, 0)$, which is absurd. Note that when the camera geometry system was drawn, it is noticed that the mapping on the image plane from the world is not one-one, but a many - one mapping, since the coordinate can lie anywhere in that line. This means whenever a point is mapped to the image plane and the inverse is to be done, the function property breaks. So, consider the equation of the straight line from image point to optical center, calculated as, \begin{equation} X = \frac{x_0}{\lambda}(\lambda - Z) \end{equation} and, \begin{equation} Y = \frac{y_0}{\lambda}(\lambda - Z) \end{equation} Initially, when the coordinate system is taken as $c_h = (kx_0, ky_0, kz, k)$ can be converted as $w_h = (kx_0, ky_0, kz, \frac{kz}{\lambda} + k)$, which was holding true as required. So, \begin{equation} w = \begin{bmatrix} X \\ Y \\ Z \end{bmatrix} = \begin{bmatrix} \frac{\lambda x_0}{\lambda + z} \\ \frac{\lambda y_0}{\lambda + z} \\ \frac{\lambda z}{\lambda + z} \end{bmatrix} \end{equation} On solving for $X$ and $Y$ we get the equations as calculated just before. So, if we know the value of $Z$, then it is possible to find $w$! ## Image Geometry Before this, the ideal case was considered, when the imaging plane, optical centre and the world coordinate under consideration were aligned perfectly. To satisfy a more general case, ![](https://i.imgur.com/wIskbuk.png) The setup represented is known as a Gimbal. The camera is mounted on a gimbal, which can be rotated along $y$ axis (Pan ($\theta$)), or along $z$ axis (Tilt ($\alpha$)). The gimbal is displaced from the world reference coordinate by $(X_0, Y_0, Z_0)$. The displacement of the image plane centre from the gimbal centre is denoted by $r$. Given $W$, now $C$ has to be calculated. Step 1: Displacement from world coordinate center by $w_0$. The matrix obtained is, \begin{equation} G = \begin{bmatrix} 1 & 0 & 0 & -X_0 \\ 0 & 1 & 0 & -Y_0 \\ 0 & 0 & 1 & -Z_0 \\ 0 & 0 & 0 & 1 \end{bmatrix} \end{equation} Step 2: Pan the camera by $\theta$. (Rotation about $Z$ axis) \begin{equation} R_\theta = \begin{bmatrix} cos \theta & \sin \theta & 0 & 0 \\ -\sin \theta & \cos \theta & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix} \end{equation} Step 3: Tile the camera by $\alpha$. (Rotation about $X$ axis). \begin{equation} R_\alpha = \begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & \cos \alpha & \sin \alpha & 0 \\ 0 & -\sin \alpha & \cos \alpha & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix} \end{equation} Therefore, $R = R_\alpha R_\theta$ yields, ![](https://i.imgur.com/eIzItna.png) Step 4: Camera center displacement by $r = (r_1,r_2,r_3)$. \begin{equation} T = \begin{bmatrix} 1 & 0 & 0 & -r_1 \\ 0 & 1 & 0 & -r_2 \\ 0 & 0 & 1 & -r_3 \\ 0 & 0 & 0 & 1 \end{bmatrix} \end{equation} Combining all these transformations, along with perspective transformation matrix, \begin{equation} c_h = PTRGw_h \end{equation} This yields the equations as, ![](https://i.imgur.com/zrbFEJu.png) The matrix expression can be boiled down as $c_h = Aw_h$, where $A = PTRG$. Now, in order to find the perspective trasformation matrix, it is possible to calibrate the camera that is used, since it is dependent on the camera and it is not dependent on the alignment of the camera in world coordinate system. Currently, we have $A$. We know that in homogenous coordinate system, $w_h = (kX,kY,kZ,k)$. Without the loss of generality, let $k = 1$. Expanding $c_h = Aw_h$, we get, \begin{equation} \begin{bmatrix} c_{h1}\\c_{h2}\\c_{h3}\\c_{h4} \end{bmatrix} = \begin{bmatrix} a_{11} & a_{12} & a_{13} & a_{14} \\ a_{21} & a_{22} & a_{23} & a_{24} \\ a_{31} & a_{32} & a_{33} & a_{34} \\ a_{41} & a_{42} & a_{43} & a_{44} \end{bmatrix} \begin{bmatrix} X \\ Y \\ Z \\ 1 \end{bmatrix} \end{equation} Here, $x = \frac{c_{h1}}{c_{h4}}$ and $y = \frac{c_{h2}}{c_{h4}}$. Therefore, \begin{equation} \begin{bmatrix} xc_{h4}\\yc_{h4}\\c_{h3}\\c_{h4} \end{bmatrix} = \begin{bmatrix} a_{11} & a_{12} & a_{13} & a_{14} \\ a_{21} & a_{22} & a_{23} & a_{24} \\ a_{31} & a_{32} & a_{33} & a_{34} \\ a_{41} & a_{42} & a_{43} & a_{44} \end{bmatrix} \begin{bmatrix} X \\ Y \\ Z \\ 1 \end{bmatrix} \end{equation} Ignoring the third equation as explained before, we have the equations as, \begin{equation} \begin{bmatrix} xc_{h4}\\yc_{h4}\\c_{h4} \end{bmatrix} = \begin{bmatrix} a_{11} & a_{12} & a_{13} & a_{14} \\ a_{21} & a_{22} & a_{23} & a_{24} \\ a_{41} & a_{42} & a_{43} & a_{44} \end{bmatrix} \begin{bmatrix} X \\ Y \\ Z \\ 1 \end{bmatrix} \end{equation} If $c_{h4}$ is removed, we get $12$ unknowns in total. For every known $3D$ point, we need two equations as calculated. So, for $6$ known points, we are able to get the unknown parameters. ## Stereo Image Resampling The problem has still not been addressed regarding the reprojection of a three-dimensional world point from the image. For that, two cameras can be used in a setup as follows, ![](https://i.imgur.com/MB9xy0S.png) To analyse the situation, consider a little simplified situation, where these two planes are the same, that is, ![](https://i.imgur.com/wONQvSm.png) It is assumed that both the cameras are identical. The setup is shown as ![](https://i.imgur.com/ouiDqDs.png) Using this figure, a series of equations can be derived. The image coordinates for the same world point are defined as $(x_1,y_1)$ and $(x_2,y_2)$ respectively. The equation of the first line can be found out as $X_1 = \frac{x_1}{\lambda}(\lambda - z)$. Similarly, $X_2 = \frac{x_2}{\lambda}(\lambda - z)$. We also know that $X_2 = X_1 + B$. Therefore, \begin{equation} \frac{x_2}{\lambda}(\lambda - z) + B = \frac{x_1}{\lambda}(\lambda - z) \end{equation} From this, we get, \begin{equation} z = \lambda - \frac{\lambda B}{x_1 - x_2} \end{equation} The term, \begin{equation} \frac{\lambda B}{x_1 - x_2} \end{equation} Is known as the disparity. Therefore, given the distance between the two cameras and their focal lengths, it is possible to find the depth or the three-dimensional coordinate obtained from the two points. So now, we have all the three coordinate values of the world point. ## Stereo Correspondence Problem Given the position of an image point on the left/right image, what will be its corresponding point on the other image? The stereo correspondence was already given in our initial case, however, to find it using the similar technique, there will be $N^2$ different comparisons and operations to be done with each image pixel and $N^4$ for the worst case. Given the world coordinate $(X,Y,Z)$ and the camera coordinates $C_1:(X_1,Y_1,Z_1)$ and $C_2:(X_2,Y_2,Z_2)$, the image points can be defined as for image 1 as $(\frac{\lambda X_1}{\lambda - Z_1},\frac{\lambda Y_1}{\lambda - Z_1} )$ and for image 2 as $(\frac{\lambda X_2}{\lambda - Z_2},\frac{\lambda Y_2}{\lambda - Z_2})$. In our case, $X_1 = X_2$ and $Y_1 = Y_2$. This makes $y_1 = y_2$.