Image Segmentation

--- tags: image processing --- {%hackmd theme-dark %} # Image Segmentation ## Table of Contents [TOC] ## Fundamentals ### Definition of Segmentation Let $R$ represent the entire spatial region occupied by an image. We may view ***image segmentation*** as a process that partitions $R$ into $n$ subregions $R_1$, ... $R_n$ such that (1) $\bigcup_{i=1}^n R_i = R$. (2) $R_i$ is a connected set, for $i = 0, ..., n$. (3) $R_i \bigcap R_j = \Phi , \forall i \neq j$ (4) $Q(R_i) = \text{TRUE}, \forall i = 0, ... , n$, where $Q(R_k)$ is a logical predicate (returns TRUE or FALSE) defined over the points in set $R_k$. (5) $Q(R_i \bigcup R_j) = \text{FALSE}$ for any adjacent regions $R_i$ and $R_j$. > Interpretation: > (1) ~ (3): straightforward. > (4), (5): To avoid trival segmentation (e.g. divide each pixel into a distinct region). In general, $Q$ can be a compound expression such as: $$ \left( \text{average of intensity of the pixels in } R_i < m_i \right) \wedge \left( \text{standard deviation of intensity of pixels in } R_i \le \sigma_i \right) $$ ## Thresholding Suppose that we are an image $f(x,y)$ and its corresponding intensity histogram. If the object and background pixels have intensity values grouped into two dominant modes, one obvious way to extract the objects from the background is to select a threshold, $T$, that separates these modes: $$ g(x, y) = \begin{cases} 1 \quad \text{if} \ f(x, y) > T \\ 0 \quad \text{if} \ f(x, y) \le T \end{cases} $$ , where $g(x,y)$ denotes the semented image. When $T$ is a constant applicable over an entire image, the process given in this equation is referred to as ***global thresholding***. When the value of $T$ changes over an image, we use the term ***variable thresholding***. Segmentation problems requiring more than two thresholds are difficult to solve, and better results usually are obtained using other methods, such as variable thresholding or region growing (see later). ### The Role of Noise, Illumination and Reflectance in Image Thresholding Histogram for noiseless, additive Gaussian noise with different deviations 8-bit images: ![](https://i.imgur.com/sdjk40K.png =700x) Histogram of noisy image, intensity ramp and their multiplicaiton: ![](https://i.imgur.com/wUzblbA.png =700x) As we can see, illumination and reflectance play a role in the success of image sementation using thresholdign or other segmentation techniques. Thus, we should control these factors when possible. Three basic approaches when contorl over these factors is not possible: 1. Correct the shading pattern directly. 2. Correct the global shadding pattern via processing using, for example, the top-hat transformation. 3. use variable thresholding. ## Gloabal Thresholding ### Otsu's Method For Global Thresholding Otsu's method is a algorithm for assigning pixels to two groups. It's optimum in the sense that it maximizes the ***between-class variance***, a measure used in statistical discriminant analysis. Otsu's method has the important property that it is based entirely on computations performed on the historgram of an image (easily obtainable 1-D array). Some notation and definition: * $\{0, 1, \ldots, L-1\}$: $L$ distinct integer intensity levels in a digital image of size $M \times N$ pixels. * $n_i$: denote the number of pixels with intensity $i$. Thus, $MN = \sum_{i=0}^{L-1} n_i$. * $p_i = n_i / MN$: normalized histogram components, from which it follows that $\sum_{i=1}^{L-1} p_i = 1$. We suppose that we select a threshold $T(k) = k, 0 < k < L-1$ to divide input image into $2$ class $c_1$ and $c_2$, where a pixel $x \in c_1$ if its intensity is in the range $[0,k]$ and $x \in c_2$ if its intensity is in the range $[k+1, L-1]$. Define the cumulative sum $$ \begin{align} P_1(k) = \sum_{i=0}^k p_i \end{align} $$ and $$ \begin{align} P_2(k) = \sum_{i=k+1}^{L-1} p_i = 1 - P_1(k) \end{align} $$ Let $m_1$ and $m_2$ denote the average intensity of $c_1$ and $c_2$, respectively. Also, let $m_G$ denotes the average intensity of the entire image. Thus, we have $P_1m_1 + P_2m_2 = m_G$ and $P_1 + P_2 = 1$. In order to evaluate the effectiveness of the threshold at level $k$, we use the measure $$ \eta = \frac{\sigma_B^2}{\sigma_G^2} $$ , where $\sigma_G^2$ is the global variance: $$ \sigma^2 = \sum_{i=1}^{L-1} (i - m_G)^2 p_i $$ and $\sigma^2_B$ is the ***between-class variance***: $$ \begin{align} \sigma_B^2(k) &= P_1(k) (m_1(k) - m_G)^2 + P_2(k) (m_2(k) - m_G)^2 \\ &= P_1(k) P_2(k) \left(m_1(k) - m_2(k)\right)^2 \\ &= \frac{(m_G P_1(k) - m(k))^2}{P_1(k)(1 - P_1(k))} \end{align} $$ The farther the two means $m_1$ and $m_2$ are from each other, the larger $\sigma_B^2$ will be, implying that the between-class variance is a measure of separability between classes. The objective of Otsu's method is to find determine the threshold $k^*$ that maximizes the between-class variance $\sigma$: $$ k^* = \underset{0 \le k \le L-1}{\arg \max} \ \sigma^2_B (k) $$ To find $k^*$ we simply evaluate this equation for all integer value of $k$ (subject to the condiion $0 < P_1(k) < 1$). Once $k^*$ has been obtained, input image $f(x, y)$ is segmented as $$ \begin{align} g(x, y) = \begin{cases} 1 & \text{if } f(x, y) > k^* \\ 0 & \text{if } f(x, y) \le k^* \end{cases} \end{align} $$ ### Using Image Smoothing To Improve Global Thresholding When noise cannot be reduced at the source, and thresholding is the preferred segmentation method, a technique that often enhances performance is to smooth the image prior to thresholding: ![](https://i.imgur.com/1miqKl1.png) In the figure above, the first row is the raw image, its histogram and the result using Otsu's method. The second raw is obtained by smoothing the noisy using a $5 \times 5$ averaging kernel. The slight distortion of the boundary between object and background in the segmented, smoothed image was caused by the blurring of the boundary. However, this method could fail if the region is so small that its contribution to the histogram is insigniﬁcant compared to the intensity spread caused by noise: ![](https://i.imgur.com/MUmCMjS.png) ### Using Edges to Improve Global Thresholding One approach for improving the shape of histograms is to consider only those pixels that **lie on or near the edges** between objects and the background. $\Rightarrow$ Histograms should be less dependent on the relative sizes of objects and background. Since, we don't exactly know exact location of the edges, an indication of whether a pixel is on an edge may be obtained by **computing its gradient or Laplacian**. The preceding discussion is summarized in the following algorithm, where $f ( x, y)$ is the input image: 1. Compute an edge image (magnitude of the gradient, or absolute value of the Laplacian) of $f ( x, y)$. 2. Specify a threshold value, $T$. It is customary to specify the value of T to correspond to a percentile, which typically is set high (e.g., in the high 90’s) so that few pixels in the gradient/Laplacian image will be used in the computation. 3. Threshold the image from Step 1 using $T$ from Step 2 to produce a binary image, $g_T ( x, y)$. This image is used as a mask image in the following step to select pixels from $f ( x, y)$ corresponding to *“strong” edge pixels* in the mask. 4. Compute a histogram using only the pixels in $f ( x, y)$ that correspond to the locations of the 1-valued pixels in $g_T ( x, y)$. 5. Use the histogram from Step 4 to segment $f ( x, y)$ globally using, for example, Otsu’s method. ### Multiple Thresholdings Otsu’s method can be extended to an arbitrary number of thresholds. In the case of $K$ classes, $c_{1}, c_{2}, \ldots, c_{K}$, the between-class variance generalizes to the expression $$ \sigma_{B}^{2}=\sum_{k=1}^{K} P_{k}\left(m_{k}-m_{G}\right)^{2} $$ where $$ P_{k}=\sum_{i \in c_{k}} p_{i} $$ and $$ m_{k}=\frac{1}{P_{k}} \sum_{i \in c_{k}} i p_{i} $$ As before, $m_{G}$ is the global mean. The $K$ classes are separated by $K-1$ thresholds whose values, $k_{1}^{*}, k_{2}^{*}, \ldots, k_{K-1}^{*}$, are the values such that $$ \sigma_{B}^{2}\left(k_{1}^{*}, k_{2}^{*}, \ldots, k_{K-1}^{*}\right)=\max _{0<k_{1}<k_{2}<\ldots k_{k}<L-1} \sigma_{B}^{2}\left(k_{1}, k_{2}, \ldots k_{K-1}\right) $$ Although this result is applicable to an arbitrary number of classes, it begins to lose meaning as the number of classes increases because we are dealing with only one variable (intensity). ## Variable Thresholding ### Variable Thresholding Based on Local Image Properties A basic approach to variable thresholding is to compute a threshold at every point, $( x, y)$, in the image based on one or more specified properties in a neighborhood of $( x, y)$. For example, we can use the mean and standard deviation of the pixel values in a neighborhood of every point in an image: Let $m_{xy}$ and $\sigma_{xy}$ denote teh mean and standard deviation of the set of pixel values in a neighborhood $S_{xy}$, centered at coordinates $(x, y)$ in an image. The following are two common forms of variable thresholds based on the local image properties: $$ T_{xy} = a \sigma_{xy} + b m_{xy} $$ where $a$ and $b$ are nonnegative constants and $$ T_{xy} = a \sigma_{xy} + b m_{G} $$ where $m_{G}$ is the global mean. The segmented image for input image $f(x, y)$ is computed as $$ g(x, y ) = \begin{cases} 1 & \text{ if } f(x, y) > T_{xy} \\ 0 & \text{ if } f(x, y) \le T_{xy} \end{cases} $$ Both of these forms are special cases of the more general form: $$ g(x, y) = \begin{cases} 1 & \text{ if } Q(\text{local parameters}) \text{ is True} \\ 0 & \text{ if } Q(\text{local parameters}) \text{ is False} \\ \end{cases} $$ where $Q$ is a predicat baesd on parameters computed using the pixels in neighborhood $S_{xy}$. ## Region Segmentation Using Superpixels The idea behind superpixels is to replace the standard pixel grid by **grouping pixels into primitive regions that are more perceptually meaningful** than individual pixels. For example, in the following figure, the left is the original $600 \times 480$ image. The right image is a image composed of $4000$ superpixels whose boundaries between the superpixels are shown in the middle image (in white): ![](https://i.imgur.com/UYGKPZf.png) Whether the superpixel representation is “adequate” depends on the application. For instance, if the objective is to detect imperfections at pixel-level resolutions, then the answer obviously is no. ### SLIC Superpixel Algorithm Simple linear iterative clustering (SLIC) is a modification of k-means algorithm. It typically use (but are not limited to) 5-dimensional vectors containing three color components and two spatial coordineates. For example, if we are using the RGB color system, the 5-dimensional vector associated with an image pixel has the form $$ \mathbf{z} = \begin{bmatrix} r \\ g \\ b \\ x \\ y \end{bmatrix} $$ where $(r,g,b)$ are the three color components of a pixel, and $( x, y)$ are its two spatial coordinates. Let $n_{sp}$ denote the desired number of superpixels and let $n_{tp}$ denote the total number of pixels in the image. The initial superpixel centers, $\mathbf{m}_i = [r_i \ g_i \ b_i \ x_i \ y_i]^T$, $i = 1, 2, ... n_{sp}$, are obtained by sampling the image on a regular grid spaced s units apart. To generate superpixels approximately equal in size (i.e., area), the grid spacing interval is selected as $s=[n_{tp} / n_{sp}]^{\frac{1}{2}}$. To prevent centering a superpixel on the edge of the image, and to reduce the chances of starting at a noisy point, the initial cluster centers are **moved to the lowest gradient position in the $3 \times 3$ neighborhood about each center**. The SLIC superpixel algorithm consists of the following steps. Keep in mind that superpixels are vectors in general. When we refer to a “pixel” in the algorithm, we are referring to the $( x, y)$ location of the superpixel relative to the image. 1. **Initialize the algorithm** Compute the initial superpixel cluster centers, $$ \mathbf{m}_{i}=\left[\begin{array}{lllll} r_{i} & g_{i} & b_{i} & x_{i} & y_{i} \end{array}\right]^{T}, i=1,2, \ldots, n_{s p} $$ by sampling the image at regular grid steps, $s$. **Move the cluster centers to the lowest gradient position in a $3 \times 3$ neighborhood**. For each pixel location, $p$, in the image, set a label $L(p)=-1$ and a distance $d(p)=\infty$. 2. **Assign samples to cluster centers** For each cluster center $\mathbf{m}_{i}, i=1,2, \ldots, n_{s p}$, **compute the distance, $D_{i}(p)$ between $\mathbf{m}_{i}$ and each pixel $p$ in a $2 s \times 2 s$ neighborhood about $\mathbf{m}_{i}.$** Then, for each $p$ and $i=1,2, \ldots, n_{s p}$, if $D_{i}<d(p)$, let $d(p)=D_{i}$ and $L(p)=i$. 3. **Update the cluster centers** Let $C_{i}$ denote the set of pixels in the image with label $L(p)=i .$ Update $\mathbf{m}_{i}$ : $$ \mathbf{m}_{i}=\frac{1}{\left|C_{i}\right|} \sum_{\mathbf{z} \in C_{i}} \mathbf{z} \quad i=1,2, \ldots, n_{s p} $$ where $\left|C_{i}\right|$ is the number of pixels in set $C_{i}$. 4. **Test for convergence** Compute the Euclidean norms of the differences between the mean vectors in the current and previous steps. Compute the residual error, $E$, as the sum of the $n_{s p}$ norms. If $E<T$, where $T$ a specified nonnegative threshold, go to Step $5 .$ Else, go back to Step 2 . 5. **Post-process the superpixel regions** Replace all the superpixels in each region, $C_{i}$, by their average value, $\mathbf{m}_{i}$. It would be senseless to use a single Euclidean distance in this case, because the scales in the axes of this coordinate system are different and unrelated. In other words, spatial and color distances must be treated separately. Thus, the authors of SLIC proposed that spatial and color distances **must be treated separately**. $\Rightarrow$ This is accomplished by normalizing the distance of the various components, then combining them into a single measure: Let $d_c$ (color Euclidean distance) and $d_s$ (spatial Euclidean distance) defined as follows: $$ d_{c}=\left[\left(r_{j}-r_{i}\right)^{2}+\left(g_{j}-g_{i}\right)^{2}+\left(b_{j}-b_{i}\right)^{2}\right]^{1 / 2} $$ and $$ d_{s}=\left[\left(x_{j}-x_{i}\right)^{2}+\left(y_{j}-y_{i}\right)^{2}\right]^{1 / 2} $$ We then define $D$ as the composite distance $$ D=\left[\left(\frac{d_{c}}{d_{c m}}\right)^{2}+\left(\frac{d_{s}}{d_{s m}}\right)^{2}\right]^{1 / 2} $$ where $d_{cm}$ and $d_{sm}$ are the maximum expected values of $d_c$ and $d_s$. Because no provision is made in the algorithm to enforce connectivity, it is possible for isolated pixels to remain after convergence. These are assigned the label of the nearest cluster using a connected components algorithm.