Assignment 1. Eye tracking.

# Assignment 1. Eye tracking. **IAML 2021. MSc Group E.** ## Introduction In this report we will be discussing our implementation of an eye tracker by estimating gaze position on a screen given the picture of an eye and a calibration set of pictures. As proposed on the assignment description, the overall process was first, given an image of an eye, to detect the location of the pupil and glints. After getting this information, we used it and the calibration dataset to get a projective transformation and get the coordinates for gaze estimation. After that, we used different evaluation metrics, for which the results are presented, to measure the detectors performance. For the last part of the assignment, we decided to improve our model by normalizing the positions of the pupil, with respect to the glints, to take into account head motion. This uses a glint detector which we will also describe. ## Pupil detection In this section we go through our implementation of a pupil detector. We start by discussing the binarization, then the preprocessing, and then finally our results of the detector. ### Binarization As in previous weekly exercises, we first binarized the supplied image by applying a threshold. Something important here is that the pupil is considerably dark, so a method like Otsu's may not be as helpful as for other features. Instead, we decided to use a simple threshold that classified everything below 50 (dark values) as 0 and everything else as 1, since the pupil is very dark compared to every other part of the image. Below you can see an example of a result from our binarization. ![](https://i.imgur.com/q1EWI0S.jpg) We can see that the dark parts of the pupil we are interested in are captured by this thresholding for different datasets, but it also includes some noise that we want to get rid of. ### Preprocessing Once we had a satisfactory binary image, a preprocessing was made to get rid of unwanted features captured on the previous step, like the black noise on the left of the pupil. We decided to apply the closing operation on our binary image using an ellipse shaped kernel, where dilation is first used to remove the noise, and then erotion, to get the pupil back to its original size, since the dilation shrinks it. The reason for choosing this closing operation was that the dilation shrinks objects, which is good since you only need one of the pixels that the structuring element considers to be white to make the pixel in question white. Then, after removing the noise, we use the erosion to decrease the size of the pupil after it was increased by the dilation. ### Candidates scoring After the preprocessing, the next step was to identify the contours in the image, which were then given a score based on their circularity. The score for each object was calculated by finding its minimum enclosing circle and divide the result with the actual area of the contour. Contours with radius in the range 10-60 pixels were only included in the scoring to remove as much noise as possible that might give us wrong results. Then the canditate in the scoring that was closest to the value 1, and therefore the most circular BLOB, was our result. ### Results and evaluation To evaluate the correctness of our pupil estimation, and given that we are supplied with the ground truth / expected results for the datasets, we used two different metrics: a distance metric and the intersection over union metric. Which we will now describe. #### Distance metric For this metric, the euclidean distance between the ground truth pupil center and our estimated point was computed. We expect that a good detection would have a value close to zero. ##### Results for dataset `pattern0` The following is the histogram of the errors of each image for this dataset. On the same plot we can also see the mean and the median of the error values. In the appendix you can see a table with the distance between the estimated value and the ground truth for each image of the dataset. ![](https://i.imgur.com/xxBAaf7.png) ##### Results for all datasets In order to have a more global comparison and for us to decide which parts of our detection to change, we also generated the same histogram, but this time taking into account all the datasets together with the mean and median values. ![](https://i.imgur.com/M3998T8.png) #### Intersection over union metric For this metric, we wanted the area of the intersection of the predicted pupil and the ground-truth pupil divided by the area of their union. Since the pupil is modeled as an ellipse, calculating the area of intersection and of unions in an analytical manner is not straightforward, since we would have to take into account the different ways ellipses may intersect and compute the area based on that, as further described [here](https://www.geometrictools.com/Documentation/AreaIntersectingEllipses.pdf). This seemed impractical and we took a discrete approach instead. We generated a matrix of the same size and used functions to draw those ellipses adding `1`s in such a way the generated matrix had `0`s where neither were, `1`s where only one of them were, and `2`s where both of them were. Adding them up gave us the required areas. In this metric, we evaluate the one that is closest to 1 as best. Since ideally, the area of the intersection and the union would be the same. ##### Results for dataset `pattern0` The following is the histogram of the errors of each image for this dataset. On the same plot we can also see the mean and the median of the error values. ![](https://i.imgur.com/VRYbyWw.png) ##### Results for all datasets In order to have a more global comparison, the following histogram represents the errors taking into account all the datasets together with the mean and median values. ![](https://i.imgur.com/BcaCeMd.png) ### Discussion of our results Both metrics showed that our estimated values are close to the ground-truth values. The largest difference on the center positions was around 10px, which we would expect to be negligible for the future stages. An important improvement was shown to be the usage of the convex hull, which is the smallest convex shape including a given contour, not using it, would result on possibly smaller pupils, since connectivity removed by the threshold or the preprocessing would be diminished for some of the detected contours. [mention the convex hull in the preprocessing] Some of the outliers on the intersection over union metric were shown to be actual errors on the provided ground-truth ## Glints detection Now we'll describe the process we followed for glints detection, as well as our results and evaluation. ### Binarization and Preprocessing In contrast to the pupil detector, here we wanted high intensity values, therefore a higher threshold was used (190). For the preprocessing stage we used a dilation operation. A dilation operation on a binary image enlarges the white sections of our detection, which are our glint candidates along with other bright parts of the input image. The idea behind using a dilation is that the other parts of the image which were not glints were close but slightly separated, making them potential glint candidates, the dilation would then join this parts so that the area of the detected contours could later be used as a deciding criteria. ### Filtering candidates We experimented with different ways of filtering glint candidates. We used the center of the detected pupil from the previous section, since the glints should be somewhere near to the pupil. This was the first and most relevant criteria. After having removed many candidates by this criteria, we also used the area of the contour, since glints have a small area. However we determined empirically that this wasn't of much use since the other false candidates had small area as well Once that several glint candidates were removed by the above heuristics, we required the best four (at most) glints. To decide which glints were the best, we initially used the distance to the pupil. However, upon further exploration on later stages we ended up using a different criteria; we decided that the best glints would be the ones that relatively close to each other. To determine this, we scored them by adding their distances to each other and choosing the smallest ones. ### Results and evaluation We used two metrics to compare our estimations against the ground truth. The count metric, and the distance metric. The count metric was a simple difference, on the difference of the amount of detected glints, without taking into account their positions. We expect to have values between 0 and 4, and ideally most of them should be zero. This is confirmed by the following histogram taking into account the errors for all datasets (where glints information was supplied) ![](https://i.imgur.com/72CuCnv.png) However, for the distance metric we had two variants. Since given two sets of glints, we calculate the distance between them as adding the shortest distance between each glint from one set to the glints on the other set. But when having different counts, this shortest distance may be the same for the set with fewer points, i.e. the metric is not commutative. If we add the errors as previously described, starting by the estimated values, we'll also take into account the differences for the estimations that have a different amount of glints detected. That is, the count metric and this metric are correlated. And it increases when detecting false positives for glints. We denote such metric the **distance metric - estimated first**. ![](https://i.imgur.com/d7Tn3wO.png) However if we start by measuring against our estimated values first, this metric would only increase for erroneous predicted values and not for false positives. Therefore we expect lower values here. We denote this metric the **distance metric - truth first**. ![](https://i.imgur.com/5uoDVMX.png) We see that the difference on both metrics is as expected, furthermore we give evidence of the correlation of the **distance - estimated first** and the **count** metric by showing a fragment of the compared values. |Image |Count metric |Distance - estimated| |------|-----------------|--------------------| |0.jpg |0 |1.88316426277161 | |1.jpg |0 |1.47491744912233 | |2.jpg |1 |16.1906830997113 | |3.jpg |0 |1.62030698106634 | | ...|...|...| |8.jpg |1 |35.2174846375548 | |9.jpg |0 |1.8151300274833 | |10.jpg|0 |1.74764792835922 | |11.jpg|0 |1.98363907154841 | |12.jpg|0 |1.03532906021628 | |13.jpg|1 |36.9189292304236 | |14.jpg|1 |17.214714251956 | |15.jpg|0 |2.12433386705972 | | ...|...|...| |29.jpg|0 |2.17750729776007 | |30.jpg|0 |1.84060511922833 | |31.jpg|0 |0.717044498111303 | |32.jpg|0 |0.543437015716397 | |33.jpg|1 |16.6711243285558 | |34.jpg|0 |1.04809332947741 | |35.jpg|1 |18.1650744576515 | |36.jpg|1 |19.9422053482441 | |37.jpg|1 |37.1045007350381 | | ...|...|...| ### Discussion of our results The glint detection has considerably larger errors, and as experienced in later stages, it is crucial to have very good glint detection for a correct normalization. Other ideas were explored to make our glint detector better, like creating an iris detector, in order to apply it as a mask for further glint filtering. This was proposed since we observed that some of the wrong glint matches were outside the iris. However due to time constraints, we did not implement this filter. ## Gaze estimation The process we used for our gaze estimation begins with calculating the position of the pupil's center on a given image, using the method for finding pupils described above. The pupil center is then transposed and multiplied with the homography matrix found between the pupil centers and gaze positions on a set of images where we have the ground truth for the gaze positions. This gives us our estimate of the gaze position on the given image, after converting to euclidean coordinates. ### Evaluation In order to evaluate the correctness of our model, we used the distance of the ground truth to our estimations as our metric. #### Results for datasets without head motion The following histogram takes into accounts all the datasets without head motion: `pattern0`, `pattern1`, `pattern2`, `pattern3`. ![](https://i.imgur.com/o6n4O1D.png) ![](https://i.imgur.com/tApYd95.png) #### Results for datasets with head motion This model doesn't take into account head motion, since two images may have the pupil on the same positions, while looking at very different points of the screen, since the head is allowed to move. Therefore we expect considerable errors on our predictions here. This is the histogram of errors for `moving_medium`. ![](https://i.imgur.com/45YJWyr.png) ![](https://i.imgur.com/eeZ8fd7.png) This is the histogram of errors for `moving_hard`. ![](https://i.imgur.com/IgBbEmZ.png) ![](https://i.imgur.com/vSK4DOJ.png) Taking into account both datasets: ![](https://i.imgur.com/ECWKwSj.png) ![](https://i.imgur.com/VDepCqq.png) #### Overall results The following is the histogram of errors for all datasets: ![](https://i.imgur.com/ZSlT7vh.png) ![](https://i.imgur.com/InGcUlU.png) ### Analysis of our results As mentioned, this simple model mapping through an homography doesn't take into account head motion, and this is further shown on the results. ### Pupil detection errors and gaze errors correlation Intuitively; pupil detection errors and gaze estimation errors should be correlated, since having errors when calculating the pupil should reflect errors on the gaze estimation as well. As we saw on the first section, the pupil detector performed really well (a mean of 0.68 along all datasets). This will bias the knowledge that we have on wether or not the errors are correlated. The correlation coefficients of the pupil errors and the gaze estimation errors were as follows \begin{equation} corr = \begin{bmatrix} 1.0&-0.0835028035311163\\ -0.08350280353111629&1.0 \end{bmatrix} \end{equation} At first sight this would imply that there's little correlation between these errors, however due to the magnitudes of the gaze errors, and the pupil detection having almost no errors, this is not conclusive evidence. To put this into perspective, the following is the covariance matrix, which is not normalized: \begin{equation} cov = \begin{bmatrix} 634268.8680457947&-84.77361310186348\\ -84.77361310186348&1.6249719848923891\\ \end{bmatrix} \end{equation} The variance of the errors for gaze estimation and pupil detection are significantly different. Therefore the observations are not enough, as the few pupil detection errors makes them seem not related to the gaze errors and appear random instead. Further evidence is given on the scatter plot for gaze errors. ![](https://i.imgur.com/zPyWihU.png) To further aid as proof that the lack of errors on the pupil detection was the culprit of low correlation, we also show the correlation matrix for a pupil detector we erroneously had initially, where there were some errors on several datasets. \begin{equation} corr = \begin{bmatrix} 1.0&0.9642382099922387\\ 0.9642382099922387&1.0 \end{bmatrix} \end{equation} This shows that for a pupil detector with wrong detections, we indeed have wrong gaze estimations. ![](https://i.imgur.com/1fKQXc3.png) ## Improving our model: Gaze estimation under head motion On the previous gaze estimation, we only used the position of the pupil to apply a homography (based on the 9 calibration images). This clearly doesn't take into account the head motion (a pupil on the same position on the image may look at different parts of the screen if the relative position of the head changes). To take this into account, the glints detected from the previous section could be used, since the detected glints come from light sources on the corners of the screen. Therefore the position of the pupil with respect to the glints would be able to give us this relative information, and would be a better fit for estimating the gaze position. This implies that our glint detection should be robust, and several further improvements were made to ensure this when tackling this stage. All details on glint detection can be found on the previous section. Once we had the position of the glints from our detector (a list with coordinates of at most length 4), the coordinate of the pupil with respect to these glints is calculated via some transformation that normalizes the position of the glints. This transformation depends on the number of detected glints, it's important to note that this correspondance between points and the corners assigned when normalizing, should be carefully taken into account, since we're not provided with a given sorting from the output of our glint detector. Therefore extra care was taken for this correspondance. The following image shows some of those cases ![](https://i.imgur.com/BAw3azb.jpg) In order for us to be sure that the desired transformation was attained, we previewed the transformation with values that didn't collapse to a corner and made sure that our detected glints remained where normalized, and also that the orientation of the image didn't change. The following image shows the application of such a transformation used for debugging our creation of the required transformations. ![](https://i.imgur.com/6kO9GAg.png) In order to detect correspondance of points with the corners, we used the following methods: - When detecting 4 glints: The points were sorted in a clockwise order through a less-than relation based on the cross product between each compared point and their centroid (based on the fact that the sign of the cross product determines relative orientation). As further explained [here](https://stackoverflow.com/a/6989383/10997228). After such sorting, we still couldn't guarantee which was the first point, only that they were clockwise sorted, so we shifted the sorted values, and the one with the least distance to the origin (presumably the top left corner) was chosen. - When detecting 3, 2 glints: We came up with an algorithm to classify the glints to the possible corners they may represent (top-left, top-right, bottom-left, bottom-right). To do that, we considered the enclosing rectangle of the given points, and considered the possible assignments for those points. Out of those, it ws chosen the one that minimized the distances to the suggested corner. This is feasible to do, since we have a small amount of points and therefore of possible assignments (at most 4!=24). This method should also work for 4 points when they form a convex polygon. Once the correspondance was given, the linear transformation for getting the pupil center into normalized coordinates was the one that fitted the glints with that amount of observations, that is: - Homography for 4 points - Affine for 3 points - Simmilarity (Rigid motion + scaling) for 2 points - Traslation for 1 point This transformation was then used, both in the calibration process and in the estimation process of our model: In the **calibration process** before computing the same homography from the 9 calibration images. The pupil centers that we previously used, was changed to the relative coordinates of the pupil respect to the glints. In the **estimation process** before applying the model generated on the calibration, the detected pupil was normalized. ### Results and comparison #### Results for datasets without head motion The following takes into accounts all the datasets without head motion: `pattern0`, `pattern1`, `pattern2`, `pattern3`. ![](https://i.imgur.com/jmQg76r.png) ![](https://i.imgur.com/sEVCGkJ.png) #### Results for datasets with head motion The normalization of the pupil position with respect to the glints, we shouldn't have significant differences with or without head motion Histogram for moving datasets: ![](https://i.imgur.com/MYZOz5u.png) ![](https://i.imgur.com/r5tqs8B.png) #### Overall results The following is the histogram of errors for all datasets: ![](https://i.imgur.com/VslJ9Uh.png) ![](https://i.imgur.com/7pxOUNN.png) #### Discussion of our results We can see that indeed there are no big differences for the estimation with or without head motion, however there's a considerable error overhead that most probably is due to the normalization when not exactly 4 or 3 glints are detected. When 2 or 1 glints are detected, the assignment of wrong corners in the case of 2 may result on an undesired rotation, moving the whole prediction. Even more, when detecting only 1 glint a traslation gives to little information, and even more if the other transformations were reducing significantly the coordinate by collapsing to a corner. A proposal could be to apply this prediction, only when there's is enough glint information, and to use other heuristics, or the previous method when there's not enough information. This would result in having two models that depending on the number of detected glints would then be applied appropiately. Also extra work could be done on providing a better glint detector, since this normalized gaze estimator depends strongly on it. As explained on the Glint estimator section, this was explored, but discarded due to time constraints. ## Conclusion The process of binarization, preprocessing and scoring is a useful guideline applicable for extracting simple local features. However extra knowledge on the expected data is also useful, and also other more global considerations (as explored in our glint detection). The simple homography gaze estimator works very well for datasets where the head doesn't move. However, as expected, the moving datasets impose a challenge since it doesn't have enough information for that. To take into account head motion, a perfect glint detector would solve the problem, since another homography would normalize the detected pupil center and it could then be used as before. However, the glint detection implemented was far from perfect, and the errors from an incorrect glint detection propagate to the normalized estimator considerably. Ideas to further improve the glint detection were given. But a different approach to the glint problem could also be taken, since we know that we expect the glints to form a (possibly rotated) rectangle, and we could fit a rectangle to the observed data that minimizes some global function on where their vertices are. This would get rid of the problem of detecting a different number of glints and it would guarantee a more consistent input for the normalized detector. This however is out of the scope of this assignment. We also analysed the errors under different metrics, and saw there is a strong correlation between pupil estimation errors and gaze estimation errors. This also showed the importance of having enough variance on the datasets that you want to test if they're correlated, since initially with little error on the pupils, there was not enough evidence of correlation on the samples. ## References ## Appendix ### Distance error for dataset pattern0 |Image |Euclidean distance in pixels| |------|--------------------| |0.jpg |0.2850407618022011 | |1.jpg |3.5693453860386146 | |2.jpg |0.09248085062175729 | |3.jpg |0.025567393098090894| |4.jpg |0.3840287518106894 | |5.jpg |0.8655679423856159 | |6.jpg |0.3557705151807884 | |7.jpg |0.40184078821271374 | |8.jpg |0.037669067560823874| |9.jpg |0.14446877808011546 | |10.jpg|0.23026619360678774 | |11.jpg|0.3763791458638298 | |12.jpg|0.37049327881158445 | |13.jpg|0.3173996920865411 | |14.jpg|0.055167925537050314| |15.jpg|0.0650537870109312 | |16.jpg|0.14619032990639755 | |17.jpg|0.6445212804138862 | |18.jpg|0.8048652092754282 | |19.jpg|0.3911686919342633 | |20.jpg|0.10670340706469153 | |21.jpg|0.15252443138415794 | |22.jpg|0.7283664276060756 | |23.jpg|0.1338691236939231 | |24.jpg|0.28518610301585806 | |25.jpg|0.22471696101418026 | |26.jpg|0.7324862911189884 | |27.jpg|0.21243694505902427 | |28.jpg|0.3930392312198671 | |29.jpg|0.28216353747247747 | |30.jpg|0.10304235398748228 | |31.jpg|0.10954623038452266 | |32.jpg|1.2920263798399654 | |33.jpg|0.41021020284656634 | |34.jpg|0.2532081291078828 | |35.jpg|0.1992646759287364 | |36.jpg|0.3221087059261082 | |37.jpg|0.16546949779626996 | |38.jpg|0.4881800737462021 | |39.jpg|0.13201144598249645 | |40.jpg|0.2770591525654215 | |41.jpg|2.5631879264369664 | |42.jpg|0.47805033658677815 | |43.jpg|0.17546243697539354 | |44.jpg|0.3069001882830804 |