Computer vision project

--- title: 'Computer vision project' disqus: hackmd --- Group 14 - Eye-tracking on your laptop === ![](https://hackmd.io/_uploads/HyouOrwDh.png) ## Students (Group 14) Nanami Hashimoto, 4779495 N.hashimoto@student.tudelft.nl Remco Huijsen, 5650844, R.Huijsen@student.tudelft.nl ## Table of Contents [TOC] ## Sources The Python scripts used for this project can be found in the following repository: https://github.com/TU-RHuijsen/CS4245-Eye-Tracker-project For privacy concerns, no images are included in this repository, nor are we willing to hand these to the public for the foreseeable future. ## Introduction Eye tracking technology has been used in various domains such as studying human's attention, as a tool to interact with a computer, and enhancing entertainment like for gaming. Traditionally, eye tracking has been done with a geometry based approach, where the input data is matched with an existing eyeball model to estimate the gaze. Although this established method can provide high accuracy when implemented properly, it does depend on a high quality input image. It is also possible to use an infrared light camera, but this requires additional hardware. Nowadays, due to the rapid development in deep learning technology, a new approach called the appearance based approach is becoming more popular. This approach allows the eye gaze to be estimated purely based on appearance. This approach does not necessarily require an external device (most mobile devices nowadays have a build in webcam), and often provides good enough accuracy for gaze estimation. However, most models are trained on very large datasets, which might not always possible. In this blog, we investigate the possibility of developing an eye tracking model with a smaller dataset generated on an ordinary laptop. We will use an existing model architecture to train this dataset on, based on the "Eye Tracking for Everyone" [1] paper, and draw comparisons towards the results in the corresponding paper. Beyond that, we will also investigate the influence of having more different people and the effect of visually differing people. ## Background research Before diving into our research, we introduce some background information. ### Traditional approach of eye-tracking Traditionally, eye-tracking has been done with a geometry based approach, where the input data is matched with an existing eyeball model to estimate the gaze. The geometric approach treats gaze as a vector in space that is jointly defined by the position and orientation of the eye in space e and the eye's rotation about its horizontal and vertical axes, ϕ and θ, respectively [2]. To accomplish this, calibration is needed to derive the optimal parameters to correlate the head and eye positions to known positions on the screen. Whilst the performance of this model can work well, it is quite dependent on the quality of the input image. More high-end eye-tracking device, like the Tobii [3] for instance, make use of an infrared light camera to establish where someone is looking. Infrared light gets sent out and gets reflected on your eyes. The reflections then get captured by one or multiple cameras. Combining the position of the light on the eyeball and the position of the pupil allows the system to estimate the direction of the gaze. Whilst this does require an external device, it also achieves a more accurate performance compared to geometry based eye tracking, for instance. ### Similar projects using apparance based approach Nowadays, due to the rapid development in deep learning technology, a new approach called the appearance based approach is becoming more popular. This approach allows the eye gaze to be estimated purely based on appearance. For example, the "Eye Tracking for Everyone" [1] paper is one of the most famous projects in eye tracking using the appearance based approach. In this paper, researchers have created a neural network model called "iTracker" that can estimate the eye gaze using iPhone's and tablet's cameras. Trained on 1,490,959 frames of data from 1474 subjects in total, the model can estimate the eye gaze with accuracy of up to 1.34 cm on a mobile phone and 2.12 cm on a tablet. The paper also proved that the model trained on the large dataset can generalize well to other datasets. ## Our approach Next, we explain what our approach for this project entailed. This includes what our model entails, how we collected our own data, how we retrieved the desirable data from this and how we train the model. ### Model architecture The model we use for this project is the model from the "Eye Tracking for Everyone" [1] paper, which can be seen below. This uses a seperate eye and face model, which connect to the overarching model. In this model, the abbriviation "CONV" indicates a convolutional layer, and "FC" indicates a fully-connected layer. The implementation of the original paper was in a Caffe deep learning framework and also used Matlab files for some of the model definition [4]. Since we are not experienced with Caffe ourselves, we decided to opt for a Keras and Tensorflow implementation of the paper [5]. ![](https://hackmd.io/_uploads/HyA45TuD2.png) ### Dataset To collect data for our experiments, we set up a Python script. This python script would display images similar to shown below. The participant can press the space bar whilst looking at a "crosshair" to take a picture, whereafter we set the "crosshair" at a new location and differ the background. One might wonder why we differ the background every screen position. We found a similar approach [6], which said this was a good way to combat "corneal reflections" of the screen in the eyes themselves. Additionally, we also noted that the amount of light on the screen can also impact the amount of light or shadow on the face, creating more "natural" artefacts. The picture that gets taken gets to send to a MTCNN model to detect the face and the location of the eyes, which will be explained shortly. Knowing the position of the face within the entire image, we can also create a "face grid" as a black square on a white background. The pictures of the face, the individual eyes and the face grid are then scaled and saved to be used for training. These four images are saved in a folder, where the name of the folder contains the (x, y) coordinate based on the resolution, as well as a timestamp. Correlating position to the resolution is not identical to how the "Eye Tracking for Everyone" paper does it, because they relate it to a position in centimetres from the camera. They need this because they actually do take into account multiple devices. However, all our data is from a single device, a Lenovo Legion 5 laptop [7], which makes this method usable for our situation. The sampling of the points is mostly random, but the script does take into account which coordinates are already sampled, even for different testing sessions with some time in between. This way, we can minimize any duplicate screen positions and maximize the amount of unique screen positions that are being used during training. ![](https://hackmd.io/_uploads/HJ7_1K_w2.gif) ### Preprocessing Since we capture a full image from the webcam, we need to extract the face and the eyes from this entire image to be able to pass these to the model. The "Eye Tracking for Everyone" paper uses the real-time face detector built into iOS for this, which is not a possibility for use on Windows laptops, for instance. Hence, we decided to use the MTCNN model [8] for this, based on the 2016 "Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Networks" paper [9]. The structure of the model is visible in the image below. This model allows us to find the face in the image, which also gives us the position of both individual eyes. Beyond the cropping of the regions of the eye and face and rescaling these, no further preprocessing steps are implemented in this current approach. ![](https://hackmd.io/_uploads/Ska3s6BPn.png) ### Training proceduce Following the "Eye Tracking for Everyone" [1] paper, we use a loss function with a learning rate of 0.001, momentum of 0.9 and weight decay of 0.0005 throughout the training procedure. The paper does not specifically state which optimizer they used, but the momentum and weight decay values were encapsulated by an SGD optimizer in the Keras + TensorFlow implementation [5], so we decided to use that. We run each training sequence for 200 epochs with a batch size of 64. Also following from the "Eye tracking for Everyone" paper is the loss function, which is the Euclidean loss on the x and y gaze position. We use this on truncated positions between 0 and 1 for both the x and y position, instead of using the full resolution position for this. Since the paper uses error margins in centimetres, we also convert the truncated screen error into a centimetre error based on the dimensions of the capture device. The Lenovo Legion 5 laptop [7] has a screen resolution of 1920x1080, and the screen is 383x215 millimetres. ## Results For this project, we investigated the performance of the "Eye Tracking for Everyone" model on a smaller dataset. Beyond that, we also investigated how much the performance is impacted by a dataset consisting of multiple people and when comparing visually differing people. ### How does the size of the dataset affect the performance of the model? Normally, it is always encouraged to achieve the highest amount of data possible. For this specific application, this means having to collect huge amounts of pictures of peoples faces. The "Eye Tracking for Everyone" [1] paper created GazeCapture, a dataset consisting of almost 2.5 million frames of over 1450 people. This is of course very nice and novel, but it might not always be feasible to collect these huge amounts of data in our current society. Think about the rise in privacy laws for instance, where it is only possible to save such data for a limited amount of time, or where people are given the possibility to have their data deleted after the fact. This begs the question: do we actually need such a large dataset? Would a smaller dataset work as well? To investigate this, we collected data from 11 different people. A subset of the data can be seen below. ![](https://hackmd.io/_uploads/rk-56suPh.png) Something to keep in mind whilst collecting data is if the data is evenly distributed. If you have a lot of data on, for instance, the left side of the screen, this can have an impact on your performance on the right side of the screen. The plot below shows each screen position for all the images in the dataset, where it can be noted that the screen is generally well covered. ![](https://hackmd.io/_uploads/SJd6JVdPh.png) To evaluate the performance of increasing data from different persons, we set out to train 11 different models. We start out with training on images of just a single person, and evaluate how well the model performs. From here, we start again, but now we add a subset of images from another person to the entire training set. This basically gets repeated until we have trained and evaluated 11 different models for different amounts of people and data. During training, we take 25 percent of the data as validation data, which we use to get an idea of the error in centimetres for "unseen" data. The "Eye Tracking for Everyone" [1] paper states the following average error margins of other state-of-the-art approaches (in centimetres): ![](https://hackmd.io/_uploads/B1uVI2uv2.png) For this experiment, we use the following metrics: #### Number of people This is the current amount of different people in the entire dataset. Note that the persons in a specific number are not random: for each new training run, a new person is added to the previous collection. This for instance means that the images for the first person are used to train for all eleven runs. #### Total number of images This is the total number of images in the entire set, so for all different people currently in the dataset. #### Min validation loss (truncated/cm) For this specific experiment, we decided to use the minimum validation loss within a training run. Are reasoning behind this is that this is the single best performance that we achieved on the validation set, such that we can compare this "best performance" with the best performance mentioned in the paper. On the more pessimistic side, if the "best performance" is nowhere near the performances mentioned in the paper, this shows that the corresponding amount of data is too lacking as well. We give this loss in the truncated form (so where both x and y are scaled between 0 and 1) and as e centimetres loss, to easily compare it with the errors mentioned in the paper. #### Number of epoch minimum loss is achievd This is the epoch where the minimum validation loss was achieved. This can give you an idea of how well the performance on the training data compares to the performance in the validation data. If the validation loss was the lowest at a very early epoch, we can presume that the training data does not generalize well to the validation data. #### Corresponding/Final training loss (truncated) The corresponding training loss is the training loss at the epoch where the validation loss was the lowest. The final training loss is the training loss at epoch 200, which in most cases should be a lower value. Together with the specific epoch where the validation loss was the lowest, this gives an idea of how well the training procedure influences the validation loss. Both of these are provided in the truncated variant. The following table shows the outcome of our eleven training sessions: | Number of persons | Total number of images | Min validation loss (truncated) | Min validation loss (cm) | #epoch min loss achieved | Corresponding training loss (truncated) | Final training loss (truncated) | |---------------------|------------------------|---------------------------------|--------------------------|---------------------|-----------------------------------------|----------------------------------| | 1 | 99 | 0.2827 | 8.4471 | 200 out of 200 | 0.2764 | 0.2764 | | 2 | 206 | 0.2537 | 6.6711 | 178 out of 200 | 0.2579 | 0.2029 | | 3 | 265 | 0.3112 | 7.5372 | 116 out of 200 | 0.2772 | 0.1753 | | 4 | 326 | 0.2019 | 5.8021 | 199 out of 200 | 0.1993 | 0.1936 | | 5 | 424 | 0.3683 | 10.8720 | 003 out of 200 | 0.4088 | 0.1769 | | 6 | 527 | 0.2790 | 8.0478 | 173 out of 200 | 0.1827 | 0.1695 | | 7 | 634 | 0.2715 | 8.0131 | 134 out of 200 | 0.2131 | 0.1911 | | 8 | 753 | 0.2756 | 8.1194 | 105 out of 200 | 0.2462 | 0.1934 | | 9 | 845 | 0.2596 | 7.7957 | 106 out of 200 | 0.2309 | 0.1881 | | 10 | 939 | 0.2496 | 6.9706 | 104 out of 200 | 0.2229 | 0.1820 | | 11 | 1110 | 0.2845 | 7.8994 | 037 out of 200 | 0.2757 | 0.1558 | For people specifically interested in the correlated graphs for these training sessions, please have a look at appendix A. If we specifically look at the minimum validation losses in centimetres that are being achieved, we see that this error is largely higher than any of the errors listed in the paper. The most favourable comment could be that we are somewhat close to the "centre method", but this method entailed just predicting the middle of the screen, which is not exactly the highest of praise. Therefore, for this experiment, we cannot say that our model trained on a smaller dataset can necessarily compete with any of the methods listed in the paper. ### Does generalizability improve for a small dataset if the dataset contains multiple subjects, compared to f.e. a model trained on a single participant? So far, we saw that the performance of an appearance based eye tracker was very much dependent on the amount it saw during training. In principle, one might think that it would also be better to diversify this dataset with images of different people, but that is what we want to test in this section: does that even matter for a smaller dataset? To test this, we train two different models: one model is fully trained on 513 images of a single person, whilst the other model is fully trained on 527 images of six people in total. Our hypothesis here is that the model trained on six different people should generalize better to new faces than the model trained on just a single face. Since we have 11 people in our overall dataset, we'll use the remaining 4 people as a testing set. The three image sets are distributed as follows: ![](https://hackmd.io/_uploads/Sy7uVuFw2.png) This means all datasets should be representative enough for the entire screen. For this experiment, we use the following metrics: #### Number of people This is the current amount of different people in the training set, for this experiment either a single person or six different people. #### Number of training images This is the total number of images that are used to train the model. #### Number of testing images This is the total number of images that are used to test the model, which in both cases have images from four different persons. #### Lowest cm loss during training This is the lowest loss (in centimetres) that was achieved during training. #### Number of epoch minimum loss is achieved This is the epoch where the minimum training loss was achieved. #### Average cm error on the training images This is the average centimetre error for the model tested on the images it was trained upon. This is more of a general test to see how close this is to the actual lowest loss during training. #### Average cm error on the testing images This is the average centimetre error for the model tested on the test set, so with images it has not seen before. This should give an idea of how generalizable each network is. The outcome of this experiment can be shown in the table below: | Number of persons | Number of training images| Number of testing images | Lowest cm loss during training | #epoch min loss achieved | Average cm error on the training images| Average cm error on the testing images| |---------------------|--------------------------|--------------------------|--------------------------------|-------------------|----------------------------------------|---------------------------------------| | 1 | 513 | 412 | 2.7737 | 200 out of 200 | 3.7445 | 9.5869 | | 6 | 527 | 412 | 4.6647 | 198 out of 200 | 4.8522 | 8.9616 | From this table, we see that the model trained on images from six people does perform slightly better than the model trained on a single person, but it is a marginal improvement at best. Whilst this does show that some performance improvements can be obtained by increasing the amount of people in your dataset, it also does show that the performance of the model trained on one single person is similar enough. ### How much does differing visual appearance impact the performance of the model? One problem with an appearance based eye-tracker is that no two human beings look alike. This might be problematic for a small dataset approach, since fewer data also means the network sees less different human beings. It can be pondered upon however if a small dataset approach does work for two people who have at least some similar attributes. In this experiment, that is exactly what we want to investigate. We perform six training runs with the datasets of three different people, the two authors and their supervisor, as can be seen below: ![](https://hackmd.io/_uploads/Hkc8V0_D3.png) This contains two Caucasian males from European descent, person 1 on the left and person 3 on the right, and one female from Asian descent, which is person 2 in the middle. Our initial hypothesis is that the performance of a model trained on one of the two Caucasian males should generalize better to another Caucasian male compared to the Asian woman. For each of the six training runs, we train a model on the full availability of images (so no validation split). Afterwards, we test the model on two datasets: the dataset it is trained upon, and another unseen dataset from one of the two remaining participants. For this experiment, we use the following metrics: #### Trained on/Tested on From our selection of three different individuals, this shows which image set the network is model upon, and which image set is used to test the model. #### Number of training images This is the total number of images that are used to train the model. #### Number of testing images This is the total number of images that are used to test the model. #### Lowest cm loss during training This is the lowest loss (in centimetres) that was achieved during training. #### Number of epoch minimum loss is achieved This is the epoch where the minimum training loss was achieved. #### Average cm error on the training images This is the average centimetre error for the model tested on the images it was trained upon. This is more of a general test to see how close this is to the actual lowest loss during training. #### Average cm error on the testing images This is the average centimetre error for the model tested on the test set, so with images it has not seen before. This should give an idea of how generalizable each network is. | Trained on | Number of training images | Tested on | Number of testing images | Lowest cm loss during training | #epoch min loss achieved | Average cm error on the training images| Average cm error on the testing images| |--------------|---------------------------|-----------|--------------------------|--------------------------------|-------------------|----------------------------------------|---------------------------------------| | Person 1 | 99 | Person 3 | 171 | 8.5509 | 200 out of 200 | 8.6079 | 11.0710 | | Person 3 | 171 | Person 1 | 99 | 2.9983 | 197 out of 200 | 3.6693 | 9.9542 | | Person 3 | 171 | Person 2 | 107 | 3.0141 | 188 out of 200 | 3.0226 | 15.3962 | | Person 2 | 107 | Person 3 | 171 | 9.2201 | 200 out of 200 | 9.1802 | 10.1395 | | Person 2 | 107 | Person 1 | 99 | 8.7418 | 200 out of 200 | 8.6957 | 13.5517 | | Person 1 | 99 | Person 2 | 107 | 5.6192 | 200 out of 200 | 6.8205 | 14.8352 | Can we conclude anything from this experiment to answer our hypothesis? If we look at training run 2 and 3 for instance, we could argue that the average error for the model tested on images of person 2 is substantially higher compared to the test set containing person 1. However, looking at training run 1 and 4 for instance, using person 3 as the test set gets a lower average error on the model trained on person 2 compared to the model trained on person 1. This makes it hard to draw a concrete conclusion regarding differing visual appearances, since in practicality, minute visual appearances can already have a severe impact. ### Future possibilities From our three experiments, it would seem like a small dataset approach does not achieve the desired results promised by the "Eye Tracking for Everyone" [1] paper. However, a possibility to remedy this could be data augmentations. If it is not possible to collect huge amounts of data, we could artificially create a larger dataset by, for instance, adding noise, altering lightning/colour or cropping images with a slight pixel error. Perhaps it could even be possible to add more points by flipping images. For instance, if you have a picture of someone looking in the top left corner of their screen, a flipped image could correlate to the top right corner of the screen. ## Conclusion In conclusion, it is very understandable why most papers try and use extensive datasets to train and test their model on. We found that for the "Eye Tracking for Everyone" [1] model trained on our dataset of all eleven people performed quite a bit worse than the performance listed in the paper. Beyond quantity, it is also important to have a qualitative dataset, and for a dataset for this application that would entail a dataset with a lot of visually different looking people. We see signs that this helps to improve generalizability, since it would appear that minute differences in people their apperances can already drastically impact the performance of this network. A possible way to remedy the performance of the model on a small dataset could be data augmentations, which we leave open as future work. ## References [1]. K. Krafka et al., "Eye Tracking for Everyone," 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 2016, pp. 2176-2184, doi: 10.1109/CVPR.2016.239. [2]. Browatzki B, Bülthoff HH, Chuang LL. A comparison of geometric- and regression-based mobile gaze-tracking. Front Hum Neurosci. 2014 Apr 8;8:200. doi: 10.3389/fnhum.2014.00200. PMID: 24782737; PMCID: PMC3986557. [3]. Tobii Dynavox Global, “What is Eye Tracking”, https://www.tobiidynavox.com/pages/what-is-eye-tracking (accessed Jun. 16, 2023). [4]. CSAILVision, “CSAILVision/Gazecapture: Eye Tracking for everyone”, https://github.com/CSAILVision/GazeCapture (accessed Jun. 16, 2023). [5]. Gdubrg, “GDUBRG/eye-tracking-for-everyone: A keras + tensorflow implementation of the CVPR 2016 paper ‘Eye tracking for everyone’”, https://github.com/gdubrg/Eye-Tracking-for-Everyone (accessed Jun. 16, 2023). [6]. Simon Ho, “Webcam Eye Tracker: Data collection of screen coordinates”, https://www.simonho.ca/machine-learning/webcam-eye-tracker-data-collection/ (accessed Jun. 16, 2023). [7]. Lenovo, “Lenovo Legion 5 17ACH6H - Overview”, https://psref.lenovo.com/syspool/Sys/PDF/Legion/Lenovo_Legion_5_17ACH6H/Lenovo_Legion_5_17ACH6H_Spec.pdf (accessed Jun. 16, 2023). [8]. PyPI, “MTCNN”, https://pypi.org/project/mtcnn/ (accessed Jun. 16, 2023). [9]. K. Zhang, Z. Zhang, Z. Li and Y. Qiao, "Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks," in IEEE Signal Processing Letters, vol. 23, no. 10, pp. 1499-1503, Oct. 2016, doi: 10.1109/LSP.2016.2603342. ## Appendix ### Appendix A: Training and validation losses for "How does the size of the dataset affect the performance of the model?" training runs ![](https://hackmd.io/_uploads/ByM0ssOvh.png)