# How to read a paper on 10 minutes (or less): a case study for more efficient reading on 4 papers ## Motivation If you're interested in starting research on anything related to artificial intelligence (AI), deep learning (DL) or its many applications, or like I was, a graduate student with not much experience reading papers, navigating through the relevant literature can feel like a daunting task. Every week there's probably dozens, if not hundreds, of new research articles posted to journals, conferences, and repositories like Arxiv. Keeping up with the literature is a challenge by itself, let alone getting started. This post aims to enable new researchers, like me, to familiarize themselves with the paper reading process. ### Reading a paper Someone once mentioned to me that some researchers intentionally over-complicate their papers to inflate the perception of how much work has been put into it, make it more difficult for others to (potentially) judge negatively, and make themselves appear smarter. I won't comment on this, but reading scientific papers, can definitely be challenging, stressing and time-consuming, specially at first. However, with time, and good technique, one gets used to the process. The purpose of this post is to share a series of tips and techniques to more efficiently read a paper, based on my personal experience and many recommendations I once read at some point. To illustrate the reading workflow, I'll demonstrate the main concepts by applying them on four articles relevant to my particular field, computer science (CS), and more specifically, computer vision using (CV) deep learning (DL) for biomedical applications. ## Workflow and main takeaways The key methodology to read a paper, or at least what works for me, is to go through the paper inspecting these points: 1. Title. 2. Structure. 3. First figure. 4. Abstract (optional). 5. Rest of figures and tables. 6. Specific section, or future work Now I'll describe each of these sections in more detail: 1. Before reading a paper, it's important to assess if the content is relevant or not to you; the title should convey that information to you. If at first glance it doesn't, there's a good chance the rest won't either. 2. If it is relevant, then get a quick overlook of the structure of the paper, in terms of sections, figures and tables. This is to assess if there's any particular section that seems more relevant to your particular interest. 3. For most papers, the first figure conveys the most amount of information and the main contribution, so take a good look. 4. Optional: If the first figure looks promising, take a look at the abstract. Most abstracts are structured in this manner: motivation, hypothesis or proposal, results. The reason why I put this as optional is that the abstract many times is the most information dense of the paper, since it summarizes everything, and usually is written using field-specific terminology. This makes it more difficult and therefore time-consuming for beginners to grasp the main idea. However, as you get more experience, points 3 and 4 become the most crucial to go through lots of papers, in a short time. 5. Then, take a look at the figures and tables in the paper. Most will describe methods, or demonstrate results. If it's the latter, don't spend much time on it; the main takeaway is usually: > We outperformed our competition in terms of some metric, usually accuracy or computational resources, or both. 6. If there's any particular section you noted it's important, now take a quick glance. Again, don't spend too much time. I make an exception for sections related to *Challenges*, *Shortcomings*, and *Future Work*, since this is where you want to capitalize. That's it. If you follow this simple guidelines, reading a gargantuan paper becomes a much more feasible task. Another key important takeaway regarding research papers in a particular area is that they have diminishing returns in terms of the knowledge that you can obtain from them. The more familiar you're with a subject, the less information you learn from reading a new paper on that area, since many of the paragraphs paraphrase or repeat the same main concepts. Therefore, the aforementioned methodology becomes much more relevant as you become more versed in a particular field's literature. ## Case studies The first article is a survey of DL using convolutional neural networks (CNNs), one of, if not the most-common architecture for CV tasks during the last decade. The second one, is a much more specific review of remote photoplethysmography (RPPG) technologies. RPPG allows for remote monitoring of vital signs, such as heart rate and respiration rate, usually by analyzing color variations in a person's skin, as recorded by a camera. The third one, is an average article in the RPPG area, where they apply 3D CNNs to increase the accuracy of a person's heart rate acquisition from a video. Finally, the fourth one is a pretty recent and trendy paper, where the researchers adapt a model that has revolutionized the field of natural language processing (NLP), the transformer, to the task of image classification, and obtain state-of-the-art (SotA) results in a variety of benchmarks. ### Example 1: A survey of the recent architectures of deep convolutional neural networks This paper is 70 pages long (57 without citations), enough to intimate anyone to put into a *to-read* list. However, by applying the aforementioned method, we can go through it in a few minutes. First, the title tells us it's a survey or review article. These are usually directed to people new to a particular area, to facilitate the process and entry-barrier to new researchers. It also tells us the focus is on architectures of deep CNNs. For someone without any knowledge of DL these terms could be confusing, but any short introductory course to DL should cover the main ideas behind these. I plan to write a complete post on how to start from scratch on doing CV using DL, but for now, if you are interested, this is a good starting point: [Deep Learning Specialization in Coursera](https://www.coursera.org/specializations/deep-learning?utm_source=gg&utm_medium=sem&utm_campaign=17-DeepLearning-ROW&utm_content=17-DeepLearning-ROW&campaignid=6465471773&adgroupid=77656689495&device=c&keyword=online%20deep%20learning%20classes&matchtype=b&network=g&devicemodel=&adpostion=&creativeid=379479317454&hide_mobile_promo&gclid=CjwKCAiAgJWABhArEiwAmNVTBxHRQmBRBaNJDiRDG9Bkhx1h8_yCnMsHcuzn9uJ5gz66PSi2Fr24URoC-ZwQAvD_BwE). Second, the first figure, actually conveys the structure of the article, so we kill two birds with one stone. <figure> <img src="https://i.imgur.com/C4NNTVS.png" alt="Paper structure"> <figcaption style="align: left; text-align:center;">Paper structure. Fig. 1 as taken from [1].</figcaption> </figure> Based on this, and by quickly reading through each of the sub-sub-sections, we can see that Section 2 covers the basic CNN components, Section 3 is concerned with the history of CNNs from a more broad point-of-view, Section 4 goes through specific innovations mentioned in the previous section in detail. After that Section 5 discusses applications of CNNs, Section 6 mentions challenges associated with each of the different main-types of CNNs, and Section 7 concludes by mentioning possible research directions. Based on your particular interests you may want to focus on a specific subsection, but for now I will stick to the original workflow and go through the figures, and tables. Fig. 2 illustrated a typical diagram of a machine learning system; it's nothing new. Similarly, Table 2 is just notation. Fig. 3, however, is (probably) the single biggest contribution of this paper, in a single diagram, show the whole history in terms of taxonomy and improvements in CNN architectures, since their start. <figure> <img src="https://i.imgur.com/ENa3eGf.png" alt="CNNs architecture history"> <figcaption style="align: left; text-align:center;">CNNs architecture history. Fig. 3 as taken from [1].</figcaption> </figure> Similarly, Table 2 and 3 summarize CNN models contributions and performance, and online resources that have made the rise of DL and CNNs possible, respectively. ![](https://i.imgur.com/N1TAq3S.png) ![](https://i.imgur.com/4kkV8LE.png) Fig. 4 provides a taxonomy for classifying CNN models into groups, but the information is similar, but not as detailed, to the one provided in Fig. 3, so we skip it. Fig. 5, 6 and 7 illustrate the main architecture or blocks of some of the most important CNN models, including [AlexNet](https://proceedings.neurips.cc/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html), [Inception](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43022.pdf), and [ResNet](https://arxiv.org/abs/1512.03385). ![](https://i.imgur.com/ieD4Jxm.png) ![](https://i.imgur.com/v2J75D7.png) ![](https://i.imgur.com/x8a5qb1.png) After this, the paper illustrates more architectural innovations, in Fig. 8, 9, 10 and 11. Later, Tables 5 a to g mention challenges in each of the main architecture taxonomies, in terms of strength and gaps; these are the last figures or tables in the paper. Finally, we take a closer look at the *Future Work* section, to take inspiration on recent trends and directions from more experienced researchers. They mention 9 points. By taking a short look at the first sentence we can usually estimate how relevant or interesting these directions are to us. I briefly summarize them as: 1. Ensemble learning: combination of multiple architectures for improved robustness and generalization. 2. CNN's generative capabilities to boost representation power of models. 3. Attention mechanism to capture information from images as inspired by human visual system. *Note: this is explored in paper [example 4](#example-4:-an-image-is-worth-16x16-words:-transformers-for-image-recognition-at-scale).* 4. Improvements in hardware to accelerate training and run-time applicabilities. 5. Hyper-parameter (activation function, kernel size, number of layers, etc.) tuning and optimization. 6. Pipeline parallelism to scale up deep CNN training. 7. Cloud-based platforms for development of computationally intensive CNN applications. 8. CNNs for sequential data: 1-D CNNs. 9. CNNs for high-energy physics. ### Example 2: Remote Monitoring of Vital Signs in Diverse Non- Clinical and Clinical Scenarios Using Computer Vision Systems: A Review This paper is 35 pages long (28 without citations). Let's first look at the title. It's pretty self-descriptive; it's a review about a specific application of computer vision systems. The sections are as follows: 1. Introduction 2. Video Camera Imaging Based Method 3. Different Aspects of Colour-Based Method 4. Applications 5. Research Gaps and Challenges and Future Direction 6. Conclusion Then, we take a look at the first figure. It describes the main contactless acquisition techniques used for obtaining vital signs. Based on this, and the outlined sections, we can deduce this review will focus on the modality c: video camera imaging. <figure> <img src="https://i.imgur.com/W4rdN8k.png" alt="CNNs architecture history"> <figcaption style="align: left; text-align:center;">Contactless measuring methods of vital-signs. Fig. 1 as taken from [2].</figcaption> </figure> Continuing with the figures, Fig. 2 gives us a better description of the methods used for these systems. From this we can understand that these contactless vital-sign monitoring systems operate by obtaining a video feed, and applying image and signal processing techniques on the frames, to obtain the desired vital sign, such as heart rate or respirationrate. Fig. 3 describes equipment (cameras) used for obtaining the video. Fig. 4 and 5 describe block diagrams for contactless monitoring systems, however they are pretty much extended versions of the one in Fig. 2, so we can skip the details. ![](https://i.imgur.com/g2SVdGH.png) Fig. 6 also describes an important process, the underlying physical mechanism by which we can obtain these physiological signals. We observe that the camera sensor basically captures reflected light that penetrates the skin and interacts with blood vessels. ![](https://i.imgur.com/4n9PRJy.png) The rest of the figures and tables describe specific landmarks and algorithms used to process these signals, under a variety of conditions, and report results. By looking at the sub-sections titles we can also infer that these algorithms have been applied under a variety of conditions including stable environments, environments with illumination and motion artifacts, using a variety of sensors, to subjects of different ages, and to capture a variety of vital signs: heart rate, respiration rate, arterial oxygenation, blood pressure, etc. Then, we put our focus into Section 4 and 5. Section 4 covers clinical and non-clinical applications such as as neonatal and critical patient monitoring, arrhythmia detection, home health care, fitness, sleep, and stress monitoring, among others. Section 5 describes possible future directions. The authors mention 9 points, summarized as: 1. Algorithms to deal with environments with both illumination and motion artifacts. 2. Different vital signs such as blood glucose and blood oxygen. 3. Multi-subject monitoring. 4. Long-distance monitoring. 5. Region of interest (ROI) selection. 6. Elder and premature babies as subjects. 7. Lack of publicly available databases. 8. ECG to obtain ground-truth values of vital signs for validation instead of pulse oximeter. 9. Multi-camera and non-visible light fusion video. ### Example 3: 3D Convolutional Neural Networks for Remote Pulse Rate Measurement and Mapping from Facial Video This paper is 21 pages long (17 without citations). Let's first look at the title. Again, it's related to the aforementioned remote monitoring of vital signs. However, this is a not review, but a paper where they apply a DL technique, a 3D CNN, for measuring the pulse rate remotely from facial video. This paper is a better representative of the average paper in an application-specific area of computer vision systems, where the focus is less on the innovation to CV as a whole, but how to improve the performance for the particular application; the CNN is just a tool to a means. The sections are as follows: 1. Introduction 2. Related Works 3. Materials and Methods 4. Results and Discussion 5. Conclusion Again, by looking at the first figure, we can infer most of the contributions of this work. The authors compare a traditional RPPG system, on the top, to their proposed system which automates the whole process, using only synthetic data, to predict pulse rates. <figure> <img src="https://i.imgur.com/w42P6Op.png" alt="Proposed CNN model for RPPG system"> <figcaption style="align: left; text-align:center;">Traditional vs proposed approach to contactless pulse rate prediction. Fig. 1 as taken from [3].</figcaption> </figure> Moving forward, Fig. 2 describes how do they obtain the synthetic data used for training their 3D CNN. Fig. 3 to 5 describe experiments and visualization of different steps of the synthetic data generation. Tables 1 and 2 also describe results related to the synthetic data generation. ![](https://i.imgur.com/2U3AA8H.png) Fig. 6 describes the architecture of the 3D CNN model they used. Fig. 7 shows the learning curves for their CNN. Similarly, Fig. 11 demonstrates how does the learning affect the model results. ![](https://i.imgur.com/qd4N61n.png) Fig. 8 serves as a demo for their results. ![](https://i.imgur.com/9ZTSDcf.png) Fig. 9 and 10, and Table 3 compare results for their proposed methodology, against other established methods, and show how it outperforms others, in terms of standard metrics for the field. Finally, Fig. 12 describes a particular case of "failure" for their model. Since this paper doesn't explicitly have a *Future Work* section as the former two, the future directions are usually stated somewhere in the discussion or the conclusion. In this case, the conclusion mentions that they didn't explore other architectural and hyper-parameter choices for their model, including deeper CNN or recurrent networks. ### Example 4: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale This is my personal favorite as of recently. It has reignited my passion for CV and DL as a whole; we're back in 2012, when AlexNet was proposed. A whole new paradigm of options to explore has opened. Also, it has generated so much discussion, since the moment it was published anonymously (spoiler: everyone knew it was probably from Google, or another big company, just based on the amount of compute used for the experiments). This paper is 21 pages long, 9 without citations and appendix. The appendix is actually an important part that adds 9 more pages, making it 18 pages without counting citations. To be honest, it's the one that I would say that has taken me the most to grasp among these four papers, because I have studied it in depth. Let's first look at the title. It's catchy but not exactly clear what it's about. We can however deduce that it's about a model called *Transformers* applied to the landmark CV task of image recognition. This paper will (in my opinion) probably be a representative of a landmark paper in DL, or a least CV; a paper that advances techniques and the field as a whole and will accumulate a huge amount of citations from all the derivative works based on the proposed method. The sections are as follows: 1. Introduction 2. Related Work 3. Method 4. Experiments 5. Conclusion Jumping to the first figure, we can get a much more clear diagram of what is being proposed. The authors divide image into patches and feed them to a model that takes the whole set of patches, allowing to attend to the whole image at the same time, unlike with traditional convolutions where their receptive field is limited and there's no long-range interactions. <figure> <img src="https://i.imgur.com/Bomyhh3.png" alt="Vision Transformer model overview"> <figcaption style="align: left; text-align:center;">Vision transformer model overview. Fig. 1 as taken from [4].</figcaption> </figure> Table 1 presents some varieties in terms of hyper-parameter choices for this architecture, that they label as ViT. Table 2 conveys the main set of results. ViT models outperform the current best CNN models in the most widely benchmarked dataset for image clasification, ImageNet, and other datasets, while taking less time to train. ![](https://i.imgur.com/WWSoN50.png) Fig. 3, 4 and 5 convey more quantitative results. Fig. 4 shows the effects of pre-training dataset in performance for the ViT models. Fig. 5 shows results for computational resources. Fig. 6 and 7 display interesting properties of ViT. Fig. 6 shows the ability to easily visualize what parts of the image the model is attending to. Fig. 7 discusses the embeddings for the image patches, along with the mean distance between attended pixels inside the transformer encoder as we increase the network depth. ![](https://i.imgur.com/mMuSJAO.png) In the conclusion they include possible future work, such as the inclusion of ViT into object detection and segmentation models, along with more exploration of self-supervised pre-training. The appendix also includes relevant information regarding the training, and some of the many experiments conducted on this new architecture, that has been mostly ported from the NLP domain. ## Conclusion This post provides an efficient and simple methodology for new researchers to break down research articles into a format that's more easily-digestable, and quick-to-read. If you like this post, or have any questions, feel free to leave a comment or contact me on any of my socials, found at the bottom of my [Github Pages](https://arkel23.github.io/). ## References [1]A. Khan, A. Sohail, U. Zahoora, and A. S. Qureshi, “A survey of the recent architectures of deep convolutional neural networks,” Artif Intell Rev, vol. 53, no. 8, pp. 5455–5516, Dec. 2020, doi: 10.1007/s10462-020-09825-6. [2]F.-T.-Z. Khanam, A. A. Al-Naji, and J. Chahl, “Remote Monitoring of Vital Signs in Diverse Non- Clinical and Clinical Scenarios Using Computer Vision Systems: A Review,” Applied Sciences, vol. 9, p. 4474, Oct. 2019, doi: 10.3390/app9204474. [3]F. Bousefsaf, A. Pruski, and C. Maaoui, “3D Convolutional Neural Networks for Remote Pulse Rate Measurement and Mapping from Facial Video,” Applied Sciences, vol. 9, no. 20, p. 4364, Jan. 2019, doi: 10.3390/app9204364. [4]A. Dosovitskiy et al., “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” arXiv:2010.11929 [cs], Oct. 2020, Accessed: Dec. 14, 2020. [Online]. Available: http://arxiv.org/abs/2010.11929.