Virtual and Augmented Reality

--- tags: uni, var --- # Virtual and Augmented Reality ![](https://i.imgur.com/1yfAxWL.png) * **Virtual reality (VR)** immerses users in a fully artificial digital environment. * Education (A flight simulator in use by the US Air Force) * Health Care (Drug Design) * Industry (Prototyping) * **Augmented reality (AR)** overlays virtual objects on the realworld environment. * **Mixed reality (MR)** not just overlays but anchors virtual objects to the real world. ### Augmented Human ![](https://i.imgur.com/MNOs9gp.png) An extension to Milgram's virtuality reality continuum (Milgram and Kishino, 1994), where the y-axis is the level of augmentation. Augmented human merges several technologies and UI paradigms to serve humans through more direct and natural interfaces. ![](https://i.imgur.com/kZwKIM7.png =500x) ### VR Definitions * Virtual Reality = Computer generated reality * Sutherland’s 1965 Vision: * Display as a window into a virtual world which sounds and feels real * Improve image generation until the picture looks real * Computer maintains world model in real time * User directly manipulates virtual objects * Immersion in virtual world via head-mounted display * S. M. Lavalle: * “Inducing targeted behavior in an organism by using artificial sensory stimulation, while the organism has little or no awareness of the interference.” * F. P. Brooks: * “Virtual reality experience as any in which the user is effectively immersed in a responsive virtual world. This implies user dynamic control of viewpoint” ### VR Characteristics * Immersion * Presence * Open vs. Closed Loop * Telepresence/Teleoperation * Augmented Reality ### VR and AI Computer should generate content, video, audio etc. alone without prescribing everything by humans. AI in general and Machine/Deep Learning in special needed in VR for: * Recognition of gestures, movements, tactile inputs etc. * Defining game content, video, audio * Players actions and behavior * A long and fruitful synergy between AI and games * First AI algorithms investigated on games: Checkers, Chess, TDGammon, etc. * Games and VR are very suitable simulation environment for research on new AI algorithms * AI and Deep Learning enable better, cheaper and faster Game and VR Design ![](https://i.imgur.com/56wj7PC.png) #### Deep Mind’s NN for Atari Games ![](https://i.imgur.com/FYfr1hF.png) ### VR and human ![](https://i.imgur.com/uVqlJI1.png) ![](https://i.imgur.com/CNbiAlx.png) ### VR Hardware * Displays (output): Devices that stimulate sense organs * Caves * Headsets * Headphones * Haptic Feedback Devices * Sensors (input): Devices that extract information from the real world. * Inertial Measurement Unit (IMU).: * Gyroscope * Accelerometer * Magnetometer * Cameras * Standard cameras * Depth (IR) cameras * Other sensors * Keyboard, controller, mouse, * Computers: Devices that process inputs and outputs * VR Software: Virtual World Generator (VWG) * VWG receives inputs from low-level systems that indicate what the user is doing in the real world * A head tracker provides timely estimates of the user’s head position and orientation * Keyboard, mouse, and game controller events arrive in a queue that are ready to be processed. * The key role of the VWG is to maintain enough of an internal “reality” so that renderers can extract the information they need to calculate outputs for their displays ![](https://i.imgur.com/quaue3k.png =550x) ### VR Software #### Virtual vs. Synthetic * The virtual world could be completely **synthetic**: * numerous triangles are defined in a 3D space * material properties that indicate how they interact with light, sound, forces, and so on * computer graphics addresses computer-generated images from synthetic models * The virtual world might be a **recorded physical world** * captured using modern cameras, computer vision, and * Simultaneous Localization and Mapping (SLAM) techniques * Using both color and depth information from cameras, a 3D model of the world can be extracted automatically using Simultaneous Localization and Mapping (SLAM) technics * Many possibilities exist **between the extremes**: * Camera images may be taken of a real object, and then mapped onto a synthetic object in the virtual world. This is called texture mapping, a common operation in computer graphics #### Matched Motion * A matched zone is maintained between the user in their real world and his representation in the virtual world. The matched zone could be moved in the virtual world by using an interface, such as a game controller, while the user does not correspondingly move in the real world. ![](https://i.imgur.com/F0D28NE.png) #### Physics * The virtual world should behave like the real world: * The basic laws of mechanics * Collision detection algorithm * Light propagation and interaction * Sound propagation * Smell propagation ### Sensation and Perception - VR Human Physiology and Perception ![](https://i.imgur.com/SwktJ43.png =630x) Perceptual psychology is the science of understanding how the brain converts sensory stimulation into perceived phenomena. Necessary for making VR design decisions: * How far away does that object appear to be? * How much video resolution is needed to avoid seeing pixels? * How many frames per second are enough to perceive motion as continuous? * Is the user’s head appearing at the proper height in the virtual world? * Where is that virtual sound coming from? * Why am I feeling nauseated? * Why is one experience more tiring than another? * What is presence? #### Human Body Senses ![](https://i.imgur.com/71D8uuO.png) ### Perception: Signal Propagation ![](https://i.imgur.com/wHyQXGX.png) #### Neuron Information Coding Neuron ![](https://i.imgur.com/32y5BZ3.png =500x) Information is encoded as the frequency and pattern of firing of action potentials (not amplitude). ![](https://i.imgur.com/U1ZE2zI.png =500x) #### Top-down and Bottom-Up: Perception is generated by external sensation (bottom up) and internal knowledge, experience, expectations etc (top-down). ![](https://i.imgur.com/EcQ4wwZ.png =500x) #### Perception Characteristics * **Hierarchical Processing** * Upon leaving the sense-organ receptors, signals propagate among the neurons to eventually reach the cerebral cortex. Along the way, hierarchical processing is performed: 1. Each receptor responds to a narrow range of stimuli, across time, space, frequency, so on. 2. After passing through several neurons, signals from numerous receptors are simultaneously taken into account. 3. In the cerebral cortex, the signals from sensors are combined with anything else from our life experiences that may become relevant for making an interpretation of the stimuli. 4. Information or concepts that appear in the cerebral cortex tend to represent a global picture of the world around us. ![](https://i.imgur.com/myYSQkP.png =400x) * **Proprioception** * Proprioception is the ability to sense the relative positions of parts of our bodies and the amount of muscular effort being involved in moving them. * This information is so important to our brains that the motor cortex, which controls body motion, sends signals called efference copies to other parts of the brain to communicate what motions have been executed. * Proprioception is effectively another kind of sense. * In the case of robots it corresponds to having encoders on joints or wheels, to indicate how far they have moved * **Fusion of senses** * Signals from multiple senses and proprioception are processed and combined with our experiences by our neural structures throughout our lives. * In ordinary life, without VR or drugs, our brains interpret these combinations of inputs in coherent, consistent, and familiar ways. * Any attempt to interfere with these operations is likely to cause a mismatch among the data from our senses we may become fatigued or develop a headache or symptoms of dizziness or nausea. * In other cases, the brain might react by making us so consciously aware of the conflict that we immediately understand that the experience is artificial * To make an effective and comfortable VR experience, trials with human subjects are essential to understand how the brain reacts * One of the most important examples of bad sensory conflict in the context of VR is vection, which is the illusion of self motion * **Adaptation** * Adaptation means that the perceived effect of stimuli changes over time. * Over long periods of time, perceptual training can lead to adaptation * Those who have spent many hours and days playing first-person shooter games apparently experience less vection when locomoting themselves in VR. * Adaptation therefore becomes a crucial factor for VR. * Through repeated exposure, developers may become comfortable with an experience that is nauseating to a newcomer. ## Human virtual perception and optical illusions * Motivation for the topic: Why do we need it for Virtual Reality? * Proposed literature * Basic facts explained * Additional literature * Thorough explanations provided * Critical argumentation/Own critical view * Advantages and disadvantages different solutions/methods * Examples of the usage in different VR and/or gaming application provided (see the attachment for some links to tools/code.) * State how you can use this for software development in general * Wie funktioniert das menschliche visuelle System? * Welche Funktionen werden wo ausgeführt? * Was sind die Einschränkungen und wie kann man es durch optische Illusionen austricksen? * Aus Blick eines Ingenieurs: * Welche Module gibt es? * Was sind die Funktionen, Inputs und Outputs jeweiliger Module? * Warum schaut Design so aus? --- http://lavalle.pl/vr/ La Valle ch 4 5,6 6.2 ### Chapter 4 - Light and Optics Knowing how light propagates in the physical world is crucial to understanding VR because of the **interface between visual displays and our eyes** and because of the **construction of virtual worlds**. Also to understan how cameras work, which provides another way present a virtual world: Through panoramic videos. * Light is emitted from displays and arrives on our retinas in the same way how light arrives through normal vision in the physical world. * In the current generation of VR headsets, a system of both engineered and natural lenses (parts of our eyes) guide the light. * We need to to model the physics of light propagation through virtual worlds ### 4.1 Basic Behavior of Light - basic physical properties of light, including its interaction with materials and its spectral properties Light can be described in three ways (these ways are compatible): 1. Photons: Tiny particles of energy moving through space at high speeds. (This interpretation is helpful when considering the amount of light received by a sensor or receptor.) 2. Waves: Ripples through space that are similar to waves propagating on the surface of water, but are 3D. The wavelength is the distance between peaks. (This interpretation is helpful when considering the spectrum of colors.) 3. Rays: A ray traces the motion of a single hypothetical photon. The direction is perpendicular to the wavefronts (see Figure 4.1). (This interpretation is helpful when explaining lenses and defining the concept of visibility.) #### Spreading waves - Figure 4.1 shows how waves would propagate from a hypothetical point light source ![](https://i.imgur.com/HSp7a10.png =450x) * Density: * same in all directions (radial symmetry) * decreases as the light source becomes more distant. * The surface area of a sphere with radius r is 4πr^2. Consider centering a spherical screen around the light source. * The total number of photons per second hitting a screen of radius 1 should be the same as for a screen of radius 2; * The density (photons per second per area) should decrease by a factor of 1/4 because they are distributed over 4 times the area. * Thus, photon density decreases quadratically as a function of distance from a point light source. * The curvature of the wavefronts also decreases as the point light source becomes further away. * If the waves were to propagate infinitely far away, then they would completely flatten as shown in Figure 4.2. * This results in the important case of parallel wavefronts. * Without the help of lenses or mirrors, it is impossible to actually obtain this case from a tiny light source in the physical world because it cannot be so far away; * It serves as both a useful approximation for distant light sources and as an ideal way to describe lenses mathematically. * At any finite distance from a point light source, the rays of light always diverge; it is impossible to make them converge without the help of lenses or mirrors. ##### Figure 4.2: If the point light source were “infinitely far” away, then parallel wavefronts would be obtained. Other names for this setting are: Collimated light, parallel rays, rays from infinity, rays to infinity, and zero vergence. ![](https://i.imgur.com/7Fs7o5h.png) #### Interactions with materials As light strikes the surface of a material, one of three behaviors might occur, as shown in Figure 4.3 - : transmission, absorption, and reflection. ![](https://i.imgur.com/XiMR5Eg.png =450x) In the case of transmission, the energy travels through the material and exits the other side. For a transparent material, such as glass, the transmitted light rays are slowed down and bend according to Snell’s law. For a translucent material that is not transparent, the rays scatter into various directions before exiting. In the case of absorption, energy is absorbed by the material as the light becomes trapped. The third case is reflection, in which the light is deflected from the surface. Along a perfectly smooth or polished surface, the rays reflect in the same way: The exit angle is equal to the entry angle. Figure 4.4 -> This case is called specular reflection, in contrast to diffuse reflection, in which the reflected rays scatter in arbitrary directions. ##### Figure 4.4: Two extreme modes of reflection are shown. Specular reflection means that all rays reflect at the same angle at which they approached. Diffuse reflection means that the rays scatter in a way that could be independent of their approach angle. Specular reflection is common for a polished surface, such as a mirror, whereas diffuse reflection corresponds to a rough surface. ![](https://i.imgur.com/Du5wWYc.png) Usually, all three cases of transmission, absorption, and reflection occur simultaneously. The amount of energy divided between the cases depends on many factors, such as the angle of approach, the wavelength, and differences between the two adjacent materials or media. #### Coherent versus jumbled light * The first complication is that light sources usually do not emit coherent light, which means the wavefronts are perfectly aligned in time and space. * Common light sources, such as light bulbs and the sun, instead emit a jumble of waves that have various wavelengths and do not have their peaks aligned. #### Wavelengths and colors * To make sense out of the jumble of waves, we will describe how they are distributed in terms of wavelengths. * Figure 4.5 shows the range of wavelengths that are visible to humans. * Each wavelength corresponds to a spectral color, which is what we would perceive with a coherent light source fixed at that wavelength alone. ![](https://i.imgur.com/2YiqEqr.png =500x) * Visible light spectrum corresponds to the range of electromagnetic waves that have wavelengths between 400nm and 700nm * Between 700 and 1000nm are called infrared, which are not visible to us, but some cameras can sense them. * Wavelengths between 100 and 400nm are called ultraviolet; they are not part of our visible spectrum, but some birds, insects, and fish can perceive ultraviolet wavelengths over 300nm. #### Spectral power Figure 4.6 shows how the wavelengths are distributed for common light sources. * An ideal light source would have all visible wavelengths represented with equal energy, leading to idealized white light. * The opposite is total darkness, which is black. We usually do not allow a light source to propagate light directly onto our retinas (don’t stare at the sun!). Instead, we observe light that is reflected from objects all around us, causing us to perceive their color. ##### Figure 4.6: The spectral power distribution for some common light sources ![](https://i.imgur.com/RdplQTS.png =500x) Each surface has its own distribution of wavelengths that it reflects. The fraction of light energy that is reflected back depends on the wavelength, leading to the plots shown in Figure 4.7. ##### Figure 4.7: The spectral reflection function of some common familiar materials ![](https://i.imgur.com/VGIkhrd.png =450x) For us to perceive an object surface as red, the red wavelengths must be included in the light source and the surface must strongly reflect red wavelengths. Other wavelengths must also be suppressed. #### Frequency Often, it is useful to talk about frequency instead of wavelength. The frequency is the number of times per second that wave peaks pass through a fixed location. Using both the wavelength λ and the speed s, the frequency f is calculated as: f = s/λ The speed of light in a vacuum is a universal constant c with value approximately equal to 3 × 10^8 m/s. In this case, s = c. Light propagates roughly 0.03 percent faster in a vacuum than in air. ### 4.2 Lenses - idealized models of how lenses work Lenses have been made for thousands of years, with the oldest known artifact shown in Figure 4.8(a). It was constructed before 700 BC in Assyrian Nimrud. ##### Figure 4.8: (a) The earliest known artificially constructed lens, which was made between 750 and 710 BC in ancient Assyrian Nimrud. It is not known whether this artifact was purely ornamental or used to produce focused images. Picture from the British Museum. (b) A painting by Conrad con Soest from 1403, which shows the use of reading glasses for an elderly male. ![](https://i.imgur.com/bOUkDkO.png) Whether constructed from transparent materials or from polished surfaces that act as mirrors, lenses bend rays of light so that a focused image is formed. * VR headsets are unlike classical optical devices, leading to many new challenges that are outside of standard patterns that have existed for centuries. * Thus, the lens design patterns for VR are still being written. #### Snell’s Law Lenses work because of Snell’s law, which expresses how much rays of light bend when entering or exiting a transparent material. ##### Figure 4.9: Propagating wavefronts from a medium with low refractive index (such as air) to one with a higher index (such as glass). (a) The effect of slower propagation on the wavefronts is shown as they enter the lower medium. (b) This shows the resulting bending of a light ray, which is always perpendicular to the wavefronts. Snell’s Law relates the refractive indices and angles as n1 sin θ1 = n2 sin θ2. ![](https://i.imgur.com/RmdDWME.png =500x) Recall that the speed of light in a medium is less than the speed c in an vacuum. For a given material, let its refractive index be defined as n = c/s in which s is the speed of light in the medium. For example, n = 2 means that light takes twice as long to traverse the medium than through a vacuum. For some common examples, n = 1.000293 for air, n = 1.33 for water, and n = 1.523 for crown glass. Snell’s law relates the four quantities as n1 sin θ1 = n2 sin θ2. ![](https://i.imgur.com/t5duEVE.png) If the condition above does not hold, then the light rays reflect from the surface. This situation occurs while under water and looking up at the surface. Rather than being able to see the world above, a swimmer might instead see a reflection, depending on the viewing angle. #### Prisms Imagine shining a laser beam through a prism, as shown in Figure 4.10. ##### Figure 4.10: The upper part shows how a simple prism bends ascending rays into descending rays, provided that the incoming ray slope is not too high. This was achieved by applying Snell’s law at the incoming and outgoing boundaries. Placing the prism upside down causes descending rays to become ascending. Putting both of these together, we will see that a lens is like a stack of prisms that force diverging rays to converge through the power of refraction. ![](https://i.imgur.com/MW0q4cm.png) Snell’s Law can be applied to calculate how the light ray bends after it enters and exits the prism. Note that for the upright prism, a ray pointing slightly upward becomes bent downward. Recall that a larger refractive index inside the prism would cause greater bending. By placing the prism upside down, rays pointing slightly downward are bent upward. Once the refractive index is fixed, the bending depends only on the angles at which the rays enter and exit the surface, rather than on the thickness of the prism. To construct a lens, we will exploit this principle and construct a kind of curved version of Figure 4.10. #### Simple convex lens ##### Figure 4.11: A simple convex lens causes parallel rays to converge at the focal point. The dashed line is the optical axis, which is perpendicular to the lens and pokes through its center. ![](https://i.imgur.com/pVYX6y6.png =500x) Figure 4.11 shows a simple convex lens, which should remind you of the prisms in Figure 4.10. Instead of making a diamond shape, the lens surface is spherically curved so that incoming, parallel, horizontal rays of light converge to a point on the other side of the lens. This special place of convergence is called the focal point. Its distance from the lens center is called the focal depth or focal length. The incoming rays in Figure 4.11 are special in two ways: * 1) They are parallel, thereby corresponding to a source that is infinitely far away, and * 2) they are perpendicular to the plane in which the lens is centered. If the rays are parallel but not perpendicular to the lens plane, then the focal point shifts accordingly, as shown in Figure 4.12. ##### Figure 4.12: If the rays are not perpendicular to the lens, then the focal point is shifted away from the optical axis. ![](https://i.imgur.com/ABcvCbQ.png) In this case, the focal point is not on the optical axis. There are two DOFs of incoming ray directions, leading to a focal plane that contains all of the focal points. Unfortunately, this planarity is just an approximation. In this idealized setting, a real image is formed in the image plane, as if it were a projection screen that is showing how the world looks in front of the lens (assuming everything in the world is very far away). If the rays are not parallel, then it may still be possible to focus them into a real image, as shown in Figure 4.13. Suppose that a lens is given that has focal length f. If the light source is placed at distance s1 from the lens, then the rays from that will be in focus if and only if the following equation is satisfied (which is derived from Snell’s law): ![](https://i.imgur.com/gMUqP9Q.png) ##### Figure 4.13: In the real world, an object is not infinitely far away. When placed at distance s1 from the lens, a real image forms in a focal plane at distance s2 > f behind the lens, as calculated using (4.6). ![](https://i.imgur.com/04fVbsX.png =550x) Figure 4.11 corresponds to the idealized case in which s1 = ∞, for which solving (4.6) yields s2 = f. What if the object being viewed is not completely flat and lying in a plane perpendicular to the lens? In this case, there does not exist a single plane behind the lens that would bring the entire object into focus. We must tolerate the fact that most of it will be approximately in focus. Unfortunately, this is the situation almost always encountered in the real world, including the focus provided by our own eyes (see Section 4.4). If the light source is placed too close to the lens, then the outgoing rays might be diverging so much that the lens cannot force them to converge. If s1 = f, then the outgoing rays would be parallel (s2 = ∞). If s1 < f, then (4.6) yields s2 < 0. In this case, a real image is not formed; however, something interesting happens: The phenomenon of magnification. A virtual image appears when looking into the lens, as shown in Figure 4.14. This exactly what happens in the case of the View-Master and the VR headsets that were shown in Figure 2.11. The screen is placed so that it appears magnified. To the user viewing looking through the lenses, it appears as if the screen is infinitely far away (and quite enormous!). ##### Figure 4.14: If the object is very close to the lens, then the lens cannot force its outgoing light rays to converge to a focal point. In this case, however, a virtual image appears and the lens works as a magnifying glass. This is the way lenses are commonly used for VR headsets. ![](https://i.imgur.com/d99fGxX.png =550x) #### Concave lenses For the sake of completeness, we include the case of a concave simple lens, shown in Figure 4.15. Parallel rays are forced to diverge, rather than converge; however, a meaningful notion of negative focal length exists by tracing the diverging rays backwards through the lens. The Lensmaker’s Equation (4.7) can be slightly adapted to calculate negative f in this case [105]. ##### Figure 4.15: In the case of a concave lens, parallel rays are forced to diverge. The rays can be extended backward through the lens to arrive at a focal point on the left side. The usual sign convention is that f < 0 for concave lenses. ![](https://i.imgur.com/st8M46Q.png) #### Diopters For optical systems used in VR, several lenses will be combined in succession. What is the effect of the combination? A convenient method to answer this question with simple arithmetic was invented by ophthalmologists. The idea is to define a diopter, which is D = 1/f. Thus, it is the reciprocal of the focal length. If a lens focuses parallel rays at a distance of 0.2m in behind the lens, then D = 5. A larger diopter D means greater converging power. Likewise, a concave lens yields D < 0, with a lower number implying greater divergence. To combine several nearby lenses in succession, we simply add their diopters to determine their equivalent power as a single, simple lens. Figure 4.16 shows a simple example. ![](https://i.imgur.com/YsvyjDb.png) ### 4.3 Optical Aberrations - lens behavior deviates from the ideal model, thereby degrading VR experiences All of the aberrations of this section complicate the system or degrade the experience in a VR headset; therefore, substantial engineering effort is spent on mitigating these problems. If lenses in the real world behaved exactly as described in Section 4.2, then VR systems would be much simpler and more impressive than they are today. Unfortunately, numerous imperfections, called aberrations, degrade the images formed by lenses. Because these problems are perceptible in everyday uses, such as viewing content through VR headsets or images from cameras, they are important to understand so that some compensation for them can be designed into the VR system. #### Chromatic aberration ![](https://i.imgur.com/sDqWdEg.png =600x) Energy is usually a jumble of waves with a spectrum of wavelengths. You have probably seen that the colors of the entire visible spectrum nicely separate when white light is shined through a prism. This is a beautiful phenomenon, but for lenses it is terrible annoyance because it separates the focused image based on color. This problem is called chromatic aberration. The problem is that the speed of light through a medium depends on the wavelength. We should therefore write a material’s refractive index as n(λ) to indicate that it is a function of λ. Recall the spectral power distribution and reflection functions from Section 4.1. For common light sources and materials, the light passing through a lens results in a whole continuum of focal points. Figure 4.18 shows an image with chromatic aberration artifacts. Chromatic aberration can be reduced at greater expense by combining convex and concave lenses of different materials so that the spreading rays are partly coerced into converging [304]. #### Spherical aberration ![](https://i.imgur.com/oN1JDJX.png) Figure 4.19 shows spherical aberration, which is caused by rays further away from the lens center being refracted more than rays near the center. The result is similar to that of chromatic aberration, but this phenomenon is a monochromatic aberration because it is independent of the light wavelength. Incoming parallel rays are focused at varying depths, rather then being concentrated at a single point. The result is some blur that cannot be compensated for by moving the object, lens, or image plane. Alternatively, the image might instead focus onto a curved surface, called the Petzval surface, rather then the image plane. This aberration arises due to the spherical shape of the lens. An aspheric lens is more complex and has non-spherical surfaces that are designed to specifically eliminate the spherical aberration and reduce other aberrations. #### Optical distortion Even if the image itself projects onto the image plane it might be distorted at the periphery. Assuming that the lens is radially symmetric, the distortion can be described as a stretching or compression of the image that becomes increasingly severe away from the optical axis. Figure 4.20 shows how this effects the image for two opposite cases: barrel distortion and pincushion distortion. For lenses that have a wide field-of-view, the distortion is stronger, especially in the extreme case of a fish-eyed lens. Figure 4.21 shows an image that has strong barrel distortion. Correcting this distortion is crucial for current VR headsets that have a wide field-of-view; otherwise, the virtual world would appear to be warped. ![](https://i.imgur.com/nPS9S8V.png) ##### Figure 4.20: Common optical distortions. (a) Original images. (b) Barrel distortion. (c) Pincushion distortion. For the upper row, the grid becomes nonlinearly distorted. For lower row illustrates how circular symmetry is nevertheless maintained. #### Astigmatism ![](https://i.imgur.com/lRxNf8U.png) Figure 4.22 depicts astigmatism, which is a lens aberration that occurs for incoming rays that are not perpendicular to the lens. Up until now, our lens drawings have been 2D; however, a third dimension is needed to understand this new aberration. The rays can be off-axis in one dimension, but aligned in another. By moving the image plane along the optical axis, it becomes impossible to bring the image into focus. Instead, horizontal and vertical focal depths appear, as shown in Figure 4.23. ![](https://i.imgur.com/crttRVL.png) #### Coma and flare Finally, coma is yet another aberration. In this case, the image magnification varies dramatically as the rays are far from perpendicular to the lens. The result is a “comet” pattern in the image plane. Another phenomenon is lens flare, in which rays from very bright light scatter through the lens and often show circular patterns. This is often seen in movies as the viewpoint passes by the sun or stars, and is sometimes added artificially. ### 4.4 Human eye - introduction of the human eye as an optical system of lenses ![](https://i.imgur.com/jwYQyvg.png) Here the eye will be considered as part of an optical system of lenses and images. Figure 4.24 shows a cross section of the human eye facing left. Parallel light rays are shown entering from the left; compare to Figure 4.11, which showed a similar situation for an engineered convex lens. Although the eye operation is similar to the engineered setting, several important differences arise at this stage. The focal plane is replaced by a spherically curved surface called the retina. The retina contains photoreceptors that convert the light into neural pulses. The interior of the eyeball is actually liquid, as opposed to air. The refractive indices of materials along the path from the outside air to the retina are shown in Figure 4.25. ##### Figure 4.25: A ray of light travels through five media before hitting the retina. The indices of refraction are indicated. Considering Snell’s law, the greatest bending occurs due to the transition from air to the cornea. Note that once the ray enters the eye, it passes through only liquid or solid materials. ![](https://i.imgur.com/yaVHRvj.png) ![](https://i.imgur.com/2uSWyP0.png) ![](https://i.imgur.com/TvJCb0i.png) #### The optical power of the eye The outer diameter of the eyeball is roughly 24mm, which implies that a lens of at least 40D would be required to cause convergence of parallel rays onto the retina center inside of the eye (recall diopters from Section 4.2). There are effectively two convex lenses: The cornea and the lens. The cornea is the outermost part of the eye where the light first enters and has the greatest optical power, approximately 40D. The eye lens is less powerful and provides an additional 20D. By adding diopters, the combined power of the cornea and lens is 60D, which means that parallel rays are focused onto the retina at a distance of roughly 17mm from the outer cornea. Figure 4.26 shows how this system acts on parallel rays for a human with normal vision. Images of far away objects are thereby focused onto the retina. #### Accommodation What happens when we want to focus on a nearby object, rather than one “infinitely far” away? Without any changes to the optical system, the image would be blurry on the retina, as shown in Figure 4.27. Fortunately, and miraculously, the lens changes its diopter to accommodate the closer distance. This process is appropriately called accommodation, as is depicted in Figure 4.28. #### Vision abnormalities The situations presented so far represent normal vision throughout a person’s lifetime. One problem could be that the optical system simply does not have enough optical power to converge parallel rays onto the retina. This condition is called hyperopia or farsightedness. Eyeglasses come to the rescue. The simple fix is to place a convex lens (positive diopter) in front of the eye, as in the case of reading glasses. In the opposite direction, some eyes have too much optical power. This case is called myopia or nearsightedness, and a concave lens (negative diopter) is placed in front of the eye to reduce the optical power appropriately. #### A simple VR headset Now suppose we are constructing a VR headset by placing a screen very close to the eyes. Young adults would already be unable to bring it into focus it if were closer than 10cm. We want to bring it close so that it fills the view of the user. Therefore, the optical power is increased by using a convex lens, functioning in the same way as reading glasses. See Figure 4.30. This is also the process of magnification. The lens is usually placed at the distance of its focal depth. The screen appears as an enormous virtual image that is infinitely far away. Note, however, that a real image is nevertheless projected onto the retina. We do not perceive the world around us unless real images are formed on our retinas. To account for people with vision problems, a focusing knob may be appear on the headset, which varies the distance between the lens and the screen. This adjusts the optical power so that the rays between the lens and the cornea are no longer parallel. They can be made to converge, which helps people with hyperopia. Alternatively, they can be made to diverge, which helps people with myopia. Thus, they can focus sharply on the screen without placing their eyeglasses in front of the lens. ![](https://i.imgur.com/oZmAtY3.png) One important detail for a VR headset is each lens should be centered perfectly in front of the cornea. If the distance between the two lenses is permanently fixed, then this is impossible to achieve for everyone who uses the headset. The interpupillary distance, or IPD, is the distance between human eye centers. The average among humans is around 64mm, but it varies greatly by race, gender, and age (in the case of children). To be able to center the lenses for everyone, the distance between lens centers should be adjustable from around 55 to 75mm. This is a common range for binoculars. Unfortunately, the situation is not even this simple because our eyes also rotate within their sockets, which changes the position and orientation of the cornea with respect to the lens. Another important detail is the fidelity of our vision: What pixel density is needed for the screen that is placed in front of our eyes so that we do not notice the pixels? A similar question is how many dots-per-inch (DPI) are needed on a printed piece of paper so that we do not see the dots, even when viewed under a magnifying glass? ### Section 4.5 cameras, which can be considered as engineered eyes #### Shutters Several practical issues arise when capturing digital images. The image is an 2D array of pixels, each of which having red (R), green (G), and blue (B) values that typically range from 0 to 255. Consider the total amount of light energy that hits the image plane. For a higher-resolution camera, there will generally be less photons per pixel because the pixels are smaller. Each sensing element (one per color per pixel) can be imagined as a bucket that collects photons, much like drops of rain. To control the amount of photons, a shutter blocks all the light, opens for a fixed interval of time, and then closes again. For a long interval (low shutter speed), more light is collected; however, the drawbacks are that moving objects in the scene will become blurry and that the sensing elements could become saturated with too much light. Photographers must strike a balance when determining the shutter speed to account for the amount of light in the scene, the sensitivity of the sensing elements, and the motion of the camera and objects in the scene. Also relating to shutters, CMOS sensors unfortunately work by sending out the image information sequentially, line-by-line. The sensor is therefore coupled with a rolling shutter, which allows light to enter for each line, just before the information is sent. This means that the capture is not synchronized over the entire image, which leads to odd artifacts, such as the one shown in Figure 4.33. Image processing algorithms that work with rolling shutters and motion typically transform the image to correct for this problem. CCD sensors grab and send the entire image at once, resulting in a global shutter. CCDs have historically been more expensive than CMOS sensors, which resulted in widespread appearance of rolling shutter cameras in smartphones; however, the cost of global shutter cameras is rapidly decreasing. ![](https://i.imgur.com/KPGR8BE.png) #### Aperture The optical system also impacts the amount of light that arrives to the sensor. Using a pinhole, as shown in Figure 4.31, light would fall onto the image sensor, but it would not be bright enough for most purposes (other than viewing a solar eclipse). Therefore, a convex lens is used instead so that multiple rays are converged to the same point in the image plane; recall Figure 4.11. This generates more photons per sensing element. The main drawback is that the lens sharply focuses objects at a single depth, while blurring others; recall (4.6). In the pinhole case, all depths are essentially “in focus”, but there might not be enough light. Photographers therefore want to tune the optical system to behave more like a pinhole or more like a full lens, depending on the desired outcome. ![](https://i.imgur.com/QjG9WRZ.png) The result is a controllable aperture (Figure 4.34), which appears behind the lens and sets the size of the hole through which the light rays enter. A small radius mimics a pinhole by blocking all but the center of the lens. A large radius allows light to pass through the entire lens. Our eyes control the light levels in a similar manner by contracting or dilating our pupils.. Finally, note that the larger the aperture, the more that the aberrations interfere with the imaging process. ### 4.6 Displays - visual display technologies, which emit light that is intended for consumption by the human eyes #### Cathode ray tubes - CRT The most important technological leap was the cathode ray tube or CRT, which gave birth to electronic displays, launched the era of television broadcasting, and helped shape many concepts and terms that persist in modern displays today. Figure 4.35 shows the basic principles. ![](https://i.imgur.com/33dWNzE.png) The CRT enabled videos to be rendered to a screen, frame by frame. Each frame was scanned out line by line due to the physical limitations of the hardware. The scanning needed to repeat frequently, known a refreshing the phosphor elements. Each light in each position would persist for less than a millisecond. The scanout behavior and timing remains today for modern smartphone displays because of memory and computation architectures, but it is not ideal for VR usage. The next major advance was to enable each picture element, or pixel, to be directly and persistently lit. Various technologies have been used to produce flatpanel displays, the output of which is illustrated in Figure 4.36. ![](https://i.imgur.com/1OBUB51.png) Liquid crystal displays (LCD displays) became widely available in calculators in the 1970s, and progressed into larger, colorful screens by the 1990s. The liquid crystals themselves do not emit light, but most commonly a backlight shines from behind to illuminate the whole screen. Currently, the vast majority of flat-panel displays are on either LCDs or light emitting diodes (LEDs). In the case of LEDs, each pixel is able to be directly lit. The consumer market for flat-panel displays was first driven by the need for flat, big-screen televisions and computer monitors. With the advancement of smartphones, miniaturized versions of these displays have been available with low cost, low power, and extremely high resolution. This enabled low-cost VR headset solutions by putting a lens in front of a smartphone screen. #### Toward custom VR displays The first step toward thinking about displays for VR is to consider the distance from the eyes. If it is meant to be viewed from far away, then it is called a naked-eye display. For a person with normal vision (or while wearing prescription glasses), the display should appear sharp without any addition help. If it is close enough so that lenses are needed to bring it into focus, then it is called a near-eye display. This is the common case in current VR headsets because the display needs to be placed very close to the eyes. It remains an active area of research to develop better near-eye display technologies, with a key challenge being whether the solutions are manufacturable on a large scale. An important family of near-eye displays is based on a microdisplay and waveguide. The microdisplay is typically based on liquid crystal on silicon (or LCoS), which is a critical component in overhead projectors; microdisplays based on organic LEDs (OLEDs) are also gaining popularity. The size of the microdisplay is typically a few millimeters, and its emitted light is transported to the eyes through the use of reflective structures called a waveguide; see Figure 4.37. The Microsoft Hololens, Google Glass, and Magic Leap One are some well-known devices that were based on waveguides. The current engineering challenges are limited field of view, overall weight, difficult or costly manufacturing, and power loss and picture degradation as the waves travel through the waveguide. A promising device for future VR display technologies is the virtual retinal display [340]. It works by a scanning beam principle similar to the CRT, but instead draws the image directly onto the human retina. A lowpower laser can be pointed into a micromirror that can be rapidly rotated so that full images are quickly drawn onto the retina. Current engineering challenges are eye safety (do not shine an ordinary laser into your eyes!), mirror rotation frequency, and expanding the so-called eye box so that the images are drawn onto the retina regardless of where the eye is rotated. To maximize human comfort, a display should ideally reproduce the conditions that occur from the propagation of light in a natural environment, which would allow the eyes to focus on objects at various distances in the usual way. The previously mentioned displays are known to cause vergence-accommodation mismatch, which is knwon to cause discomfort to human viewers. For this reason, researchers are actively prototyping displays that overcome this limitation. Two categories of research are light-field displays [75, 164, 201] and varifocal displays [4, 50, 129, 190, 209]. ### Chapter 5 - The Physiology of Human Vision What you perceive about the world around you is “all in your head”. The light around us forms images on our retinas that capture colors, motions, and spatial relationships in the physical world. For someone with normal vision, these captured images may appear to have perfect clarity, speed, accuracy, and resolution, while being distributed over a large field of view. This apparent perfection of our vision is mostly an illusion because neural structures are filling in plausible details to generate a coherent picture in our heads that is consistent with our life experiences. When building VR technology that co-opts these processes, it is important to understand how they work. They were designed to do more with less, and fooling these processes with VR produces many unexpected side effects because the display technology is not a perfect replica of the surrounding world. ### Section 5.1 From the Cornea to Photoreceptors - anatomy of the human eye to the optical system. Most of the section is on photoreceptors, which are the “input pixels” that get paired with the “output pixels” of a digital display for VR. #### Parts of the eye Figure 5.1 shows the physiology of a human eye. ![](https://i.imgur.com/9jVd6vn.png) The shape is approximately spherical, with a diameter of around 24mm and only slight variation among people. The cornea is a hard, transparent surface through which light enters and provides the greatest optical power. The rest of the outer surface of the eye is protected by a hard, white layer called the sclera. Most of the eye interior consists of vitreous humor, which is a transparent, gelatinous mass that allows light rays to penetrate with little distortion or attenuation. As light rays cross the cornea, they pass through a small chamber containing aqueous humour, which is another transparent, gelatinous mass. After crossing this, rays enter the lens by passing through the pupil. The size of the pupil is controlled by a disc-shaped structure called the iris, which provides an aperture that regulates the amount of light that is allowed to pass. The optical power of the lens is altered by ciliary muscles. After passing through the lens, rays pass through the vitreous humor and strike the retina, which lines more than 180◦ of the inner eye boundary. Since Figure 5.1 shows a 2D cross section, the retina is shaped like an arc; however, keep in mind that it is a 2D surface. Imagine it as a curved counterpart to a visual display. To catch the light from the output pixels, it is lined with photoreceptors, which behave like “input pixels”. The most important part of the retina is the fovea; the highest visual acuity, which is a measure of the sharpness or clarity of vision, is provided for rays that land on it. The optic disc is a small hole in the retina through which neural pulses are transmitted outside of the eye through the optic nerve. It is on the same side of the fovea as the nose #### Photoreceptors The retina contains two kinds of photoreceptors for vision: 1) rods, which are triggered by very low levels of light, and 2) cones, which require more light and are designed to distinguish between colors. See Figure 5.2. ![](https://i.imgur.com/07ASE8L.png) To understand the scale, the width of the smallest cones is around 1000nm. This is quite close to the wavelength of visible light, implying that photoreceptors need not be much smaller. Each human retina contains about 120 million rods and 6 million cones that are densely packed along the retina. Figure 5.3 shows the detection capabilities of each photoreceptor type. Rod sensitivity peaks at 498nm, between blue and green in the spectrum. ![](https://i.imgur.com/DKmjVq9.png) Three categories of cones exist, based on whether they are designed to sense blue, green, or red light. Photoreceptors respond to light levels over a large dynamic range. Figure 5.4 shows several familiar examples. ![](https://i.imgur.com/CeK5VhN.png) The luminance is measured in SI units of candelas per square meter, which corresponds directly to the amount of light power per area. The range spans seven orders of magnitude, from 1 photon hitting a photoreceptor every 100 seconds up to 100,000 photons per receptor per second. At low light levels, only rods are triggered. Our inability to distinguish colors at night is caused by the inability of rods to distinguish colors. Our eyes may take up to 35 minutes to fully adapt to low light, resulting in a monochromatic mode called scotopic vision. By contrast, our cones become active in brighter light. Adaptation to this trichromatic mode, called photopic vision, may take up to ten minutes (you have undoubtedly noticed the adjustment period when someone unexpectedly turns on lights while you are lying in bed at night). #### Photoreceptor density The density of photoreceptors across the retina varies greatly, as plotted in Figure 5.5. ![](https://i.imgur.com/hnwV5ON.png) The most interesting region is the fovea, which has the greatest concentration of photoreceptors. The innermost part of the fovea has a diameter of only 0.5mm or an angular range of ±0.85 degrees, and contains almost entirely cones. This implies that the eye must be pointed straight at a target to perceive a sharp, colored image. The entire fovea has diameter 1.5mm (±2.6 degrees angular range), with the outer ring having a dominant concentration of rods. Rays that enter the cornea from the sides land on parts of the retina with lower rod density and very low cone density. This corresponds to the case of peripheral vision. We are much better at detecting movement in our periphery, but cannot distinguish colors effectively. Peripheral movement detection may have helped our ancestors from being eaten by predators. Finally, the most intriguing part of the plot is the blind spot, where there are no photoreceptors. This is due to our retinas being inside-out and having no other way to route the neural signals to the brain. With 20/20 vision, we perceive the world as if our eyes are capturing a sharp, colorful image over a huge angular range. This seems impossible, however, because we can only sense sharp, colored images in a narrow range. Furthermore, the blind spot should place a black hole in our image. Surprisingly, our perceptual processes produce an illusion that a complete image is being captured. This is accomplished by filling in the missing details using contextual information and by frequent eye movements. ![](https://i.imgur.com/aBF4hs3.png) ### Section 5.2 From Photoreceptors to the Visual Cortex - neuroscience by explaining what is known about the visual information that hierarchically propagates from the photoreceptors up to the visual cortex. Photoreceptors are transducers that convert the light-energy stimulus into an electrical signal called a neural impulse, thereby inserting information about the outside world into our neural structures. Signals are propagated upward in a hierarchical manner, from photoreceptors to the visual cortex (Figure 2.19). Think about the influence that each photoreceptor has on the network of neurons. Figure 5.7 shows a simplified model. ![](https://i.imgur.com/cxXoV0X.png) As the levels increase, the number of influenced neurons grows rapidly. Figure 5.8 shows the same diagram, but highlighted in a different way by showing how the number of photoreceptors that influence a single neuron increases with level. ![](https://i.imgur.com/ATMvyNp.png) Neurons at the lowest levels are able to make simple comparisons of signals from neighboring photoreceptors. As the levels increase, the neurons may respond to a larger patch of the retinal image. Eventually, when signals reach the highest levels (beyond these figures), information from the memory of a lifetime of experiences is fused with the information that propagated up from photoreceptors. As the brain performs significant processing, a perceptual phenomenon results, such as recognizing a face or judging the size of a tree. It takes the brain over 100ms to produce a result that enters our consciousness. Now consider the first layers of neurons in more detail, as shown in Figure 5.9. ![](https://i.imgur.com/GfGrMpc.png) The information is sent from right to left, passing from the rods and cones to the bipolar, amacrine, and horizontal cells. These three types of cells are in the inner nuclear layer. From there, the signals reach the ganglion cells, which form the ganglion cell layer. Note that the light appears to be entering from the wrong direction: It passes over these neural cells before reaching the photoreceptors. This is due to the fact that the human retina is inside-out, as shown in Figure 5.10. ![](https://i.imgur.com/0T3pKrq.png) One consequence of an inside-out retina is that the axons of the ganglion cells cannot be directly connected to the optic nerve (item 3 in Figure 5.10), which sends the signals outside of the eye. Therefore, a hole has been punctured in our retinas so that the “cables” from the ganglion cells can be routed outside of the eye (item 4 in Figure 5.10). This causes the blind spot. Upon studying Figure 5.9 closely, it becomes clear that the neural cells are not arranged in the ideal way of Figure 5.8. The bipolar cells transmit signals from the photoreceptors to the ganglion cells. Some bipolars connect only to cones, with the number being between cones 1 and 10 per bipolar. Others connect only to rods, with about 30 to 50 rods per bipolar. There are two types of bipolar cells based on their function. An ON bipolar activates when the rate of photon absorption in its connected photoreceptors increases. An OFF bipolar activates for decreasing photon absorption. The bipolars connected to cones have both kinds; however, the bipolars for rods have only ON bipolars. The bipolar connections are considered to be vertical because they connect directly from photoreceptors to the ganglion cells. This is in contrast to the remaining two cell types in the inner nuclear layer. The horizontal cells are connected by inputs (dendrites) to photoreceptors and bipolar cells within a radius of up to 1mm. Their output (axon) is fed into photoreceptors, causing lateral inhibition, which means that the activation of one photoreceptor tends to decrease the activation of its neighbors. Finally, amacrine cells connect horizontally between bipolar cells, other amacrine cells, and vertically to ganglion cells. There are dozens of types, and their function is not well understood. Thus, scientists do not have a complete understanding of human vision, even at the lowest layers. Nevertheless, the well understood parts contribute greatly to our ability to design effective VR systems and predict other human responses to visual stimuli. At the ganglion cell layer, several kinds of cells process portions of the retinal image. Each ganglion cell has a large receptive field, which corresponds to the photoreceptors that contribute to its activation as shown in Figure 5.8. The three most common and well understood types of ganglion cells are called midget, parasol, and bistratified. They perform simple filtering operations over their receptive fields based on spatial, temporal, and spectral (color) variations in the stimulus across the photoreceptors. Figure 5.11 shows one example. ![](https://i.imgur.com/icXqAE0.png) In this case, a ganglion cell is triggered when red is detected in the center but not green in the surrounding area. This condition is an example of spatial opponency, for which neural structures are designed to detect local image variations. Thus, consider ganglion cells as tiny image processing units that can pick out local changes in time, space, and/or color. They can detect and emphasize simple image features such as edges. Once the ganglion axons leave the eye through the optic nerve, a significant amount of image processing has already been performed to aid in visual perception. The raw image based purely on photons hitting the photoreceptor never leaves the eye. The optic nerve connects to a part of the thalamus called the lateral geniculate nucleus (LGN); see Figure 5.12. ![](https://i.imgur.com/hQhFSYu.png) The LGN mainly serves as a router that sends signals from the senses to the brain, but also performs some processing. The LGN sends image information to the primary visual cortex (V1), which is located at the back of the brain. The visual cortex, highlighted in Figure 5.13, contains several interconnected areas that each perform specialized functions. ![](https://i.imgur.com/ecgXryQ.png) Figure 5.14 shows one well-studied operation performed by the visual cortex. Visual perception is the conscious result of processing in the visual cortex, based on neural circuitry, stimulation of the retinas, information from other senses, and expectations based on prior experiences. ![](https://i.imgur.com/C5qdYo7.png) ### Section 5.3 Eye Movements - how our eyes move, which serves a good purpose, but incessantly interferes with the images in our retinas. Eye rotations are a complicated and integral part of human vision. They occur both voluntarily and involuntarily, and allow a person to fixate on features in the world, even as his head or target features are moving. Reasons for eye movement: * To position the feature of interest on the fovea. * Only the fovea can sense dense, color images, and it unfortunately spans a very narrow field of view. * To gain a coherent, detailed view of a large object, the eyes rapidly scan over it while fixating on points of interest. Figure 5.15 shows an example. ![](https://i.imgur.com/1RxqHfQ.png) * Our photoreceptors are slow to respond to stimuli due to their chemical nature (10ms to fully respond to stimuli and produce a response for up to 100ms) * Eye movements help keep the image fixed on the same set of photoreceptors so that they can fully charge. * This is similar to the image blurring problem that occurs in cameras at low light levels and slow shutter speeds. * To maintain a stereoscopic view * To prevent adaptation to a constant stimulation. * It has been shown experimentally that when eye motions are completely suppressed, visual perception disappears completely [118]. As movements combine to build a coherent view, it is difficult for scientists to predict and explain how people interpret some stimuli. For example, the optical illusion in Figure 5.16 appears to be moving when our eyes scan over it. ![](https://i.imgur.com/xLpRZEv.png) #### Eye muscles The rotation of each eye is controlled by six muscles that are each attached to the sclera (outer eyeball surface) by a tendon. Figures 5.17 and 5.18 show their names and arrangement. ![](https://i.imgur.com/hfH5MGh.png) ![](https://i.imgur.com/bCK5ygG.png) The tendons (Sehnen/Sciegna) pull on the eye in opposite pairs. * For example, to perform a yaw (side-to-side) rotation, the tensions on the medial rectus and lateral rectus are varied while the other muscles are largely unaffected. * To cause a pitch motion, four muscles per eye become involved. * All six are involved to perform both a pitch and yaw, for example, looking upward and to the right. A small amount of roll can be generated; however, our eyes are generally not designed for much roll motion. Thus, it is reasonable in most cases to approximate eye rotations as a 2D set that includes only yaw and pitch, rather than the full 3 DOFs obtained for rigid body rotations.. #### Types of movements We now consider movements based on their purpose, resulting in six categories: 1) saccades, 2) smooth pursuit, 3) vestibulo-ocular reflex, 4) optokinetic reflex, 5) vergence, and 6) microsaccades. All of these motions cause both eyes to rotate approximately the same way, except for vergence, which causes the eyes to rotate in opposite directions. We will skip a seventh category of motion, called rapid eye movements (REMs), because they only occur while we are sleeping and therefore do not contribute to a VR experience. * Saccades: * lasts less than 45ms with rotations of about 900◦ per second. * The purpose is to quickly relocate the fovea so that important features in a scene are sensed with highest visual acuity. * Figure 5.15 showed an example in which a face is scanned by fixating on various features in rapid succession. * Each transition between features is accomplished by a saccade. * The result of saccades is that we obtain the illusion of high acuity over a large angular range. * Saccades frequently occur while we have little or no awareness of them, but we have the ability to consciously control them as we choose features for fixation. * Smooth pursuit * The eye slowly rotates to track a moving target feature (a tennis ball, or a person walking by) * The rate of rotation is usually less than 30◦ per second * The main function of smooth pursuit is to reduce motion blur on the retina; this is also known as image stabilization. * Vestibulo-ocular reflex * One of the most important motions to understand for VR * Hold your finger at a comfortable distance in front of your face and fixate on it. Next, yaw your head back and forth (like you are nodding “no”), turning about 20 or 30 degrees to the left and right sides each time. You may notice that your eyes are effortlessly rotating to counteract the rotation of your head so that your finger remains in view. * The eye motion is involuntary. * It is called a reflex because the motion control bypasses higher brain functions. * Figure 5.19 shows how this circuitry works. * Based on angular accelerations sensed by vestibular organs, signals are sent to the eye muscles to provide the appropriate counter motion. * The main purpose of the VOR is to provide image stabilization, as in the case of smooth pursuit. ![](https://i.imgur.com/z85AcAp.png) * Optokinetic reflex * Occurs when a fast object speeds along. * This occurs when watching a fast-moving train while standing nearby on fixed ground. * The eyes rapidly and involuntarily choose features for tracking on the object, while alternating between smooth pursuit and saccade motions. * Vergence * Stereopsis refers to the case in which both eyes are fixated on the same object, resulting in a single perceived image. * Two kinds of vergence motions occur to align the eyes with an object. See Figure 5.20. * If the object is closer than a previous fixation, then a convergence motion occurs. This means that the eyes are rotating so that the pupils are becoming closer. * If the object is further, then divergence motion occurs, which causes the pupils to move further apart. * The eye orientations resulting from vergence motions provide important information about the distance of objects. * Microsaccades * Small, involuntary jerks of less than one degree that trace out an erratic path. * They are believed to augment many other processes, including control of fixations, reduction of perceptual fading due to adaptation, improvement of visual acuity, and resolving perceptual ambiguities [274]. * Their behavior is extremely complex and not fully understood. ![](https://i.imgur.com/S69jD9Q.png) #### Eye and head movements together Most of the time the eyes and head are moving together. Figure 5.21 shows the angular range for yaw rotations of the head and eyes. ![](https://i.imgur.com/KvkB42Z.png) Although eye yaw is symmetric by allowing 35◦ to the left or right, pitching of the eyes is not. Human eyes can pitch 20◦ upward and 25◦ downward, which suggests that it might be optimal to center a VR display slightly below the pupils when the eyes are looking directly forward. In the case of VOR, eye rotation is controlled to counteract head motion. In the case of smooth pursuit, the head and eyes may move together to keep a moving target in the preferred viewing area. #### Section 5.4 Implications for VR - applying the knowledge gained about visual physiology to determine VR display requirements, such as the screen resolution Basic physiological properties, such as photoreceptor density or VOR circuitry directly impact the engineering requirements for visual display hardware. The engineered systems must be good enough to adequately fool our senses, but they need not have levels of quality that are well beyond the limits of our receptors. Thus, the VR display should ideally be designed to perfectly match the performance of the sense it is trying to fool. #### How good does the VR visual display need to be? Three crucial factors for the display are: 1. **Spatial resolution: How many pixels per square area are needed?** 3. **Intensity resolution and range: How many intensity values can be produced, and what are the minimum and maximum intensity values?** * could also be called color resolution and range because the intensity values of each red, green, or blue subpixel produce points in the space of colors 4. **Temporal resolution: How fast do displays need to change their pixels?** Photoreceptors can span seven orders of magnitude of light intensity. However, displays have only 256 intensity levels per color to cover this range. Entering scotopic vision mode does not even seem possible using current display technology because of the high intensity resolution needed at extremely low light levels. #### How much pixel density is enough? Insights into the required spatial resolution are obtained from the photoreceptor densities. We see individual lights when a display is highly magnified. As it is zoomed out, we may still perceive sharp diagonal lines as being jagged, as shown in Figure 5.22(a); this phenomenon is known as aliasing. Another artifact is the screen-door effect, shown in Figure 5.22(b); this is commonly noticed in an image produced by a digital LCD projector. ![](https://i.imgur.com/z4014Zw.png) What does the display pixel density need to be so that we do not perceive individual pixels? Steve Jobs claimed that 326 pixels per linear inch (PPI) is enough, achieving what they called a retina display. * Is this reasonable, and how does it relate to VR? Assume that the fovea is pointed directly at the display to provide the best sensing possible. The first issue is that red, green, and blue cones are arranged in a mosaic, as shown in Figure 5.23. ![](https://i.imgur.com/HZLX8eH.png) Vision scientists and neurobiologists have studied the effective or perceived input resolution through measures of visual acuity [142]. Subjects in a study are usually asked to indicate whether they can detect or recognize a particular target. In the case of detection, for example, scientists might like to know the smallest dot that can be perceived when printed onto a surface. In terms of displays, a similar question is: * How small do pixels need to be so that a single white pixel against a black background is not detectable? In the case of recognition, a familiar example is attempting to read an eye chart, which displays arbitrary letters of various sizes. In terms of displays, this could correspond to trying to read text under various sizes, resolutions, and fonts. Many factors contribute to acuity tasks, such as brightness, contrast, eye movements, time exposure, and the part of the retina that is stimulated. One of the most widely used concepts is cycles per degree, which roughly corresponds to the number of stripes (or sinusoidal peaks) that can be seen as separate along a viewing arc; see Figure 5.24. The Snellen eye chart, which is widely used by optometrists, is designed so that patients attempt to recognize printed letters from 20 feet away (or 6 meters). A person with “normal” 20/20 (or 6/6 in metric) vision is expected to barely make out the horizontal stripes in the letter “E” shown in Figure 5.24. ![](https://i.imgur.com/RmsMf3x.png) This assumes he is looking directly at the letters, using the photoreceptors in the central fovea. The 20/20 line on the chart is designed so that letter height corresponds to 30 cycles per degree when the eye is 20 feet away. The total height of the “E” is 1/12 of a degree. Note that each stripe is half of a cycle. What happens if the subject stands only 10 feet away from the eye chart? The letters should roughly appear to twice as large. Using simple trigonometry, s = d tan θ (5.1), we can determine what the size s of some feature should be for a viewing angle θ at a distance d from the eye. For very small θ, tan θ ≈ θ (in radians). For the example of the eye chart, s could correspond to the height of a letter. Doubling the distance d and the size s should keep θ roughly fixed, which corresponds to the size of the image on the retina. We now return to the retina display concept. Suppose that a person with 20/20 vision is viewing a large screen that is 20 feet (6.096m) away. To generate 30 cycles per degree, it must have at least 60 pixels per degree. Using (5.1), the size would be s = 20 ∗ tan 1◦ = 0.349ft, which is equivalent to 4.189in. Thus, only 60/4.189 = 14.32 PPI would be sufficient. Now suppose that a smartphone screen is placed 12 inches from the user’s eye. In this case, s = 12∗tan 1◦ = 0.209in. This requires that the screen have at least 60/0.209 = 286.4 PPI, which was satisfied by the 326 PPI originally claimed by Apple. In the case of VR, the user is not looking directly at the screen as in the case of smartphones. By inserting a lens for magnification, the display can be brought even closer to the eye. This is commonly done for VR headsets. Suppose that the lens is positioned at its focal distance away from the screen, which for the sake of example is only 1.5in (this is comparable to current VR headsets). In this case, s = 1 ∗ tan 1◦ = 0.0261in, and the display must have at least 2291.6 PPI to achieve 60 cycles per degree! One of the highest-density smartphone displays available today is in the Sony Xperia Z5 Premium. It has only 801 PPI, which means that the PPI needs to increase by roughly a factor of three to obtain retina display resolution for VR headsets. This is not the complete story because some people, particularly youths, have better than 20/20 vision. The limits of visual acuity have been established to be around 60 to 77 cycles per degree, based on photoreceptor density and neural processes [38, 52]; however, this is based on shining a laser directly onto the retina, which bypasses many optical aberration problems as the light passes through the eye. A small number of people (perhaps one percent) have acuity up to 60 cycles per degree. In this extreme case, the display density would need to be 4583 PPI. Thus, many factors are involved in determining a sufficient resolution for VR. It suffices to say that the resolutions that exist today in consumer VR headsets are inadequate, and retinal display resolution will not be achieved until the PPI is several times higher. #### How much field of view is enough? What if the screen is brought even closer to the eye to fill more of the field of view? Based on the photoreceptor density plot in Figure 5.5 and the limits of eye rotations shown in Figure 5.21, the maximum field of view seems to be around 270◦ , which is larger than what could be provided by a flat screen (less than 180◦). Increasing the field of view by bringing the screen closer would require even higher pixel density, but lens aberrations (Section 4.3) at the periphery may limit the effective field of view. Furthermore, if the lens is too thick and too close to the eye, then the eyelashes may scrape it; Fresnel lenses may provide a thin alternative, but introduce artifacts. Thus, the quest for a VR retina display may end with a balance between optical system quality and limitations of the human eye. Curved screens may help alleviate some of the problems. #### Foveated rendering One of the frustrations with this analysis is that we have not been able to exploit that fact that photoreceptor density decreases away from the fovea. We had to keep the pixel density high everywhere because we have no control over which part of the display the user will be look at. If we could track where the eye is looking and have a tiny, movable display that is always positioned in front of the pupil, with zero delay, then much fewer pixels would be needed. This would greatly decrease computational burdens on graphical rendering systems. Instead of moving a tiny screen, the process can be simulated by keeping the fixed display but focusing the graphical rendering only in the spot where the eye is looking. This is called foveated rendering, which has been shown to work [106], but is currently too costly and there is too much delay and other discrepancies between the eye movements and the display updates. In the near future, it may become an effective approach for the mass market. #### VOR gain adaptation The VOR gain is a ratio that compares the eye rotation rate (numerator) to counter the rotation and translation rate of the head (denominator). Because head motion has six DOFs, it is appropriate to break the gain into six components. In the case of head pitch and yaw, the VOR gain is close to 1.0. For example, if you yaw your head to the left at 10◦ per second, then your eyes yaw at 10◦ per second in the opposite direction. The VOR roll gain is very small because the eyes have a tiny roll range. The VOR translational gain depends on the distance to the features. Adaptation is a universal feature of our sensory systems. VOR gain is no exception. For those who wear eyeglasses, the VOR gain must adapt due to the optical transformations. Lenses affect the field of view and perceived size and distance of objects. The VOR comfortably adapts to this problem by changing the gain. Now suppose that you are wearing a VR headset that may suffer from flaws such as an imperfect optical system, tracking latency, and incorrectly rendered objects on the screen. In this case, adaptation may occur as the brain attempts to adapt its perception of stationarity to compensate for the flaws. In this case, your visual system could convince your brain that the headset is functioning correctly, and then your perception of stationarity in the real world would become distorted until you readapt. For example, after a flawed VR experience, you might yaw your head in the real world and have the sensation that truly stationary objects are sliding back and forth! #### Display scanout Cameras have either a rolling or global shutter based on whether the sensing elements are scanned line-by-line or in parallel. Displays work the same way, but whereas cameras are an input device, displays are the output analog. Most displays today have a rolling scanout (called raster scan), rather than global scanout. This implies that the pixels are updated line by line, as shown in Figure 5.25. ![](https://i.imgur.com/c9eUNnH.png) This procedure is an artifact of old TV sets and monitors, which each had a cathode ray tube (CRT) with phosphor elements on the screen. An electron beam was bent by electromagnets so that it would repeatedly strike and refresh the glowing phosphors. Due to the slow charge and response time of photoreceptors, we do not perceive the scanout pattern during normal use. However, when our eyes, features in the scene, or both are moving, then side effects of the rolling scanout may become perceptible. Think about the operation of a line-by-line printer, as in the case of a receipt printer on a cash register. If we pull on the tape while it is printing, then the lines would become stretched apart. If it is unable to print a single line at once, then the lines themselves would become slanted. If we could pull the tape to the side while it is printing, then the entire page would become slanted. You can also achieve this effect by repeatedly drawing a horizontal line with a pencil while using the other hand to gently pull the paper in a particular direction. The paper in this analogy is the retina and the pencil corresponds to light rays attempting to charge photoreceptors. Figure 5.26 shows how a rectangle would distort under cases of smooth pursuit and VOR. ![](https://i.imgur.com/bujvE4K.png) One possibility is to fix this by rendering a distorted image that will be corrected by the distortion due to the line-by-line scanout. Constructing these images requires precise calculations of the scanout timings. Yet another problem with displays is that the pixels could take so long to switch (up to 20ms) that sharp edges appear to be blurred. #### Retinal image slip Recall that eye movements contribute both to maintaining a target in a fixed location on the retina (smooth pursuit, VOR) and also to changing its location slightly to reduce perceptual fading (microsaccades). During ordinary activities (not VR), the eyes move and the image of a feature may move slightly on the retina due to motions and optical distortions. This is called retinal image slip. Once a VR headset is used, the motions of image features on the retina might not match what would happen in the real world. This is due to many factors already mentioned, such as optical distortions, tracking latency, and display scanout. Thus, the retinal image slip due to VR artifacts does not match the retinal image slip encountered in the real world. The consequences of this have barely been identified, much less characterized scientifically. They are likely to contribute to fatigue, and possibly VR sickness. As an example of the problem, there is evidence that microsaccades are triggered by the lack of retinal image slip. This implies that differences in retinal image slip due to VR usage could interfere with microsaccade motions, which are already not fully understood. #### Vergence-accommodation mismatch Accommodation is the process of changing the eye lens’ optical power so that close objects can be brought into focus. This normally occurs with both eyes fixated on the same object, resulting in a stereoscopic view that is brought into focus. In the real world, the vergence motion of the eyes and the accommodation of the lens are tightly coupled. For example, if you place your finger 10cm in front of your face, then your eyes will try to increase the lens power while the eyes are strongly converging. If a lens is placed at a distance of its focal length from a screen, then with normal eyes it will always be in focus while the eye is relaxed * What if an object is rendered to the screen so that it appears to be only 10cm away? In this case, the eyes strongly converge, but they do not need to change the optical power of the eye lens. The eyes may nevertheless try to accommodate, which would have the effect of blurring the perceived image. The result is called vergence-accommodation mismatch because the stimulus provided by VR is inconsistent with the real world. Even if the eyes become accustomed to the mismatch, the user may feel extra strain or fatigue after prolonged use. The eyes are essentially being trained to allow a new degree of freedom: Separating vergence from accommodation, rather than coupling them. New display technologies may provide some relief from this problem, but they are currently too costly and imprecise. For example, the mismatch can be greatly reduced by using eye tracking to estimate the amount of vergence and then altering the power of the optical system.