A Two-stage Deep Network for High Dynamic Range Image Reconstruction

# A Two-stage Deep Network for High Dynamic Range Image Reconstruction ## Abstract Mapping a **single exposure** low dynamic range (LDR) image into a high dynamic range (HDR) is considered among the most strenuous image to image translation tasks due to exposure-related missing information. - Single-exposure (brackets): the process where you take the original **Raw** photo from the camera and use your Raw editor to create brackets. - RAW image: A camera RAW image is an unprocessed photograph captured with a digital camera. This study tackles the challenges of single-shot LDR to HDR mapping by proposing a **novel two-stage deep network**. Notably, our proposed method aims to reconstruct an HDR image without knowing hardware information, including **camera response function (CRF)** and exposure settings. - CRF: A camera measures the light intensity hitting the imaging sensor and outputs an image with pixel values that in some way correspond to that light intensity. This correspondence is modeled by the camera response function (CRF). ![image](https://roboalgorithms.com/crf/crf-animation.gif) Therefore, we aim to perform image enhancement task like denoising, exposure correction, etc., in the first stage. Additionally, the second stage of our deep network learns tone mapping and bit-expansion from a **convex set** of data samples. - Convex set: 凸集（Convex set）是一個點集合，其中每兩點之間的直線點都落在該點集合中。 ![](https://i.imgur.com/Z5pnsP8.png) The qualitative and quantitative comparisons demonstrate that the proposed method can outperform the existing LDR to HDR works with a marginal difference. Apart from that, we collected an LDR image dataset incorporating different camera systems. The evaluation with our collected realworld LDR images illustrates that the proposed method can reconstruct plausible HDR images without presenting any visual artefacts. ## Introduction Due to numerous hardware limitations, digital cameras are susceptible(易受影響的) to capture a limited range of luminance(亮度). Subsequently, such hardware deficiencies drive most standalone devices to capture over/under-exposed images with implausible perceptual quality. To counter such inevitable consequences, typically, digital camera leverage multiple LDR shoots with different exposure settings. Regrettably, such multi-shot LDR to HDR recovery is also far from the expectation and can incorporate limitations, including producing ghost artefacts in dynamic scenes captured with hand-held cameras. Contrarily, recovering HDR images from a single-shot image consider among the most prominent solution to address the shortcomings of its multi-shot counterparts. However, a single-shot HDR recovery always remains a challenging task as it aims to recover significantly higher pixelwise information than a legacy LDR image (i.e., 8-bit image). Most notably, such LDR to HDR mapping has to incorporate dynamic bit-expansion, noise suppression, and estimation of CRF without having any additional information from the neighbour frames. In the recent past, several methods have attempted to reconstruct HDR images from single-shot LDR input by leveraging the convolutional neural networks (CNNs). Typically, these deep methods learn to hallucinatethe CRF and perform bit-expansion from a convex set of data samples. Notably, the hardware-related information, explicitly the CRF is proprietary property of the original equipment manufacturer (OEMs) and mostly remains undisclosed. Therefore, addressing the single-shot LDR to HDR mapping with a single-stage deep network with pre/post-processing operation can result in inaccurate CRF estimation along with quantization. Subsequently, such HDR mapping methods can end up with visual artefacts in real-world scenarios. In this paper, we propose a two-stage learning-based deep method to tackle the challenging single-shot HDR reconstruction. The proposed method comprises a two-stage deep network and learns from a convex set of single-shot 8-bit LDR images to reconstruct 16-bit HDR images comprehensively (please see Fig. 1). Here, the first stage of the proposed method performs the basic enhancements task like exposure correction, denoising, etc., and the second stage recovers the 16-bit HDR image, including the **tone mapping**. Notably, we encouraged our network to directly learn to reconstruct HDR images without explicitly estimating hardware-related information like CRF and bit-expansion. Hence, our method incorporates a significantly simple training process and does not require any handcrafted processing. We studied our network with real-world LDR images to confirm the feasibility in unknown data samples. - Tone mapping: Tone mapping is a technique used in image processing and computer graphics to map one set of colors to another to approximate the appearance of high-dynamic-range images in a medium that has a more limited dynamic range. ![](https://i.imgur.com/v0w3Pi5.png) Our contributions are as follows: - A two-stage deep network to reconstruct 16-bit HDR images from 8-bit LDR inputs. - Comparison with state-of-the-art methods and outperform them in both objective and subjective measurement. - Collection of an LDR image dataset and extensively study the proposed method’s feasibility in real-world scenarios. ## Related WorkRelated Works LDR to HDR image reconstruction has been largely investigated in the last couple of years. The following subsection discusses some of the previous work on this topic, and for simplicity of the presentation, we categories those methods into learning and non-learning based methods. ### Non-learning Based Methods Inverse tone-mapping, additionally known as Expansion operators(EOs), broadly used for LDR to HDR image reconstruction, has been studied for the last couple of decades. Nevertheless, this technique’s difficulty persists as it lacks to produce details of the missing portion of the image. Hereabouts, concerning single image HDR reconstruction, we discuss some existing EOs techniques. EOs is commonly formulated mathematically as: ![](https://i.imgur.com/9qjNBcu.png) Here, Le indicates the produced HDR content from LDR inputs, which is denoted as Ld. f(.) indicates the expansion function, which takes LDR content as input. Inverse tone mapping, along with global operators, mainly used in the early time of solving this LDR to HDR conversion problem. Landis, one of the earliest to solve this problem, used a linear function to all the images’ pixels. A gamma function has been used in Bist et al. paper, where the gamma curve is defined with the help of the human visual system’s characteristics. Maisa et al. proposed a global method that expands the content based on image properties determined by an image key. All the above methods are categorized as the global method. An analytical method coupled with an expand map is typically applied in the local method to expand LDR content to HDR. A median-cut method was used in Banterle et al. paper to find the areas with high luminance. Later they generated an expand map using an inverse operator to extend the luminance range in the high luminance areas. To maintain the contrast, Rempel et al. further used an expand map calculated by a gaussian filter and an edge-stopping function. Some other methods were proposed to tackle this issue where user interaction was added in most of them. Didyk et al. used a semiautomatic classifier to detect the high luminance and other saturated areas. Wang et al. proposed an impainting-based method where textures are recovered by transferring details from the user’s specific selected region. However, **these above techniques solve LDR to HDR conversion problem and produce satisfactory outcomes only when well-behaved inputs are provided.** ### Learning Based Methods Learning-based image to image translation like image enhancement showed great promises in the past decade. Considering their success in different domains of image manipulation, recent LDR to HDR studies have incorporated deep learning in their respective solutions. In recent work, Endo et al. propose an auto-encoder to generate HDR images from multi-exposure LDR images. Lee et al.sequentially bracketed LDR exposures and utilized a CNN to reconstruct an HDR image. Later, Lee et al. proposed a recursive conditional generative adversarial network (GAN) and combined an L1-norm to reconstruct the HDR images. Yu-Lun et al. intended to learn reverse camera pipeline for HDR reconstruction from a single input. Notably, all of these deep methods incorporate complicated training manoeuvre and handcrafted pre/postprocessing operations. Apart from these approaches, a few novel methods propose to learn LDR to HDR directly through a single-stage deep network. For example, Eilertsen et al. propose to utilize a U-Net architecture to estimate the overexposed region of an image and combines it with underexposed pixels of the LDR inputs. In another way, Marnerides et al. proposed a multi-branch CNN to extract features from the input LDR and fuse the output of each branch to expand the bit values of LDR images. Similarly, Zeeshan et al. proposed a recurrent neural network to learn single-shot LDR to HDR from training pairs. The existing straightforward deep networks learn CRF and bit-expansion with a single-stage network, which can easily misinterpret the reconstruction network to produce visual artefacts. Unlike the existing works, the proposed method does not include any additional pre/post-processing operation. Our proposed method directly learns an 8-bit LDR to 16-bit HDR mapping with a novel deep network. Unlike the existing works, the proposed method does not include any additional pre/post-processing operation. Our proposed method directly learns an 8-bit LDR to 16-bit HDR mapping with a novel deep network. ## Method The proposed method aims to recover 16-bit HDR images from single-shot LDR inputs. This section describes the process of network design, optimization, and implementation strategies in detail. ### Network Design We consider the single-shot LDR to HDR formation as an image to image translation task. - recover 16-bit HDR images as F : IL → IH. - (F) learns to generate a 16-bit image (IH) from an 8-bit LDR image (IL) comprehensively from a convex set of training samples. ![](https://i.imgur.com/xODqU7p.png) As Fig. 2 depicts, the proposed method comprises a twostage deep network to map an input LDR input to an HDR image. - Stage I: Learns basic operation like exposure correction, denoising, contrast correction, gamma correction, etc. - the LDR images illustrate numerous shortcomings like over/under exposure, over/desaturation, sensor noises, etc. - aims to perform such image enhancement tasks before reconstructing the HDR images. - the network maps the input LDR input (IL) as IH' ∈ [0, M]H×W×3. H and W represent the height and width of IH'. The maximum value of M can be perceived as M = 255. - we normalized the value of M by dividing 255 to accelerate the training process. - stage-I as a stacked CNN and comprises a single convolutional operations (i.e., as input and output layer) with multiple Residual Dense Attention blocks (RDAB). RDAB n = 2. - Stage II: Learns tone mapping, bit-expansion, and recover 16-bit HDR images from the output of stage-I. - proposed method aims to reconstruct the final 16-bit HDR images by learning tone mapping and bit expansion. - it takes the output of the stage-I IH' as input and maps it as IH ∈ [0, K] H×W×3. - It is noteworthy that the output range of IH has been stored in a 16-bit image format. The maximum value of K can be K = 65535. - this stage shares a similar network architecture as its predecessor. - due to reduce the trainable parameter, we set the frequency of RDAB in stage-II as n = 1 - Residual Dense Attention Block. - To accelerate our learning process, we develop a novel block combining a residual dense block and a spatial attention module - the spatial attention modules in the newly developed RDAB allowed us to leverage spatial attention along with residual feature propagation to mitigate visual artefacts. For a given input X, an RDAB aims to output the feature map (X') as: ![](https://i.imgur.com/cmP2V5c.png) - R(·) and S(·) present the function of residual dense attention block and spatial attention block. We added the output of S(·) along with R(·) to learn a long-distance feature inter-dependency while performing HDR mapping. ![](https://i.imgur.com/lGBaJ5k.png) - spatial attention modules ![](https://i.imgur.com/4RKbEGa.png) ### Optimization The stages of the proposed method have been optimized with dedicated loss functions. Based on their dedicated role, we set the objective functions to maximize the performance. - Stage-I optimization. - The stages of the proposed method have been optimized with dedicated loss functions. This study utilizes an L1-norm as a base reconstruction loss, which can be derived as follows: ![](https://i.imgur.com/dzXCKbs.png) - IH0 and IG8 present the output obtained from stage-I and reference 8-bit image. - L1-norm: L1 norm 更傾向於讓模型不重要的參數變成 0， - 要儲存的參數較少，減少模型大小與記憶體需求 - 運算時候的乘法次數較少，增加預測時的速度 - 在特徵組合時特別重要 - L2-norm: L2 norm 傾向於讓參數的值更小，而不是變成 0 ![](https://i.imgur.com/8VY21tU.png) ![](https://i.imgur.com/0ezCVpq.png) - SSIM: 考虑了亮度 (luminance)、对比度 (contrast) 和结构 (structure)指标，这就考虑了人类视觉感知，一般而言，SSIM得到的结果会比L1，L2的结果更有细节 - SSIM loss as structure loss and derived as follow: ![](https://i.imgur.com/SOGNq7d.png) - GAN based loss: improve the texture in the reconstructed images ![](https://i.imgur.com/GLWO1kM.png) - The total loss of the stage-I can be derived as : ![](https://i.imgur.com/abEvuY4.png) - Stage-II optimization. - we develop another dedicated loss function to maximize the performance of stage II. ![](https://i.imgur.com/E3u2Vbp.png) - IH and IG generated 16-bit HDR image and corresponding reference 16-bit image. - perceptual colour loss (PCL): guide the network to avoid any colour degradation while mapping the given 8-bit images into a 16-bit HDR image. ![](https://i.imgur.com/sv9s7LE.png) - ∆E represents the CIEDE2000 colour difference between generated image and the reference image. - The total loss of stage-II can be summarized as follows: ![](https://i.imgur.com/ovp3VcK.png) ### Implementation Details - The input layer of both stages aims to map an arbitrary image with a dimension of H × W × 3 into a feature map Z = H × W × 64, where H and W represent the height and width of the input image. - each network’s output layer generates images as IR∗ = H × W × 3. - The convolution operations of stage-I and stage-II comprises a kernel= 3 × 3, a stride=1, padding=1, and activated by a ReLU activation. - ReLU activation: 為何使用激勵函數? 假如沒有激勵函數，輸出跟輸入脫離不了線性關係，那深度類神經網路就沒有意義了 - Utilizes a discriminator for estimating the adversarial loss. - cGAN(conditional generative adversarial network) - 新增限制條件，來控制GAN生成資料的特徵（類別） ![](https://i.imgur.com/4E8JPgI.png) - The network comprises eight consecutive convolutional layers with a kernel size of 3×3 and activated with a swish function. - The feature depth of these convolutional layers has started from 64 channels. In every (2n − 1)th layer, the architecture expands its feature depth and reduce the spatial dimension by a factor of 2. - The final output of the discriminator obtained with another convolution operation comprising a kernel = 1 × 1 and activated by a sigmoid function. ## Experiments and results We perform dense experiments to study the feasibility of the proposed study in a different scenario. This section details the results obtained from the experiments for LDR to HDR reconstruction. ### Setup - dataset: HdM HDR dataset - comprised a set of 1289 scenes (i.e., long, medium, short exposure LDR images, and 16-bit HDR ground-truth) - we used 1,000 image sets for training and the rest for the testing. - We extracted a total of 7,551 image patches and made image sets for exploiting supervised training. Each patch set comprised randomly extracted images patches of LDR input, 16-bit and 8-bit ground truth images. - we obtained the 8-bit reference images by clipping and normalizing the 16- bit ground truth images. Fig. 4 depicts the sample image patches that we used extracted from the HdM HDR dataset, which we used for training only. ![](https://i.imgur.com/kfltUPj.png) - PyTorch - the networks were optimized with an Adam optimizer, where the hyperparameters were tuned as β1 = 0.9, β2 = 0.99, and learning rate = 5e-4. - We trained our model for 25 epochs with a constant batch size of 8. - We conducted our experiments on a machine comprises of an AMD Ryzen 3200G central processing unit (CPU) clocked at 3.6 GHz, a random-access memory of 16 GB, and An Nvidia Geforce GTX 1060 (6GB) graphical processing unit (GPU). ### Comparison with state-of-the-art methods - We compared our methods with three different state-ofthe-art single-shot LDR to HDR works: i) HDRCNN, ii) ExpandNet, and iii) FHDR. none of these methods has been specially designed for generating 16-bit HDR images, as we aim to learn in this study. - to keep the evaluation process as fair as possible, we studied each state-of-the-art model with the same dataset we used to investigate our proposed method. - each method was studied with their suggested hyperparameters until they converge with the given data samples. - We evaluated each deep method with the same testing samples and summarized the performance with peak-signal-tonoise-ratio (PSNR) and µ-PSNR metrics [28]. Here, we compute the µ-PSNR as per the suggestions of [28] and employed a compression factor µ = 5000, normalizing percentile = 99, and a tanh function for maintaining the [0, 1] range. - PSNR峰值訊號與雜訊比（Peak Signal-to-Noise Ratio） - 表示訊號最大可能功率和影響它表示精度的破壞性雜訊功率的比值，而在影像裡面我們就可以用 PSNR 這種比較客觀(有一個量化數據) 的方法來計算影像的失真 - ![](https://i.imgur.com/Nbe649n.png) - 其中的 MAX I 為訊號的最大強度，在每個pixel 點用8-bit 表示的影像裡面就是 255 - ![](https://i.imgur.com/9L7I0gu.png) - MSE 就是統計學裡面講的均方誤差(mean-square error) - Quantitative evaluation. - Table 1 illustrates the quantitative comparisons between the deep methods. - It is worth noting the HDRCNN model leverage a VGG16 backbone in its architecture. Typically, such pretrained VGG-16 backbone networks aim to enhance the details while performing any image to image translations task.We found that the VGG-16 backbone of HDRCNN boosts the sensor noise of LDR inputs while detail enhancement. Also, the 16-bit expansion boosts up these noises further in final reconstruction and resist the HDRCNN to perform a satisfactory performance as its counterparts. - ![](https://i.imgur.com/OT5tsSw.png) - VGG 是英國牛津大學 Visual Geometry Group 的縮寫，主要貢獻是使用更多的隱藏層，大量的圖片訓練，提高準確率至90% - VGG16: 16層(13個卷積層及3個全連接層) - ![](https://i.imgur.com/lsXGIGX.png) - Qualitative comparison. - we perform a qualitative evaluation to perform the subjective measurement between the different singleshot LDR to HDR reconstruction methods. Fig. 5 illustrates reconstructed HDR images obtained through the different deep models. We normalized and clipped the 16- bit HDR outputs for better visualization. The visual comparison grasps the consistency of quantitative comparison. - our two-stage deep method reconstructs cleaner HDR images with natural colour consistency. It maintains the details in the complicated overexposed regions comparing to its counterparts. Overall, the proposed method can recover plausible HDR image from an LDR input without producing any visually disturbing artefacts. ![](https://i.imgur.com/nRiu53w.png) ### Ablation Study We studied the feasibility and the contribution of our two-stage method with sophisticated experiments. Specifically, we trained and evaluated our stages separately to verify the feasibility of a two-stage model for LDR to HDR reconstruction. Here, we used challenging single-shot LDR images from the HdM HDR dataset to perform the quantitative and qualitative evaluation. - Quantitative evaluation. - Table. 2 illustrates the performance of each stage of the proposed method on the HdM HDR dataset. Here, the PSNR and µ-PSNR calculated over 289 image pairs. We arbitrarily selected an LDR image from the three exposure shoots and paired it with the ground truth image for performing the evaluation. The ablation study illustrates that each stage of the proposed method contributes to the final HDR reconstruction. The individual stages of the proposed method can not achieve the height in evaluation metrics as their two-stage variants. We observed a tendency of underfitting in one-stage variants due to the significantly lesser number of trainable parameters (please see sec. 4.5 for detail). ![](https://i.imgur.com/wcOVpLe.png) - Qualitative evaluation. - Fig. 6 illustrates the visual comparison between different variants of the proposed study. Results have been visualized by applying a normalizing factor on 16-bit HDR images. It can be visible that the proposed two-stage model can reconstruct visually cleaner and plausible images among all models. Despite sharing similar network configurations, the single-stage networks struggle to reach the height of their two-stage variants. Particularly, estimating CRF, bit-expansion with image enhancement misinterpreted them to produce visual artefacts ![](https://i.imgur.com/XUjMWUS.png) ##### 參考資料: [QUANTITATIVE AND QUALITATIVE METHODS](https://www.intrac.org/wpcms/wp-content/uploads/2017/01/Quantitative-and-qualitative-methods.pdf) ### Method generalization The key motivation of our proposed works is to obtain satisfactory results on diverse LDR images. Therefore, we studied the feasibility of our proposed method with a substantial amount of LDR samples captured with different hardware. To obtain this, we collected an LDR dataset incorporating numerous camera hardware, including DSLR (i.e., Canon Rebel T3i) and smartphone cameras (i.e., Sam sung Galaxy Note 8, Xiaomi MI A3, iPhone 6s, etc.). We collected a total of 52 LDR images using these devices. Depending on the hardware types (i.e., DSLR or smartphone), we capture images by applying the following strategies ### Discussion - The proposed method comprises 834,476 trainable parameters (555,655 for stage-I and 278,821 for stage-II). - Despite train with images patches, our model can be inference with any dimensioned images. Our model takes around 1.10 seconds to successfully inference an image dimensioned of 1900 × 1060 × 3. - As the proposed method doesn’t require any pre/post-processing, the inference times are meant to remain contents with the same hardware settings. Subsequently, the simplicity of the proposed method made the solution convenient for real-world deployment. ### Conclusion - This study proposed a two-stage learning-based method for single-shot LDR to HDR mapping without explicitly calculating camera hardware related information. - stage-I of the proposed method learns to perform the basic image manipulation techniques like exposure correction, denoising, brightness correction comprehensively. - stage-II focuses on tone mapping and bitexpansion to output 16-bit HDR images. - We evaluated and compared our proposed approach with the state-of-theart single-shot HDR reconstruction methods. - Both qualitative and quantitative comparison evident that the proposed method can outperform the existing deep methods with a substantial margin. - we also collected a set of LDR images captured with the different camera hardware. The study with our newly collected dataset reveals that the proposed method can handle the real-world LDR samples without producing any visual artefacts. It has planned to extend the proposed method for multi-shot HDR reconstruction in a future study.