Removing Object from 360° "Stationary" Videos

<h1> Removing Object from 360° "Stationary" Videos </h1> # Table of Contents [toc] # Intro to 360° Video 360° Videos has became a popular way for providing immersive experience in which the viewer can explore and engage with the scene actively. Taiwan AILabs has been developing techonologies for processing and streaming 360° videos. One representative application of our technology is [Taiwan Traveler](https://tter.cc/), an online tourist platform that allows the visiters to enjoy "virtual journey" through various tourist routes in Taiwan. One of the challenge that we faced when creating immersive experience with 360° video is the presence of cameraman in the video, which disturbs the viewing experience. Since the 360° camera capture scenes from all positble viewing directions, unavoidablely the cameraman will be captured in the initial film. As a result, we tries to fix this "cameraman" issue in the processing stage of 360° videos. # Intro to Cameraman Removal Cameraman removal (CMR) is an important stage of 360° video processing, in which we mask out the cameraman and tries to inpaint the masked region with realistic content. The goal of cameraman removal to is to mitigate the problem of camera equipment or cameraman disturbing the viewing experience. > ![](https://i.imgur.com/ijta6u2.jpg) See the following picture for example, given a video containing camera tripot and corresponding mask, our goal is to "remove" the camera tripot from the scene, by cropping out the masked region and then inpainting new synthetic content onto the same region. In our previous blog[^previous_blog] about cameraman removel, we have introduced how we removed cameraman from 8K 360° videos with processing pipeline based on FGVC [^FGVC]. However, the previous method[^previous_blog] is only suitable for videos shot by moving camera. Hence we develope another method to perform cameraman removal on **Stationary** 360° videos. :::warning For most of the demo videos in this blog, we didn't show the full 360° video, instead, we rotate the camera down 90°, so that the camera is faceing the ground ( the direction in which the cameraman usually appears), and crop a rectangular region from the camera view. Actually, similar rotation process is also present in our CMR pipeline, where we rotate the camera and crop a small region as input to the actual CMR algorithms. ::: # Stationary Video vs Dynamic Video Dynamic videos are captured by constantly moving camera. For example, the cameraman may be walking throughout the video. > **Dynamic Video Example:** Hand-held video shot while walking. We want to remvove the cameraman in the center. > <iframe width="560" height="315" src="https://www.youtube.com/embed/4CscyVEhRc0" title="raw cropped sswharf short moving" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> In contrast to dynamic videos, stationary videos are filmed with little or no camera movement. For example, the video may be filmed with cameraman holding the camera while staying in a relatively small region compared to the surroundings, or with camera mounted on a tripod. > **Stationary Video Example 1:** 360° video shot by hand-held camera. We want to mask out the red cameraman along with his shadow. > <iframe width="560" height="315" src="https://www.youtube.com/embed/nfx8M0IGd-M" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>  > **Stationary Video Example 2:** Video shot with tripot, we want to mask out the camera equipment and the tripot in the center. > <iframe width="560" height="315" src="https://www.youtube.com/embed/9UqnV40o2xE" title="raw cropped ncnu sakura" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>  Although sometimes it is hard to classify a video into either stationary or dynamic. (e.g. video in which the cameraman walks and stands alternatively), we still find it very helpful to do so, since these two types of video have very different nature an thus require different approaches to remove cameraman. # Why We Treat Stationary Video and Dynamic Video Differently The motivation for categorizing videos into dynamic and stationary is the fact that previous cameraman remodval algorithm [^previous_blog] failed to fill masked area with realistic content when the camera has no movement. We analyzed the properties of stationary videos and developed a seperate CMR algorithm based on targeting the special setting. In the following sections, we will introduce the two challenges induced by the nature of stationary videos that we need to consider when developing high-quiality CMR algorithm. ## Challenge 1: **"Content Borrowing"** Strategy Fails on Stationary Videos #### What is **"Content Borrowing"** ? How does Previous CMR Method Work? ![](https://i.imgur.com/Dx55byD.jpg) The diagram shows the working principle of **flow-guided** video inpainting methods (e.g. FGVC [^FGVC] that is used by our previous CMR method[^previous_blog]), in which the frames of a dynamic video are drawn from bottom-left to top-right sequentially. To generate the content of masked region (red region), we detect the relative movement of each pixels between neibouring frames, and **borrow** (green arrows) the content of currently masked region from neibouring frames, in which the target region might be exposed ( not in masked region ) due to the movement of camera. In dynamic video, where the cameraman is walking, masked content changes along the video thus previously masked regions will almost certainly be exposed in future frames. That's why the strategy of "**Feature Borrowing**" works fine on dynamic video. > CMR Result on Dynamic Video with **Feature Borrowing** Strategy (previous cmr method based on FGVC) > <iframe width="885" height="498" src="https://www.youtube.com/embed/eWLk-CiFoyk" title="FGVC on sswarg_short_moving (crop=1200:1200:600:600)" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> #### **Content Borrowing** on Stationary Video In the case of CMR on stationary video, if the mask is also stationary (which is often the case because cameraman tends keep same relative position with respect to the camera), then **the content of the masked region might never exposed through out the entire video**. Due to this special limitation, our stationary CMR algorithm has to **"hallucinate" or "create" realistic new content by itself** in order to fill in the masked region. ## Challenge 2: Human Eyes are More Sensitive to Flickering and Moving Artifacts on Stationary Scene. Another challenge we faced when designing CMR algorithm for stationary video is that human eyes are more sentive on stationary video then on dynamic ones. Specifically, in a video with no camera movement, any flickering or distortion in the content will catch the viewer's attention and disturb the viewing experience.  - When we run previous cmr method on stationary video, we can see obvious artifacts caused by warping and flow estimation error. > **The result of previous method (based on FGVC[^FGVC]) on stationary video**, notice the texture mismatch around mask boundary caused by warpping error, and the content generation failure in the bottom left area. <video width="560" height="560" autoplay loop muted playsinline src="https://i.imgur.com/0kc5MjN.mp4" type="video/mp4"> </video> [Video Link](https://i.imgur.com/0kc5MjN.mp4) - Although we use the same mask generating method for both stationary and dynamic video, the tiny disturbance in mask boundary is viewed as artifact only in stationary video, while beening silently ignored in dynamic video. > In stationary video, **even tiny inconsistency in mask boundary becomes obvious to human eyes**. Although the mask boundary in inpainted result looks natural in each individual frame, we can still see disturbing flickering because the boundary changes across frames. This phenomenon is also presented in dynamic video, but is hardly noticable in the dynamic context. > <video autoplay loop muted playsinline src="https://imgur.com/aTQapVx.mp4" type="video/mp4"></video> [video link](https://imgur.com/aTQapVx.mp4) Based on the above mentioned properties of stationary video, we proposed 3 different solutions that might solve our problem, whose concept and experiment results will be presented in the next section. # 3 Different Stationary CMR Solutions The input to 3 different solutions preprocess by the same pipline, in which we first convert 360° video into equirectangular format, then rotate down the viewing direction so that the mask in near the center region, finally a square-shaped region containing the mask and its surrounding area is cropped and resized to a suitable resolution depending on each solution. After the CMR is finished, we upsample the rclose tot with ESRG region.r-resolution model[^ESRGAN]. A common component among these 3 solutions is the use of **image inpainting model**. The reason behind the use of image inpainting models is that their ability to hallucinate realistic content is better than video inpainting models, which tend to rely heavily on "content borrowing" strategy. Also, video inpainting models learn their content generation prior only from video datasets, which contains less diverse content compared to image datasets. For all the following experiments, we choose LaMa[^LaMa] as our image inpainting model. In a simplified perspective, our 3 different solutions are essentially 3 different ways to extend the result of LaMa image inpainting model (which works on single image instead of video) into a full sequence of video frame in a temporally consistent and visually plausible manner. ## Method1: Video Video Inpainting + Guidance In method 1, we modified video inpainting methods(which are usually trained on dynamic videos) to make them work on stationary video. We tested various video inpainting methods and chose E2FGVI[^E2FGVI] due to its robustness and high quality output. Our experiment on video inpainting can be seperated into 4 stages. - #### 1. Naive Method First, we simply runs video inpainting model directly on the stationry input video, we get bad inpainting result as shown below. The model fails to generate realistic content because it is only trained on dynamic video, thus relies heavily on previous described **content borrowing** strategy. <iframe width="560" height="560" src="https://www.youtube.com/embed/SdjRTrbduyw" title="E2FGVI_HQ without Guidance" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>  - #### 2. Add Image Inpainting Guidance In order to leverage to power of video inpainting model, we make the "content borrowing" strategy applicable on stationary video, by designing a new usage for our video inpainting model. Specificaly, **we insert the image inpainting result and remove the corresponding mask in the first frame of the input sequence, so that the video inpainting model can propagate the image inpainted content from the first guiding frame to the latter frames in a temporal consistent way. > **Modified Input** to video inpainting model: We insert image inpaining result in the begining of input sequence as guidance to the video inpainting model. >![](https://i.imgur.com/UuQxz7a.png) After the inserting guiding frames, we can see that the result is much more realistic. <iframe width="560" height="560" src="https://www.youtube.com/embed/hhNf6cXNNpk" title="E2FGVI_HQ + Guidance ( first 100 frames) (guidence period 50 )" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> However, notice that there is an inconsistent strange frame in the middle of the above video. The reason behind this is a little complex: 1. With VRAM on a single Nvidia RTX3090 GPU, **we can only process at most 100 frames in each individual run**, due to limitation of E2FGVI[^E2FGVI]. (our experiments on other deep-learning-based video inpainting implementations experience similar memory constrains). 2. Hence we are forced to divide our input video into multiple slices, and processs each slice seperately.(Each slice contains 100 frames in this case). 3. Since each individual run is independently from one another, **we have to insert image inpainting guidance in each slice** 4. The inserted guiding slice generated by LaMa[^LaMa] and the video inpainted slice generated by [^E2FGVI] have different resolution and texture quality, causing flickering artifact in the transition between pair of slices. We deal with this artifact in the next stage. - #### 3. Softly Chaining the Video Slices together. In order to deal with the inconsistency in the transition of consequetive slices as described in the end of previous stage, we first tried to use the last frame of previous slide as the guiding frame of the next slice. The result is shown below, this method sufferes from accumulated content degeneration. The content of the inpainted region becomes blurry after several iterations.  <iframe width="560" height="560" src="https://www.youtube.com/embed/PGtFDc78SLk" title="E2FGVI_HQ AutoRegressive Chaining (period=50)" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> This result shows that we cannot remove the guiding frames from the slices, with this knowledge in mind, We propose a **"soft chaining"** method that mitigate the flickering artifact while preserving the guiding frames in each slice. In particular, we modify the chaining mechanism, so that every pairs of neibouring slices have an overlapping period, where the video is smoothly crossfaded from the previous slice to the next slice. With this arangement, we can still insert guiding frame, but the inserted guiding frame will be hidden by the overlapping crossfade. Thus the flickering artifact is solve, and we can smoothly transit between each slice. > New Soft Chaingnig Method for smooth transition between slices. > ![](https://i.imgur.com/H5tV8u9.png) > The result of Soft Chaining is shown below, we can see that the gap between each slice is less obvious. ><iframe width="560" height="560" src="https://www.youtube.com/embed/lxVR4N1nbCc" title="E2FGVI_HQ + Soft_Chaining (period=50)" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>  - #### 4. Temporal Filtering Finally, we use temporal filtering method (similar to Method3) to eliminate vibrateing artifact of our model. Other details are omitted for simplicity. <iframe width="560" height="560" src="https://www.youtube.com/embed/Hzt7F_TFJ8c" title="E2FGVI_HQ + Temporal Filtering (period=100) ncnc_sakura_result" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> The main artifact of Method1 is the crossfade transition between neibouring slices, although the transition is visually smooth, it is still disturbing in stationary video. Another artifact is the unnatural blury blob in the center of large mask, we suspect that the artifact is caused by the limitation of architecture of E2FGVI_HQ, which makes the inpainted content too far from the mask boundary becomes blurry. ## Method2: Image Inpainting + Poisson Blending In method 2 we tried to use poisson blending to literally "copy" and "paste" the inpainted region of first frame to other frames in the scene. Poisson blending, first proposed [here](https://dl.acm.org/doi/10.1145/882262.882269), is a way to propagate the texture of the source image on to the target region on another image, while preserving the visual smoothness at the boundary of the inpainted region. For example, here is the seamless cloning effect shown in the original paper: ![](https://i.imgur.com/y5c5hvz.jpg) If we use poisson blending to clone the image inpainting result of LaMa to the rest of the video frames, because of the property of poisson blending, it automatically adjust the lighting of the copied region according to the color of surroundings on the target frame. > The method works on timelapse sunrise video becaus poisson blending adjust the color by enforcing smooth color transition in the mask boundary. > <iframe width="560" height="560" src="https://www.youtube.com/embed/l5IV6U44k7M" title="Lama_Poisson_Result_Chouwu_timelapse" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> >The advantage of poisson blending is that the inpainting region is very stable accross frames, compared to the other 2 methods. See the result below. ><iframe width="560" height="560" src="https://www.youtube.com/embed/K_D9taCko2U" title="Lama_Poisson_Result_ncnu_sakura" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> However, poisson blending only works for tripot video, and doesn't work on handheld stationary video because it assumes that the inpainting content doesn't move throught out the entire video. In the example below, we can see the inpainted texture doesn't move with the surrounding content of the video, causing an unnatural visual effect. > Result of Method 2 (Poisson Blending) on handheld video, we can see that the inpainted texture doesn't move with the surrounding content of the video, causing an unatural visual effect . <video autoplay loop muted playsinline src="https://i.imgur.com/fwC3R85.mp4" type="video/mp4"> </video> [video link](https://i.imgur.com/fwC3R85.mp4) > Even tripot video may contains tiny camera pose movement that accumulate through time. > <video width="560" height="560" autoplay loop muted playsinline src="https://i.imgur.com/x8e4lxG.mp4" type="video/mp4"> </video> [video link](https://i.imgur.com/x8e4lxG.mp4) ## Method3: Image Inpainting + Temporal Filtering In method 3, we take a different strategy, instead of running image inpainting only on the first frame of input sequence, we run image inpaining on all of the frames. The result is shown here. > The result of running image inpainting in every frame seperately. > <iframe width="560" height="560" src="https://www.youtube.com/embed/K6MFATUDjIY" title="Lama frame-by-frame Result ncnu sakura" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> We can see that the result is very noisy because the content generated by LaMa have temporal inconsistent texture. In order to mitigate the noisy visual artifact, we apply a temporal low pass filter on the inpainted result, the temporal filter average the values of pixels in the same position across a temporal sliding window, thus extracts the consistent content across the temporal window, and average out the inconsistent high frequency noise in the temporal window. We show the result after filtering here. See HyperCON[^HYPERCON] for a more thorough realization of this temporal aggregation idea, (our solution is inpired by this work, however, the author didn't provide the code for the paper, so we implement this simplified solution due to time constraint.) > The result of applying **temporal filtering** on the frame-by-frame inpainting reuslt. There are black and white noise caused by the super-resolution model, which we will discuss in the Fituning Section. > <iframe width="560" height="560" src="https://www.youtube.com/embed/1alHfva-2Ho" title="Temporal Filtering Upscale (window=40) crop=1200:1200:600:600 ncnu_sakura_Result" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> The result seems stable at first glance, but if we zoom in and observe carefully, the color of each pixel is slowing changing, creating a slowly vibrating texture. Also, we experience a white and black spots in the reult, which we call "salt and pepper artifact", as shown below. This artifact is the most disturbing artifact of Method3, and will be discussed latter. > Method3 Result Zoomed In, We can see Salt and Pepper Artifact and tiny flickering cause by super-resolution model. <video width="560" height="560" autoplay loop muted playsinline src="https://i.imgur.com/DT9f3Fn.mp4" type="video/mp4"> </video> [video link](https://i.imgur.com/DT9f3Fn.mp4) Now that we have seem the result of these 3 methods, the next step is to compare their property and choose the one that have the best potential for fitting our need. # Comparing and Choosing from Experimental Solutions At this stage, in results of previous experiments, we could see that each method has its own advantage and weakness, there was no absolute winner that dominated and out performed every competitor in all situations. In order to decide on one final choice for further finetuning, we estimates for each method, based on the experiment result, the potential of accomplishing our project goal. We converted experimental result of each method into 8K 360° format, and then asked our photographer and QC people for feedbacks about the results. We summarize the Pros and Cons of each method in the following table. #### Pros and Cons of each Method | Method| Pros | Cons | |----|----|----| | 1: Video Inpainting + Guidance| Robust | Slightly disturbing crossfade transition between video slice (every 200 frames) / Blurry artifact in the center of large masks. | | 2: Image Inpainting + Poisson Blending | Most temporally stable result | Fails on handheld camera, also, the tripot need to be absolutely stable accross the whole video.| | 3: Image Inpainting + Temporal Filtering | Robust |Motion Blur on large camera movement / Large pepper and salt artifact after super-resolution. | #### Speed Comparison We also tried to consider the running speed of each methods in the decision making process. However, it is hard to tell the optimal speed of each methods in experiment stage because we expect to see large speed up is we implement the whole pipeline in GPU. The experimental pipeline is very slow because we save and load intermediate results to the disk, and unnecessarily transfered image frames back and forth between GPU and CPU. The estimated rough speed of each algorithm was similar, so the computation cost of each method didn't affect our decision process. #### Conclusion Based on the qualitative comparison and feedbacks from photographer and QC. We made the following decisions. - Method2 is first one that got kicked out from the list because its low robustness (it only works on perfectly stable tripot video). - Method1 is and Method2 performed equally well according to the feedback voting, so we compared their potential for further finetuning: - In method1, blurry artifact in mask center is hard to fix because it is reslated to the architecture of video inpainting model. - Pepper and salt artifact of Method2 (which is the most complained part of Method2) may be fixed by adjusting the resizing rules of our pipeline. Therefore, we decided to choose solution 3 since it has more potential for improvement in further finetuning. # Finetuning for Method3 (Image Inpainting + Temporal Filtering) The finetunng for Method3 can be devided into 3 parts: 1. Removing Salt and Pepper Artifact 2. Removing Flickernig Mask Boundary 3. Speedup We breifly decribe each section below, and show the flow chart of our CMR algorithm in the end of this chapter. ## Removing Salt and Pepper Artifact The most disturbing artifact reported in our testing feedback is the salt and pepper noise ( white and black spots in the scene). ### The Source of Salt and Pepper Artifact > The pipeline diagram of Stationary CMR using Method3, in which we search for the cause of the artifact. > <video width="560" height="315" autoplay loop muted playsinline src="https://i.imgur.com/FDhxcuE.mp4" type="video/mp4"> </video> [video link](https://i.imgur.com/FDhxcuE.mp4) We search for the origin of salt and pepper artifact in the CMR pipeline ,and found out that it is caused by the resizing process before the image inpainting. We follow the design of previous CMR method and resize the cropped image from 2400x2400 to 600x600, he problem is that in resolution 600x600, the image contains severe aliasing effect, which is replicated by LaMa (shown below), resulting in black and white noise pixels. The noise pixels are preserve and amplified by the super-resolution model. > The Output of LaMa image inpainting, we can see that the anialiasing noise in the context region is replicated by LaMa in the inpainted region. ![](https://i.imgur.com/Fac3B4h.png) > Result of Inpainted Region after Super-Resolution, We can see that the black and white noise pixel are preserver and amplified by the super-resolution model. ![](https://i.imgur.com/hXev6Id.png) ### Solution to Salt and Pepper Artifact The main idea is to reduce the scaling factor of the resize process, so that more detailed information can be preserve and less aliasing effect would occur. We made 2 modification to achieve this: 1. Increase LaMa input Dimension and 2. Track and Crop smaller region around the mask. - **Increase LaMa Input Dimension** We tried different input resolution and found out that LaMa is capable of handling input size larger than 600x600. However, it doesn't means that we can directly feed the original 2400x2400 into LaMa, the synthsized pattern has a limited scale, so when the input image is too large, we would get unnatural repetitive texture in the inpainting area. Epirically, we found out that the sweet spot on the trade off between the operating resolution and content quality is about 1000x1000, so we change resizing dimension before LaMa to 1000x1000. > LaMa Inpainting Result with Different Input Size: > 600x600: many salt and pepper artifact due to extreme resizing scale. > ![](https://i.imgur.com/YvaOYst.jpg) > 1000x1000: good inpainting result > ![](https://i.imgur.com/0oTTuB9.jpg) > 1400x1400: we can see unnatural repeatative texture inside the inpainted area. > ![](https://i.imgur.com/7KobcZI.jpg) - **Track and Crop Smaller Region Around the Mask** Another modification that also boosts the inpainting quality is to crop a smaller region from 8k equirectangular video. Smaller cropped image means we can use smaller resize scale to shrink our input into 1000x1000, thus preserving more detailed content in the result. The cropping area need to contains the whole masked region, and also contains the surrounding areas of the mask, from which LaMa inpainting model can synthesize plausible inpainting content. We hene modify the cropping mechanism so that it tracks position and shape of the mask, specifically, instead of always cropping the center 2400x2400 region of the rotated equirectangular frame, we first rotate the mask region to the center of the equirectangular frame, and then crop the bounding rect around the mask with a margin of 0.5x width and heigth of the bounding rect. The mask tracking process is illustrated below. > Rotate and Crop the Equirectangular Frame based on the Position and Shape of Mask Area: this strategy could reduce the size of the cropped image, thus reduce the resizing scale and increase the overall quality of the result. > Rotate: (Up: Always Rotate 90 degrees / Down: Rotate to Align Mask to the Center) > ![](https://i.imgur.com/OjV3kLF.png) > Find BBox (Up: Always Rotate 90 degrees / Down: Rotate to Align Mask to the Center) > ![](https://i.imgur.com/qtPvZkf.png) > Cropped Region Around the BBOX > ![](https://i.imgur.com/AXd7JSV.png) > Resized Input Mask to LaMa Image Inpainting Model > ![](https://i.imgur.com/vKqG1Jp.png) After we the above 2 modifications, **we found out that the the scaling factore required is drastically decreased, such that super-resolution module is no longer necessary for upscaling**. Hence, we replace ESRGAN super-resolution model with normal image resizing, which further remove more salt and pepper artifacts. ## Removing Flickering Mask Boundary See the picture inpainted below, can you find the boundary of inpainted mask? ![](https://i.imgur.com/NPm8O4z.jpg) It should be hard to notice the inpainted mask even after we zoom in the piture. ![](https://i.imgur.com/5wK3oF5.jpg) However, if the mask boundary is inconsistent between different frames, we can notice a flickering effect that exposes the existence of inpainting mask boundary. <video width="560" autoplay loop muted playsinline src="https://i.imgur.com/huBHb01.mp4" type="video/mp4"> </video> [video link](https://imgur.com/huBHb01.mp4) In order to remove the artifact, we apply temporal filtering on the binary mask before feeding the mask to the CMR pipeline. We continuously blur the mask when we paste the inpainted region back to the 360° video. ## Speed Up In the finetuning stage, we also speed up the algorithm by the following modifications: 1. We move the preprocessing of LaMa model from CPU and GPU 2. We chained all the processes in the CMR together, and run all the steps in GPU. 3. We chain the CMR algorithm and the video mask generation algorithm[^CFBVIO] together with a queue of generated mask frames implemented with python generator, so the 2 stages can run at the same time in a single GPU. After these finetuning, the final CMR and mask generation model together runs at **4.5 fps** on **5.6k** 360° video. ## Complete Flow Chart of Finetuned CMR Algorithm Here we show the complete flow chart of CMR and video mask generator. ![](https://i.imgur.com/cw3pm1h.png) # Conclusions Here is the final result of our finetuned stationary CMR algorithm in 360° format. We successfully removed the pepper and salt artifacts. <iframe width="640" height="360" src="https://www.youtube.com/embed/lXvUSXWl_jA" title="BLOG EXPORT static runner result ncnu sakura 360" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> We can see that the inpainted area is both visually reasonable temporally smooth. The algorithm successfully achieves the goal of CMR under the constraints of stationary video, and greatly improves the viewing experience compared to previous method. We summarize our blog in the following 3 points: 1. We compare the different properties between stationary and dynamic videos, and analyze the challenges for designing CMR algorithm from stataionary video. 2. Base on these observations, we propose 3 different solutions for CMR and compare thier weakness and strength. 3. We choose Solution3 as our final algorithm and further finetune its performance in both quality and speed. On stataionary video, our algorithm can handle both tripod and handheld video, while achieving much better result and also runs a lot faster previous method [^previous_blog]. # 360° Final Results ## Result of Method1 ( Video Inpainting + Guidance) > <iframe width="640" height="360" src="https://www.youtube.com/embed/Km6mTLjFua4" title="E2FGVI_Hq_360 Soft_chaining + Temporal Filtering" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> ## Result of Method3 Before Finetuning (Image Inpainting + Temporal Filtering) <iframe width="634" height="360" src="https://www.youtube.com/embed/2jxZv_GGebk" title="Temporal Filtering Before Finetuinig Sakura output 360" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> <iframe width="634" height="360" src="https://www.youtube.com/embed/CPVArT0jcPs" title="Temporal Filteringf Before Finetune ( SML Sakura Grass ) 360" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> ## Result of Finetuned Method3 (Image Inpainting + Temporal Filtering) Notice the qualitative improvements compared to Method3 Before Finetuning: 1. Removed Salt and Pepper Artifact 2. Removed Flickernig Mask Boundary > <iframe width="640" height="360" src="https://www.youtube.com/embed/lXvUSXWl_jA" title="BLOG EXPORT static runner result ncnu sakura 360" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> ><iframe width="640" height="360" src="https://www.youtube.com/embed/Vfl5sbn33KY" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> ><iframe width="640" height="360" src="https://www.youtube.com/embed/iRc-lXMRpcI" title="BLOG EXPORT static runner result chaowu timelapse 360" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> > <iframe width="640" height="360" src="https://www.youtube.com/embed/895dIzn_PUY" title="BLOG EXPORT static runner result asian bay 360" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> # References [^previous_blog]: Privious blog on [Remove Cameraman from 8K 360° Videos](https://ailabs.tw/smart-city/the-magic-to-disappear-cameraman-removing-object-from-8k-360-videos/) [^FGVC]: Paper for [Flow-edge Guided Video Completion](http://chengao.vision/FGVC/) [^ESRGAN]: see [Github Repo of ESRGAN](https://github.com/xinntao/ESRGAN) [^LaMa]: see [Resolution-robust Large Mask Inpainting with Fourier Convolutions](https://saic-mdal.github.io/lama-project/) [^E2FGVI]: [*Towards An End-to-End Framework for Flow-Guided Video Inpainting* ](https://github.com/MCG-NKU/E2FGVI) [^HYPERCON]: paper [link](https://openaccess.thecvf.com/content/WACV2021/papers/Szeto_HyperCon_Image-to-Video_Model_Transfer_for_Video-to-Video_Translation_Tasks_WACV_2021_paper.pdf) for *HyperCon: Image-to-Video Model Transfer for Video-to-Video Translation Tasks* [^CFBVIO]: see [paper link](https://arxiv.org/abs/2010.06349) and [repo link](https://github.com/z-x-yang/CFBI) for *Collaborative Video Object Segmentation by Multi-Scale Foreground-Background Integration*