Daily Log - HackMD

# 04.10 - Finetune the data augmented with the bottle and the chair ![Screenshot 2024-04-11 at 20.38.32](https://hackmd.io/_uploads/SJ-hG-8lR.jpg) - The mAPs of the bottle and the chair are improved, while others do not change too much. ![plot_inpaiting_chair](https://hackmd.io/_uploads/H1P-7b8eC.png) # 04.09 - Finetune the data augmented with the bottle ![Screenshot 2024-04-11 at 20.32.11](https://hackmd.io/_uploads/HyfcZbIgR.jpg) - The mAP of the bottle is improved from 0.459 to 0.491 # 04.08 - Analyze the voc 2007 dataset - PLCC between Faster R-CNN mAP and annotation count: 0.060 - PLCC between Faster R-CNN mAP and area mean: 0.614 - PLCC between Faster R-CNN mAP and area std: 0.686 - Synthesize more objects on VOC 2007 dataset # 04.05 Synthesize more objects on one image with various size # 04.04 - Use the new inpaiting method BrushNet to synthesize data. - The result of mAP (0.281) is similar to Controlnet. # 04.03 - Finish synthesizing hair drier of various size (randomly sample from 1,2,3,4). - The mAP hair drier is not improved. # 04.02 - Visualize the mAP and standard deviation of area for each class. The PLCC is 0.3651. - Visualize the mAP and std/mean of area for each class. The PLCC is -0.4756. - Synthesize data of various size to improve the diversity. # 04.01 - Visualize the mAP and annotation number for each class. The PLCC is 0.0406. - Visualize the mAP and average area for each class. The PLCC is 0.4012. # 03.29 - Visualize the results on different class ![Picture1](https://hackmd.io/_uploads/Sy4d0luk0.png) - Visualize the image label of the classes with low accuracy # 03.27&3.28 - Split coco labels to 60 and 20 classes Have 5 shots on the 20 classes Train on 60 classes and finetune on 20 classes Inpainting on training dataset based on the class which has similar clip score - Finetune on 5 shot coco dataset: 2.5% Finetune on synthesized dataset, then 5 shot coco dataset: 3.3% # 03.26 - The result of using clip score filtering: Synthesized coco and coco dataset + data filtering: 0.285 I also test the clip score on real coco, the result is similar to synthesized data. # 03.25 - Use the clip score as the data filtering - Split the coco data to 60 classes and 20 classes # 03.22 - Synthesize data using coco with openimage label - Read paper [Data Augmentation for Object Detection via Controllable Diffusion Models](https://openaccess.thecvf.com/content/WACV2024/papers/Fang_Data_Augmentation_for_Object_Detection_via_Controllable_Diffusion_Models_WACV_2024_paper.pdf) # 03.21 - Add noise to test dataset to test the robust of synthesized dataset. The noise is (0, 25) mAP on real coco: 0.178 mAP on synthesized dataset and coco: 0.176 mAP on synthesized coco and real coco + data filtering (threshold: 0.3): 0.176 mAP on synthesized coco and real coco + data filtering (threshold: 0.7): 0.176 # 03.20 Try to synthesize data using ddpm inversion and self-attention and cross-attention guidance diffusion # 03.19 The result is that mAP on real coco: 0.282 mAP on synthesized coco and real coco: 0.285 mAP on synthesized coco and real coco + data filtering (threshold: 0.3): 0.283 mAP on synthesized coco and real coco + data filtering (threshold: 0.7): 0.282 # 03.18 - Add the data filtering on synthesized dataset - The data filtering process utilizes a detector that has been trained on the actual COCO dataset. This detector is then applied to a synthesized version of the COCO dataset to bbox that has IOU exceeding 0.2, along with a confidence score. # 03.15 - Deal with domainnet dataset - Train the model on the whole coco dataset and synthesized coco dataset # 03.14 - Generate more data on the whole coco dataset - Data filtering on synthesized coco dataset # 03.08 - faster rcnn ViT result: first synthesized dataset, then coco dataset: mAP 0.4340 - faster rcnn result: synthesized dataset training from scratch: mAP 0.268 # 03.07 - faster rcnn ViT result: coco dataset: mAP 0.5100 only synthezied dataset traning: mAP 0.4090 coco dataset + synthezied dataset traning (50%): mAP 0.4290 # 03.06 - Working on the ViT training on synthezied dataset # 03.05 - faster rcnn result: coco dataset: mAP 0.282 coco dataset training + synthezied dataset traning: mAP 0.271 coco dataset training + synthezied and coco dataset (50%) traning: mAP 0.285 # 03.04 - Use coco dataset to train the faster rcnn model - Use synthesized dataset generated by inpainting to finetune the faster rcnn model # 03.01 - Synthesize the data on coco dataset and compute FID on 2k dataset. ![Screenshot 2024-03-05 at 21.35.36](https://hackmd.io/_uploads/HJBRUIS66.jpg) The FID of Stable Diffusion + Controlnet: 41.6509 The FID of Stable Diffusion inpaiting: 46.7153 # 02.29 - Set up the mmdection for coco data training - Read paper [InstaGen](https://arxiv.org/pdf/2402.05937.pdf) and [Effective Data Augmentation](https://arxiv.org/pdf/2302.07944.pdf) # 02.28 - Change the step to add the mask constrain. Now the best result is ![Screenshot 2024-02-29 at 01.15.11](https://hackmd.io/_uploads/HkOtZiTh6.jpg) # 02.27 - Replace the attention map to the mask in the controlnet. - Working on optimization based on attention map and mask # 02.26 - Read the code of [Instance Diffusion](https://github.com/frank-xwang/InstanceDiffusion) - Still working on adding the mask contrain to attention map # 02.23 - Train controlnet with SD2.1 model - Working on adding attention replace and optimization based on mask conbined with controlnet model # 02.22 - Use the mask to blend the noised background to the controlnet method. The results are ![Screenshot 2024-02-23 at 03.01.10](https://hackmd.io/_uploads/B1bBZCr36.png) The performance is better than that without blending operation. - Test the code of [Instance Diffusion](https://github.com/frank-xwang/InstanceDiffusion). Consider use it in the image editing. # 02.21 - Test different learning rate and batch size, the results are similar to yesterday. - Still work on add mask constrain in the controlnet method. # 02.20 - Enlarge the learning rate for controlnet training from scratch. Now it starts to converge ![Screenshot 2024-02-21 at 01.54.06](https://hackmd.io/_uploads/rJLE0MX26.png) - Working on add mask constrain in the controlnet method # 02.16 - Add the prompt from LLava to train the controlnet inpainting. Load the pretraining weight from https://huggingface.co/lllyasviel/control_v11p_sd15_inpaint. The results can align the mask well. ![Screenshot 2024-02-18 at 23.13.33](https://hackmd.io/_uploads/SkY5SLe3T.png) - Prepare the data for coco mini-training # 02.15 - Previous training cannot converge. Prepare data by using SAM datast labeled by LLava as https://huggingface.co/datasets/PixArt-alpha/SAM-LLaVA-Captions10M/tree/main - Read paper [MIGC](https://arxiv.org/pdf/2402.05408.pdf) and [ECLIPSE](https://arxiv.org/pdf/2402.05195.pdf) # 02.14 - Test the code of [Shape-guided diffusion](https://arxiv.org/pdf/2212.00210.pdf). The results are good ![Screenshot 2024-02-15 at 05.27.44](https://hackmd.io/_uploads/BJBovvsjT.png) - Train Inpaiting controlnet training with SAM dataset. The training process is https://wandb.ai/swen/controlnet-demo # 02.13 - Train Inpaiting controlnet training with SAM dataset - Read the code of [Boundary Attention](https://arxiv.org/pdf/2401.00935.pdf) # 02.12 - Finish adding the mask constrain in the DDIM inversion. The results are ![Screenshot 2024-02-13 at 05.11.21](https://hackmd.io/_uploads/Hk5Yxpdjp.jpg) - Preparing SAM dataset for Inpaiting controlnet training - Read paper [Fast Training of Diffusion Models](https://arxiv.org/pdf/2306.09305.pdf) and [PixArt](https://arxiv.org/pdf/2310.00426.pdf) # 02.09 - Still work on adding the mask constrain in the DDIM inversion - Read the code of Inpaiting controlnet and consider training it with SAM dataset # 02.08 - Work on adding the mask constrain in the DDIM inversion for better reconstruction - Discuss with Vikash about accurete inpainting model. Read paper SAM and [MGIE](https://arxiv.org/pdf/2309.17102.pdf) # 02.07 - Work on other mask from SAM. Results are shown below ![Screenshot 2024-02-08](https://hackmd.io/_uploads/r1JR2zfia.jpg) - Finish combining prompt2prompt using DDIM inversion with blended latent diffusion. The result is the left bottom. The results are still distorted. Now I am working on add the mask constrain in the DDIM inversion to make the self-learning attention map also follow the mask. # 02.06 - Work on combining prompt2prompt using DDIM inversion with blended latent diffusion. Now the results are distorted. - Read paper [InstanceDiffusion](https://arxiv.org/pdf/2402.03290.pdf), [DiffEditor](https://arxiv.org/pdf/2402.02583.pdf) # 02.05 - Use prompt2prompt replacing the mask generated using DDIM inversion ![Screenshot 2024-02-06 at 03.46.28](https://hackmd.io/_uploads/rJ1bGOksT.png) The source prompt is "black front background", while the target prompt is "cat front background". The result can change the black mask to a black cat. However it still looks like a black mask, and it change the surrounding content. Next step is to refine the attention using the mask. - Read paper [Boundary Attention](https://arxiv.org/pdf/2401.00935.pdf) and [Spatial-Aware Latent Initialization](https://arxiv.org/pdf/2401.16157.pdf) # 02.02 - Test with replacing the attention map with various hyperparameters. When replacing it with the mask, the results are ![Screenshot 2024-02-05 at 00.12.23](https://hackmd.io/_uploads/B1CbJx096.jpg) When replacing it with the mask generated using DDIM inversion, the results are ![Screenshot 2024-02-05 at 00.12.33](https://hackmd.io/_uploads/Hy3X1xRcp.jpg) - Work on the prompt2prompt replacing the mask generated using DDIM inversion # 02.01 - Combine MultiDiffusion with blended diffusion ![Screenshot 2024-02-02 at 05.04.10](https://hackmd.io/_uploads/SJMNAVq5p.png) The result is similar to blended diffusion, which means it only ganrantee the content within the mask aligns with the prompt. - Read paper [AdapEdit](https://arxiv.org/pdf/2312.08019.pdf), [ActAnywhere](https://arxiv.org/pdf/2401.10822.pdf) - Work on replacing the attention map with various hyperparameters. # 01.31 - Replace the attention map with the attention map generated from DDIM inversion. Results are not improved. - Test the code of MultiDiffusion. # 01.30 - Inject the attenion map from DDIM inversion to DDIM forward in the blended latent diffusions ![Screenshot 2024-01-31 at 04.52.06](https://hackmd.io/_uploads/rJjKucwqp.jpg) The optimization makes the content leave the image manifold - Read paper [Diffusion-Based Image Editing with Instant Attention Masks](https://arxiv.org/pdf/2401.07709.pdf), [Shape-guided diffusion](https://arxiv.org/pdf/2212.00210.pdf), [TF-ICON](https://arxiv.org/pdf/2307.12493.pdf) # 01.29 - Synthesize the masked image. Write the prompt to do DDIM inversion and visualize the attention maps. ![Screenshot 2024-01-30 at 02.02.45](https://hackmd.io/_uploads/S1yr1XIcp.png) - Utilize the attention maps above the replace the mask with the inpainted content. - The result shows that the editing learns the shape of mask, but retains the content from input. Next step is to inject the attention map to blended latent diffusions # 01.26 - Experimenting with various hyperparameters yielded diverse results, as shown in the images below: ![Screenshot 2024-01-27 at 03.45.30](https://hackmd.io/_uploads/Byw1XHGca.jpg) The attention maps generated during the optimization process are visualized as follows: ![Screenshot 2024-01-27 at 03.44.24](https://hackmd.io/_uploads/rk4wXHfqT.png) Despite covering the entire mask, there are two issues: -- The generated content lacks reasonable structure. -- The image quality has degraded. - Experimenting with BCE loss. Some results: ![Screenshot 2024-01-27 at 03.51.40](https://hackmd.io/_uploads/S1hSEBzc6.png) -- Although image quality has improved, the generated content is still less reasonable. - Some idea: The optimization may destroy the image manifold. I will try to learn the layout based on diffusion models and use pixel2piexel to replace the mask to the content. # 01.25 - Finish the code combining blended latent diffusion with attention control by maxmize attention values in the mask and minimize attentions value outside the mask. A test case is ![result](https://hackmd.io/_uploads/S1inoJb96.jpg) -- The foreground does not occupy the entire mask. -- Distored object. - Test different hyperparameters, such as optimization steps and blending steps for the background. - (Doing) Write the experiment to use BCE for supervising attention maps and visualize the attention maps. # 01.24 - Read paper [MultiDiffusion](https://arxiv.org/pdf/2302.08113.pdf) and [DirectDiffusion](https://arxiv.org/pdf/2302.13153.pdf) - (Doing) Implement the code combining blended latent diffusion with attention control based on optimization as yesterday. Deal with the problem of version conflict of latent diffusion and diffusers while modifying the blended latent diffusion code. - Set up AWS EC2 Access # 01.23 - Read paper [Zero-shot spatial layout conditioning](https://arxiv.org/pdf/2306.13754.pdf) -- Spatial self-guidance by BCE loss between mask and attention map - (Doing) Implement the code which combining blended latent diffusion with attention control based on optimization: -- BCE loss between mask and attention map -- Maxmize attention values in the mask and minimize attentions value outside the mask # 01.22 - Set up sony accounts and the laptop - Read paper and test the code: [Blended Latent Diffusion](https:https://arxiv.org/abs/2206.02779) -- Blend the background and foreground in the latent space -- Inability to ensure the foreground object covers the entire mask. - Read paper and test the code: [Masked-Attention Diffusion Guidance](https://arxiv.org/pdf/2308.06027.pdf) -- Swaps cross-attention maps with with constant maps. -- Observed misalignment issues with the background.