StreetscapeAI: Enhancing Urban Vision with Fine-Tuned CLIP Models for Street View Imagery

# StreetscapeAI: Enhancing Urban Vision with Fine-Tuned CLIP Models for Street View Imagery ## Introduction The ability of machines to comprehend and interpret images has always been a cornerstone in technological advancements, as they are progressively learning to understand the visual world in ways that parallel human perceptions. One prominent player in this field is CLIP (Contrastive Language-Image Pre-Training), a breakthrough in cross-modal understanding with remarkable capabilities across various tasks. While the prowess of CLIP is conspicuous, a specific context demands our attention - the intricate streetscapes that define our urban environments. In this paper, we embark on the journey of enhancing the CLIP model's performance in recognizing street view images, opening avenues for innovative applications at the intersection of AI and urban life. ### About CLIP - **Introduction:** - CLIP: Contrastive Language–Image Pretraining. - Bridges the gap between understanding images and text. - Enables simultaneous processing of both visual and textual data. - Performs tasks like image classification, generating text for images, and ranking images based on text prompts. - **Contrastive Learning Approach:** - Pulls similar image-text pairs closer in the embedding space. - Pushes unrelated pairs apart. - Forms connections between visual content and textual descriptions. - **Generalization and Few-Shot Learning:** - Can generalize across various tasks without task-specific fine-tuning. - Demonstrates zero-shot and few-shot learning capabilities. - Applies learned understanding to new tasks with minimal examples or instructions. ### Pre-training - **Pre-training:** A foundational concept in machine learning and natural language processing (NLP). - **General Features:** Involves training a model on a large dataset to learn general features, patterns, and representations from the data. - **Learning before Specialization:** Model gains broad knowledge before fine-tuning for specific tasks. - **Analogous to Human Learning:** Resembles how humans acquire general knowledge before applying it to specific skills. - **NLP Context:** Often applied to neural network models, like transformers, in NLP tasks. - **Massive Text Corpora:** Model pre-trained on vast text datasets to understand language structure and semantics. - **Predictive Learning:** Pre-training phase involves predicting missing words, understanding word relationships, and capturing context. - **Fine-tuning:** After pre-training, model is adapted to specific tasks with narrower datasets. - **Improved Performance:** Fine-tuning leverages pre-trained knowledge, leading to better task-specific performance. - **Widely Successful:** Pre-training approach revolutionized NLP and other domains, enabling breakthroughs in AI applications. ### Fine-tuning - **Brief:** - **Definition:** Fine-tuning refers to the process of adjusting a pre-trained machine learning model to perform a specific task or set of tasks. - **Purpose:** It allows models to adapt their learned representations to new tasks without starting training from scratch. - **Key Aspects of Fine-Tuning:** - **Pretrained Models:** Fine-tuning begins with a model that has been trained on a large dataset for a general task, like language understanding or image recognition. - **Transfer Learning:** Fine-tuning leverages transfer learning, where knowledge learned from one task is applied to another related task. - **Benefits and Use Cases:** - **Efficiency:** Fine-tuning saves time and resources compared to training models from scratch. - **Customization:** Models can be tailored to perform well on specific tasks or domains. - **Domain Adaptation:** Fine-tuning helps models adapt to new data distributions or tasks. - **Challenges and Considerations:** - **Overfitting:** Fine-tuning can lead to overfitting if the task-specific dataset is too small or dissimilar to the original training data. ### Goal Our model should be able to generate descriptions about the given photo, including the environment, the components in the photograph. ## Related Works * [**Fine tuning CLIP with Remote Sensing (Satellite) images and captions**](https://huggingface.co/blog/fine-tune-clip-rsicd) This work fine-tuned the CLIP network with satelite images and captions from the RSICD dataset, a dataset that provides satelite images paired with captions. The model can be used by applications to **search through large collections of satellite images** using textual queries, which could be describing the image in totality (beach, mountain, airport, etc) or search or mention specific geographic or man-made features within these images. * [**Learning Generalized Zero-Shot Learners for Open-Domain Image Geolocalization**](https://arxiv.org/pdf/2302.00275.pdf) This work fine-tuned in the domain of geolocalization with street view photo and geographic captions. The model turned out a robust, publicly available foundation model not only achieving state-of-the-art performance on multiple open-domain image geolocalization benchmarks but also doing so in a zero-shot setting, outperforming supervised models trained on more than 4 million images. ## Methodology ### Training Process 1. Types of Models 2. Dataset Collection 3. Model Training 4. Model Testing 5. Model Evaluation ### Data Collection **Method A. Real-World Street View Images** Dataset: Mapiliary Vistas In this method, we will segment the photos into objects then construct the objects into a descriptive sentence. However, this method is very time and resource consuming, and it is hard to determine the parts that should be highlighted in the photo. **Method B. Using Midjourney to Generate Photos** **About Midjourney** Midjourney is a generative artificial intelligence program that generates images from natural language descriptions, called "prompts", similar to OpenAI's DALL-E and Stable Diffusion. However, Midjourney is considered the best with higher quality results. In this method, we will 1. Create prompts with ChatGPT and feed them to Midjourney 2. Midjourney will return a result of the given prompt. Example: ![](https://hackmd.io/_uploads/r1nF-9znn.jpg) 3. Create a dataset, made up of the results created by Midjourney and the prompts and descriptions we fed it 4. Use the dataset to fine-tune the model. ### Model Training #### CLIP Model Fine-Tuning 1. Begin each epoch by initializing a progress bar using tqdm to keep track of progress. 2. In each iteration, load a batch of images and their corresponding captions. 3. The data is passed through the model, generating predictions. 4. These predictions are compared with the ground truth to calculate the loss. 5. This loss is then back-propagated through the network to update the model’s parameters. 6. This fine-tuning process will continue for the number of epochs defined, gradually improving the model’s understanding of the relationship between our specific set of images and their corresponding captions. ### Model Evaluation **A. 參考RSICD CLIP的做法** 1. Sort images into categories. 2. Compare each image with a set of 30 caption sentences of the form "A street view photograph of {category}". 3. The model will produce a ranked list of the 30 captions, from most relevant to least relevant. 4. Categories corresponding to captions with the top k scores (for k=1, 3, 5, and 10) were compared with the category provided via the image file name. 5. The scores are averaged over the entire set of images used for evaluation and reported for various values of k, as shown below. **B. Compare Performance with Supervised Model** **C. Some Automated-Metircs** **Automated Metrics:** 1. **BLEU (Bilingual Evaluation Understudy):** Originally designed for machine translation, BLEU measures the overlap between generated descriptions and reference descriptions. It's a common metric used for evaluating the quality of generated text. 2. **ROUGE (Recall-Oriented Understudy for Gisting Evaluation):** Similar to BLEU, ROUGE evaluates the quality of generated text by comparing it to reference descriptions. It focuses on the overlap of n-grams (word sequences) between generated and reference text. 3. **METEOR (Metric for Evaluation of Translation with Explicit ORdering):** METEOR combines precision and recall using stemming and synonym matching. It aims to address some of the limitations of BLEU and ROUGE. 4. **CIDEr (Consensus-Based Image Description Evaluation):** CIDEr measures the similarity between generated descriptions and reference descriptions based on cosine similarity of TF-IDF features. 5. **SPICE (Semantic Propositional Image Caption Evaluation):** SPICE evaluates the semantic quality of generated descriptions by considering semantic structures and attribute agreements. 6. **Perplexity:** Perplexity measures how well a language model predicts a sample of text. Lower perplexity indicates that the model is more confident in its predictions. **Human Evaluation:** 1. **Human Judgments:** Have human evaluators rate the quality of generated descriptions based on factors like relevance, fluency, and informativeness. This can provide nuanced insights into the quality of generated content. 2. **Human Comparisons:** Conduct preference tests where human evaluators compare multiple generated descriptions and rank them in terms of quality. 3. **User Studies:** Collect feedback from potential users who interact with the generated descriptions in real-world scenarios. This can provide insights into the practical utility of the generated content. ## Conclusion ## Future Works ### Applications 1. Generate descriptive content as a visual aid for the blind when navigating on road. 2. A SVI search engine of **(1)** objects on road **(2)** special landscapes ## Reference ### Papers * [Jaywalking detection and localization in street scene videos using fine-tuned convolutional neural networks](https://link.springer.com/article/10.1007/s11042-023-14922-z) * [CLIP-RS: A Cross-modal Remote Sensing Image Retrieval Based on CLIP, a Northern Virginia Case Study](https://vtechworks.lib.vt.edu/bitstream/handle/10919/110853/Djoufack_Basso_L_T_2022.pdf?sequence=1&isAllowed=y) ### Blogs * [Fine tuning CLIP with Remote Sensing (Satellite) images and captions](https://huggingface.co/blog/fine-tune-clip-rsicd) * [A LLaMa 2, Midjourney & Autodistill Computer Vision Pipeline](https://blog.roboflow.com/midjourney-computer-vision-data/) * [How to Analyze and Classify Video with CLIP ](https://blog.roboflow.com/how-to-analyze-and-classify-video-with-clip/#)