# 112-1 Intro to AI Release date: 2023/09/14 Due date: 2023/09/26 21:00 ## Homework 1 ###### tags:`1121iai` In this assignment, we will gain a deeper understanding of the embedding space and representation learning concepts by utilizing the **ImageBind** toolkit. TA: 蘇蓁葳 [b09611048@ntu.edu.tw](mailto:b09611048@ntu.edu.tw) Original idea suggested by 沈兆軒 ## Introduction ### Embedding space ![](https://hackmd.io/_uploads/r1ZDQWECh.png) An embedding serves as a reduced-dimensional realm capable of accommodating the conversion of high-dimensional vectors. This technique streamlines the process of conducting machine learning tasks on extensive inputs, such as sparsely represented word vectors. Ideally, an embedding endeavors to encapsulate the underlying semantics of the input data by clustering semantically related elements in proximity within the embedding space. Additionally, it's worth noting that embeddings can be acquired and effectively employed across various models, enhancing their versatility and utility. Watch these [video](https://www.youtube.com/playlist?list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab) if you want to know more about embedding space. ### ImageBind (Original article source: https://imagebind.metademolab.com/) ImageBind, a groundbreaking AI model from Meta AI's multimodal research lab, seamlessly connects data from different sources like images, audio, text, and more. It understands the relationships between these data types and enhances existing AI models while enabling new applications. <figure> <img src="https://hackmd.io/_uploads/HkUMH_XR3.png" alt="Trulli" style="width:100%"> <figcaption align = "center">source: https://imagebind.metademolab.com/</figcaption> </figure> ImageBind's core idea is that images possess a unique property—they can link to various modalities. For instance, an image of a dog can connect to its bark sound, textual breed description, depth data, thermal information, and motion data. ImageBind capitalizes on this property to create a shared embedding space for six modalities: images, video, audio, text, depth, thermal, and IMUs. It can encode each modality into this common space, facilitating operations like addition, subtraction, multiplication, and division. It can also decode these representations into any of the six modalities, preserving the original data's semantics and syntax. ImageBind's utility lies in its ability to help machines analyze diverse information collectively. It surpasses specialized models for individual modalities and enables applications like audio-based and cross-modal search, multimodal arithmetic, and cross-modal data generation. For example, it can transform audio into images, such as creating a visual representation of a rainforest or a bustling market solely from sound. Future prospects include more precise content recognition and moderation, streamlined media creation, and enhanced multimodal search capabilities. For more reference, please refer to [ImageBind](https://imagebind.metademolab.com/) ## ImageBind Practice ### Set up Google Colab 1. Make a copy of this [Colab notebook](https://colab.research.google.com/drive/1VEuWte-VF5NzgChauQFq8T_Zd1KEhFY-?usp=sharing) 2. Go to your google drive, and create a directory name **AI_HW1** 3. Go into the **AI_HW1** directory, and create three directories named **Car**, **Dog** and **Bird**. 4. Choose **Runtime -> Change runtime type** (**執行階段 -> 變更執行階段類型**), and select GPU. <img src="https://hackmd.io/_uploads/B1nOWz11p.png" alt="Image 1" height="300" /> <img src="https://hackmd.io/_uploads/ryE1zzkJp.png" alt="Image 2" height="200" /> 5. Run the code till the `# Sample Code` and make sure the code is running correctly. ### Prepare data 1. Please prepare 10 pictures for each catagory (10%) ### Fill up the TODO blocks 1. Choose one picture for each catagory and generates an embedding, and build 3\*3 matrixes (inner product, softmax of inner product, consine similarity to the text) for the selected three pictures (15%) ![](https://hackmd.io/_uploads/SkpBNxxya.png) 3. Generate embeddings of all the pictures in the directory of each catagory(10%) 4. Visualization by t-SNE (containing 30 points) ([sklearn package document](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html)) (10%) 5. Implement PCA (containing 30 points) (25%) 6. Applications of ImageBind - Please choose three pictures from the same catogory, and discribe the three pictures to your best respectively, build the 3*3 matrix (softmax of inner product) and see how powerful ImageBind is able to generate the embeddings correctly. ![](https://hackmd.io/_uploads/SkQKNex1T.png) ## Grading policy 1. **TODO** code blocks **65%** 2. Report **25%**, please answer the following questions in the report. - Provide the three 3x3 matrices obtained in Task 1 (as described in the "Fill up the TODO blocks" section), and perform a comparative analysis. Explore the differences within and between these matrices and describe your observations in detail. - Why do we need to use softmax after calculating the inner product? - Include the figures and discuss the outputs of t-SNE when using different numbers for perplexity. - Include the figure generated by t-SNE and PCA, compare the result and discuss the differeces. - Include the three pictures and text you selected and described. Record the result. - Anything you find out interesting about the embedding space. 3. Gather data **10%** ## Homework policy 1. Discussing with others is encouraged. But please **do not copy from others**, you have to **write your final answer alone**. 2. You can ask **ChatGPT** for help, but do not **copy** from it. 3. How to hand in. - Download your code from Colab. (.ipynb) - Submit a zip file to **NTU cool**, which contains two files. - Please include the Google Drive directory link in the report. ``` {student_id}_hw1.zip # In lowercase ├── {student_id}_hw1.ipynb └── {student_id}_report.pdf b09611048_hw1.zip ├── b09611048_hw1.ipynb └── b09611048_report.pdf ``` ## Acknowledgments I would like to express my gratitude to my Teacher and the AI TA team for discussing this assignment with me and saving my poor English.