# Augmented Intelligence and Interaction (AII) Workshop Notes ###### tags: `AI Workshop` ## 2023/03/24 ### From Seeing to Doing: Understanding and Interacting with the Real World. (李飛飛) - Vision is a conrnerstone of intellignece - understanding - doing - Seeing is for **understanding** the real world - early attemps: hand-designed models - early attemps: hand designed features, learned models - Intenet --> many data - Visual Genome: scene graph - visual relationship understanding using scene graphs - Actions Genome: Actions as Spatio-Temporal Scene Graph - Multi-Object Multo-Actor Activity Understanding - Do - The original and fundamental function of the nervous system is to link perception with action. ### 3D Comprehension for Efficient Robotic Manipulations (徐宏民) - Challgnges for deploying robotics in manufacturing - Generalization - grasp detection (GDN) for unseen objects - Adaptability - onsite object training by interactive perception and few-shot learning - Flexibility - few-shot robotic imitation (Demo2Learn) - Efficiency - fast & 6-DOF Peg-In-Hole - HMI - clutter scene object grounding/referring ### On Breakthroughs of Generative AI towards Artificial General Intelligence (陳佩君) - AI 1.0: Close-set, single-skill - understanding and simulation - Big data, large deep models - AI 1.5: Open-world, multi-skill - understanding and generation - Multi-modal pretraining, huge foundation models - Florence: A new Foundation Model for Computer Vision - AI 2.0: Artifical General Intelligence (AGI) - adaptive reasoning and behavior - Multi-modality unification, unlimited skillsets, value driven - Intelligence as creativity but not memorization (Sparks of Artificial General Intelligence: Early experiments with GPT-4) ### Feature Pyramid Diffusion for Complex Scene Image Synthesis (王鈺強) - Generative AI - Generative models for visual synthesis: VAE, GAN, Diffusion models - From LLM to Multi-Model LLM - Limitation of LLM - Only handle language /text - Vision + Language = ? - Single vs. Multimodal GenAI - Vision and Language - Novel Image Captioning - Text to Image manipulation - text-to-image - scene-graph-to-image - layout-to-image ### The Right Problem to Solve in Manufacturing (陳維超) - Simplicity and Impact - simple algorithms tend to win - complexity does not equal to impact - Adaptive Goals - revisit to redefine - same techonology, new prupose - Data LifeCycle is the Core Problem - Your AI isonly as good as your data - Data Goverance and Standard - Treat how to generate and interpret data like a math problem - Let us be clear about tata - Data-as-a-service plasform - Trustworthy AI - Enable data ecchange for zero-truse scenarios - Transferrable & Zero-Shot - Flexibile job scheduling - From unstructured video to structured data - Conclusion: Data-Centric AI - from big data to good data ### AI for 3D Indoor Space (孫民) - aaa ### A Novel Approach to Solving Goal-Achieving Problems for Board Games (吳毅成) - Straightforward Method - Solution Tree - Issue: tree size - solution: Threat-based solution - Lambda Search ### RGB-D Face Recognition with Feature Distanglement and Depth Augmentation (賴尚宏) - Face Recognition Challenges - Hich-accuracy, high-security, robustness, fairness, privacy - Multi-Modal Face Recognition - RGB-D face recognition - more robust against various variations - Challenges: lack of large RGBD face dataset, how to effectiveelely utilize RGBD data to improve face recog.? - Proposed Approach to RGB-D Face Recognition - DepthNet - Multi-Modality RGB-D Face Recognition ### Multiview Regenerative Morphing (陳煥宗) - Image Morphing & Shape Interpolation ### Lessons Learned from Self-Driving Car Operations on Public Roads in Taiwan and Australia (王傑智) - aaa ### Environment Diversification with Multi-head Neural Network for Invariant Learning (林守德) - Unsupervised out of distribution generalization - solution: Invariant Learning - invariant features & variant features ### On the Long Way to Learning Depth and Dynamic Perception (邱維辰) ### Virtual-to-Real: Vision based Navigation (李濬屹) - first do semantic segmentation on both real and virtual image, which convert the data to the third space. Then perform and do the actions on the third space. ## 2023/03/25 ### Foundation Model for Speech Processing (李宏毅) - Pretrain ----foundation model---> downstream - End-to-end Speech Question Answering (SQA) - Typical solution: speech recognition + text QA - New method: end-to-end SQA model without speech recognition - BERT trained on text --> fintune at DNA sequence ### When Ads Meet Conversational Interfaces (陳緼儂) - Scenarios of Product Advertisement - bad ads user experience - Two types of dialouge systems - open-domain chatting - task-oriented - challenge: how to bridge two types of dialouge ### Automatic Music Generation with the Transformers (楊奕軒) - Text-to-music - "Lyrics"-to-Music: Jukebox (2020, OpenAI) - MusicLM ### Learning to Synthesize Image and Video Contents (楊明玄) - Image Synthesis - image-to-image (I2I) translation - challenges: paired training data & multimodal (ont-to-many) mapping - If have paired data, do one-to-one mapping, then manipualate in latent space - Unpaired data with one-to-one mapping (CycleGAN, DiscoGAN) - Paired data with one-to-many mapping (BicycleGAn) - I2I using randomly sampled attributes - exempler based I2I - Training with Unpaired Data: content space is shared, attribute space is not (cross-cycle consistency) - --> Disentangled Representation for Image Translation - Improve Diversity: Mode Seeking GAN - Issues with Image Synthesis - Learning GANs with limited Data - LeCam Regularization: track discriminator predictions using exponential moving averages (EMA), then compute the regularization loss with the EMAs, regularize the discrminator training - Image Outpainting - Image-to-image based modeling - challenges: repetitive or simple extension - In & Out: Outpainting via Inversion - InfinityGAN - Text-to-Image - DALL-E (auto-regressive), DALL-E2 (diffusion model), MidJourney (latent diffusion), Imagen (diffusion, Parti, Stable Diffusion - VQGAN - SuperResolution