# Augmented Intelligence and Interaction (AII) Workshop Notes
###### tags: `AI Workshop`
## 2023/03/24
### From Seeing to Doing: Understanding and Interacting with the Real World. (李飛飛)
- Vision is a conrnerstone of intellignece
- understanding
- doing
- Seeing is for **understanding** the real world
- early attemps: hand-designed models
- early attemps: hand designed features, learned models
- Intenet --> many data
- Visual Genome: scene graph
- visual relationship understanding using scene graphs
- Actions Genome: Actions as Spatio-Temporal Scene Graph
- Multi-Object Multo-Actor Activity Understanding
- Do
- The original and fundamental function of the nervous system is to link perception with action.
### 3D Comprehension for Efficient Robotic Manipulations (徐宏民)
- Challgnges for deploying robotics in manufacturing
- Generalization
- grasp detection (GDN) for unseen objects
- Adaptability
- onsite object training by interactive perception and few-shot learning
- Flexibility
- few-shot robotic imitation (Demo2Learn)
- Efficiency
- fast & 6-DOF Peg-In-Hole
- HMI
- clutter scene object grounding/referring
### On Breakthroughs of Generative AI towards Artificial General Intelligence (陳佩君)
- AI 1.0: Close-set, single-skill
- understanding and simulation
- Big data, large deep models
- AI 1.5: Open-world, multi-skill
- understanding and generation
- Multi-modal pretraining, huge foundation models
- Florence: A new Foundation Model for Computer Vision
- AI 2.0: Artifical General Intelligence (AGI)
- adaptive reasoning and behavior
- Multi-modality unification, unlimited skillsets, value driven
- Intelligence as creativity but not memorization (Sparks of Artificial General Intelligence: Early experiments with GPT-4)
### Feature Pyramid Diffusion for Complex Scene Image Synthesis (王鈺強)
- Generative AI
- Generative models for visual synthesis: VAE, GAN, Diffusion models
- From LLM to Multi-Model LLM
- Limitation of LLM
- Only handle language /text
- Vision + Language = ?
- Single vs. Multimodal GenAI
- Vision and Language
- Novel Image Captioning
- Text to Image manipulation
- text-to-image
- scene-graph-to-image
- layout-to-image
### The Right Problem to Solve in Manufacturing (陳維超)
- Simplicity and Impact
- simple algorithms tend to win
- complexity does not equal to impact
- Adaptive Goals
- revisit to redefine
- same techonology, new prupose
- Data LifeCycle is the Core Problem
- Your AI isonly as good as your data
- Data Goverance and Standard
- Treat how to generate and interpret data like a math problem
- Let us be clear about tata
- Data-as-a-service plasform
- Trustworthy AI
- Enable data ecchange for zero-truse scenarios
- Transferrable & Zero-Shot
- Flexibile job scheduling
- From unstructured video to structured data
- Conclusion: Data-Centric AI
- from big data to good data
### AI for 3D Indoor Space (孫民)
- aaa
### A Novel Approach to Solving Goal-Achieving Problems for Board Games (吳毅成)
- Straightforward Method - Solution Tree
- Issue: tree size
- solution: Threat-based solution
- Lambda Search
### RGB-D Face Recognition with Feature Distanglement and Depth Augmentation (賴尚宏)
- Face Recognition Challenges
- Hich-accuracy, high-security, robustness, fairness, privacy
- Multi-Modal Face Recognition
- RGB-D face recognition
- more robust against various variations
- Challenges: lack of large RGBD face dataset, how to effectiveelely utilize RGBD data to improve face recog.?
- Proposed Approach to RGB-D Face Recognition
- DepthNet
- Multi-Modality RGB-D Face Recognition
### Multiview Regenerative Morphing (陳煥宗)
- Image Morphing & Shape Interpolation
### Lessons Learned from Self-Driving Car Operations on Public Roads in Taiwan and Australia (王傑智)
- aaa
### Environment Diversification with Multi-head Neural Network for Invariant Learning (林守德)
- Unsupervised out of distribution generalization
- solution: Invariant Learning
- invariant features & variant features
### On the Long Way to Learning Depth and Dynamic Perception (邱維辰)
### Virtual-to-Real: Vision based Navigation (李濬屹)
- first do semantic segmentation on both real and virtual image, which convert the data to the third space. Then perform and do the actions on the third space.
## 2023/03/25
### Foundation Model for Speech Processing (李宏毅)
- Pretrain ----foundation model---> downstream
- End-to-end Speech Question Answering (SQA)
- Typical solution: speech recognition + text QA
- New method: end-to-end SQA model without speech recognition
- BERT trained on text --> fintune at DNA sequence
### When Ads Meet Conversational Interfaces (陳緼儂)
- Scenarios of Product Advertisement
- bad ads user experience
- Two types of dialouge systems
- open-domain chatting
- task-oriented
- challenge: how to bridge two types of dialouge
### Automatic Music Generation with the Transformers (楊奕軒)
- Text-to-music
- "Lyrics"-to-Music: Jukebox (2020, OpenAI)
- MusicLM
### Learning to Synthesize Image and Video Contents (楊明玄)
- Image Synthesis
- image-to-image (I2I) translation
- challenges: paired training data & multimodal (ont-to-many) mapping
- If have paired data, do one-to-one mapping, then manipualate in latent space
- Unpaired data with one-to-one mapping (CycleGAN, DiscoGAN)
- Paired data with one-to-many mapping (BicycleGAn)
- I2I using randomly sampled attributes
- exempler based I2I
- Training with Unpaired Data: content space is shared, attribute space is not (cross-cycle consistency)
- --> Disentangled Representation for Image Translation
- Improve Diversity: Mode Seeking GAN
- Issues with Image Synthesis
- Learning GANs with limited Data
- LeCam Regularization: track discriminator predictions using exponential moving averages (EMA), then compute the regularization loss with the EMAs, regularize the discrminator training
- Image Outpainting
- Image-to-image based modeling
- challenges: repetitive or simple extension
- In & Out: Outpainting via Inversion
- InfinityGAN
- Text-to-Image
- DALL-E (auto-regressive), DALL-E2 (diffusion model), MidJourney (latent diffusion), Imagen (diffusion, Parti, Stable Diffusion
- VQGAN
- SuperResolution