- [Yong Jae Lee](https://pages.cs.wisc.edu/~yongjaelee/) - Learning to Understand Our Multimodal World with Minimal Supervision - [LLaVA: Large Language and Vision Assistant](https://llava-vl.github.io/) - Use CLIP and or MS COCO to generate image-caption pairs - Use text captions to generate question-answer pairs with LLMs - Merge image, question, answer as triplets to train LMM (large multimodal models) - [Maurits Kaptein](https://www.mauritskaptein.com/) - Scaling Vision-Based Edge AI Solutions: From Prototype to Global Deployment - Problems with realizing PoC - Hardware too expensive - Development too expensive - [OAAX](https://www.networkoptix.com/blog/2024/03/05/introducing-the-open-ai-accelerator-standard) - [Heather Courture](https://www.hdcouture.com/) - Data-Efficient and Generalizable: The Domain-Specific Small Vision Model Revolution - Challenges for domain specific computer vision - Unique image modality (Histopathology, drone, microscopy etc.) vs. General modality (ImageNet) - Limited data amount (few hundred images) - Computer resource constraints: models too large - Solution: Domain-Specific Foundation Model - Pretraining domain-specific image dataset on domain-specific foundation model - Self-supervised learning using unlabeled images - Pretext task instead of manual labels - Contrastive learning - MAE - Supervised learning using labeled images - Normal finetuning / supervised learning - Examples - Satellite - land cover classification, 27k images, 80/20 pre-train/test - Same model (ResNet18), get similar performances for different contrastive frameworks - Different model sizes, found that smaller models sufficient for small training, but larger models benefit from more data. - Pretraining data set matters more than model size - Histopathology - predict 9 tissue classes, 100k images train, 7180 images for test - Challenge: distribution shift - color variations from different scanners - staining procedures from different labs - Solution: simulate color variations with image augmentation - Small model with carefully selected datasets gives good enough performance - Large models are better but need more resources - Best practices - Start with another foundation model to shorten pre-training - Use a diverse dataset to capture variations within domain - Simulate additional variations with augmentation - [Guy Lavi](https://embeddedvisionsummit.com/2024/speaker/guy-lavi/) - Continual, On-the-Fly Learning through Sequential, Lightweight Optimization - Techniques of sequential optimization, the lightweight nature, without losing memory - Strategy for retraining - Continual learning - SOTA - A simple naive strategy (a.k.a finetuning) augmented with additional behavior to counteract catastrophic forgetting. - Sequential optimization - [Brian Geisel](https://geisel.software/) - Using Synthetic Data to Train Computer Vision Models - Why we need synthetic data - [Vaibhav Ghadiok](https://www.linkedin.com/in/vghadiok) - 10 Commandments for Building a Vision AI Product - AI compute has increased 80x in efficiency over the past decade - Focus on Solving a Real-World Customer Problem logseq.order-list-type:: number - Solve a specific problem first before solving it more generically - Separate marketing hype from actual problem solving - Seek the simplest solution (maybe not AI) - Don't Steal or Kill for GPUs logseq.order-list-type:: number - Inference - how can you dramatically reduce your data/AI compute needs? - Training - maybe not just AI - Respect the technology gap logseq.order-list-type:: number - Is there a fundamental scientific or engineering innovation needed to build the product? - Human-in-the-loop - Controlling the environment - Researcher vs. practitioner gap - Don't be a hero logseq.order-list-type:: number - Don't make the problem artificially harder by limiting the sensors/actuators - Perception - Use a depth/ranging sensor - Real-time kinematic positioning (RTK) - Multispectral sensors - Actuator - Use a better gripper - touch sensing with vision - suction cup - Calibration - in-factory and in situ - Use priors logseq.order-list-type:: number - help from semi-structured environments - Embrace multimodality - sensor fusion is good logseq.order-list-type:: number - Classical kalman filter - nonlinear optimization - Multimodal LLMS - Optimize for high data quality logseq.order-list-type:: number - Data acquisition quality is vital - Carefully choose sensors - Image sensors, lens, time synchronization between sensors - Quality of training data is critical - Hard negatives are good - AI thrives on good quality data - Choose the right metrics logseq.order-list-type:: number - Use more than one metric, not strongly correlated - Don't take the name of MM-LLMs in Vain logseq.order-list-type:: number - No emergent capabilities - Poor in projective and Euclidean geometry - Carefully choose AI inference compute logseq.order-list-type:: number - Understand total cost of development - AI TOPS is not everything, these matter: - Heterogenous compute - Memory bandwidth - Operator support - TOPS/W - ... - Test, continuously learn and adapt logseq.order-list-type:: number - Test in the target deployment environment - unlike traditional testing, 100% coverage is infeasible - Continuously iterate and improve the system - current AI is not adaptive - Design systems to be debuggable - Conclusion: - In almost all successful deployments of AI - Human in the loop - Constraints on the environments - Optimize end-to-end - multimodal sensors, AI, compute, hardware - Solution is not always more data - Don't judge AI capabilities by human analogies. - [Will Glaser](https://www.linkedin.com/in/willglaser) - Why Amazon Failed and the Future of CV in Retail (Interview) - What is grabango? - Checkout-free shopping - Why did you found grabango? - Evolution of computer vision - 5-10x faster processing per store - Profit shrinkage due to criminal activities - First to file patents on this area - Amazon doing it actually justified the idea - Why could you do it when Amazon can't? - User experience - marginally better than Amazon - Expensive system - 20x more expensive due to sensor fusion - Amazon Go stores limited in SKUs (not using pure vision solution) - What's the secret sauce in pure vision solution - People - Reduction of data - Cascaded series of algorithms for filtering - Better algorithms come in later - Free ground truth - Cashier labels the data - One month training - Cameras on-par with iPhone 5 - Algorithm? - Matrix of algorithms, cascaded, parallel - Processing and compute - A lot of embedded compute - Some in-store servers and some cloud - - - [Jacob Marks](https://www.linkedin.com/in/jacob-marks) - How Large Language Models are Impacting Computer Vision - Multimodal and language models are more and more popular in CV community - VLMs - Represent visual information in semantically meaningful way - Train a multimodal model to ingest/interpret these representations - VLMs thrive on - Open-world knowledge - Rich semantic/visual content - Challenges - Hallucination - Edge - Robust training - LLM-Aided Visual Reasoning - Not actually directly processing images - Run vision models on visual data - Pass results to LLM for analysis and conclusions - Preprocessing the prompt before passing it to LLM - Agentic Information Flow - meta-learning, planning to use tools - VisProg: Compositional Visual Reasoning - Visual programming with modules - HuggingGPT - uses all the models in HuggingFace, decides which tools to use - Pros - Flexible & Modular - Interpretable - Training-free (using foundation models) - Cons - Latency - Reliance on Prompt Engineering - Robustness & Reliability