- [Yong Jae Lee](https://pages.cs.wisc.edu/~yongjaelee/) - Learning to Understand Our Multimodal World with Minimal Supervision
- [LLaVA: Large Language and Vision Assistant](https://llava-vl.github.io/)
- Use CLIP and or MS COCO to generate image-caption pairs
- Use text captions to generate question-answer pairs with LLMs
- Merge image, question, answer as triplets to train LMM (large multimodal models)
- [Maurits Kaptein](https://www.mauritskaptein.com/) - Scaling Vision-Based Edge AI Solutions: From Prototype to Global Deployment
- Problems with realizing PoC
- Hardware too expensive
- Development too expensive
- [OAAX](https://www.networkoptix.com/blog/2024/03/05/introducing-the-open-ai-accelerator-standard)
- [Heather Courture](https://www.hdcouture.com/) - Data-Efficient and Generalizable: The Domain-Specific Small Vision Model Revolution
- Challenges for domain specific computer vision
- Unique image modality (Histopathology, drone, microscopy etc.) vs. General modality (ImageNet)
- Limited data amount (few hundred images)
- Computer resource constraints: models too large
- Solution: Domain-Specific Foundation Model
- Pretraining domain-specific image dataset on domain-specific foundation model
- Self-supervised learning using unlabeled images
- Pretext task instead of manual labels
- Contrastive learning
- MAE
- Supervised learning using labeled images
- Normal finetuning / supervised learning
- Examples
- Satellite - land cover classification, 27k images, 80/20 pre-train/test
- Same model (ResNet18), get similar performances for different contrastive frameworks
- Different model sizes, found that smaller models sufficient for small training, but larger models benefit from more data.
- Pretraining data set matters more than model size
- Histopathology - predict 9 tissue classes, 100k images train, 7180 images for test
- Challenge: distribution shift
- color variations from different scanners
- staining procedures from different labs
- Solution: simulate color variations with image augmentation
- Small model with carefully selected datasets gives good enough performance
- Large models are better but need more resources
- Best practices
- Start with another foundation model to shorten pre-training
- Use a diverse dataset to capture variations within domain
- Simulate additional variations with augmentation
- [Guy Lavi](https://embeddedvisionsummit.com/2024/speaker/guy-lavi/) - Continual, On-the-Fly Learning through Sequential, Lightweight Optimization
- Techniques of sequential optimization, the lightweight nature, without losing memory
- Strategy for retraining
- Continual learning - SOTA
- A simple naive strategy (a.k.a finetuning) augmented with additional behavior to counteract catastrophic forgetting.
- Sequential optimization
- [Brian Geisel](https://geisel.software/) - Using Synthetic Data to Train Computer Vision Models
- Why we need synthetic data
- [Vaibhav Ghadiok](https://www.linkedin.com/in/vghadiok) - 10 Commandments for Building a Vision AI Product
- AI compute has increased 80x in efficiency over the past decade
- Focus on Solving a Real-World Customer Problem
logseq.order-list-type:: number
- Solve a specific problem first before solving it more generically
- Separate marketing hype from actual problem solving
- Seek the simplest solution (maybe not AI)
- Don't Steal or Kill for GPUs
logseq.order-list-type:: number
- Inference - how can you dramatically reduce your data/AI compute needs?
- Training - maybe not just AI
- Respect the technology gap
logseq.order-list-type:: number
- Is there a fundamental scientific or engineering innovation needed to build the product?
- Human-in-the-loop
- Controlling the environment
- Researcher vs. practitioner gap
- Don't be a hero
logseq.order-list-type:: number
- Don't make the problem artificially harder by limiting the sensors/actuators
- Perception
- Use a depth/ranging sensor
- Real-time kinematic positioning (RTK)
- Multispectral sensors
- Actuator
- Use a better gripper
- touch sensing with vision
- suction cup
- Calibration - in-factory and in situ
- Use priors
logseq.order-list-type:: number
- help from semi-structured environments
- Embrace multimodality - sensor fusion is good
logseq.order-list-type:: number
- Classical kalman filter
- nonlinear optimization
- Multimodal LLMS
- Optimize for high data quality
logseq.order-list-type:: number
- Data acquisition quality is vital
- Carefully choose sensors - Image sensors, lens, time synchronization between sensors
- Quality of training data is critical
- Hard negatives are good
- AI thrives on good quality data
- Choose the right metrics
logseq.order-list-type:: number
- Use more than one metric, not strongly correlated
- Don't take the name of MM-LLMs in Vain
logseq.order-list-type:: number
- No emergent capabilities
- Poor in projective and Euclidean geometry
- Carefully choose AI inference compute
logseq.order-list-type:: number
- Understand total cost of development
- AI TOPS is not everything, these matter:
- Heterogenous compute
- Memory bandwidth
- Operator support
- TOPS/W
- ...
- Test, continuously learn and adapt
logseq.order-list-type:: number
- Test in the target deployment environment
- unlike traditional testing, 100% coverage is infeasible
- Continuously iterate and improve the system
- current AI is not adaptive
- Design systems to be debuggable
- Conclusion:
- In almost all successful deployments of AI
- Human in the loop
- Constraints on the environments
- Optimize end-to-end - multimodal sensors, AI, compute, hardware
- Solution is not always more data
- Don't judge AI capabilities by human analogies.
- [Will Glaser](https://www.linkedin.com/in/willglaser) - Why Amazon Failed and the Future of CV in Retail (Interview)
- What is grabango?
- Checkout-free shopping
- Why did you found grabango?
- Evolution of computer vision
- 5-10x faster processing per store
- Profit shrinkage due to criminal activities
- First to file patents on this area
- Amazon doing it actually justified the idea
- Why could you do it when Amazon can't?
- User experience - marginally better than Amazon
- Expensive system - 20x more expensive due to sensor fusion
- Amazon Go stores limited in SKUs (not using pure vision solution)
- What's the secret sauce in pure vision solution
- People
- Reduction of data
- Cascaded series of algorithms for filtering
- Better algorithms come in later
- Free ground truth
- Cashier labels the data
- One month training
- Cameras on-par with iPhone 5
- Algorithm?
- Matrix of algorithms, cascaded, parallel
- Processing and compute
- A lot of embedded compute
- Some in-store servers and some cloud
-
-
- [Jacob Marks](https://www.linkedin.com/in/jacob-marks) - How Large Language Models are Impacting Computer Vision
- Multimodal and language models are more and more popular in CV community
- VLMs
- Represent visual information in semantically meaningful way
- Train a multimodal model to ingest/interpret these representations
- VLMs thrive on
- Open-world knowledge
- Rich semantic/visual content
- Challenges
- Hallucination
- Edge
- Robust training
- LLM-Aided Visual Reasoning
- Not actually directly processing images
- Run vision models on visual data
- Pass results to LLM for analysis and conclusions
- Preprocessing the prompt before passing it to LLM
- Agentic Information Flow - meta-learning, planning to use tools
- VisProg: Compositional Visual Reasoning
- Visual programming with modules
- HuggingGPT
- uses all the models in HuggingFace, decides which tools to use
- Pros
- Flexible & Modular
- Interpretable
- Training-free (using foundation models)
- Cons
- Latency
- Reliance on Prompt Engineering
- Robustness & Reliability