Tesla Vision - Andrej Karpathy

###### tags: `Tag(Tesla,Deep Learning,Neural Networks)` # Tesla Vision - Andrej Karpathy {%youtube 2blLi3T4EGw %} * No HD Maps and No Lidar, recently no Radar. * **Why no radar**? : Mostly based on AEB false trigger events. {%youtube dslDwfvVWh8 %} **Comments** :speech_balloon: * Radar has insufficient performance **below 30 mph** * Other than followed car, environment consists of 0 speed objects with reflections. This results in a noise around 0 speed when the car stops. * **Manhole cover problem** : Manholes creates stronger signal than followed car. Falsely triggers AEB * **Bridge False Slowdown Problem** : Radar's false positive obstacle triggers are handled with NN. This was a very common problem with AP with legacy stack. ![](https://i.imgur.com/O0oygZo.png) * **Very Harsh Breaking** * Due to large decel, radar drops and re-pick the track of the object (1st graph). * CIPV : Closest In-Path Vehicle ![](https://i.imgur.com/VyUccJ4.png) * **High Speed Stationary Approach** : Improvement * Legacy Stack (Vision + Radar) picks up a ST object at 110m away while Vision+DL starts to slowdown at 180m. * ![](https://i.imgur.com/9WYdzxh.png) * **Vision setup on the vehicle** : * 36 frames/sec, 1280 x 960 x 3 from 8 cameras * Depth understanding with neural networks * Requires enourmous datasets: * Large (millions of videos), Clean (labeled data; depth,velocity,acceleration) * Diverse (contains a lot of edge cases) * Solution : **Offline Data Labeling - Auto Labeling** * Since offline, expensive offline optimization and tracking can be applied with fusion of radar, lidar, camera. Using radar online and offline creates huge differences due to computation time. In addition human intervention to labeling process is possible. * Offline auto video labeling success:Tracking before and after this type of dust enables to continue labeling. ![](https://i.imgur.com/t58vMbT.png) * Four months of focus only this (20 people). * After initial training, neural net deployed in the customers cars in shadow mode (not interfering with controls). Comparison btw this NN and legacy method has been recorded and NN retrained for some inaccuracies (according to defined triggers). This method has applied in 7 cycles. Gather data in shadow mode & retrain. * With millions of camera-equipped cars sold across the world, Tesla is in a great position to collect the data required to train the car vision deep learning model. The Tesla self-driving team accumulated **1.5 petabytes of data consisting of one million 10-second videos and 6 billion objects annotated with bounding boxes, depth, and velocity.** ![](https://i.imgur.com/5vWgOKk.png) * The deep learning model uses convolutional neural networks (resnet) to extract features from the videos of eight cameras installed around the car and fuses them together using transformer networks. It then fuses them across time (recurrent NN-LSTM), which is important for tasks such as trajectory-prediction and to smooth out inference inconsistencies. * The hierarchical structure makes it possible to reuse components for different tasks and enable feature-sharing between the different inference pathways. ![](https://i.imgur.com/rbsFYym.png) * Hardware Side: In-House Supercomputer #5 in the world ![](https://i.imgur.com/qCBqhYZ.jpg) * Legacy stack crash rate : 1 in a 5M miles * New Neural Stac k crash rate : NA (15M miles driven so far, 1.7M is on AP)