If you wonder how a Tesla car can not only see but navigate the roads with other vehicles, this is the video you were waiting for. A couple of days ago was the first Tesla AI day where Andrej Karpathy, the Director of AI at Tesla, and others presented how Tesla’s autopilot works from the image acquisition through their eight cameras to the navigation process on the roads. This week, I cover Andrej Karpathy's talk at Tesla AI Day on how Tesla's autopilot works. Learn more in this short video Video Transcript 00:00 if you wonder how a tesla car can not 00:02 only see but navigate the roads with 00:04 other vehicles this is the video you 00:06 were waiting for a couple of days ago 00:08 was the first tesla ai day where andrei 00:11 karpathy the director of ai at tesla and 00:14 others presented how tesla's autopilot 00:17 works from the image acquisition through 00:19 their eight cameras to the navigation 00:21 process on the roads tesla's cars have 00:23 eight cameras like this illustration 00:26 allowing the vehicle to see its 00:27 surrounding and far in front 00:29 unfortunately you cannot simply take all 00:31 the information from these eight cameras 00:33 and send it directly to an ai that will 00:35 tell you what to do as this will be way 00:37 too much information to process at once 00:39 and our computers aren't this powerful 00:42 yet just imagine trying to do this 00:44 yourself having to process everything 00:46 all around you honestly i find it 00:48 difficult to turn left when there are no 00:50 stop signs and you need to check both 00:52 sides multiple times before taking a 00:54 decision well it's the same for neural 00:57 networks or more precisely for computing 00:59 devices like cpus and gpus to attack 01:02 this issue we have to compress the data 01:04 while keeping the most relevant 01:06 information similar to what our brain 01:08 does with the information coming from 01:10 our eyes to do this tesla transfers 01:13 these eight cameras data into a smaller 01:16 space they call the much smaller vector 01:18 space this space is a three-dimensional 01:21 space that looks just like this and 01:23 contains all the relevant information in 01:25 the world like the road signs cars 01:27 people lines etc this new space is then 01:31 used for many different tasks the car 01:33 will have to do like object detection 01:35 traffic light tests lane prediction etc 01:38 but how do they go from eight cameras 01:40 which will mean eight times three 01:42 dimensions inputs composed of red green 01:45 blue images to a single output in three 01:47 dimensions this is achieved in four 01:50 steps and done in parallel for all eight 01:52 cameras making it super efficient at 01:55 first the images are sent into a 01:56 rectification model which takes the 01:59 images and calibrates them by 02:00 translating them into a virtual 02:02 representation this step dramatically 02:05 improves the autopilot's performance 02:07 because it makes the images look more 02:09 similar to each other when nothing is 02:11 happening allowing the network to 02:13 compare the images more easily and focus 02:15 on essential components that aren't part 02:18 of the typical background then these new 02:20 versions of the images are sent in a 02:22 first network called regnet this regnet 02:25 is just an optimized version of the 02:27 convolutional neural network 02:29 architecture cnns if you are not 02:31 familiar with this kind of architecture 02:33 you should pause the video and quickly 02:34 watch the simple explanation i made 02:36 appearing on the top right corner right 02:38 now basically it takes these newly made 02:41 images compresses the information 02:43 iteratively like a pyramid where a start 02:45 of the network is composed of a few 02:47 neurons representing some variations of 02:50 the images focusing on specific objects 02:52 telling us where it is especially then 02:55 the deeper we get the smaller these 02:57 images will be but they will represent 03:00 the overall images while also focusing 03:02 on specific objects so at the end of 03:05 this pyramid you will end up with many 03:07 neurons each telling you general 03:09 information about the overall picture 03:11 whether it contains a car a road sign 03:13 etc in order to have the best of both 03:16 worlds we extract the information at 03:18 multiple levels of this pyramid which 03:20 can also be seen as image 03:21 representations at different scales 03:23 focusing on specific features in the 03:25 original image we end up with local and 03:28 general information all of them together 03:31 telling us what the images are composed 03:34 of and where it is 03:35 then this information is sent into a 03:38 model called bi fpm which will force 03:41 this information from different scales 03:42 to talk together and extract the most 03:45 valuable knowledge among the general and 03:47 specific information it contains the 03:50 output of this network will be the most 03:52 interesting and useful information from 03:54 all these different scales of the eight 03:56 cameras information so it contains both 03:58 the general information about the images 04:01 which is what it contains and the 04:03 specific information such as where it is 04:06 its size etc for example it will use the 04:09 context coming from the general 04:10 knowledge of deep features extracted at 04:13 the top of the pyramid to understand 04:15 that since these two blurry lights are 04:17 on the road between two lanes they are 04:20 probably attached to a specific object 04:22 that was identified from one camera in 04:25 the early layers of the network using 04:27 both this context and knowing it is part 04:29 of a single object one could 04:31 successfully guess that these blurry 04:33 lights are attached to a car so now we 04:35 have the most useful information coming 04:37 from different scales for all eight 04:39 cameras we need to compress this 04:41 information so we don't have eight 04:43 different data inputs and this is done 04:45 using a transformer block if you are not 04:48 familiar with transformers i will invite 04:50 you to watch my video covering them in 04:52 vision applications in short this block 04:54 will take the eight different pictures 04:56 condensed information we have and 04:58 transfer it into the three-dimensional 05:00 space we want the vector space it will 05:03 take this general and spatial 05:04 information here called the key 05:07 calculate the query which is of the 05:09 dimension of our vector field and we'll 05:12 try to find what goes where for example 05:14 one of these query could be seen as a 05:16 pixel of the resulting vector space 05:19 looking for a specific part of the car 05:21 in front of us the value will merge both 05:23 of these accordingly telling us what is 05:26 where in this new vector space this 05:28 transformer can be seen as the bridge 05:30 between the eight cameras and this new 05:33 3d space to understand all 05:35 interrelations between the cameras now 05:37 that we have finally condensed your data 05:40 into a 3d representation we can start 05:42 the real work this is a space where they 05:45 annotate the data they use for training 05:47 their navigation network as the space is 05:49 much less complex than 8 cameras and 05:52 easier to annotate ok so we have an 05:54 efficient way of representing all our 8 05:57 cameras now but we still have a problem 05:59 single camera inputs are not intelligent 06:02 if a car on the opposite side is 06:04 occluded by another car we need the 06:06 autopilot to know it is still there and 06:08 it hasn't disappeared because another 06:10 car went in front of it for a second to 06:13 fix this we have to use time information 06:16 or in other words use multiple frames 06:18 they chose to use a feature cue and a 06:21 video module the feature queue will take 06:23 a few frames and save them in the cache 06:26 then for every meter the car does or 06:29 every 27 milliseconds it will send the 06:32 cached frames to the model here they use 06:35 both a time or a distance measure to 06:38 cover when the car is moving and stopped 06:41 then these 3d dimensions of the frames 06:44 we just processed are merged with their 06:46 corresponding positions and kinematic 06:48 data containing the car's acceleration 06:51 and velocity informing us how it is 06:53 moving at each frame all this precious 06:56 information is then sent into the video 06:59 module this video module uses these to 07:01 understand the car itself and its 07:03 environment in the present and past few 07:05 frames this understanding process is 07:07 made using a recurrent neural network 07:10 that processes all the information 07:11 iteratively over all frames to 07:13 understand the context better and 07:16 finally build this well-defined map you 07:18 can see if you are not familiar with 07:20 recurrent neural networks i will again 07:22 orient you to a video i made explaining 07:24 them since it uses past frames the 07:26 network now has much more information to 07:29 understand better what is happening 07:31 which will be necessary for temporary 07:33 occlusions this is the final 07:35 architecture of the vision process with 07:37 this output on the right and below you 07:40 can see some of these outputs translated 07:42 back into the images to show what the 07:44 car sees in our representation of the 07:47 world or rather the eight cameras 07:50 representation of it we finally have 07:52 this video module output that we can 07:54 send in parallel to all the cars tasks 07:57 such as object detection lane prediction 08:00 traffic lights etc if we summarize this 08:02 architecture we first have the eight 08:04 cameras taking pictures then they are 08:07 calibrated and sent into a cnn 08:10 condensing the information which 08:12 extracts information from them 08:14 efficiently and merges everything before 08:16 sending this into a transformer 08:18 architecture that will fuse the 08:20 information coming from all eight 08:22 cameras into one 3d representation 08:26 finally this 3d representation will be 08:29 saved in the cache over a few frames and 08:32 then sent into an rnn architecture that 08:35 will use all these frames to better 08:37 understand the context and output the 08:40 final version of the 3d space to send 08:42 our tasks that can finally be trained 08:44 individually and may all work in 08:47 parallel to maximize performance and 08:49 efficiency as you can see the biggest 08:52 challenge for such a task is an 08:53 engineering challenge make a car 08:56 understand the world around us as 08:58 efficiently as possible through cameras 09:00 and speed sensors so it can all run in 09:03 real time and with a close to perfect 09:06 accuracy for many complicated human 09:08 tasks of course this was just a simple 09:11 explanation of how tesla autopilot sees 09:13 our world i strongly recommend watching 09:15 the amazing video on tesla's youtube 09:18 channel linked in the description below 09:20 for more technical details about the 09:22 models they use the challenges they face 09:24 the data labeling and training process 09:26 with their simulation tool their custom 09:28 software and hardware and the navigation 09:32 it is definitely worth the time, thank you for watching. References ►Read the full article: ►" ", Tesla, August 19th, 2021, ►My Newsletter (A new AI application explained weekly to your emails!): https://www.louisbouchard.ai/tesla-autopilot-explained-tesla-ai-day/ Tesla AI Day https://youtu.be/j0z4FweCy4M https://www.louisbouchard.ai/newsletter/