I recently came across this interesting paper by the NVIDIA autonomous driving team.
- Bojarski, Del Testa, Dworakowski, et al., “End to End Learning for Self-Driving Cars,” 2016.
I wrote a summary and a few comments about it on my Twitter account. And I thought maybe I can repost it here, with some additional discussions, to rekindle this dormant blog. So here you are. As always, your comments are appreciated.
The NVIDIA group formulates the problem of learning how to drive as an imitation learning problem. It learns a mapping from the image input to the steering command by imitating how a human driver does that.
Their approach is essentially a modern (mid 2010s) version of ALVINN from late 1980s: more data, deeper neural networks, and more computation power.
The function approximator is a convolutional neural network (a normalization + 5 convolutional + 3 fully connected). They use a lot of collected data based on actual driver’s behaviour to train their network (about 70 hours of real driving, which I believe corresponds to about 2.5M data samples — not explicitly mentioned) and some data augmentation. You can see the video of the self-driving car here. Cool, isn’t it?!
It is exciting to see an end-to-end neural network learned how to perform relatively well. I congratulate them on this. But there are potential problems from machine learning perspective: Treating the imitation learning problem as a standard supervised learning problem may lead to lower performance than expected. This is due to the distribution mismatch caused by the dynamical nature of the agent-environment interaction: When an agent (e.g., self-driving car) makes a mistake at each time step, the distribution of the future states slightly changes compared to the distribution induced by the expert agent (e.g., human driver). This has a compounding effect and the difference in distributions can potentially grow as the agent makes more interactions with the environment. In the self-driving car example, it means that a series of small mistakes by a self-driving car moves the car to situations that are farther and farther away from the usual situation of a car driven by a human, e.g., the car gradually gets dangerously close to the shoulder.
As a result, as time passes, the agent is more likely to be in regions of the state space from which it doesn’t have much training data (generated by the expert agent). So the agent starts behaving in ways that are not predictable even though it might perform well on the training distribution. This difference between two distributions is called the distribution mismatch (or covariate shift) problem in the machine learning/statistics literature.
A solution to this problem is to use DAGGER-like algorithms:
- Stéphane Ross, Geoffrey J. Gordon, and J. Andrew Bagnell, “A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning,” AISTATS, 2011.
The basic idea behind DAGGER is that instead of letting the agent only learn on a fixed training data coming from an expert agent (which is a human driver in this case), we should let it learn on the distribution that the agent itself actually encounters. So if it happens that the agent goes to regions of the state space which are not usually encountered by the expert (so not in the initial training data set), well, that’s OK, because we can ask the expert to tell us what to do then, and hopefully the expert also knows how to deal with those situations. By keep training on the data from this distribution, the agent can learn a policy that is much better.
Of course, if the “expert” itself is not a real expert for certain situations, we cannot really hope to learn a useful agent even if we use DAGGER. For example, most drivers know how to drive a car that is a bit over the line to the centre of the lane, so they are expert in that situation and their expertise can be useful; but they may not be of any real use how to deal with a car that is in a ditch. Not being able to be better than the expert is a limitation of imitation learning. There are some solutions for that, but maybe that should be the topic of another post.
Aside the aforementioned work, which analyzes the phenomenon in the imitation learning setting, the analysis of how the distribution of the agent’s changes, in the context of reinforcement learning, has been done by several researchers, including myself. I only refer to three papers. See their references for further information.
- Remi Munos, “Performance bounds in Lp norm for approximate value iteration,” SIAM Journal of Control and Optimization, 2007.
- Amir-massoud Farahmand, Remi Munos, and Csaba Szepesvari, “Error Propagation for Approximate Policy and Value Iteration,” NIPS, 2010.
- Bruno Scherrer, Mohammad Ghavamzadeh, Victor Gabillon, Matthieu Geist, “Approximate Modified Policy Iteration,” ICML, 2012.
Anyway, it is nice to see a self-driving car that is not based on a lot of manual design and engineering, but is heavily based on the principles of machine learning. Of course, there are a lot more to be done and I am sure that the NVIDIA team will improve their system.