Reinforcement Learning – On Adaptive Situated Agents

July 10, 2016July 11, 2016

End to End Learning for Self-Driving Cars and the Distribution Mismatch Problem

I recently came across this interesting paper by the NVIDIA autonomous driving team.

Bojarski, Del Testa, Dworakowski, et al., “End to End Learning for Self-Driving Cars,” 2016.

I wrote a summary and a few comments about it on my Twitter account. And IÂ thought maybe I can repost it here, with some additional discussions, to rekindle this dormant blog. So here you are. As always, your comments are appreciated.

SUMMARY

The NVIDIA group formulatesÂ the problem of learning how to drive as an imitation learning problem.Â It learns a mapping from the image input to the steering command by imitating how a human driver does that.

Their approach is essentially a modern (mid 2010s) version of ALVINN from late 1980s: more data, deeper neural networks, and more computation power.
The function approximator is a convolutional neural network (a normalization + 5 convolutional + 3 fully connected).Â They use a lot of collected data based on actual driverâ€™s behaviour to train their network (about 70 hours of real driving, which I believe corresponds to about 2.5M data samples â€” not explicitly mentioned) and some data augmentation. You can see the video of the self-driving car here. Cool, isn’t it?!

COMMENTS

It is exciting to see an end-to-end neural network learned how to perform relatively well. I congratulate them on this. But there are potential problems from machine learning perspective: Treating the imitation learning problem as a standardÂ supervised learning problemÂ may lead to lower performance than expected. This is due toÂ theÂ distribution mismatchÂ caused by the dynamical nature of the agent-environment interaction: When an agent (e.g., self-driving car) makes a mistake at each time step, the distribution of the future states slightly changes compared to the distribution induced byÂ the expertÂ agent (e.g., humanÂ driver).Â This has a compounding effect and the difference in distributions can potentially grow as the agent makes more interactions with the environment. In the self-driving car example,Â it means that aÂ series of small mistakes by a self-driving car moves the car to situations that are farther and farther away from the usual situation of a car driven by a human, e.g., the car gradually gets dangerously close to the shoulder.

As a result, as time passes, the agent is more likely to be in regions of the state space from which it doesnâ€™t have much training data (generated by the expert agent). So the agent starts behaving in ways that are not predictable even though it might perform well on the training distribution. This difference between two distributions is called the distributionÂ mismatch (or covariate shift) problem in the machine learning/statistics literature.

A solution to this problem is to use DAGGER-like algorithms:

StÃ©phane Ross, Geoffrey J. Gordon, and J. Andrew Bagnell, â€œA Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning,â€ AISTATS, 2011.

The basic idea behind DAGGER is that instead of letting the agent only learn on a fixed training data coming from an expert agent (which is a human driver in this case), we should let it learn on the distribution that the agent itselfÂ actually encounters. So if it happens that the agent goes to regions of the state space whichÂ are not usually encountered by the expert (so not in the initial training data set), well, that’s OK, because we can ask the expert to tell us what to do then, and hopefully the expert also knows how to deal with those situations.Â Â By keep training on the data from this distribution, the agent can learn a policy that is much better.

Of course, if the “expert” itself is not a real expert for certain situations, we cannot really hope to learn a useful agent even if we use DAGGER. For example,Â most drivers know how to drive a car that is a bit over the lineÂ to the centre of the lane, so they are expert in that situation and their expertise can be useful; but they may not be of any real useÂ how to deal with aÂ car thatÂ is in aÂ ditch. Not being able to be better than the expert is a limitation of imitation learning. There are some solutions for that, but maybe that should be the topic of another post.

Aside the aforementioned work, which analyzes the phenomenon in the imitation learning setting, the analysis of how the distribution of the agentâ€™s changes, in the context of reinforcement learning, has been done by several researchers, including myself. I only refer to threeÂ papers. See their references for further information.

Remi Munos, “Performance bounds in Lp norm for approximate value iteration,” SIAM Â Journal of Control and Optimization, 2007.
Amir-massoud Farahmand, Remi Munos, and Csaba Szepesvari, “Error Propagation for Approximate Policy and Value Iteration,” NIPS, 2010.
Bruno Scherrer, Mohammad Ghavamzadeh, Victor Gabillon, Matthieu Geist, “Approximate Modified Policy Iteration,” ICML, 2012.

Anyway, it is nice to see a self-driving car that is not based on a lot of manual design and engineering, but is heavily based on the principles of machine learning. Of course, there are a lot more to be done and I am sure that the NVIDIA team willÂ improve their system.

March 27, 2008

Approximate Dynamic Programming with Gaussian Processes

Marc Deisenroth, Jan Peter, and Carl E. Rasmussen, “Approximate Dynamic Programming with Gaussian Processes,” ACC 2008.

Similar to what we’ve done, without proofs. Formalized in a Bayesian setting (GP).

May 15, 2006

RLAI Tea-time Talks

Starting today, we will have a RLAI Tea-time talks almost everyday. I do not know how effective this will be; but if it becomes effective (=many good relevant presentations), it can be quite joyful.
I heard from a friend that this idea of tea-time talks comes from Richard Hamming who forces his students to present their research in a similar fashion. However, I doubt about it (I even guess these kinds of things have been existed from the ancient Greek time; imagine Aristotle and Plato talking with each other about the world’s most recent findings (e.g. Archimedes, “Eureka!,” Hellenic Journal of Natural Sciences, 220BC (*)) and Aristotle does not accept vague metaphysical viewpoints of Plato).

(*): Aristotle could not talk about Eureka! paper of Archimedes as this latter one did not publish his paper very soon. Actually, he was born much after those mentioned philosophers.

February 20, 2006

Predictive State Representation and System Identification

I have started reading/thinking about this Predictive State Representation (PSR) concept recently. It is interesting in my opinion, but I am not so sure if it is a “really” good thing or not. Anyway, I am investigating it.

I wrote this mail to my friend Mohsen. I guess putting it here may be useful for me and others:

Well … You may ask yourself that why I become interested in SI. The reason is this Predictive State Representation newly proposed concept by a few reinforcement learning researchers. Let me introduce briefly the concept of PSR.

A central problem in reinforcement learning, control engineering, and … is predicting a sequence of observations. You want to know what is the next observation based on what you have seen before. Thereafter, you can design a controller or whatever you want. In order to do so, you need to have a model of the dynamical system.
There are two dominant approaches in modeling dynamical system in RL community. One of them is using Partial Observable MDP (POMDP) and the other is history-based methods. I think you know what a MDP is. MDP is like a dynamical system that you have access to the state information. On the other hand, you do not have access to state information in POMDP, but you observe an output of that system. It is like a dynamical system that you are observing the state of the system through a function of state (y=h(x)).
Well … the bad news, which I believe is common for both RL and control theory, is that it is not reasonable that you have access to the internal model of the dynamics. You may have an approximation of it, but in general, it is not known exactly. The only thing you know for sure is the output (observation) of the system. The whole job of SI is estimating that model. Right now, I want to know if estimation of state-space system is an easy job or not.

PSR is a new idea that wants to remedy this problem. How? It says that instead of modeling an internal dynamics of the system, let’s work only on observations and base our predictions on a set of previous predictions.
Suppose you want to predict (in a stochastic system), the probability of some sequences in the future based on your previous observations. In the language of control theory, it is like predicting the output of a system (in a form of probability distribution) based on its previous responses to signals. The idea of PSR asserts that you can do so if you have a set of known predictions called core tests. For instance, if you know that P{o1o2} = 0.3 and P{o2o3} = 0.5, you can find P{o1o2o3} (it is simplified!) (oi is observation (output) of the system at time instance ‘i’. You know, I used ‘y’ instead of ‘o’ when I was young! (; ). I have some problems with this concept. RL people do not agree that a similar idea exists in other fields. They believe that they are completely new. However, I am not so sure. I think the whole concept is not that new. It may use new names and metaphors, but the concept is like identification of IO systems (identification in the form of transfer function). I think, it is somehow like knowing the impulse response of a linear system. If your system is not linear, the impulse response is not sufficient anymore. But the concept is similar.

Well … what is your idea?! I would be happy if PSR people write me about it.

February 8, 2006

SVM: Maximizing the Sum of Gaps

A few days ago, while sitting in the machine learning class, I was thinking about the following problem about designing a classifier:
AFAIK, In traditional SVM, the goal is maximizing the minimum gap. To cope with noise, we may add slack variables to our constrained optimization programming.
BUT, why don’t we just maximize the total gap (sum of gaps) instead of max. of the min?! This way, outliers act as negative gaps. This problem would be defined as Linear Programming, so we have efficient solution for it.

I just wonder if this approach has been used before. Or is it convertible to some other commonly-used objective functions?

December 7, 2005

Active Learning and Reinforcement Learning

The problem of active learning can be considered as a special case of reinforcement learning (Sanjoy Dasgupta noted it). We can consider it as learning a policy (which selects new data point) that maximizes the increase in some classification performance, e.g. empirical risk, our estimate about structural risk, or anything similar.

August 18, 2005

Insight into the processes of positive and negative learners

Insight into the processes of ‘positive’ and ‘negative’ learners

… their detailed studies also found a difference in ERNs in “positive” and “negative” learners. The former are people who perform better at choosing the correct response than avoiding the wrong one, and the latter are those who learn better to avoid incorrect responses. The negative learners, they found, showed larger ERNs, suggesting that “these individuals are more affected by, and therefore learn more from, their errors. This notion makes the strong prediction that the feedback negativity should also be relatively larger in these participants to negative compared with positive feedback, which could potentially reflect the neural mechanism causing them to be more sensitive to their mistakes.”

Is there two different paths for avoiding something and selecting something? It seems that there are two different mechanism for processing negative and positive experiences. And it seems that it is (at least) due to reinforcement signal generator.

They explained that “In other words, positive reinforcement learners appear to have experienced greater conflict when choosing between two stimuli that were each previously associated with positive (compared with negative) feedback, whereas negative reinforcement learners may have experienced greater conflict when choosing among negative stimuli.”

(maybe) People do some kind of ordering evaluation instead of mere value comparison for their RL-based decisions. And/Or people have different sensitivity in different region of values, i.e. negative values are stored in coarser details comparing with positive values for some people and vice versa.

July 2, 2005

On Performance Reporting in Reinforcement Learning

One common way to evaluate reinforcement learning methods is comparing the learning curve that is usually an average of a few runs (trials) of the method applied to a single problem. If the ultimate part of the learning curve is higher than the other methods, the author says that Ã¢Â€Âœour method handles uncertainty and partial observability much better than Ã¢Â€Â¦Ã¢Â€Â. If the problem is easy enough that reaching the optimal solution is not an astonishing job (e.g. it is MDP), then comparison will be made between slopes of the learning curves, i.e. learning speed. If the method reaches faster to the optimum, it is said that it uses more information during learning/it uses intrinsic hierarchy of the problem better/it bootstraps better/Ã¢Â€Â¦ . Of course, some papers are not written this way, but it is the case.

However, there are a few important points one must pay attention:
1- What if variance of expected gained reward is too high? Averaging a few runs is not very bad, but it is a lossy method of showing results as some very good learning curves can heighten the average curve. This high amount of variance is not due to the intrinsic nature of the problem, but it is due to the fact that the method cannot learn in every run.

To remedy this problem, I am used to plot the probability distribution of the performance instead of mere averaging. This results in a three dimensional diagram (time Ã¢Â€Â“ performance Ã¢Â€Â“ probability of the performance) which is not an easy to show on a paper; thus, I do some sampling at different stages of learning and provide a few two dimensional diagram. I do not know whether anyone have used this kind of representation before Ã¢Â€Â“ but believe me that providing a probability distribution is at least a very good looking diagram! (; But remember to compute Ã¢Â€ÂœCummulative Probability Distribution FunctionÃ¢Â€Â instead of Ã¢Â€ÂœProbability Density FunctionÃ¢Â€Â. The latter one is bumpy if you have not many samples.

2- To my best knowledge, there is few mathematical work on the learning speed of RL (I must confess that I have seen two or three papers on the sample complexity of it Ã¢Â€Â“ but not anymore. There might be a few more). Most researches show the benefit of their method comparing on at most a few problems. Nevertheless, it is not usually said that getting this high performance needs many parameter fine tunings. So, how should we compare two methods when they are not equally optimized?! (Some people used GA or Ã¢Â€Â¦ to find the best learning parameters for their method. I have not done such a thing yet.)

June 30, 2005

A Samba of Decision Makers

It is in the half time of Brazil-Argentina final match that I get involved in thinking about the decision making process and reinforcement learning. Suppose that one intends to write a program to play soccer. What kind of state representation should he use? The representation must encompass as much information as it is possible to discriminate between two states with different actions (seeing the problem from classification side Ã¢Â€Â“ separablity classes and Ã¢Â€Â¦) or it is informative as it enables the agent to separate two state-action pairs with different value (like the abstraction theorem in MaxQ or Ã¢Â€Â¦).
For instance, the playerÃ¢Â€Â™s position, the ballÃ¢Â€Â™s relative position, and a few nearby players and their tags (opponent/teammate) would be a rational choice. However, one may ask how many nearby players must be selected? Two? Three? More state information you use, the best possible answer to this POMDP would be better. But it is notable that the state representation is exponentially growing with the number of states Ã¢Â€Â“ at least in most common representations. What should we do?
I guess we must seek for newer (rich) state representation. Here, I am not talking about using general function approximators or hierarchical RL that are useful in their own. I am talking about wise selection of state representation: a dynamic and automatic generation of states is crucial. As an example, suppose that you are a player in the middle of the field and the ball is with your very close opponent (a few meters). The most important factors (read it as state) for your decision making is your relative distance with the opponent and if you are a good player, his limbsÃ¢Â€Â™ movement. It is not Ã¢Â€ÂœthatÃ¢Â€Â important to know where the exact position of your teammate is when he is 20 meters away. However, when you are close to the penalty area of the opponent, not only your relative position to your opponents are important, but also your teammate positions might be critical for a good decision making, e.g. passing to your teammate may come at a goal.

I believe that there must be a method for automatic selection of important features for each state. Different states need not have the same kind of representation and dimension. In some situations, the current sensory information might be sufficient, in some other situations, the predictions of other agentÃ¢Â€Â™s sensory information might be necessary and Ã¢Â€Â¦ . An extended Markov property may apply to this situation: having a set of S1Ã¢Â€Â¦Sn (n-dim) state variables, I guess it is possible to reduce the state transition of the MDP environment in this way: P(Si(t+1)..Sj(t+1)..Sk(t+1)|S1(t)…Sn(t)) = P(Si(t+1)..Sj(t+1)..Sk(t+1)|Sp(t)..Sq(t)..Sr(t)) for some p..q..r, i.e. there are some independency here.
As far as I know, the most similar well-known research similar to this idea is work of McCallum: Utile Suffix Memory and Nearest Sequence Memory. Nevertheless, those methods do consider only a single kind of states which is simpler than what I am thinking about.
Well Ã¢Â€Â¦ Brazil won the game tremendously with those samba dancers! Congratulations to Marcelo!

May 15, 2005

Myths of Reinforcement Learning

RL is slow?
Large state spaces are hard for RL?!
RL does not work well with function approximation?!
Non-Markovianness invalidates standard RL methods?!
POMDPs are hard for RL to deal with?!
…

In the opinion of Satinder Singh, these are myths of RL. He answers to these myths and expresses that these are not true beliefs. Moreover, there are many other information about reinforcement learning (myths, success, and …) in The University of Michigan Reinforcement Learning Group.