One common way to evaluate reinforcement learning methods is comparing the learning curve that is usually an average of a few runs (trials) of the method applied to a single problem. If the ultimate part of the learning curve is higher than the other methods, the author says that “our method handles uncertainty and partial observability much better than …â€Â. If the problem is easy enough that reaching the optimal solution is not an astonishing job (e.g. it is MDP), then comparison will be made between slopes of the learning curves, i.e. learning speed. If the method reaches faster to the optimum, it is said that it uses more information during learning/it uses intrinsic hierarchy of the problem better/it bootstraps better/… . Of course, some papers are not written this way, but it is the case.
However, there are a few important points one must pay attention:
1- What if variance of expected gained reward is too high? Averaging a few runs is not very bad, but it is a lossy method of showing results as some very good learning curves can heighten the average curve. This high amount of variance is not due to the intrinsic nature of the problem, but it is due to the fact that the method cannot learn in every run.
To remedy this problem, I am used to plot the probability distribution of the performance instead of mere averaging. This results in a three dimensional diagram (time – performance – probability of the performance) which is not an easy to show on a paper; thus, I do some sampling at different stages of learning and provide a few two dimensional diagram. I do not know whether anyone have used this kind of representation before – but believe me that providing a probability distribution is at least a very good looking diagram! (; But remember to compute “Cummulative Probability Distribution Function†instead of “Probability Density Functionâ€Â. The latter one is bumpy if you have not many samples.
2- To my best knowledge, there is few mathematical work on the learning speed of RL (I must confess that I have seen two or three papers on the sample complexity of it – but not anymore. There might be a few more). Most researches show the benefit of their method comparing on at most a few problems. Nevertheless, it is not usually said that getting this high performance needs many parameter fine tunings. So, how should we compare two methods when they are not equally optimized?! (Some people used GA or … to find the best learning parameters for their method. I have not done such a thing yet.)