Is it possible that a stochastic strategy be better than a greedy one in the sense of obtained reward and after learning and convergence to a fixed policy? For instance, is there any situation that something like Boltzman action selection performs better than Greedy one? It is not the case in MDP, but what about POMDP?! I guess not! I am looking for a counterpart of game theory’s Mixed Strategy in other fields. For some multi-player games, there exist a mixed strategy Nash equilibrium but there is no such a point in pure strategy case. Have you seen something similar in other fields and more specifically in the cases that the performance is the comparison criterion. I wonder what the benefit of acting randomly can be.
Interesting questions you got here.
Boltzman and Gibbs distributions are used for exploration, without an exploration policy (i.e epsilon-greedy or softmax) there’s no convergence waranties.
Hope this is useful 😉
Thanks very much! (: Yes! Of course, stochastic policy is needed for ensuring convergence in those problems that there is some kind of convergent phenomena. However, now I am curious about the performance in the sense of expected received reward: Is there any problem that a stochastic policy gains more reward comparing with a deterministic one?! It is not the case for MDP, but what about POMDP or Markov Games?!