A Question on Stochastic vs Deterministic Policies

Is it possible that a stochastic strategy be better than a greedy one in the sense of obtained reward and after learning and convergence to a fixed policy? For instance, is there any situation that something like Boltzman action selection performs better than Greedy one? It is not the case in MDP, but what about POMDP?! I guess not! I am looking for a counterpart of game theory’s Mixed Strategy in other fields. For some multi-player games, there exist a mixed strategy Nash equilibrium but there is no such a point in pure strategy case. Have you seen something similar in other fields and more specifically in the cases that the performance is the comparison criterion. I wonder what the benefit of acting randomly can be.

This entry was posted in Reinforcement Learning. Bookmark the permalink.

2 Responses to A Question on Stochastic vs Deterministic Policies

  1. Interesting questions you got here.

    Boltzman and Gibbs distributions are used for exploration, without an exploration policy (i.e epsilon-greedy or softmax) there’s no convergence waranties.

    Hope this is useful 😉

  2. SoloGen says:

    Thanks very much! (: Yes! Of course, stochastic policy is needed for ensuring convergence in those problems that there is some kind of convergent phenomena. However, now I am curious about the performance in the sense of expected received reward: Is there any problem that a stochastic policy gains more reward comparing with a deterministic one?! It is not the case for MDP, but what about POMDP or Markov Games?!

Leave a Reply

Your email address will not be published. Required fields are marked *