It is in the half time of Brazil-Argentina final match that I get involved in thinking about the decision making process and reinforcement learning. Suppose that one intends to write a program to play soccer. What kind of state representation should he use? The representation must encompass as much information as it is possible to discriminate between two states with different actions (seeing the problem from classification side – separablity classes and …) or it is informative as it enables the agent to separate two state-action pairs with different value (like the abstraction theorem in MaxQ or …).
For instance, the player’s position, the ball’s relative position, and a few nearby players and their tags (opponent/teammate) would be a rational choice. However, one may ask how many nearby players must be selected? Two? Three? More state information you use, the best possible answer to this POMDP would be better. But it is notable that the state representation is exponentially growing with the number of states – at least in most common representations. What should we do?
I guess we must seek for newer (rich) state representation. Here, I am not talking about using general function approximators or hierarchical RL that are useful in their own. I am talking about wise selection of state representation: a dynamic and automatic generation of states is crucial. As an example, suppose that you are a player in the middle of the field and the ball is with your very close opponent (a few meters). The most important factors (read it as state) for your decision making is your relative distance with the opponent and if you are a good player, his limbs’ movement. It is not “that†important to know where the exact position of your teammate is when he is 20 meters away. However, when you are close to the penalty area of the opponent, not only your relative position to your opponents are important, but also your teammate positions might be critical for a good decision making, e.g. passing to your teammate may come at a goal.
I believe that there must be a method for automatic selection of important features for each state. Different states need not have the same kind of representation and dimension. In some situations, the current sensory information might be sufficient, in some other situations, the predictions of other agent’s sensory information might be necessary and … . An extended Markov property may apply to this situation: having a set of S1…Sn (n-dim) state variables, I guess it is possible to reduce the state transition of the MDP environment in this way: P(Si(t+1)..Sj(t+1)..Sk(t+1)|S1(t)…Sn(t)) = P(Si(t+1)..Sj(t+1)..Sk(t+1)|Sp(t)..Sq(t)..Sr(t)) for some p..q..r, i.e. there are some independency here.
As far as I know, the most similar well-known research similar to this idea is work of McCallum: Utile Suffix Memory and Nearest Sequence Memory. Nevertheless, those methods do consider only a single kind of states which is simpler than what I am thinking about.
Well … Brazil won the game tremendously with those samba dancers! Congratulations to Marcelo!