Today, I came to Control Lab. in order to write a technical report about approximate reward in RL. I write something, but my efficiency is not very good, e.g. you may get involved in a long conversation and you cannot escape! 😀 Anyway …
During my writings, I found out that there might be some fallacy in agnostic learning: policy would change after changed agnostic reinforcement signal. I am not sure whether my result is correct or not.
If I can prove that policy does not change value function, everything would be ok! It is not generally correct, but may be correct in some situations, i.e. being sure that every state-action will be visited infinitely, then V->V* and so policy is irrelevant. emmm … must be thought!