**Michael Bowling and Manuela Veloso, âMultiagent Learning using a Variable Learning Rate,â Artificial Intelligence, 2002.**

This is the first paper I write about in my weblog in my series of multi-agent reinforcement learning papers. I had seen this paper about a year ago, but I did not read it as I thought that changing learning rate is not a real solution to the problem and I supposed that this paper is a kind of ad-hoc method. In that time, I was not that concerned about learning in multi-agent learning from the game theoretic perspectives, so I was not that aware of this paper. Now, it is apparent that I was not that correct. The results of this paper is interesting, Michael Bowling and Manuela Veloso tried to use as much mathematics as possible, and more importantly, his approach to the problem is really insightful.

I do not discuss the paper much, but I tries to write about my concerns about it. Before going on, it is mandatory to mention that I am new in this MAS learning and game theory; so concepts such as Nash equilibrium and its importance like that are not much clear for me.

1-In this paper, Michael (and Manuela â but I refer to it as Michaelâs work) proposed **rationality** and **convergence** as two key criteria for a good MAS learner. An agent is rational whenever all other agents perform stationary, it selects actions (from a pure or mixed strategy) optimally: select a best response action. An agent is convergent if it converges to a stationary policy. If these two happens for all agents, then the system converges to [Nash] equilibrium.

There is a very big question in my mind: Why Nash equilibrium?! It seems that they are **local** optimum points for the MASâs received reward. Am I right?

**Consider this scenario:** *The MAS does not converge to any point in its joint policy space, but it receives more reward than whether it would be in any Nash equilibrium in probability.* Is it possible?

**Another scenario:** I want to design an agent to work in a society. *My agent is selfish and it wants to increase its rewardâs expectation. Is the concept of rationality (with Michaelâs definition) rational anymore?!* There is no stationary policy of other agents at all. Moreover, the concept of convergence is not that useful as it is not apparent if other agents follows a âlearning mechanismâ that lead to a convergent results. In other words, Convergence is a fragile concept in my opinion.

Disregarding these basic concerns, lets see what I found in other parts of the paper:

2-I do not know how to find the Nash equilibrium for a general-sum stochastic game (or even game matrix). I read some parts of Michaels Ph.D. thesis and in its (2.3.3-Types of Matrix Game) and (3.1.4-Linear and Nonlinear Programming/Finding Equilibria) he stated that it was not that easy to do so and might lead to linear constrained quadratic programming.

3-I did not understand the reasons that Michael Littmanâs **Minimax-Q** is not a rational learning method. I understand the example, but I cannot find out the assumptions that make the Minimax-Q irrational. *It may be due to its assumption that other player âis in the Nash Equilibriumâ but the agent might be not.* Am I right? If it is the case, and if we change the definition of rationality to â¦ âIf the other players’ policies converge to stationary **in Nash Equilibrium** policies then the learning algorithm will converge to a policy that is a best response to the other

players’ policies.â then it would be rational. Right?

4-Opponent modeling is interesting. One may relax the **stationary policy** of other agents by adding a forgetting mechanism like the one which is common in stochastic estimation (and the one which is used in Q-Learning). I have not read the original papers of this Opponent Modelling yet, and I do not know whether there is any dynamical analysis like this paper or not, but it may be useful to see what would happen when the agent do not consider other agentsâ policy stationary. It may happen that figures like Fig. 3 and Fig. 4(a) do not happen then.

5-The idea of Gradient Ascent and variable learning rate are very interesting. Although using gradient information is very natural in many optimization problems, I did not anticipate to see it here. In addition, I have not seen similar usage of variable learning rate. People use it in their neural networks, in their Q-Learning, their RLS, and many other places. They even adapt it; however, not with this same idea explicitly. I wonder if it is possible to use it in other context. Lets imagine:

a) Apply WoLF for *a supervised training of a multi-layer feedforward neural network (or anything similar) comparing its current contribution to error with its average contribution to error*. If the current error is smaller, then that neuron is winner, and then the learning rate must be decayed as other neurons might be changing tremendously.

b) Applying WoLF to *multi-agent evolutionary algorithms: For instance, increase/decrease mutation rate according to WoLF?* What is your idea?

6-What if the system has more than a Nash equilibrium? To which one the system would converge?

7-I wonder what similar idea can be done!!! For instance, what would happen if we **set the learning rate on/off using a schedule**. The schedule might be as simple as random turn on/off or depends on the learning rate. In this regard, we let each agent learns its locally optimal solution (Best Response) while other agentsâ policy are stationary. This injects a randomness that may help the system to escape from a poor Nash equilibrium.

8-I used pen and paper to obtain the same results as the paper. However, the formalism is presented in two dimensional state-space. I wonder if it is possible to extend it. Yes! It was possible. But can I do the same analysis as Michaelâs? emmm â¦ not sure! Michaelâs analysis is not that pleasant for me as he used many lemmas to prove its results. He did so as there were many different cases that must be analyzed geometrically. I do not like this geometrical analysis as it is not scalable in my opinion (I cannot do so for n-action games). I asked myself if I can do better. Could I? I have not been successful yet. Let me state what I have done since then:

I wrote the expected payoff for two players multiple actions game and then obtained the generalized formalism for updating rules. I wanted to describe the dynamical behavior of the system. It was possible. However, the big obstacle in this problem is its constraints. I have not seen (or cannot remember) a similarly constrained dynamical system. So I tried to do something. I supposed that the dynamical system xdot = f(x) is on an impassable manifold g(x) = 0 (an impassable manifold is a manifold that a dynamical system cannot pass from a g(x)0 side. It defines the constraints of the problem.) and obtained a new dynamics for it. Now the analysis must be done on this manifold. However, we must know beforehand on which manifold the current state is. I may continue working on it. Is it useful?

9-What are relations of this method with other ones? For instance, can I interpret our (actually Nili Ahmadabadi, Ahad Harati, Masoud Asadpour, and Sahar Mastour.) **expertness-based cooperative multiagent learning** methodology in this framework or vice versa? And a very important issue: what is the relation of this and **COIN**? I have read a few papers on COIN (COllective INtelligence) of Wolpert as again it was very interesting for me.

10-That is enough!!! This writing takes a lot of time. I am not sure if it is possible to review each paper this way. But I know that doing so is very useful for me as if I do so, I understand the paper 3dB more! I would be happy if I get any feedback from anyone: the authors of the paper (Michael and Manuela) or any other one.