* Jochen Triesch, Christoph von der Malsburg,*** Self-Organized Integration of Adaptive Visual Cues for Face Tracking**

I have started reading a few papers on multi-cues and multi-modal hypothesis in visual object tracking. I guessed the field was not that interesting, but it seems that it is! The observer design problem in these situations is much more complex and realistic that what I have studied in my control courses. There, nothing was being said about multi-modality. Anyway, I summarize a few papers on this subject and express my ideas sometimes. Note that this weblog is semi-academic: so, my synopses may be not so accurate or correct. I would be happy if the author of papers help me understand their papers better. (:

It proposed a method called * Democratic Integration*. It is a weighted-based voting-like method that adapts the weights in a self-organizing manner, i.e. it does not use any external signal for those changes. Instead, it uses the difference between the *overall* result and the result of each cue. If they are alike, that cue’s weight (or reliability – as used in the paper) will be increased.

$$R(x,t)=\sum_i r_i(t)A_i(x,t)$$

$$\hat{x}(t) = argmax{R(x,t)}$$

And for adaptation, it defines a quality $q_i(t)$ as follows

$$\tilde{q_i}(t) = R(A_i(\hat{x}(t)) – E[A_i(x,t)])$$

in that $R(.)$ is the ramp function. In this formula, $A(.)$ is the saliency map of each tracker. It is stated in the paper and I see it in a few other that in the current literature, this saliency map is considered to be probability of the object to be in a specific place. The paper mentioned that this choice of quality function is ad hoc and some other ideas like using Kullback-Leibler distance would give better results.

utput of each tracker and represents After doing normalization over $q_i$, the change in reliability would be

$$\tau\dot{r_i}(t) = q_i(t) – r_i(t)$$

A similar formulation is given for adaptation of prototypes. The paper used a few simple cues to do face tracking. The result is not that exciting, but its performance comparing with the case without adaptation is much superior. I am not aware of the performance of other face tracking methods.

I wonder what would happen if I use context-based switching between different cues, i.e. storing the temporally steady-state reliability weights and then test them as initial guess whenever error happens. What is the measure of error?! I am not sure, but what about the number of conflicts between different trackers?! Or time-averaged gradient of reliability vector over time.