Jochen Triesch, Christoph von der Malsburg, Self-Organized Integration of Adaptive Visual Cues for Face Tracking
I have started reading a few papers on multi-cues and multi-modal hypothesis in visual object tracking. I guessed the field was not that interesting, but it seems that it is! The observer design problem in these situations is much more complex and realistic that what I have studied in my control courses. There, nothing was being said about multi-modality. Anyway, I summarize a few papers on this subject and express my ideas sometimes. Note that this weblog is semi-academic: so, my synopses may be not so accurate or correct. I would be happy if the author of papers help me understand their papers better. (:
It proposed a method called Democratic Integration. It is a weighted-based voting-like method that adapts the weights in a self-organizing manner, i.e. it does not use any external signal for those changes. Instead, it uses the difference between the overall result and the result of each cue. If they are alike, that cue’s weight (or reliability – as used in the paper) will be increased.
$$R(x,t)=\sum_i r_i(t)A_i(x,t)$$
$$\hat{x}(t) = argmax{R(x,t)}$$
And for adaptation, it defines a quality $q_i(t)$ as follows
$$\tilde{q_i}(t) = R(A_i(\hat{x}(t)) – E[A_i(x,t)])$$
in that $R(.)$ is the ramp function. In this formula, $A(.)$ is the saliency map of each tracker. It is stated in the paper and I see it in a few other that in the current literature, this saliency map is considered to be probability of the object to be in a specific place. The paper mentioned that this choice of quality function is ad hoc and some other ideas like using Kullback-Leibler distance would give better results.
utput of each tracker and represents After doing normalization over $q_i$, the change in reliability would be
$$\tau\dot{r_i}(t) = q_i(t) – r_i(t)$$
A similar formulation is given for adaptation of prototypes. The paper used a few simple cues to do face tracking. The result is not that exciting, but its performance comparing with the case without adaptation is much superior. I am not aware of the performance of other face tracking methods.
I wonder what would happen if I use context-based switching between different cues, i.e. storing the temporally steady-state reliability weights and then test them as initial guess whenever error happens. What is the measure of error?! I am not sure, but what about the number of conflicts between different trackers?! Or time-averaged gradient of reliability vector over time.