Generative Affine Localisation and Tracking

Reference:
J. Winn and  A. Blake, Generative Affine Localisation and Tracking, Advances in Neural Information Processing Systems 16, 2004.

Using a generative layered vision model, we are able to localise objects undergoing affine motion in a video sequence, whilst simultaneously learning their shape and appearance. In the following video, the shape and appearance of the moving hand is learned and the hand is accurately tracked. Note that the tracking is robust to local non-affine deformations, such as when the thumb bends across the hand.

Learned appearance and shape


<== Video showing learned pose    

AVI [481K]   MPEG [489K]


<== Original video (left) and video reconstructed from learned model parameters (right). The red halo shows the foreground segmentation.

AVI [279K]    MPEG [289K]

Note that the reconstructed hand appears flattened due to the assumption of a planar affine transform.

More examples

Learned appearance and shape


<== Video showing learned pose    

AVI [243K]   MPEG [240K]


How it works

A two-layer generative model is used with one layer corresponding to the tracked foreground and the second to the stationary background. The Bayesian network below shows how the two layers with appearances f and b are composited together to form the observed image x. The foreground mask m for a particular frame is governed by an overall mask prior p. Both the mask prior and the foreground appearance are transformed by an affine transform T.

For a given image sequence, a posterior distribution over all variables is found by applying approximate Bayesian inference within the model.

Affine tracking graphical model
The Bayesian network for the two-layer generative image model.