Reference:
J. Winn and A. Blake, Generative Affine Localisation and Tracking,
Advances in Neural Information Processing Systems 16, 2004.
Using a generative layered vision model, we are able to localise objects undergoing affine motion in a video sequence, whilst simultaneously learning their shape and appearance. In the following video, the shape and appearance of the moving hand is learned and the hand is accurately tracked. Note that the tracking is robust to local non-affine deformations, such as when the thumb bends across the hand.
![]() |
Learned appearance and shape
|
<== Video showing learned pose AVI [481K] MPEG [489K] |
|
![]() |
<== Original video (left) and video reconstructed from learned model parameters (right). The red halo shows the foreground segmentation. Note that the reconstructed hand appears flattened due to the assumption of a planar affine transform. |
![]() |
Learned appearance and shape
|
<== Video showing learned pose AVI [243K] MPEG [240K] |
A two-layer generative model is used with one layer corresponding to the tracked foreground and the second to the stationary background. The Bayesian network below shows how the two layers with appearances f and b are composited together to form the observed image x. The foreground mask m for a particular frame is governed by an overall mask prior p. Both the mask prior and the foreground appearance are transformed by an affine transform T.
For a given image sequence, a posterior distribution over all variables is found by applying approximate Bayesian inference within the model.
The Bayesian network for the two-layer generative image model.