Background segmentation is a computer vision technique that is routinely used as part of automated video surveillance systems to try to distinguish targets to track within a scene.  It works only in the case of vision systems that incorporate a stationary camera observing a mostly stationary scene.  It works best when the scene is not regularly crowded with many moving targets.  The central concept is that targets are found by modeling not the targets themselves, but the rest of the scene.

Background segmentation relies on a background model of the scene, i.e., image features that do not change or change very slowly with time.  Then, on each frame of video, the background model is compared to what is actually in the image;  the parts of the image that do not match the background are therefore foreground and are worthy of further processing.  The background model can also be updated with every frame so that moving objects that stop can be incorporated into the background model (e.g. a vehicle stopping in a parking lot) or so that slow changes can be incorporated (e.g. the position of shadows as the location of the sun changes).

Variable Definitions

At every point of time $t$, we have an image frame $I^{t}$.  The image is broken up into a number of pixels, and we can identify an individual pixel as $I^t_{i, j}$.  The pixel is represented by a three-dimensional vector, containing a red, green, and blue value.

The background model that we build up will have several main components:  A three-dimensional mean vector $\mu_{i, j}$ and a $3 \times 3$ covariance matrix $\Sigma_{i,j}$ for each pixel in the video.  We also need to keep track of the learning rate $\alpha$ that represents how fast new information is incorporated into the background model (a good starting value is 0.01). Finally, we need a threshold value $\tau$ to determine how many standard deviations away from the mean we have to be to be considered foreground (a good starting value is between 1.0 and 3.0).

Initialization

The first step in the background segmentation process is initialization--defining an initial set of values for our model before we can start processing video frames as they come in.  For this, we will set the initial mean value for each pixel to that pixel's value in the first frame of the video sequence, e.g.,

$\mu_{i, j}= I^0_{i, j}.$

Similarly, we will initialize the covariance matrix for each pixel to the identity matrix, e.g.,

$\Sigma_{i,j}= \left[ \begin{array}{ccc} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{array} \right].$

Update

The model is updated on every frame $t$ using an exponential moving average.  This allows the model to incorporate changes into the background. We update the mean value of each pixel based on that pixel's current value in the current frame, e.g.,

$\mu_{i, j}= \alpha I^t_{i, j}+ ( 1- \alpha) \mu_{i, j}.$

Similarly, the covariance matrix for each pixel is updated based on the that pixel's current value in the current frame, e.g.,

$\sigma= [ \mu_{i, j}- I^t_{ i, j}] [ \mu_{i, j}- I^t_{i, j}]^T$

$\Sigma_{i,j}= \alpha \sigma+ ( 1- \alpha) \Sigma_{i, j}.$

A foreground mask is a black-and-white image where a pixel is white if that pixel in the current frame is foreground, and black otherwise.  This requires a way to determine if a given pixel is foreground.  For this, we use a distance measure known as the Mahalanobis distance, which ultimately just tells us how far a sample point is from the mean in terms of standard deviations as defined by the covariance matrix.  The Mahalanobis distance is really handy, because it gives us a distance measure that scales with the uncertainty we have in the system.  Simply, the distance $d_{i, j}$ is

$d_{i, j}= \sqrt{ (I^t_{i, j}- \mu_{i, j})^T \Sigma_{i, j}^{-1} (I^t_{i, j}- \mu_{i, j})}$

To create our foreground mask $M^t$, we set $M^t_{i, j}= 1$ if $d_{i, j}> \tau$ and set $M^t_{i, j}= 0$ otherwise.

Example

Here is a demonstration of the method presented above.  It shows me walking around a static background.  The current frame $I^t$ is shown on the top, the current background model is shown in the center, and the foreground mask $M^t$ is shown on the bottom.

Notice there are a couple of jumps where almost the whole image is detected as foreground when my camera decided to "helpfully" automatically adjust it's focus; this is expected behavior as the values detected for most individual pixels changed, and this was detected.  Also notice that I get learned into the background when I stop by the table.

Notes on Implementation

This algorithm is really straightforward to code up, but is not readily vectorizable, which means that it runs very slowly when coded up in Matlab.  It runs much faster than real time if you code it up in C++ and use OpenCV to handle the reading and writing of the video files and VNL to handle the linear algebra.

Also note that the method presented here is very bare-bones.  For professional systems, there are additional pieces that can be added on to improve results.