This post is the third, and final, in a series of posts on mathematical camera representation.  The following are links to the earlier two entries in this series:

This post builds upon the model built up in these previous two posts by adding two final concepts:  the ability to handle non-square pixels in an image and the ability to handle skewed images.

For the rest of this discussion, the form of the solution for finding the projection matrix will remain the same as in Part 2.  That is, the $3 \times 4$ projection matrix $\mathbf{P}$ can be found by incorporating the $3 \times 3$ camera rotation matrix $\mathbf{R}$, the 3-vector $\mathbf{t}$, and the $3 \times 3$ upper-triangular intrinsic camera parameter matrix $\mathbf{K}$ as

$\mathbf{P}= \mathbf{K} \left[ \mathbf{R} | \mathbf{t} \right]$.

The intrinsic camera parameter matrix $\mathbf{K}$ defined in the Part 2 will be updated to take into account non-square pixels and skew.  It hopefully makes sense that $\mathbf{K}$ is where these changes take place since pixel dimensions and image skew are intrinsic to the camera and do not relate to the camera's extrinsic location in the world.

## Non-Square Pixels

Most digital cameras have rectangular pixels.  Because the pixels are rectangular, the camera model must scale the image by different amounts along the x- and y-axes.  We now update the definition of the intrinsic camera parameter matrix $\mathbf{K}$ to be defined as:

$\mathbf{K}= \left[ \begin{array}{ccc} \alpha_x & 0 & x_0\\ 0 & \alpha_y & y_0\\ 0 & 0 & 1 \end{array} \right]$.

Here, $\alpha_x= f m_x$ and $\alpha_y= f m_y$ where $m_x$ is the number of pixels per unit distance in x and $m_y$ is the number of pixels per unit distance in y.  The principal point $(x_0, y_0)$ is now measured in terms of pixels.

## Skew

The final parameter we will add to our model is the skew parameter s.  The skew parameter models how the x- and y-axes are aligned in the image plane.  In most cases, the axes are perpendicular and $s=0$.  If the x- and y-axes are not perpendicular, then $s \neq 0$.

Incorporating the skew parameter into the intrinsic camera parameter matrix, we get

$\mathbf{K}= \left[ \begin{array}{ccc} \alpha_x & s & x_0\\ 0 & \alpha_y & y_0\\ 0 & 0 & 1 \end{array} \right]$.

## Final Note on Degrees of Freedom

The camera projection matrix $\mathbf{P}$ is a homogenous transform, which means that two projection matrices are equivalent if the only difference between them is a non-zero scaling coefficient.  That is, $\mathbf{P}_1= \mathbf{P}_2$ if $\mathbf{P}_2= c \mathbf{P}_1$ where c is a non-zero constant.  Practically, this means that a projection matrix has 11 degrees of freedom despite being a 12-item matrix.

Going into a bit more depth, we can expand out our projection matrix as

$\mathbf{P}= \mathbf{K} \left[ \mathbf{R} | \mathbf{t} \right]= \left[ \begin{array}{ccc} \alpha_x & s & x_0\\ 0 & \alpha_y & y_0\\ 0 & 0 & 1 \end{array} \right] \left[ \begin{array}{cccc} r_{11} & r_{12} & r_{13} & t_x\\ r_{21} & r_{22} & r_{23} & t_y\\r_{31} & r_{32} & r_{33} & t_z \end{array} \right]$

We can now count our degrees of freedom:

• $\mathbf{K}$ has 5 degrees of freedom since it has 6 elements, but is homogenous and only defined up to scale.  That is, $\mathbf{K}$ only has five elements that are mutually exclusive.
• $\mathbf{R}$ defines a rotation matrix, and therefore only has 3 degrees of freedom (roll, pitch, and yaw).
• $\mathbf{t}$ has 3 degrees of freedom since it defines a translation in 3-dimensional space which links the camera position with the world origin.

Thus, by simple addition, the camera projection matrix $\mathbf{P}$ has 11 degrees of freedom.

And with that, we are finished with our discussion of the mathematical camera model.  I hope that you have found this useful!

This post is the second in a series of posts on representing cameras mathematically.  If you have not read it yet, or need a quick refresher, please read Part 1 here.

## Intrinsic vs. Extrinsic Camera Properties

To move the camera in the world and to move the image on the image plane, we must distinguish between properties that are intrinsic to the camera and those that are extrinsic to it.  Extrinsic properties describe the camera's position in the world, while intrinsic properties describe things like the location of the image plane origin and image scaling.

To separate out the intrinsic from the extrinsic parameters, we define the camera calibration matrix $\mathbf{K}$ which describes the camera's intrinsic parameters.  The camera calibration matrix for the simple pinhole camera described in Part 1 is

$\mathbf{K}= \left[ \begin{array}{ccc} f & 0 & 0\\ 0 & f & 0\\ 0 & 0 & 1\end{array} \right]$.

This camera calibration matrix only takes into account the focal length f.  But, we now have a description of the intrinsic parameters that is separate from the camera's position in the world.  Let's now change the camera's position.

## Setting the Camera Location

The above diagram was introduced back in Part 1, but the projection matrix $\mathbf{P}$ was then calculated assuming that the camera center $\mathbf{C}_w$ was at the origin and the camera points along the z-axis.  We will now generalize and assume that $\mathbf{C}_w$ can be any location in the world, and that the camera can be rotated arbitrarily.

The rotation of the camera is described by a $3 \times 3$ rotation matrix $\mathbf{R}$.  Rotation matrices are a common way to mathematically describe an object's roll, pitch, and yaw in a 3 dimensional space.  Rotation matrices are used whenever a linear model of 3D location is needed--vision, robotics, and graphics are example sub-fields of computer science that use rotation matrices regularly.

To apply the rotation matrix $\mathbf{R}$ and the camera position $\mathbf{C}_w$, we must define a transformation that translates and rotates the camera in terms of the world frame.  That is, we need the rotation and translation of the camera from the origin of the world frame to its position and orientation in the world.  The rotation is very straight forward, as it is described by rotation matrix $\mathbf{R}$.  However, the translation is a bit trickier;  to find the translation to use in the projection matrix $\mathbf{P}$, we need to "correct" for the rotation.  Thus, the translation is described as

$\mathbf{t}= -\mathbf{RC}_w$,

where $\mathbf{t}$ is the resulting 3 dimensional vector.

Given all of this, we can solve for the projection matrix using the following equation:

$\mathbf{P}= \mathbf{K} \left[ \mathbf{R} | \mathbf{t} \right]$.

## Setting the Image Location

Now that we can move the camera around to any arbitrary location and orientation in the world, we will focus on moving the principal point of the image to an arbitrary point in the image plane.  The principal point is the point in the 2D image plane that corresponds to point $\mathbf{C}_i$ in the diagram above.  The reason why it is important to move is because the principal point is the origin, point (0, 0) in the image.  Most digital image formats put the origin in the corner of the image, but without moving the principal point, the origin will be in the center of the image.  This must be changed!

Image plane diagram. Shows the location of the principal point and associated axes in the camera image plane ($C_{cam}$) and the x,y axes of the actual image.

To move the principal point to the image origin, we need to add the $y_0$ offset for the y-axis and the $x_0$ offset for the x-axis.  This is a fairly straightforward modification of the camera calibration matrix $\mathbf{K}$ above.  Once we make this change, we get:

$\mathbf{K}= \left[ \begin{array}{ccc} f & 0 & x_0\\ 0 & f & y_0\\ 0 & 0 & 1 \end{array} \right]$.

It can clearly be seen that this addition simply adds a (scaled) offset to the image locations in the image plane.  To illustrate this with an example, let's solve for $\mathbf{K X}_{cam}$ where $\mathbf{X}_{cam}$ is a 3D homogenous vector containing a point in the camera's image plane:

$\mathbf{K X}_{cam}= \left[ \begin{array}{ccc} f & 0 & x_0\\ 0 & f & y_0\\ 0 & 0 & 1 \end{array} \right] \left[ \begin{array}{c} x_{cam}\\ y_{cam}\\ 1 \end{array} \right]= \left[ \begin{array}{c} fx_{cam}+x_0\\ fy_{cam}+y_0\\ 1 \end{array} \right]$.

### Images with Origin in the Upper-Left-Hand Corner

One final thought to consider:  many digital image formats put the origin of the image in the upper left-hand corner of the image, with the y-axis pointed down.  If you are dealing with images like that, you will need to correct your camera calibration matrix as follows:

$\mathbf{K}'= \left[ \begin{array}{ccc}1 & 0 & 0\\ 0 & -1 & 0\\ 0 & 0 & 1 \end{array} \right] \mathbf{K}$.

This correction will flip the y-axis so that it will line up correctly with the image plane.

And that is where we will leave off for today.  Come back next time for Part 3 of this series where we will add in more intrinsic camera parameters to think about.

Edit 8/16/2013:  You can find Part 3 of this series here.

Representing a camera mathematically can be a bit tricky, especially if you want to represent many aspects of the camera.  In this post, I will begin a discussion of the linear pinhole camera model.  This is the first in a series of posts on camera representation;  at the end of this series, we will have completely walked through the derivation of a linear system that describes how a point in the 3D world projects to a point on the 2D image plane.

## Homogenous Coordinates

Before we can go any further, we need to discuss homogenous coordinates, which is basically a linear algebra trick to simplify the writing of our equations.  To convert a normal coordinate system to a homogenous coordinate system, an extra dimension must be added to every point in the system.  This extra coordinate is simply a scalar multiple ($s_w$ here), so an (originally 3D) world point would be $\mathbf{X}_w= (s_w x_w, s_w y_w, s_w z_w, s_w)^T= s_w ( x_w, y_w, z_w, 1)^T$ in homogenous coordinates.  Similarly, an (originally 2D) image point will then be $\mathbf{X}_i= (s_i x_i, s_i y_i, s_i)^T= s_i (x_i, y_i, 1)^T$ in homogenous coordinates.

It is important to note that in homogenous coordinates, the value of the scalar multiple ($s_w$ and $s_i$ above) does not matter, since it can simply be divided out of the point.  It just cannot be zero in most circumstances.  For example,

$\frac{1}{s_i}\mathbf{X}_i= \frac{1}{s_i} \left[ \begin{array}{c}s_i x_i \\ s_i y_i \\ s_i \end{array}\right]= \left[ \begin{array}{c} x_i \\ y_i \\ 1 \end{array}\right]$.

## The Simplest Pinhole Camera

Using homogenous coordinates, we will now build a mathematical description of a camera.  The mathematical description of the camera is a set of linear equations that translate a world point $\mathbf{X}_w$ into an image point $\mathbf{X}_i$.  Since the homogenous world point is 4 dimensional and the homogenous image point is 3 dimensional, the overall transformation can be described by the $3 \times 4$ projection matrix $\mathbf{P}$.  The projection from the world point to its corresponding image point can then be written as $\mathbf{X}_i= \mathbf{P} \mathbf{X}_w$.

Let's now dig in and look at an example camera:

Pinhole camera diagram.

In the above diagram of a simple pinhole camera, we have a number of key items listed:

• x, y, and z are the 3D world axes.
• z is the principal axis, which is simply the axis perpendicular to the image plane.  Think of this as the direction that the camera is pointing.  The z axis is often chosen as the principal axis because most vision scientists have historically chosen it to be the principal axis.  All of the equations we derive can be re-derived to use a different axis as the principal axis if you are so inclined.
• $\mathbf{C}_w$ is the world coordinate of the camera center.  This is a 4 dimensional homogenous point.
• $\mathbf{C}_i$ is the principal point, which is the point where the principal axis meets the image plane.  This is a 3 dimensional homogenous point because it is on the 2D image plane, not in the 3D world.
• f is the the focal length, which is just is the scalar distance from the camera center to the image plane.
• $\mathbf{X}_w$ is the world point being imaged.  This is a 4 dimensional homogenous point.
• Finally, $\mathbf{X}_i$ is the point on the image plane that the world point projects to.  This is a 3 dimensional homogenous point.

In the simplest case of the projection matrix, the camera center is at the origin of the world coordinate system, which means $\mathbf{C}_w= (0, 0, 0, 1)^T$.  Therefore, our very simple transformation of the world coordinate to the image coordinate ($\mathbf{X}_i= \mathbf{P} \mathbf{X}_w$) can be fully written out as

$\mathbf{X}_i= \left[ \begin{array}{c} s_i x_i\\ s_i y_i\\ s_i \end{array}\right]= \left[ \begin{array}{cccc} f & 0 & 0 & 0\\ 0 & f & 0 & 0\\ 0 & 0 & 1 & 0\end{array} \right] \left[ \begin{array}{c} x_w\\ y_w\\ z_w\\ 1 \end{array}\right]$

where

$\mathbf{P}= \left[ \begin{array}{cccc} f & 0 & 0 & 0\\ 0 & f & 0 & 0\\ 0 & 0 & 1 & 0\end{array} \right]$

and

$\mathbf{X}_w= \left[ \begin{array}{c} x_w\\ y_w\\ z_w\\ 1 \end{array}\right]$.

That is where we will stop for today.  Come back for the next post in this series, which will explore moving the camera center to a different point in the world and moving the principal point to a different point in the image plane.

Edit 8/16/2013:  You can find Part 2 of this series here, and Part 3 of this series here.

Background segmentation is a computer vision technique that is routinely used as part of automated video surveillance systems to try to distinguish targets to track within a scene.  It works only in the case of vision systems that incorporate a stationary camera observing a mostly stationary scene.  It works best when the scene is not regularly crowded with many moving targets.  The central concept is that targets are found by modeling not the targets themselves, but the rest of the scene.

Background segmentation relies on a background model of the scene, i.e., image features that do not change or change very slowly with time.  Then, on each frame of video, the background model is compared to what is actually in the image;  the parts of the image that do not match the background are therefore foreground and are worthy of further processing.  The background model can also be updated with every frame so that moving objects that stop can be incorporated into the background model (e.g. a vehicle stopping in a parking lot) or so that slow changes can be incorporated (e.g. the position of shadows as the location of the sun changes).

Variable Definitions

At every point of time $t$, we have an image frame $I^{t}$.  The image is broken up into a number of pixels, and we can identify an individual pixel as $I^t_{i, j}$.  The pixel is represented by a three-dimensional vector, containing a red, green, and blue value.

The background model that we build up will have several main components:  A three-dimensional mean vector $\mu_{i, j}$ and a $3 \times 3$ covariance matrix $\Sigma_{i,j}$ for each pixel in the video.  We also need to keep track of the learning rate $\alpha$ that represents how fast new information is incorporated into the background model (a good starting value is 0.01). Finally, we need a threshold value $\tau$ to determine how many standard deviations away from the mean we have to be to be considered foreground (a good starting value is between 1.0 and 3.0).

Initialization

The first step in the background segmentation process is initialization--defining an initial set of values for our model before we can start processing video frames as they come in.  For this, we will set the initial mean value for each pixel to that pixel's value in the first frame of the video sequence, e.g.,

$\mu_{i, j}= I^0_{i, j}.$

Similarly, we will initialize the covariance matrix for each pixel to the identity matrix, e.g.,

$\Sigma_{i,j}= \left[ \begin{array}{ccc} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{array} \right].$

Update

The model is updated on every frame $t$ using an exponential moving average.  This allows the model to incorporate changes into the background. We update the mean value of each pixel based on that pixel's current value in the current frame, e.g.,

$\mu_{i, j}= \alpha I^t_{i, j}+ ( 1- \alpha) \mu_{i, j}.$

Similarly, the covariance matrix for each pixel is updated based on the that pixel's current value in the current frame, e.g.,

$\sigma= [ \mu_{i, j}- I^t_{ i, j}] [ \mu_{i, j}- I^t_{i, j}]^T$

$\Sigma_{i,j}= \alpha \sigma+ ( 1- \alpha) \Sigma_{i, j}.$

A foreground mask is a black-and-white image where a pixel is white if that pixel in the current frame is foreground, and black otherwise.  This requires a way to determine if a given pixel is foreground.  For this, we use a distance measure known as the Mahalanobis distance, which ultimately just tells us how far a sample point is from the mean in terms of standard deviations as defined by the covariance matrix.  The Mahalanobis distance is really handy, because it gives us a distance measure that scales with the uncertainty we have in the system.  Simply, the distance $d_{i, j}$ is

$d_{i, j}= \sqrt{ (I^t_{i, j}- \mu_{i, j})^T \Sigma_{i, j}^{-1} (I^t_{i, j}- \mu_{i, j})}$

To create our foreground mask $M^t$, we set $M^t_{i, j}= 1$ if $d_{i, j}> \tau$ and set $M^t_{i, j}= 0$ otherwise.

Example

Here is a demonstration of the method presented above.  It shows me walking around a static background.  The current frame $I^t$ is shown on the top, the current background model is shown in the center, and the foreground mask $M^t$ is shown on the bottom.

Notice there are a couple of jumps where almost the whole image is detected as foreground when my camera decided to "helpfully" automatically adjust it's focus; this is expected behavior as the values detected for most individual pixels changed, and this was detected.  Also notice that I get learned into the background when I stop by the table.

Notes on Implementation

This algorithm is really straightforward to code up, but is not readily vectorizable, which means that it runs very slowly when coded up in Matlab.  It runs much faster than real time if you code it up in C++ and use OpenCV to handle the reading and writing of the video files and VNL to handle the linear algebra.

Also note that the method presented here is very bare-bones.  For professional systems, there are additional pieces that can be added on to improve results.

There are a number of useful tools out there for writing computer vision software.  The three packages that follow are the ones that I have gotten the most use out of over the years.  These are very good tools, which I highly recommend to anyone that needs to create computer vision or image processing software.

Matlab Image Processing Toolbox

The Matlab Image Processing Toolbox is a set of functions in Matlab that support image processing and manipulation.  The toolbox provides the basic functionality to read in and write out images;  in recent years, this has expanded to video files as well.  Matlab treats images as matrices of pixel values, making great use of the first-class-citizen support Matlab gives to vectors and matrices.  Working with images at a pixel level is just like working with any other matrix in Matlab.

Matlab's great strength is it's ease of use for the programmer.  I sometimes joke that Matlab code is "executable pseudocode," and that is precisely the feel it gives while using it.  I can code up and try out more ideas, faster, in Matlab than I can in any other language.  Creating output plots, figures, and images is very straight forward, and makes checking code behavior a joy.  Unfortunately, there are some drawbacks to using Matlab--it's code can only run within the Matlab interpreter; it's dynamic record datatype is nice to work with, but there is no object-oriented support; and code executes very slowly (especially if you use any loops in your code, which is likely with image processing).

I find that my typical development workflow when developing computer vision algorithms will often start with creating prototype code in Matlab.  Once I get all of the mathematical and structural kinks worked out, I will sometimes then move on to implementing the code in C++ based on my completed Matlab functions.  Reimplementation in C++ is only done if I need real-time performance or faster testing.  If performance is no big deal, I find simply using my Matlab code to be acceptable.

OpenCV

The OpenCV C++ library is probably the most widely used computer vision library.  There is a lot in here from the low level to the high level.  The low level functionality to read in and write out video files is exceptionally useful, and the first step for someone just beginning to write vision software in C++.  OpenCV contains a plethora of higher-level vision code, from edge detectors through face detectors through machine learning algorithms.

OpenCV is very usable, and installation is very easy.  There is not much that has to be linked against, and the most widely used functions in it work really well.  It is an open source project though, and components that do not get much use can exhibit some wonky code.  It is pretty big, needing about 4 GB of space.  Overall, OpenCV is a great library and a good place to start for someone beginning to write vision code in C++.

VXL

The VXL C++ library is an amazing library, with vast functionality applicable to computer vision projects.  I am especially fond of the VNL and VNL Algos sub-libraries.  Honestly, they are the closest thing I have found to having Matlab's language-level support for matrices and matrix operations in C++.  This isn't to say the VXL library is as intuitive to use as Matlab, but VXL provides a lot of the functionality without having to write all the low-level matrix manipulation functions yourself.  What I find especially remarkable about VXL is that all of the code that I have had to go through in it has been very well written, which is not the case with all open source projects.

The major downside of VXL, in my opinion, is the building and installation process.  It is a serious rite of passage.  If you intend to use VXL for a project of yours, make sure to set aside a day (maybe two) to get to the point where you can link against it.  VXL is a library where only the source is provided, and it is up to you to get it compiled and working.  There are installation instructions on the VXL site, but they are not very detailed.  Here are some detailed VXL installation instructions that I have found to be useful.  It is important to also note that VXL is quite large; make sure to have about 10 GB free before you start with it.