This post is the third, and final, in a series of posts on mathematical camera representation.  The following are links to the earlier two entries in this series:

This post builds upon the model built up in these previous two posts by adding two final concepts:  the ability to handle non-square pixels in an image and the ability to handle skewed images.

For the rest of this discussion, the form of the solution for finding the projection matrix will remain the same as in Part 2.  That is, the $3 \times 4$ projection matrix $\mathbf{P}$ can be found by incorporating the $3 \times 3$ camera rotation matrix $\mathbf{R}$, the 3-vector $\mathbf{t}$, and the $3 \times 3$ upper-triangular intrinsic camera parameter matrix $\mathbf{K}$ as

$\mathbf{P}= \mathbf{K} \left[ \mathbf{R} | \mathbf{t} \right]$.

The intrinsic camera parameter matrix $\mathbf{K}$ defined in the Part 2 will be updated to take into account non-square pixels and skew.  It hopefully makes sense that $\mathbf{K}$ is where these changes take place since pixel dimensions and image skew are intrinsic to the camera and do not relate to the camera's extrinsic location in the world.

## Non-Square Pixels

Most digital cameras have rectangular pixels.  Because the pixels are rectangular, the camera model must scale the image by different amounts along the x- and y-axes.  We now update the definition of the intrinsic camera parameter matrix $\mathbf{K}$ to be defined as:

$\mathbf{K}= \left[ \begin{array}{ccc} \alpha_x & 0 & x_0\\ 0 & \alpha_y & y_0\\ 0 & 0 & 1 \end{array} \right]$.

Here, $\alpha_x= f m_x$ and $\alpha_y= f m_y$ where $m_x$ is the number of pixels per unit distance in x and $m_y$ is the number of pixels per unit distance in y.  The principal point $(x_0, y_0)$ is now measured in terms of pixels.

## Skew

The final parameter we will add to our model is the skew parameter s.  The skew parameter models how the x- and y-axes are aligned in the image plane.  In most cases, the axes are perpendicular and $s=0$.  If the x- and y-axes are not perpendicular, then $s \neq 0$.

Incorporating the skew parameter into the intrinsic camera parameter matrix, we get

$\mathbf{K}= \left[ \begin{array}{ccc} \alpha_x & s & x_0\\ 0 & \alpha_y & y_0\\ 0 & 0 & 1 \end{array} \right]$.

## Final Note on Degrees of Freedom

The camera projection matrix $\mathbf{P}$ is a homogenous transform, which means that two projection matrices are equivalent if the only difference between them is a non-zero scaling coefficient.  That is, $\mathbf{P}_1= \mathbf{P}_2$ if $\mathbf{P}_2= c \mathbf{P}_1$ where c is a non-zero constant.  Practically, this means that a projection matrix has 11 degrees of freedom despite being a 12-item matrix.

Going into a bit more depth, we can expand out our projection matrix as

$\mathbf{P}= \mathbf{K} \left[ \mathbf{R} | \mathbf{t} \right]= \left[ \begin{array}{ccc} \alpha_x & s & x_0\\ 0 & \alpha_y & y_0\\ 0 & 0 & 1 \end{array} \right] \left[ \begin{array}{cccc} r_{11} & r_{12} & r_{13} & t_x\\ r_{21} & r_{22} & r_{23} & t_y\\r_{31} & r_{32} & r_{33} & t_z \end{array} \right]$

We can now count our degrees of freedom:

• $\mathbf{K}$ has 5 degrees of freedom since it has 6 elements, but is homogenous and only defined up to scale.  That is, $\mathbf{K}$ only has five elements that are mutually exclusive.
• $\mathbf{R}$ defines a rotation matrix, and therefore only has 3 degrees of freedom (roll, pitch, and yaw).
• $\mathbf{t}$ has 3 degrees of freedom since it defines a translation in 3-dimensional space which links the camera position with the world origin.

Thus, by simple addition, the camera projection matrix $\mathbf{P}$ has 11 degrees of freedom.

And with that, we are finished with our discussion of the mathematical camera model.  I hope that you have found this useful!

This post is the second in a series of posts on representing cameras mathematically.  If you have not read it yet, or need a quick refresher, please read Part 1 here.

## Intrinsic vs. Extrinsic Camera Properties

To move the camera in the world and to move the image on the image plane, we must distinguish between properties that are intrinsic to the camera and those that are extrinsic to it.  Extrinsic properties describe the camera's position in the world, while intrinsic properties describe things like the location of the image plane origin and image scaling.

To separate out the intrinsic from the extrinsic parameters, we define the camera calibration matrix $\mathbf{K}$ which describes the camera's intrinsic parameters.  The camera calibration matrix for the simple pinhole camera described in Part 1 is

$\mathbf{K}= \left[ \begin{array}{ccc} f & 0 & 0\\ 0 & f & 0\\ 0 & 0 & 1\end{array} \right]$.

This camera calibration matrix only takes into account the focal length f.  But, we now have a description of the intrinsic parameters that is separate from the camera's position in the world.  Let's now change the camera's position.

## Setting the Camera Location

The above diagram was introduced back in Part 1, but the projection matrix $\mathbf{P}$ was then calculated assuming that the camera center $\mathbf{C}_w$ was at the origin and the camera points along the z-axis.  We will now generalize and assume that $\mathbf{C}_w$ can be any location in the world, and that the camera can be rotated arbitrarily.

The rotation of the camera is described by a $3 \times 3$ rotation matrix $\mathbf{R}$.  Rotation matrices are a common way to mathematically describe an object's roll, pitch, and yaw in a 3 dimensional space.  Rotation matrices are used whenever a linear model of 3D location is needed--vision, robotics, and graphics are example sub-fields of computer science that use rotation matrices regularly.

To apply the rotation matrix $\mathbf{R}$ and the camera position $\mathbf{C}_w$, we must define a transformation that translates and rotates the camera in terms of the world frame.  That is, we need the rotation and translation of the camera from the origin of the world frame to its position and orientation in the world.  The rotation is very straight forward, as it is described by rotation matrix $\mathbf{R}$.  However, the translation is a bit trickier;  to find the translation to use in the projection matrix $\mathbf{P}$, we need to "correct" for the rotation.  Thus, the translation is described as

$\mathbf{t}= -\mathbf{RC}_w$,

where $\mathbf{t}$ is the resulting 3 dimensional vector.

Given all of this, we can solve for the projection matrix using the following equation:

$\mathbf{P}= \mathbf{K} \left[ \mathbf{R} | \mathbf{t} \right]$.

## Setting the Image Location

Now that we can move the camera around to any arbitrary location and orientation in the world, we will focus on moving the principal point of the image to an arbitrary point in the image plane.  The principal point is the point in the 2D image plane that corresponds to point $\mathbf{C}_i$ in the diagram above.  The reason why it is important to move is because the principal point is the origin, point (0, 0) in the image.  Most digital image formats put the origin in the corner of the image, but without moving the principal point, the origin will be in the center of the image.  This must be changed!

Image plane diagram. Shows the location of the principal point and associated axes in the camera image plane ($C_{cam}$) and the x,y axes of the actual image.

To move the principal point to the image origin, we need to add the $y_0$ offset for the y-axis and the $x_0$ offset for the x-axis.  This is a fairly straightforward modification of the camera calibration matrix $\mathbf{K}$ above.  Once we make this change, we get:

$\mathbf{K}= \left[ \begin{array}{ccc} f & 0 & x_0\\ 0 & f & y_0\\ 0 & 0 & 1 \end{array} \right]$.

It can clearly be seen that this addition simply adds a (scaled) offset to the image locations in the image plane.  To illustrate this with an example, let's solve for $\mathbf{K X}_{cam}$ where $\mathbf{X}_{cam}$ is a 3D homogenous vector containing a point in the camera's image plane:

$\mathbf{K X}_{cam}= \left[ \begin{array}{ccc} f & 0 & x_0\\ 0 & f & y_0\\ 0 & 0 & 1 \end{array} \right] \left[ \begin{array}{c} x_{cam}\\ y_{cam}\\ 1 \end{array} \right]= \left[ \begin{array}{c} fx_{cam}+x_0\\ fy_{cam}+y_0\\ 1 \end{array} \right]$.

### Images with Origin in the Upper-Left-Hand Corner

One final thought to consider:  many digital image formats put the origin of the image in the upper left-hand corner of the image, with the y-axis pointed down.  If you are dealing with images like that, you will need to correct your camera calibration matrix as follows:

$\mathbf{K}'= \left[ \begin{array}{ccc}1 & 0 & 0\\ 0 & -1 & 0\\ 0 & 0 & 1 \end{array} \right] \mathbf{K}$.

This correction will flip the y-axis so that it will line up correctly with the image plane.

And that is where we will leave off for today.  Come back next time for Part 3 of this series where we will add in more intrinsic camera parameters to think about.

Edit 8/16/2013:  You can find Part 3 of this series here.

Representing a camera mathematically can be a bit tricky, especially if you want to represent many aspects of the camera.  In this post, I will begin a discussion of the linear pinhole camera model.  This is the first in a series of posts on camera representation;  at the end of this series, we will have completely walked through the derivation of a linear system that describes how a point in the 3D world projects to a point on the 2D image plane.

## Homogenous Coordinates

Before we can go any further, we need to discuss homogenous coordinates, which is basically a linear algebra trick to simplify the writing of our equations.  To convert a normal coordinate system to a homogenous coordinate system, an extra dimension must be added to every point in the system.  This extra coordinate is simply a scalar multiple ($s_w$ here), so an (originally 3D) world point would be $\mathbf{X}_w= (s_w x_w, s_w y_w, s_w z_w, s_w)^T= s_w ( x_w, y_w, z_w, 1)^T$ in homogenous coordinates.  Similarly, an (originally 2D) image point will then be $\mathbf{X}_i= (s_i x_i, s_i y_i, s_i)^T= s_i (x_i, y_i, 1)^T$ in homogenous coordinates.

It is important to note that in homogenous coordinates, the value of the scalar multiple ($s_w$ and $s_i$ above) does not matter, since it can simply be divided out of the point.  It just cannot be zero in most circumstances.  For example,

$\frac{1}{s_i}\mathbf{X}_i= \frac{1}{s_i} \left[ \begin{array}{c}s_i x_i \\ s_i y_i \\ s_i \end{array}\right]= \left[ \begin{array}{c} x_i \\ y_i \\ 1 \end{array}\right]$.

## The Simplest Pinhole Camera

Using homogenous coordinates, we will now build a mathematical description of a camera.  The mathematical description of the camera is a set of linear equations that translate a world point $\mathbf{X}_w$ into an image point $\mathbf{X}_i$.  Since the homogenous world point is 4 dimensional and the homogenous image point is 3 dimensional, the overall transformation can be described by the $3 \times 4$ projection matrix $\mathbf{P}$.  The projection from the world point to its corresponding image point can then be written as $\mathbf{X}_i= \mathbf{P} \mathbf{X}_w$.

Let's now dig in and look at an example camera:

Pinhole camera diagram.

In the above diagram of a simple pinhole camera, we have a number of key items listed:

• x, y, and z are the 3D world axes.
• z is the principal axis, which is simply the axis perpendicular to the image plane.  Think of this as the direction that the camera is pointing.  The z axis is often chosen as the principal axis because most vision scientists have historically chosen it to be the principal axis.  All of the equations we derive can be re-derived to use a different axis as the principal axis if you are so inclined.
• $\mathbf{C}_w$ is the world coordinate of the camera center.  This is a 4 dimensional homogenous point.
• $\mathbf{C}_i$ is the principal point, which is the point where the principal axis meets the image plane.  This is a 3 dimensional homogenous point because it is on the 2D image plane, not in the 3D world.
• f is the the focal length, which is just is the scalar distance from the camera center to the image plane.
• $\mathbf{X}_w$ is the world point being imaged.  This is a 4 dimensional homogenous point.
• Finally, $\mathbf{X}_i$ is the point on the image plane that the world point projects to.  This is a 3 dimensional homogenous point.

In the simplest case of the projection matrix, the camera center is at the origin of the world coordinate system, which means $\mathbf{C}_w= (0, 0, 0, 1)^T$.  Therefore, our very simple transformation of the world coordinate to the image coordinate ($\mathbf{X}_i= \mathbf{P} \mathbf{X}_w$) can be fully written out as

$\mathbf{X}_i= \left[ \begin{array}{c} s_i x_i\\ s_i y_i\\ s_i \end{array}\right]= \left[ \begin{array}{cccc} f & 0 & 0 & 0\\ 0 & f & 0 & 0\\ 0 & 0 & 1 & 0\end{array} \right] \left[ \begin{array}{c} x_w\\ y_w\\ z_w\\ 1 \end{array}\right]$

where

$\mathbf{P}= \left[ \begin{array}{cccc} f & 0 & 0 & 0\\ 0 & f & 0 & 0\\ 0 & 0 & 1 & 0\end{array} \right]$

and

$\mathbf{X}_w= \left[ \begin{array}{c} x_w\\ y_w\\ z_w\\ 1 \end{array}\right]$.

That is where we will stop for today.  Come back for the next post in this series, which will explore moving the camera center to a different point in the world and moving the principal point to a different point in the image plane.

Edit 8/16/2013:  You can find Part 2 of this series here, and Part 3 of this series here.