55:148 Digital Image Processing
55:247 Image Analysis and Understanding

Chapter 9,
3D Vision (Part I): Geometry for 3D vision

Chapter 9.2 Overview:

Basics of projective geometry
The single perspective camera
An overview of single camera calibration
Calibration of one camera from the known scene
Two cameras, stereopsis
The geometry of two cameras. The fundamental matrix
Relative motion of the camera; the essential matrix
Estimation of a fundamental matrix from image point correspondences
Applications of the epipolar geometry in vision
Three and more cameras
Stereo correspondence algorithms
Active acquisition of range images

Basics of projective geometry

How to use 2D image information for automated measurement of the 3D world.
Perspective projection (central projection) describes image formation by a pinhole camera or a thin lens.

Basic notation and the definitions.
Consider (n+1) dimensional space Rⁿ⁺¹ not containing its origin
Then equivalence relations can be defined

P is the projective space.
Points in the projective space are expressed in homogeneous (also projective) co-ordinates, which we will denote in bold with a tilde.
Such points are often shown with the number one on the rightmost position, [x'_1, ..., x'_n, 1]^T.

This point is equivalent to any point that differs only by nonzero scaling.

We are more accustomed to $n$-dimensional Euclidean space R^n.

The one-to-one mapping from R^n into P^n is given by

Only the points x_1, ..., x_n, 0]^T do not have an Euclidean counterpart.
It is easy to demonstrate that they represent points at infinity in a particular direction.

Consider x_1, ..., x_n, 0]^T as a limiting case of [x_1, ..., x_n, alpha]^T that is projectively equivalent to [x_{1}/alpha, ..., x_{n}/alpha, 1]^T, and assume that alpha --> 0.
This corresponds to a point in R^n going to infinity in the direction of the radius vector [x_{1}/alpha,..., x_{n}/alpha]
A colineation, or projective transformation, is any mapping P^n -> P^n that is defined by a regular (n+1)x(n+1) matrix A ... ~y = A ~x.
Note that the matrix A is defined up to a scale factor.

Colineations map hyperplanes to hyperplanes; a special case is the mapping of lines to lines that is often used in computer vision.

The single perspective camera

Consider the case of one camera with a thin lens (simplest approximation).
The pinhole camera performs perspective projection.

The geometry of the device is depicted in Figure above; the plane on the bottom is an image plane pi to which the real world projects, and the vertical dotted line is the optical axis.
The lens is positioned perpendicularly to the optical axis at the focal point C (also called the optical center).

The focal length f (sometimes called the principal axis distance) is a parameter of the lens.

The projection is performed by an optical ray (also a light beam) reflected from a scene point X.
The optical ray passes through the optical center C and hits the image plane at the point U.

Let's define four co-ordinate systems:

The world Euclidean co-ordinate system (subscript _w) has origin at the point O_w.
Points X, U are expressed in the world co-ordinate system.

The camera Euclidean co-ordinate system (subscript _c) has the focal point C = O_c as its origin.
The co-ordinate axis Z_c is aligned with the optical axis and points away from the image plane.
There is a unique relation between world and camera co-ordinate systems.
We can align the world to camera co-ordinates by performing an Euclidean transformation consisting of a translation t and a rotation R.

The image Euclidean co-ordinate system (subscript _i) has axes aligned with the camera co-ordinate system, with X_i, Y_i lying in the image plane.

The image affine co-ordinate system (subscript _a) has co-ordinate axes u, v, w, and origin O_i coincident with the origin of the image Euclidean co-ordinate system.
The axes w, v are aligned with the axes Z_i, X_i, but the axis u may have a different orientation to the axis Y_i.

The reason for introducing the camera affine co-ordinates is the fact that in general, pixels need not be perpendicular and axes can be scaled differently.
A camera performs a linear transformation from the 3D projective space P^3 to the 2D projective space P^2.
A scene point X is expressed in the world Euclidean co-ordinate system as a 3x1 vector.

To express the same point in the camera Euclidean co-ordinate system, i.e. X_c, we have to rotate it as specified by the matrix R and translate it by subtracting vector t.

The point X_c is projected to the image plane pi as point U_c.

The x and y co-ordinates of the projected point can be derived from the similar triangles illustrated in Figure 9.4.

It remains to derive where the projected point U_c is positioned in the image affine co-ordinate system, i.e. to determine the co-ordinates which the real camera actually delivers.

The image affine co-ordinate system, with origin at the top left corner of the image, represents a shear and rescaling (often called the aspect ratio) of the image Euclidean co-ordinate system.
The principal point U_0 -- sometimes called the center of the image in camera calibration procedures is the intersection of the optical axis with the image plane pi.
The principal point U_0 is expressed in the image affine co-ordinate system as U_0a=[u_0,v_0,0]^T.
The projected point can be represented in the 2D image plane pi in homogeneous co-ordinates as ~u = [U,V,W]^T, and its 2D Euclidean counterpart is u = [u,v]^T = [U/W,V/W]^T.

Homogeneous co-ordinates allow us to express the affine transformation as a multiplication by a single 3x3 matrix where unknowns a, b, c describe the shear together with scaling along co-ordinate axes, and u_0 and v_0 give the affine co-ordinates of the principal point in the image.

We aim to collect all constants in this matrix, sometimes called the camera calibration matrix K.

Since homogeneous co-ordinates are in use, the equation can be multiplied by any nonzero constant; thus we multiply by z_c to remove this parameter.

The extrinsic parameters of the camera depend on the orientation of the camera Euclidean co-ordinates with respect to the world Euclidean co-ordinate system (see Figure 9.3).
This relation is given in equation (9.6) by matrices R and t.
The rotation matrix R expresses three elementary rotations of the co-ordinate axes -- rotations along the axes x, y, and z are termed pan, tilt, and roll, respectively.
The translation vector t gives three elements of the translation of the origin of the world co-ordinate system with respect to the camera co-ordinate system.

Thus there are six extrinsic camera parameters; three rotations and three translations.

The camera calibration matrix K is upper triangular as can be seen from equation (9.6). The coefficients of this matrix are called intrinsic parameters of the camera, and describe the specific camera independent on its position and orientation in space.

If the intrinsic parameters are known, a metric measurement can be performed from images.

Assume momentarily the simple case in which the world co-ordinates coincide with the camera co-ordinates, meaning that X_w = X_c.

Then equation (9.6) simplifies to

Two separate equations for u and v

where we make the substitutions alpha_u = -fa, alpha_shear = -fb, and alpha_v = -fc.

Thus we have five intrinsic parameters, all given in pixels.
The formulae also give the interpretation of the intrinsic parameters:

alpha_u represents scaling in the u axis, measuring F in pixels along the u axis,
alpha_v similarly specifies f in pixels along the v-axis.
alpha_shear measures in pixels in the v-axis direction how much is the focal length f coincident with u-axis slanted from the Y_i-axis.

This completes the description of the extrinsic and intrinsic camera parameters.

Returning to the general case given by the equation (9.6) ...if we express the scene point in homogeneous co-ordinates ~X_w = [X_w,1]^T, we can write the perspective projection using a single 3x4 matrix.
The leftmost 3x3 submatrix describes a rotation and the rightmost column a translation

The delimiter | denotes that the matrix is composed of two submatrices.

where ~X is the 3D scene point in homogeneous co-ordinates.

The matrix M is called the projective matrix (also camera matrix).
It can be seen that the camera performs a linear projective transformation from the 3D projective space P^3 to the 2D projective plane P^2.

Introduction of projective space and homogeneous co-ordinates made the expressions simpler.
Instead of the nonlinear equation (9.4), we obtained the linear equation (9.9).

The 3x3 submatrix of the projective matrix M consisting of three leftmost columns is regular, i.e. its determinant is non-zero.
The scene point ~X_w is expressed up to scale in homogeneous co-ordinates (recall that projection is expressed in the projection space) and thus all alpha, M are equivalent for alpha not equal to 0.

Sometimes the simplest form of the projection matrix M is used.

This special matrix corresponds to the normalized camera co-ordinate system, in which the specific parameters of the camera can be ignored.
This is useful when the properties of stereo and motion are to be explained in a simple way and independently of the specific camera.

An overview of single camera calibration

The calibration of one camera is a procedure that allows us to set numeric values in the camera calibration matrix K (equation 9.6) or the projective matrix M (equation 9.9).
I. Intrinsic camera parameters only

If the camera is calibrated, and a point in the image is known, the corresponding line (ray) in camera-centered space is uniquely determined.

II. Intrinsic and extrinsic parameters.

Basic approaches to the calibration of a single camera.

I. Known scene

A set of n non-degenerate (not co-planar) points lies in the 3D world, and the corresponding 2D image points are known.

Each correspondence between a 3D scene and 2D image point provides one equation

The solution solves an over-determined system of linear equations.
The main disadvantage is that the scene must be known, for which special calibration objects are often used.

II. Unknown scene:

More views of the scene are needed to calibrate the camera.
The intrinsic camera parameters will not change for different views, and the correspondence between image points in different views must be established.

A. Known camera motion:

Both rotation and translation known:

This general case of arbitrary known motion from one view to another has been solved.

Pure rotation:

If camera motion is restricted to pure rotation, the solution can be found.

Pure translation:

The linear solution (pure translation) can be found.

B. Unknown camera motion:

No a priori knowledge about motion, sometimes called camera selfcalibration.
At least three views are needed and the solution is nonlinear.
Calibration from an unknown scene is still considered numerically hard, and will not be considered here.

Calibration of one camera from the known scene

Typically a two stage process.
1. The projection matrix M is estimated from the co-ordinates of points with known scene positions.
2. The extrinsic and intrinsic parameters are estimated from M.

(The second step is not always needed -- the case of stereo vision is an example.)

To obtain M, observe that each known scene point X=[x,y,z]^T and its corresponding 2D image point [u,v]^T give one equation (9.11) - we seek the numerical values m_ij in the 3x4 projection matrix M.
Expanding from Equation (9.11)

Thus we obtain two linear equations, each in 12 unknowns m_11, ... , m_34, for each known corresponding scene and image point.

If n such points are available, we can write the equations 9.14 as a 2n x 12 matrix

The matrix M actually has only 11 unknown parameters due to the unknown scaling factor, since homogeneous co-ordinates were used.
To generate a solution, at least six known corresponding scene and image points are required.

Typically, more points are used and the over-determined equation (9.15) is solved using a robust least squares method to correct for noise in measurements.
The result of the calculation is the projective matrix M.

To separate the extrinsic parameters (the rotation R and translation t) from the estimated projection matrix M, recall that the projection matrix can be written as

Determining the translation vector is easy; we substituted A = K R in equation (9.16), and so can write t = -A^-1 b.
To determine R, note that the calibration matrix is upper triangular and the rotation matrix is orthogonal.

The matrix factorization method called QR decomposition will decompose A into a product of two such matrices, and hence recover K and R.

So far, we have assumed that the lens performs ideal central projection as the pinhole camera does.
This is not the case with the real lenses.
Such a typical lens performs distortion of several pixels.
A human observer does not notice it if he looks at general scene.
In the case an image is used for measurements, the distortion from the idealized pinhole model should be compensated.

When calibrating a real camera, the more realistic model of the lens includes two distortion components.
First, the radial distortion bends the ray more or less than in the ideal case.
Second, the decentering displaces the principal point from the optical axis.

Recall that the five intrinsic camera parameters were introduced in equation (9.8).
Here, the focal length f of the lens is replaced by a parameter called the camera constant.
Ideally, the focal length and the camera constant should be the same.
In reality, this is true when the lens is focused at infinity. Otherwise, the camera constant is slightly less than the focal length.

Similarly, the coordinates of the principal point can slightly change from the ideal intersection of the optical axis with the image plane.
The main trick of the intrinsic parameters calibration is to observe a known calibration image with some regular pattern, e.g. blobs or lines covering the whole image.
The observed distortions of the pattern allows to estimate the intrinsic camera parameters.
Both the radial distortion and the decentering can be treated in most cases as rotationally symmetric.
They are often modeled as polynomials.

Let u, v denote the correct image coordinates; ~u, ~v denote the measured incorrected image coordinates that come from the actual pixel coordinates x, y and the estimate of the position of the principal point ^u_0, ^v_0.

The correct image coordinates u, v are obtained if the compensations for errors delta u, delta v are added to the measured uncorrected image coordinates ~u, ~v.

The compensations for errors are often modeled as polynomials in even powers to secure the rotational symmetry property.

Typically elements up to maximally degree six are considered.

where u_p, v_p is the correction to the position of the principal point.

The r^2 is the square of the radial distance from the centre of the image.

Recall that ^u_0, ^v_0 were used in equation (9.18).

The u_p, v_p are corrections to ^u_0, ^v_0 that can be applied after calibration to get the proper position of the principal point.

Let's visualize typical radial distrortion of the lens for the simple second order model as a special case of equation (9.20), i.e. no decentering is assumed and second order polynomial approximation is considered

The original image was a square pattern.
The distorted images are shown

The left part of the figure shows the pillow like distortion (minus sign in equation (9.23), whereas the right part depicts the barrel like distortion corresponding to the plus sign.

There are more complicated lens models that cover tangential distortions that model such effects as lens decentering.

Two cameras, stereopsis

Calibration of one camera and knowledge of the co-ordinates of one image point allows us to determine a ray in space uniquely.
If two calibrated cameras observe the same scene point X, its 3D co-ordinates can be computed as the intersection of two such rays.

This is the basic principle of stereo vision that typically consists of three steps:

Camera calibration;
Establishing point correspondences between pairs of points from the left and the right images
Reconstruction of 3D co-ordinates of the points in the scene.

The geometry of the system with two cameras

The line connecting optical centers C and C' is called the baseline.
Any scene point X observed by the two cameras and the two corresponding rays from optical centers C, C' define an epipolar plane.
This plane intersects the image planes in the epipolar lines l, l'.
When the scene point X moves in space, all epipolar lines pass through epipoles e, e' - the epipoles are the intersections of the baseline with the respective image planes.

Let u, u' be projections of the scene point X in the left and right images respectively.
The ray CX represents all possible projections of the point X to the left image, and is also projected into the epipolar line l' in the right image.
The point u' in the right image that corresponds to the projected point u in the left image must thus lie on the epipolar line l' in the right image.

This geometry provides a strong epipolar constraint that reduces the dimensionality of the search space for a correspondence between u and u' in the right image from 2D to 1D.

A special arrangement of the stereo camera, called the canonical configuration is often used.

The baseline is aligned to the horizontal co-ordinate axis, the optical axes of the cameras are parallel, the epipoles move to infinity, and the epipolar lines in the image planes are parallel.

For this configuration, the computation is slightly simpler.
It is easier to move along horizontal lines than along general lines.

The geometric transformation that changes a general camera configuration with nonparallel epipolar lines to the canonical one is called image rectification.

There are practical problems with the canonical stereo configuration, which adds unnecessary technical constraints to the vision hardware.

If high precision of reconstruction is an issue, it is better to use general stereo geometry since rectification induces resampling that causes loss of resolution.

Let's consider an easy canonical configuration and recover depth.
The optical axes are parallel, which leads to the notion of disparity that is often used in stereo literature.

In Figure, we have a bird's eye view of two cameras with parallel optical axes separated by a distance 2 h.

The images they provide, together with one point P with co-ordinates (x,y,z) in the scene, showing this point's projection onto left (P_l) and right (P_r) images.
The co-ordinates have the z axis representing distance from the cameras (at which z=0) and the x axis representing horizontal distance (the y co-ordinate, into the page, does not therefore appear).
x=0 will be the position midway between the cameras; each image will have a local co-ordinate system (x_l on the left, x_r on the right) which for the sake of convenience we measure from the center of the respective images; that is, a simple translation from the global x co-ordinate.
P_l will be used simultaneously to represent the position of the projection of P onto the left image, and its x_l co-ordinate - its distance from the center of the left image (and similarly for P_r).
It is clear that there is a disparity between x_l and x_r as a result of the different camera positions (that is, | P_l - P_r | > 0); we can use elementary geometry to deduce the z co-ordinate of P.