55:148 Digital Image Processing
55:247 Image Analysis and Understanding

Chapter 9,
3D Vision: 3D vision tasks

Chapter 9.1 Overview:

Marr's theory
Other vision paradigms: Active and purposive vision

3D vision using intensity images as input is difficult.

The imaging system of a camera and the human eye performs perspective projection, which leads to considerable loss of information.
All points along a line pointing from the optical center towards a scene point are projected to a single image point.
We are interested in the inverse task that aims to derive 3D co-ordinates from image measurements - this task is underconstrained, and some additional information must be added to solve it unambiguously.

The relationship between image intensity and the 3D geometry of the corresponding scene point is very complicated.
The pixel intensity depends on surface reflectivity parameters, surface orientation, type and position of illuminants, and the position of the viewer.
Attempting to learn about 3D geometry, surfaces orientation and depth represents another ill-conditioned task.

III:

The mutual occlusion of objects in the scene, and even self-occlusion of one object, further complicates the vision task.

The field of 3D vision is young and still developing.
No unified theory is available.

3D vision tasks

Marr defines 3D vision as

From an image (or a series of images) of a scene, derive an accurate three-dimensional geometric description of the scene and quantitatively determine the properties of the object in the scene

3D vision is formulated as a 3D object reconstruction task, i.e. description of the 3D shape in a co-ordinate system independent of the viewer.
One rigid object, whose separation from the background is straightforward, is assumed, and the control of the process is strictly bottom-up from an intensity image through intermediate representations.

Wechsler stresses the control principle of the process:

The visual system casts most visual tasks as minimization problems and solves them using distributed computation and enforcing nonaccidental, natural constraints.

Computer vision is seen as a parallel distributed representation, plus parallel distributed processing, plus active perception.
The understanding is carried in the perception - control - action.

Aloimonos asks what principles might enable us to;

(i) understand vision of living organisms,
(ii) equip machines with visual capabilities.

Aloimonos and Shulman see the central problem of computer vision as:

... from one or the sequence of images of a moving or stationary object or scene taken by a monocular or polynocular moving or stationary observer, to understand the object or the scene and its three-dimensional properties.

System theory provides a general framework that allows us to treat understanding of complex phenomena using the machinery of mathematics.
The objects and their properties need to be characterized, and a formal mathematical model is typically used for this abstraction.
The model is specified by a relatively small number of parameters, which are typically estimated from the (image) data (e.g. algebraic or differential equations).

Some authors propose object recognition systems in which 3D models are avoided.

The priming-based (geons) approach is based on the idea that 3D shapes can be inferred directly from 2D drawings - the qualitative features are called geons.

Constituents of a single object (geons) and their spatial arrangement are pointers to a human memory and are used in the recognition process.

The alignment of 2D views is another option - lines or points in 2D views can be used for aligning different 2D views.
The correspondence of points, lines or other features must be solved first. A linear combination of views has been used for recognition., and various issues related to image based scene representations in which a collection of images with established correspondences is stored instead of a 3D model.

Marr's theory

Marr was critical of earlier work that, while successful in limited domains or image classes, was either empirical or unduly restrictive of the images with which it could deal.
Marr proposed a more abstract and theoretical approach that permitted work to be put into a larger context.

Marr's work was restricted to 3D interpretation of single, static scenes.

Marr proposed three levels:

Computational theory:

describes what the device is supposed to do; what information it provides from other information provided as input. It should also describe the logic of the strategy that performs this task.

Representation and algorithm:

address precisely how the computation may be carried out; including information representations and algorithms to manipulate them.

Implementation:

includes physical realization of the algorithm; programs and hardware.

Example ... an after-image (induced by staring at a light bulb) is a physical effect, while the mental confusion provoked e.g., by the well known Necker cube is at a different level.

The primal sketch
The primal sketch aims to capture, in as general a way as possible, the significant intensity changes in an image.
Such changes have been referred to as `edges' but Marr makes the observation that this word implies a physical meaning that cannot at this stage be inferred.

The first stage is to locate these changes at a range of scales after which second-order zero crossings are determined for each scale of blur.
Zero crossing evidence in the same locality at many scales provides strong evidence of a genuine physical feature in the scene.
Zero crossings are grouped, according to their location and orientations, to provide information about tokens in the image (edges, bars and blobs) that may help provide later information about (3D) orientation of scene surfaces.
The grouping phase, paying attention to the evidence from various scales, extracts tokens that are likely to represent surfaces in the real world.
(There is strong evidence for the existence of the various components used to build the primal sketch in the human visual system - we too engage in detection of features at various scales, the location of sharp intensity changes and their subsequent grouping into tokens.)

The 2.5D sketch
The 2.5D sketch reconstructs the relative distances from the viewer of surfaces detected in the scene, and may be called a depth map.
The output of this phase uses as input features detected in the preceeding phase.
It does not provide a 3D reconstruction.
It is midway between 2D and 3D representations.

3D representation
Marr paradigm overlaps with top-down, model-based approaches.
This step represents a transition to an object centered co-ordinate system, allowing object descriptions to be viewer independent.
This is the most difficult phase and successful implementation is remote, especially compared to the success seen with the derivation of the primal and 2.5D sketches.

The Marr paradigm advocates a set of relatively independent modules; the low-level modules aim to recover a meaningful description of the input intensity image, the middle-level modules use different cues such as intensity changes, contours, texture, motion to recover shape or location in space.

The Marr paradigm is a nice theoretic framework, but unfortunately does not lead to successful vision applications performing, e.g., recognition and navigation tasks.

It was shown later that most low-level and middle-level tasks are ill-posed, with no unique solution.

One popular way developed in the eighties to make the task well-posed is regularization. A constraint requiring continuity and smoothness of the solution is often added.

Other vision paradigms: Active and purposive vision

When consistent geometric information has to be explicitly modeled (as for manipulation of the object), an object-centered co-ordinate system seems to be appropriate.

Two schools are trying to explain the vision mechanism:

The first and older one tries to use explicit metric information in the early stages of the visual task (lines, curvatures, normals, etc.).
Geometry is typically extracted in a bottom-up fashion without any information about the purpose of this representation.
The output is a geometric model.

The second and younger school does not extract metric (geometric) information from visual data until needed for a specific task.
Data are collected in a systematic way to ensure all the object's features are present in the data, but may remain uninterpreted until a specific task is involved.
A database or collection of intrinsic images (or views) is the model.

Many traditional computer vision systems and theories capture data with cameras with fixed characteristics while active perception and purposive vision may be appropriate.

Active vision system ... characteristics of the data acquisition are dynamically controlled by the scene interpretation.

Many visual tasks tend to be simpler if the observer is active and controls its visual sensors.
The controlled eye (or camera) movement is an example.
If there is not enough data to interpret the scene the camera can look at it from other viewpoint.
Active vision is an intelligent data acquisition controlled by the measured, partially interpreted scene parameters and their errors from the scene.

The active approach can make most ill-posed vision tasks tractable.

There is no established theory that provides a mathematical (computational) model explaining the understanding aspects of human vision.
Two recent developments towards new vision theory are:
Qualitative vision

that looks for a qualitative description of objects or scenes.
The motivation is not to represent geometry that is not needed for qualitative (non-geometric) tasks or decisions.
Qualitative information is more invariant to various unwanted transformations (e.g. slightly differing viewpoints) or noise than quantitative ones.
Qualitativeness (or invariance) enables interpretation of observed events at several levels of complexity.

Purposive paradigm

The key question is to identify the goal of the task, the motivation being to ease the task by making explicit just that piece of information that is needed.
Collision avoidance for autonomous vehicle navigation is an example where precise shape description is not needed.
The approach may be heterogeneous and a qualitative answer may be sufficient in some cases.
The paradigm does not yet have a solid theoretical basis, but the study of biological vision is a rich source of inspiration.

Last Modified: April 20, 1997

55:148 Digital Image Processing 55:247 Image Analysis and Understanding

Chapter 9, 3D Vision: 3D vision tasks

55:148 Digital Image Processing
55:247 Image Analysis and Understanding

Chapter 9,
3D Vision: 3D vision tasks