Video compression is composed of ten basic processes, six of which result in data reduction. The first four, filtering, color conversion, digitizing and scaling are essentially preprocessing steps. Though they are not considered compression, they can have a dramatic effect on data reduction. The processes that directly result in compression are data transforms, quantization, compaction encoding, interframe compression using motion compensation and predictive coding.
In MPEG, compression results from intraframe compression in which redundancies among pixels in the same video frame are exploited to reduce the amount of data and interframe compression in which similarities between pixels in adjacent frames are exploited to compress the data. In MPEG, most of the compression is due to interframe compression.
All points along the black-to-white diagonal of the cube are shades of gray
or luminance, called Y. The eye is very sensitive to changes in luminance but
not so sensitive to changes in color. Thus, instead of storing video information
in terms of their red, green and blue components, two chrominance values called
U and V and the luminance value, Y, are used instead (this is called the YUV
representation). The YUV representation is simply a different set of coordinates
which are used to represent the same point in color space. The U vector
points approximately in the red direction while the V vector points
approximately in the blue direction. Any color can be converted from its RGB
representation to its YUV values using a simple matrix transformation. The
advantage to this approach is that the Y value can be digitized at full
resolution while the U and V can be digitized at lower resolutions without
causing the eye to notice any reduction in information or detail.
This is a good point to define the three types of video used in PC video.
When video is divided into its RGB components, it's called RGB video. When Y,U
and V get separate cables, its called component video and when U and V are
combined but Y is separate (two cables) its called S-video.
The DCT works on 8 X 8 blocks of pixels, which results in 64 frequency
coefficients that are arranged in an 8 X 8 square in frequency space as shown in
the figure. The lowest frequency term (which is essentially the average value of
the block) is at the upper left. The higher-frequency horizontal coefficients
increase from left-to-right while the higher frequency vertical coefficients
increase from top to bottom. For natural images, most the the energy in the
block lies in the low frequency coefficients in the upper left-hand corner.
The Wavelet Transform is another type of transform used in video compression.
The Wavelet Transform converts an image to multiple spatial images, one of which
contains the high frequency horizontal information (horizontal edges), one of
which contains the high frequency vertical information (vertical edges), one of
which contains the high frequency diagonal information (diagonal edges) and
other images which contain successively lower frequency information in the
horizontal, vertical and diagonal directions. The Wavelet Transform keeps image
information more localized, enabling more selective data reduction at the
quantization step.
In MPEG and JPEG, the low frequency DCT coefficients are quantized using
approximately four bits meaning they have 16 different levels. The higher
frequency coefficients are quantized with fewer bits, usually, 2,1, or zero
bits. This means that many of the high frequency coefficients are converted to
zero which causes them to be eliminated. MPEG and JPEG enable one to define the
quantization table (Q table), i.e. how many bits are used for each of the
coefficients in the 8X8 block. If a codec chooses its own Q table based on an
anlysis of the image, its called adaptive compression.
MPEG looks at a frame that has already been compressed using intraframe
techniques (called an I-frame) and predicts a frame or picture, called a
P-frame, which can be the next frame, two or four frames later. Interpolated
frames, called B-frames, fill in the skipped frames.
MPEG breaks the I-picture into 16 X 16 pixel blocks and assumes the block is
the same in the predicted P-frame, say four frames later. If the pixels in the
video are moving, then the block will not be the same, and MPEG subtracts the
actual picture block from the prediction to produce a new picture that is
comprised of the set of differences called an error signal. This "differences"
picture is zero when the picture doesn't change. MPEG then applies intraframe
JPEG compression to the differences picture in 8 X 8 pixel blocks. If the
picture has changed considerably, MPEG invokes more powerful motion estimation
techniques to see if a macroblock has moved just slightly. If motion
compensation fails, MPEG starts over and recodes the macroblock with interframe
coding. If a P-picture is completely different from the preceeding I-picture (so
that it's unpredictable), it becomes an entirely new I-picture. Normally,
P-pictures are used to predict other P-pictures several times before MPEG
encodes another I-picture.
When pixels change, its often due to predictable camera or subject motion
which causes the macroblocks appear in the new image in a slightly different
position. MPEG uses a complex trial-and-error procedure to search for
macroblocks that have moved. MPEG's motion estimation procedure calculates a
motion vector (vertical and horizontal pixel shift) which is used to obtain a
better prediction for that macroblock in the next frame.
Bidirectional prediction in MPEG interpolates pictures by predicting them not
only from earlier I or P frames but also from later I or P frames. The later
frames must be transmitted out of order to make this trick possible. The easiest
and fastest way to interpolate a picture is to simply average earlier and later
frames. However, to achieve greater compresion, MPEG uses bidirectional
interpolation. A B-picture contains macroblocks which have been successfully
predicted from a preceeding I or P picture, a following P or I picture, both a
preceeding and following (by averaging) or by neither, in which case that
macroblock is intraframe coding. When a prediction is successful, a small error
signal is the result which is itself coded using intraframe coding.
The MPEG standard allows as many as three B-pictures in a row, the number of
I-pictures is typically two per second which means that P-pictures are used to
forward predict two-to-five following P-pictures before another I picture is
coded.
Filtering
Images which lack sharp edges are easier to compress than
images which have sharp edges. Thus a preprocessing step in MPEG is to filter
the image to reduce the sharp edges (i.e. high frequencies) by filtering.
Filtering works by averaging neighboring pixels to produce a lower data rate
(frequency). The number of pixels used in the average is called the number of
taps. MPEG recommends using a seven tap filter.
Color-Space Conversion
Human eyes detect less detail in color changes
than in black and white. Compression schemes take advantage of this lack of
sensitivity to save significant space. Human eyes detect three primary colors:
red, green and blue. Different colors in any image are combinations of
differening amounts of red, green and blue. This is illustrated by forming a
cube where each of the three colors lies along one axis of the color cube. For
example, yellow is made up of red and green with no blue. The corner of the cube
where R=G=B=0 is black and the diagonally opposite corner is white.
Digitization
When video is stored in YUV values, not as much detail
is needed for the U and V chrominance values as for the Y luminance value. There
are various different sampling schemes used for different compression methods
and video formats. In studio quality video, the U and V components are sampled
twice for every four Y samples (4:2:2 YUV). Therefore there are an average of 16
bits per pixel. In MPEG (and DVI),
only one U and V sample is taken for each 2 by 2 square of luminance pixels
(4:1:1 YUV). Thus in MPEG there are an average of 12 bits per pixel. In Indeo,
only one U and V sample is taken for each 4-by-4 array of luminance pixels.
Intel calls this YUV9 because there are an average of 9 bits per pixel. For
MPEG, this 4:1:1 YUV sampling scheme results in 2::1 compression.
Scaling
The largest reduction in data occurs by scaling
(subsampling) the image when it is not displayed in a full 640 X 480 pixel
window or when a reduction in spatial resolution is tolerable. High-quality
digitizers average, or filter, neighboring pixels and lines with multitap
filters. The inverse of scaling is zooming the picture on playback. When crudely
done, this simply doubles or quadraples pixels and lines producing strong
pixelization and jaggies. In contrast, smart zooming interpolates the missing
pixels by averaging their surroundings.
Transforms
The first step in pure compression is a data transform
which converts the ordinary two-dimensional spatial representation of a picture
(or video frame) into another set of dimensions (such as frequency) in which
some of the data can be selectively eliminated without severely degrading the
quality of the picture. In the discrete cosine transform (DCT) used in JPEG
and MPEG, a Fourier-type transform converts the image (or video frame) to
frequency space. Because most natural images contain very little high frequency
information, most of the picture information is concentrated in the low
frequency coefficients of the transformed data. Image transforms by themselves
don't lead to compression because they are inherently lossless, that is, the
inverse transform restores the original image. The value in using a transform in
compression is obtained in the next step, quantization,
which throws away most of the high frequency information, leaving a slightly
"softer" picture. The objective is to throw away just enough information so that
the eye doesn't notice the loss of the higher frequencies which contain the
sharper details of the image.
Quantization
The heart of lossy compression is the quantization step
in which the transformed coefficients (which can be considered as continuous
floating point numbers) are converted into representations using smaller numbers
of bits and therefore are represented with larger "quantum" steps.
Compaction Encoding
The final step in intraframe compression is a
lossless process of data compaction similar to that used to compress data files.
Codecs use three types of compaction schemes. The first is run-length encoding
which replaces consecutive identical digits (usually zeros) with the number of
consecutive digits and their value. The second is Huffman coding where values
are replaced by a code based on their frequency of occurrance. Frequently
occurring values are given shorter codes which results in compression. The third
type of compaction is arithmetic coding which operates on somewhat the same
principle as Huffman Coding but is more efficient.
Interframe Compression
The greatest amount of data redundancy in
video is not the neighboring pixels in a single frame, it's the vast number of
similar pixels in successive frames that do not change or change very slowly. By
predicting that a pixel will remain the same, and then encoding the difference,
you end up with a string of zeroes or very small numbers suitable for efficient
run-length encoding.