In MPEG, compression results from intraframe compression
in which redundancies among pixels in the same video frame are exploited to
reduce the amount of data and interframe compression in which similarities
between pixels in adjacent frames are exploited to compress the data. In MPEG,
most of the compression is due to interframe compression.
Filtering
Images which lack sharp edges are easier to compress than images which
have sharp edges. Thus a preprocessing step in MPEG is to filter the image to
reduce the sharp edges (i.e. high frequencies) by filtering. Filtering works by
averaging neighboring pixels to produce a lower data rate (frequency). The
number of pixels used in the average is called the number of taps. MPEG
recommends using a seven tap filter.
Color-Space Conversion
Human
eyes detect less detail in color changes than in black and white. Compression
schemes take advantage of this lack of sensitivity to save significant space.
Human eyes detect three primary colors: red, green and blue. Different colors in
any image are combinations of differening amounts of red, green and blue. This
is illustrated by forming a cube where each of the three colors lies along one
axis of the color cube. For example, yellow is made up of red and green with no
blue. The corner of the cube where R=G=B=0 is black and the diagonally opposite
corner is white.
All points along the black-to-white diagonal of the
cubA single frame of broadcast-quality video comprises 640 by 480 pixels, which,
when multiplied by each pixel's 24 bits of color information, brings the total
to 921 KB. A single-speed CD-ROM can only read data at a rate of 150 KB per
second. That means that if you want to achieve 30 frames per second, you need to
squeeze each 921 KB frame down to 5 KB (a compression ratio of almost 200:1).
Since you also need room for other data such as text, voice and MIDI, the video
data usually needs to be compressed by more than 200:1.
Video compression
is composed of ten basic processes, six of which result in data reduction. The
first four, filtering, color conversion, digitizing and scaling are essentially
preprocessing steps. Though they are not considered compression, they can have a
dramatic effect on data reduction. The processes that directly result in
compression are data transforms, quantization, compaction encoding, interframe
compression using motion compensation and predictive coding.
e are
shades of gray or luminance, called Y. The eye is very sensitive to changes in
luminance but not so sensitive to changes in color. Thus, instead of storing
video information in terms of their red, green and blue components, two
chrominance values called U and V and the luminance value, Y, are used instead
(this is called the YUV representation). The YUV representation is simply a
different set of coordinates which are used to represent the same point in color
space. The U vector points approximately in the red direction while the V vector
points approximately in the blue direction. Any color can be converted from its
RGB representation to its YUV values using a simple matrix transformation. The
advantage to this approach is that the Y value can be digitized at full
resolution while the U and V can be digitized at lower resolutions without
causing the eye to notice any reduction in information or detail.
This is
a good point to define the three types of video used in PC video. When video is
divided into its RGB components, it's called RGB video. When Y,U and V get
separate cables, its called component video and when U and V are combined but Y
is separate (two cables) its called S-video.
Digitization
When
video is stored in YUV values, not as much detail is needed for the U and V
chrominance values as for the Y luminance value. There are various different
sampling schemes used for different compression methods and video formats. In
studio quality video, the U and V components are sampled twice for every four Y
samples (4:2:2 YUV). Therefore there are an average of 16 bits per pixel. In
MPEG (and >DVI), only one U and V sample is taken for each 2 by 2 square of
luminance pixels (4:1:1 YUV). Thus in MPEG there are an average of 12 bits per
pixel. In Indeo, only one U and V sample is taken for each 4-by-4 array of
luminance pixels. Intel calls this YUV9 because there are an average of 9 bits
per pixel. For MPEG, this 4:1:1 YUV sampling scheme results in 2::1
compression.
Scaling
The largest reduction in data occurs by
scaling (subsampling) the image when it is not displayed in a full 640 X 480
pixel window or when a reduction in spatial resolution is tolerable.
High-quality digitizers average, or filter, neighboring pixels and lines with
multitap filters. The inverse of scaling is zooming the picture on playback.
When crudely done, this simply doubles or quadraples pixels and lines producing
strong pixelization and jaggies. In contrast, smart zooming interpolates the
missing pixels by averaging their surroundings.
Transforms
The
first step in pure compression is a data transform which converts the ordinary
two-dimensional spatial representation of a picture (or video frame) into
another set of dimensions (such as frequency) in which some of the data can be
selectively eliminated without severely degrading the quality of the picture. In
the discrete cosine transform (DCT) used in JPEG and MPEG, a Fourier-type
transform converts the image (or video frame) to frequency space. Because most
natural images contain very little high frequency information, most of the
picture information is concentrated in the low frequency coefficients of the
transformed data. Image transforms by themselves don't lead to compression
because they are inherently lossless, that is, the inverse transform restores
the original image. The value in using a transform in compression is obtained in
the next step, quantization, which throws away most of the high frequency
information, leaving a slightly "softer" picture. The objective is to throw away
just enough information so that the eye doesn't notice the loss of the higher
frequencies which contain the sharper details of the image.
The DCT works
on 8 X 8 blocks of pixels, which results in 64 frequency coefficients that are
arranged in an 8 X 8 square in frequency space as shown in the figure. The
lowest frequency term (which is essentially the average value of the block) is
at the upper left. The higher-frequency horizontal coefficients increase from
left-to-right while the higher frequency vertical coefficients increase from top
to bottom. For natural images, most the the energy in the block lies in the low
frequency coefficients in the upper left-hand corner.
The Wavelet
Transform is another type of transform used in video compression. The Wavelet
Transform converts an image to multiple spatial images, one of which contains
the high frequency horizontal information (horizontal edges), one of which
contains the high frequency vertical information (vertical edges), one of which
contains the high frequency diagonal information (diagonal edges) and other
images which contain successively lower frequency information in the horizontal,
vertical and diagonal directions. The Wavelet Transform keeps image information
more localized, enabling more selective data reduction at the quantization
step.
Quantization
The heart of lossy compression is the
quantization step in which the transformed coefficients (which can be considered
as continuous floating point numbers) are converted into representations using
smaller numbers of bits and therefore are represented with larger "quantum"
steps.
In MPEG and JPEG, the low frequency DCT coefficients are quantized
using approximately four bits meaning they have 16 different levels. The higher
frequency coefficients are quantized with fewer bits, usually, 2,1, or zero
bits. This means that many of the high frequency coefficients are converted to
zero which causes them to be eliminated. MPEG and JPEG enable one to define the
quantization table (Q table), i.e. how many bits are used for each of the
coefficients in the 8X8 block. If a codec chooses its own Q table based on an
anlysis of the image, its called adaptive compression.
Compaction
Encoding
The final step in intraframe compression is a lossless process
of data compaction similar to that used to compress data files. Codecs use three
types of compaction schemes. The first is run-length encoding which replaces
consecutive identical digits (usually zeros) with the number of consecutive
digits and their value. The second is Huffman coding where values are replaced
by a code based on their frequency of occurrance. Frequently occurring values
are given shorter codes which results in compression. The third type of
compaction is arithmetic coding which operates on somewhat the same principle as
Huffman Coding but is more efficient.
Interframe Compression
The
greatest amount of data redundancy in video is not the neighboring pixels in a
single frame, it's the vast number of similar pixels in successive frames that
do not change or change very slowly. By predicting that a pixel will remain the
same, and then encoding the difference, you end up with a string of zeroes or
very small numbers suitable for efficient run-length encoding.
MPEG looks
at a frame that has already been compressed using intraframe techniques (called
an I-frame) and predicts a frame or picture, called a P-frame, which can be the
next frame, two or four frames later. Interpolated frames, called B-frames, fill
in the skipped frames.
MPEG breaks the I-picture into 16 X 16 pixel
blocks and assumes the block is the same in the predicted P-frame, say four
frames later. If the pixels in the video are moving, then the block will not be
the same, and MPEG subtracts the actual picture block from the prediction to
produce a new picture that is comprised of the set of differences called an
error signal. This "differences" picture is zero when the picture doesn't
change. MPEG then applies intraframe JPEG compression to the differences picture
in 8 X 8 pixel blocks. If the picture has changed considerably, MPEG invokes
more powerful motion estimation techniques to see if a macroblock has moved just
slightly. If motion compensation fails, MPEG starts over and recodes the
macroblock with interframe coding. If a P-picture is completely different from
the preceeding I-picture (so that it's unpredictable), it becomes an entirely
new I-picture. Normally, P-pictures are used to predict other P-pictures several
times before MPEG encodes another I-picture.
When pixels change, its
often due to predictable camera or subject motion which causes the macroblocks
appear in the new image in a slightly different position. MPEG uses a complex
trial-and-error procedure to search for macroblocks that have moved. MPEG's
motion estimation procedure calculates a motion vector (vertical and horizontal
pixel shift) which is used to obtain a better prediction for that macroblock in
the next frame.
Bidirectional prediction in MPEG interpolates pictures by
predicting them not only from earlier I or P frames but also from later I or P
frames. The later frames must be transmitted out of order to make this trick
possible. The easiest and fastest way to interpolate a picture is to simply
average earlier and later frames. However, to achieve greater compresion, MPEG
uses bidirectional interpolation. A B-picture contains macroblocks which have
been successfully predicted from a preceeding I or P picture, a following P or I
picture, both a preceeding and following (by averaging) or by neither, in which
case that macroblock is intraframe coding. When a prediction is successful, a
small error signal is the result which is itself coded using intraframe
coding.
The MPEG standard allows as many as three B-pictures in a row,
the number of I-pictures is typically two per second which means that P-pictures
are used to forward predict two-to-five following P-pictures before another I
picture is coded.
Cool Video Sites & Who's
using Digital Video today
3D and VR QuickTime Links
Surf
the Web Great Southern California Links
Online Mag's & User
Groups
Troubleshooting Digital Video
Leeman's Main page