Home

13. MPEG Audio

MPEG Standards

Organizations from all over the world are involved in developing MPEG standards. Fraunhofer-Gesellschaft of Germany and Thomson Multimedia of the United States provided key technology related to MPEG Audio Layer-III (MP3). Dolby Labs was heavily involved in the development of MPEG AAC. Each of these organizations holds multiple patents related to the technologies they contributed.

The MPEG committee works in phases and meets several times a year. To date, MPEG has released three families of standards: MPEG-1, MPEG-2 and MPEG-4. (If you are wondering about MPEG-3, it was merged into MPEG-2). All MPEG phases include standards for both audio and video. This chapter is concerned only with MPEG Audio.

It typically takes several years from when a standard is released to when consumer products that support it reach the market. MPEG-1, which includes MP3, was released in 1992. However, it took more than four years for software players, such as Winamp, to appear, and almost six years for the first portable MP3 players to become available.

MPEG standards for digital audio cover encoding of audio, either by itself or as the audio component of a multimedia file or stream. MPEG Audio is based on perceptual encoding techniques, which take advantage of the characteristics of human hearing and remove sounds that most people can’t hear.

MPEG-1

MPEG-1 (which includes MP3) was approved in November 1992. It works with bit-rates up to 1.5 mbps (million bits per second) and supports both mono and stereo audio, but not multi-channel surround sound. MPEG-1 supports sampling rates of 32, 44.1 and 48 kHz.

MPEG-2

MPEG-2 adds support for surround sound, lower sampling rates of 16, 22.05 and 24 kHz and bit-rates as low as 8 kbps. MPEG-2 can have up to five channels for surround sound and one low frequency enhancement channel for subwoofers. A multilingual extension adds support for up to seven more channels.

MPEG-4

MPEG-4 is intended to be an all-purpose encoding standard for multimedia systems of the future. It’s designed to handle applications ranging from simple voice systems that require very low bandwidth to high quality “audiophile” and professional sound systems. MPEG-4 can integrate synthetic and natural audio, including MIDI and text-to-speech systems. A large part of MPEG-4 is based on Apple’s QuickTime multimedia format.

MPEG-4 is made extensible by a language called MSDL (MPEG Syntax Description Language). The support for interactivity allows manipulation of the presentation of audio and visual data. MPEG-4 supports a wide range of storage and transmission media and will work over networks and wireless mobile connections.

MPEG-7

MPEG-7 is also referred to as Multimedia Content Description Interface. It defines a structure that supports searching, filtering and management of multimedia data. MPEG-7 is expected to be released in 2001.

Table 14 - MPEG Phases

MPEG-1 (approved Nov. 1992)

Single (mono) and dual (stereo) channel encoding of audio at 32, 44.1 and 48 kHz sampling rates and bit-rates from 32 to 448 kbps.

MPEG-2 (approved Nov. 1994)

A backwards compatible extension to MPEG1 with up to five channels, plus one low frequency enhancement channel. Adds support for 16, 22.05 and 24 kHz sampling rates for bit-rates between 32 to 256 kbps for Layer-I, up to 384 kbps for Layer-II, and from 8 to 320 kbps for Layer-III.

MPEG-2 AAC

Supports a wider range of sampling rates (from 8 kHz to 96 kHz) and up to 48 audio channels, plus up to 15 auxiliary low-frequency enhancement channels and up to 15 embedded data streams. AAC works at bit-rates from 8 kbps for mono speech and in excess of 320 kbps for very-high-quality audio.

MPEG-4 Version 1 (approved Oct. 1998)

All-purpose encoding standard for multimedia systems of the future. Supports coding and composition of both natural and synthetic audio at a wide range of bit-rates.

MPEG4 Version 2 (scheduled to be approved Dec. 1999)

Builds on previous standards for digital television, interactive graphics applications and interactive multimedia.

MPEG-7 (scheduled to be approved July 2001)

Also called Multimedia Content Description Interface. Provides information search, filtering and management for multimedia data.

 

Licensing

Once a standard is released, it is up to private industry to develop products and technologies to take advantage of it. Often, these companies are required to pay licensing fees to companies that hold patents on technologies related to the standard. The only requirement from MPEG is that any licensing fees be fair and equitable.

Many people are surprised to learn that licensing fees are required to develop products based on an open standard. These licensing fees help compensate companies that contribute technology and other resources towards developing MPEG standards. If these companies had no way to recoup their investment, there would be little incentive for them to spend money developing technologies that their competitors could then use free of charge.

MPEG 2.5

A non-ISO extension called “MPEG 2.5” was created by the Fraunhofer Institute to improve performance at lower bit-rates. At lower bit-rates, this extension allows sampling rates of 8, 11.025 and 24 kHz. A high sampling rate at a very low bit-rate requires a trade-off in reduced resolution.
 Lowering the sampling rate reduces the frequency response but allows the frequency resolution to be increased, so the result is a file with significantly better quality.

 

MPEG Layers

Several related audio encoding schemes fall under the MPEG umbrella. These are referred to as Layers I, II and III, which exist under both MPEG-1 and MPEG-2. (Another audio encoding scheme that’s part of MPEG-2 is MPEG AAC, which is not compatible with Layers I - III.)

Each layer uses the same basic structure and includes the features of the layers below it. Higher layers offer progressively better sound quality at comparable bit-rates and require increasingly complex encoding software. This, in turn, requires more processing power for encoding and decoding the audio.

Layer-I

Layer-I was originally designed for the Digital Compact Cassette (DCC) and is not widely used.

Layer-II

Layer-II  (also referred to as MP2) is widely used within the broadcasting industry. It was designed as a trade-off between complexity and performance and offers very high quality sound at higher bit-rates. It also has lower encoding delays than MP3, which is important for live broadcasting.

Layer-III

Layer-III  (MP3) was designed for better quality at lower bit-rates. The high level of compression achieved by MP3 is very important because of the limited bandwidth of the Internet and the limited space of hard disks. This compression also makes MP3 well suited for portable players that use expensive solid-state memory cards.

AAC

AAC (Advanced Audio Coding) is not a MPEG layer, although it is based on a psycho-acoustic model. Sometimes referred to as MP4, AAC provides significantly better quality at lower bit-rates than MP3. AAC was developed under MPEG-2 and also exists under MPEG-4.

AAC supports a wider range of sampling rates (from 8 kHz to 96 kHz) and up to 48 audio channels, plus up to 15 auxiliary low frequency enhancement channels and up to 15 embedded data streams. AAC works at bit rates from 8 kbps for mono speech and up to in excess of 320 kbps for high-quality audio. Three profiles of AAC provide varying levels of complexity and scalability.

AAC software is much more expensive to license than MP3 because the companies that hold related patents decided to keep a tighter reign on it. Most AAC software is geared towards professional applications and secure music distribution systems, so it may be a while before you see AAC in consumer-oriented products.

AT&T’s a2b music, Global Music’s MP4 and Liquid Audio are systems for music delivery that are based on AAC. They both include schemes for copyright identification, encryption and royalty tracking. It’s important to remember that these systems are proprietary, even though they are based on an open standard.

Even though AAC is a better format for digital audio, it’s not clear whether or not it will eclipse MP3 in consumer products. MP3 can sound just as good as AAC at the expense of using more disk space, and disk space is getting cheaper all the time.

Compatibility

The various flavors of MPEG Audio are compatible with each other to some degree. Layers I, II and III are backward compatible. For example, a Layer-III decoder should also be able to decode a Layer-I or II stream, and a Layer-II decoder should be able to decode a Layer-I stream. AAC is not backward compatible with any of the MPEG layers and is sometimes referred to as “NBC,” or “not backward compatible.”

MPEG-1 layers, and the same layers under MPEG-2, are compatible with each other to a limited degree. MPEG-2 decoders must be able to decode MPEG-1 files, and MPEG-1 decoders should be able to play the left and right channels of an MPEG-2 signal.

Most MP3 players are compatible with both MPEG-1 and MPEG-2 files, and most mainstream MP3 encoders and players are compatible with each other (though there have been compatibility issues reported with a few of the freeware encoders and some players).

Compatibility between proprietary formats based on MPEG is another story. Most of the proprietary formats based on MPEG Audio, such as AT&T’s a2b music and Liquid Audio, are not compatible with each other or with software that supports only pure MPEG formats.

Some features added to MPEG Audio (such as watermarking) should not affect compatibility, but many proprietary formats use encryption. And any form of encryption is likely to make these formats incompatible with each other and with products that support only pure MPEG Audio.

MPEG Encoding

MPEG Audio uses what’s referred to as perceptual encoding (a type of “lossy” compression. To compress audio, MPEG encoders first apply a psycho-acoustic model to identify parts of the signal that most people can’t hear. The encoder removes these sounds from the signal and then applies standard lossless data compression techniques.

This technique does not work perfectly because the sensitivity of each person’s hearing is different. But the sensitivity of human hearing does fall within a finite range, and researchers can determine a range that applies to the vast majority of people.

Sub-bands

The encoder first divides the signal into multiple sub-bands, so the encoded signal can be better optimized to the response of the human ear. For example, most of stereo information below 100 Hz can be discarded because the ear cannot determine the direction of very low frequency sounds; but at higher frequencies the ear is more sensitive to direction of sounds, so more stereo information needs to be retained.

Minimum Audible Threshold

The level below which all sounds are inaudible to the human ear is called the threshold of hearing, or minimum audible threshold. This threshold varies according to frequency because the human ear does not have a linear response.

Sounds below this threshold can be removed by the encoder, and most listeners will not detect any difference between the encoded signal and the original. The ear is most sensitive to frequencies between 2 kHz and 4 kHz, so less information can be removed from this range without affecting the quality of the sound.


Figure 28 shows the Fletcher-Munsen curve, which illustrates how the threshold of human hearing varies according to frequency.

Masking Effect

Quiet sounds are “masked” by louder sounds that are close to them in frequency and time. Since you can’t hear these sounds, they can be removed from the signal without affecting the perceived quality. An example is the hiss and other background noise you hear when a song is paused or blank tape is playing. When the music plays above a certain level you can no longer hear this background noise, but it is still there in the signal.


Reservoir of Bits


Certain musical passages need to be encoded at higher bit-rates to maintain fidelity, so MP3 creates a reservoir by setting aside bits from less complex passages. These extra bits can then be applied to more complex passages, where they are needed more. This is different from variable bit-rate encoding, because a fixed number of bits are allocated—they just are shifted to where they are needed most.

Stereo Modes

Stereo audio normally requires twice the bandwidth of mono because it uses two separate channels. Much of the information is identical on both channels. For example, any sounds positioned at the center of the stereo image will be carried by both channels. This wastes a lot of space because the information is identical. MPEG Audio has several ways of handling stereo information. Each method varies in the amount of compression and the fidelity to the stereo image.

Simple Stereo (mode 0) is the closest to a normal stereo signal. It uses independent channels; therefore, any duplicate information will be retained, and some bandwidth will be wasted. The MPEG encoder can vary the allocation of bits between channels according to the complexity of the signal. The overall bit-rate remains constant, but the split between the channels varies according to the dynamic range of each channel.

Joint Stereo (mode 1) uses MS (middle/side) Stereo, where one channel carries the information that is identical on both channels and the other carries the difference. Joint Stereo retains all the original stereo information and uses bandwidth very efficiently.

Intensity Stereo encodes only the stereo information that is perceived as important to the stereo image. Intensity Stereo provides the highest level of compression, but the stereo image will suffer at lower bit-rates.

Although Simple Stereo is the closest to a normal stereo signal, it is not the best option to use with MPEG Audio. In most cases, Joint Stereo will produce higher quality sound because the bits can be allocated more efficiently.

Huffman Encoding

In any musical composition, certain sound patterns are repeated—some more often than others. These patterns can be coded with symbols to save space, then decoded into the original pattern when played. Huffman encoding increases compression by using shorter codes for more common sound patterns. It’s similar to replacing every word in a document with a number and using the smaller numbers for the most common words.

Bit-rates

MPEG Audio supports constant and variable bit-rates ranging from 8 kbps to 1.5 mbps. Just as with uncompressed audio, the bit-rate of MPEG Audio has a direct relationship to sound quality and file size.

Constant bit-rate (CBR) encoding is not very efficient because it uses the same number of bits, regardless of how complex or simple the passage is. Variable bit-rate (VBR) encoding varies the number of bits depending on the complexity of the music and is more efficient than CBR. For example, a simple passage with just a vocalist and acoustic guitar needs fewer bits than a passage with a full symphony.

Text Box: Resolution and MPEG Audio
MPEG encoders rely on the resolution used in the uncompressed audio file to set the range of resolution that will be used for the encoded file. The resolution of the encoded file is varied according to the complexity of the signal to achieve compression. Many encoders are optimized to work with 16-bit resolution input, and some will only accept 44. kHz, 16-bit WAV files as input.
Table 15 shows the file sizes and relative amounts of compression for different bit-rates. As the bit-rate increases, so does the sound quality, along with the file size. This table also shows how many hours of audio or four-minute songs, a 1GB hard disk will hold at each rate.

Table 15 - File Size vs. Bit-rate

 Bit-rate

 File Size

(4-min. song)

 MB per

 Minute

 Compression

 Ratio

 Hours per

 GB

 4-min. Songs

 per GB

 1,411 kbps

 (CD Audio)

41.3MB

10.3

  None

  1.7

  25

 80 kbps

  2.3MB

  0.6

  7.6 = 1

29.1

 437

 128 kbps

  3.8MB

  0.9

11.0 = 1

18.2

 273

 160 kbps

  4.7MB

  1.2

  8.8 = 1

14.6

 218

 192 kbps

  5.6MB

  1.4

  7.3 = 1

12.1

 182

 256 kbps

  7.5MB

  1.9

  5.5 = 1

  9.1

 137

 320 kbps

  9.4MB

  2.3

  4.4 = 1

  7.3

 109

 

Signal Delays

The process of encoding and decoding audio introduces a slight delay into the signal. This is not a problem for home use, but it is a factor for applications where a short delay is critical, such as two-way voice conversations, where a delay of more than 10 ms (milliseconds) can be disturbing. Delays for MPEG Audio typically range from 19 ms for Layer-I to more than 60 ms for Layer-III and AAC. The actual delay depends on the hardware and software used.

Embedded Data (ID3 Tags)(ID3 Tags

MPEG Audio is frame-based, which allows it to support the insertion of additional program information in the form of text, graphics and other data. The standard is flexible enough that software developers can include almost any type of data, such as copyright information, lyrics, album artwork and even links to artist’s Web sites.

ID3 Tags

An informal standard called ID3 tagging has emerged that specifies a format for storing non-audio data inside MP3 files. The ID3 information can be displayed and edited by MP3 players such as Winamp. The ID3 tag is placed at the very end of the MP3 file, which makes it unsuitable for streaming audio.

ID3 Version 1 is limited to 128 bytes of data and 30 characters per field and contains fixed length fields for title, artist, album, year, comments, track number and genre. Audio CDs do not contain this information, so it needs to be entered manually or obtained from a database, such as the CDDB (see Chapter 9, Organizing and Playing Music. The identification field must contain the characters “TAG” to indicate ID3 version 1 compliance.

ID3 Version 1.1 takes the last two characters of the comments field and uses them for the number of the CD track that the song originated from.

Table 16 - ID3 Tag Version 1.1 Fields

Position

Length ( Bytes)

Field

0-2

3

Identification

3-32

30

Title

33-62

30

Artist

63-64

30

Album

93-97

4

Year

98-124

30

Comments

125

1

0 (zero)

126

1

Track Number

127

1

Genre

 

ID3v2

ID3 Version 2 is designed to be more flexible and expandable than version 1.1. Each tag contains smaller chunks of data, called frames. Each frame can contain any type of data, such as lyrics, album cover graphics and links to a band’s Web site. The ID3v2 tag is placed at the beginning of the file, which makes it useful for streaming applications. A unique feature called the Popularimeter can be used to keep track of how often you listen to each song, and this information could be used to automatically construct playlists based on your personal tastes.

Key Features of ID3v2

·        Uses a container format.

·        Tag data is at beginning of the file, which makes it suitable for streaming.

·        Has an “unsynchronization" feature to prevent ID3v2 incompatible players from attempting to play the tag.

·        Maximum tag size is 256MB; maximum frame size is 16MB.

·        Supports Unicode and the capability to compress data.

·        Has several new text fields such as composer, conductor, media type, BPM, copyright message, etc.

·        Able to contain both plain and synchronized lyrics (for karaoke).

·        Can contain volume, balance and equalizer settings.

·        Supports encrypted information, images and hyperlinks.

Measuring Sound Quality

Sound quality is subjective, so traditional measures like total harmonic distortion (THD) and signal-to-noise ratio are not useful for rating perceptual encoding schemes. The perceived quality of the sound is more important than any characteristic that can be measured with test equipment. Controlled tests with trained listeners are the best way of measuring the performance of perceptual encoding schemes.

During the MPEG-1 development process, three international listening tests were performed using the CCIR (Centre for Communication Interface Research) impairment scale shown in Table 17. At 128 kbps, MP3 scored between 3.6 and 3.8. This indicates that listeners detected a difference between the MP3 and the original but the difference was not annoying. At 240 kbps and above, MP3 scored at the high end of the scale, and most listeners found it difficult to distinguish between the MP3 and the original version.


Table 17 - CCIR Impairment Scale

5.0       -      Imperceptible (indistinguishable from the original)

4.0       -      Perceptible (perceptible difference, but not annoying)

3.0       -      Slightly annoying

2.0       -      Annoying

1.0       -      Very annoying

 

Variables That Affect Sound Quality

The major variables that affect the sound quality of encoded audio are the type of encoder, the bit-rate, the type of music and the sensitivity of the listener’s hearing. The quality of commercially available encoders is generally very good, and most people would find it difficult to tell the difference between two MP3 files encoded from the same song by different encoders. Assuming you’ve already decided on using MP3, the bit-rate is the biggest factor that you can control.

In general, music that is more complex will require higher bit-rates. A good example is classical (or symphonic) music. Classical music is generally more complex, because there are more instruments and a wider dynamic range compared to most other types of music, such as blues and rock. Variable bit-rate encoding is a good choice for all types of music because it provides significantly better quality than constant bit-rate encoding at a similar rate. This is because the bits are allocated where they are needed most, which also helps maintain a more constant signal-to-noise ratio.

Table 18 shows the bit-rates for various digital audio formats that will produce high quality sound for most types of music.

Table 18 - Bit-rates for High Quality Sound

 Format

Bit-rate

Compression

 Red Book (CD)

1.4Mbps

None

 MPEG Layer-I

384 kbps

3.6=1

 MPEG Layer-II

256 kbps

5.5=1

 MPEG Layer-III (MP3)

192 kbps

7.3=1

 MPEG Layer-III (MP3)

VBR Normal/High

7=1 to 10=1

 MPEG AAC

128 kbps

11=1

 

Table 19 - ID3 Tag Genre Codes

 0

Blues

20

Alternative

40

Alternative Rock

60

Top 40

 1

Classic Rock

21

Ska

41

Bass

61

Christian Rap

 2

Country

22

Death Metal

42

Soul

62

Pop/Funk

 3

Dance

23

Pranks

43

Punk

63

Jungle

 4

Disco

24

Soundtrack

44

Space

64

Native American

 5

Funk

25

Euro-Techno

45

Meditative

65

Cabaret

 6

Grunge

26

Ambient

46

Instrumental Pop

66

New Wave

 7

Hip-Hop

27

Trip-Hop

47

Instrumental Rock

67

Psychedelic

 8

Jazz

28

Vocal

48

Ethnic

68

Rave

 9

Metal

29

Jazz+Funk

49

Gothic

69

Showtunes

 10

New Age

30

Fusion

50

Darkwave

70

Trailer

 11

Oldies

31

Trance

51

Techno-Industrial

71

Lo-Fi

 12

Other

32

Classical

52

Electronic

72

Tribal

 13

Pop

33

Instrumental

53

Pop-Folk

73

Acid Punk

 14

R&B

34

Acid

54

Eurodance

74

Acid Jazz

 15

Rap

35

House

55

Dream

75

Polka

 16

Reggae

36

Game

56

Southern Rock

76

Retro

 17

Rock

37

Sound Clip

57

Comedy

77

Musical

 18

Techno

38

Gospel

58

Cult

78

Rock & Roll

 19

Industrial

39

Noise

59

Gangsta

79

Hard Rock

Source: www.dv.co.yu/mpgscript/mpeghdr.htm

Additional Resources

Organization

Web Site

American National Standards Institute (ANSI)

www.ansi.org

Centre for Communication Interface Research (CCIR)

www.ccir.org

Fraunhofer Gesellschaft

www.iis.fhg.de/amm/techinf

ID3 Tag Specification

www.id3.org

International Standards Organization (ISO)

www.iso.ch

Moving Picture Experts Group (MPEG)

http://drogo.cselt.stet.it/mpeg