One way to compare CODEC quality

I’m often asked about my opinion regarding sound quality vs. compression formats or sampling rates or bit depths or psychoacoustic CODEC’s or other things like that…

Of course, there are lots of ways to decide on such an opinion, depending on what parameters you use to define “sound quality” and therefore what it is you’re asking specifically…

One way to think of this is to consider that the original sound file is the “reference” (regardless of how “good” or “bad” it is…), and when you encode it somehow (say, by changing sampling rates, or making it an MP3 file, for example), AND that encoding makes it different, then the resulting difference from the original can be considered an error.

So, I took a compilation of tracks that I often use for listening to loudspeakers. This is about 13 minutes long and is made of excerpts of many different recordings and recording styles, ranging from anechoic female speech, through a cappella choral, orchestral music, jazz, hard rock, heavy metal, and hip hop. The original tracks were all taken from 44.1 kHz / 16-bit CD’s, and the compilation is a 44.1 kHz / 16 bit result. This is what we’ll call the “reference”.

I then used LAME to encode the compilation in different bitrates of MP3. I re-encoded as 320, 256, and 128 CBR (Constant Bit Rate). I also used the “–preset” option to make encodings in the “insane”, “extreme”, “standard”, and “medium” settings (I’ve included the details of this at the bottom in the “Appendix”). Three of these four presets are VBR – the “Insane” setting is a CBR 320 kbps with some tweaked parameters.

I decoded those MP3 files back to PCM, and compared them to the original, of course making sure that everything was time- and gain-aligned. (There are some small differences in the overall level of the original file and the MP3 output – which is different for different bitrates. If I did not do this, then I would be exaggerating the differences between the original and the encoded versions – so this gain difference was calculated and compensated for, before subtracting the original from the MP3.)

Let’s take a look at a plot of the sample values in the left channel of the beginning of the track.

Figure 1. The original (in black) and the decoded 128 kbps MP3 file.

The plot above shows the first 44100 samples in the track (the first second of sound). The red plot is the decoded 128 kbps MP3. The black plot (which is difficult to see because it is overlapped by the red plot – except in the signal peaks) is the original file. For example, if I zoom into the area around the beginning of the sound (say, starting around sample number 15800) then we see this

Figure 2. A close-up of a portion of Figure 1.

So, as you can see in the two plots above, the decoded 128 kbps MP3 and the original 44.1/16 file are different. But, the difference is small relative to the levels of the signals themselves. The question is, how small is the difference, exactly?

We can find this out by subtracting the original signal from the decoded MP3 output, sample by sample. The result of this is shown in the plot below.

Figure 3. The difference between the two plots in Figure 2.

Notice that the vertical scale of the plot in Figure 3 is small. This is because it shows the difference between the two lines in Figure 2, which is also quite small.

Let’s think for a minute about how I arrived at the signal in Figure 3. I subtracted the Original signal from the MP3 output. In other words:

MP3 output – Original = Difference

If we consider that the difference between the MP3 output and the Original can be thought of as an “error”, and if I move the terms in the equation above, I get the following:

MP3 output – Original = Error

Original + Error = MP3 output

So, the question is: how loud is that error relative to the signal we’re listening to? The idea here is that, the louder the error, the easier it will be to detect.

Figure 4, below, shows this level difference over time. The black curve is a running RMS level of the decoded 128 kbps MP3 file. As you can see there, it ranges from about -30 dB FS to about +10 dB FS. You may think that it’s strange that it “only” goes to -10 dB FS – but this is because the time window I’m using to calculate the RMS value of the signal is 500 ms long. The peaks of the track reach full scale, but since my time window is long, this tends to pull down the apparent level (because the peaks are short). (NB: If you want to argue about the choice of a 500 ms time window, please wait until I’ve followed up this posting with another one that divides things up by frequency band…)

The res curve in Figure 4 is a running RMS value of the Error signal – the difference between the MP3 file and the original. As you can see there, that error signal ranges from about -50 dB FS to about -30 dB FS, give or take…

Figure 4. Running measures of the level of the decoded 128 kpbs MP3 file (in black) and the error signal (in red).

We can find the running value of the difference between the level of the MP3 file and the level of the Error it contains by subtracting the black curve from the red curve. The result of this is shown in Figure 5, below.

Figure 5: The difference in level between the error signal and the decoded 128 kbps MP3 file.

So, Figure 5, therefore, shows the measure of how loud the signal is relative to the error that makes it different from the original. If this error signal were just harmonic distortion, then we could call this a measure of THD in dB. If it were just good-old-fashioned noise, like on a magnetic tape, then we could call it a signal-to-noise ratio. However, this is neither distortion or noise in the traditional sense – or, maybe more accurately, it’s both…

So, let’s call the plot in Figure 5 a “signal-to-error ratio”. What we can see there is that, for this particular track, for the settings that I used to make the 128 kbps MP3 file, the error – the MP3 artefacts – are only 20 to 25 dB below the signal most of the time. Now, don’t jump to conclusions here. This does not mean that they would be as audible as white noise that is only 25 dB below the signal. This is because part of the “magic” of the MP3 encoder is that it tries to ensure that the error can “hide” under the signal by placing the error signal in the same frequency band(s) as the signal. Typically, white noise is in a different band than the signal, so it’s easier to hear because it’s not masked. So, be very careful about interpreting this plot. This is a measurable signal-to-error ratio, but it cannot be directly compared to a signal-to-noise ratio.

Let’s now increase the bitrate of the MP3 encoding, allowing the encoder to increase the quality.

Figure 6. A running RMS of a decoded 256 kbps MP3 file (black) and the difference between that signal and the original (red).

Figure 7: The Signal-to-Error ratio of a 256 kbps MP3 file.

Figure 6 and 7 show the same information as before, but for a 256 kpbs encoding of the same track. As you can see there, by doubling the bitrate of the MP3, we have increased our signal-to-error ratio by about 10 to 15 dB or so – to about 35 or 40 dB.

Figure 8: A running RMS of a decoded 320 kbps MP3 file (black) and the difference between that signal and the original (red).

Figure 9: The Signal-to-Error ratio of a 320 kbps MP3 file.

As you can see in Figures 8 and 9 above, increasing the MP3 bitrate to 320 kbps can improve the Signal-to-Error ratio from about 25 dB (for 128 kbps) to about 40 dB or so.

Now, if you’re looking carefully, you might notice that, some times in the track that I used for testing, the signal-to-error ratio is actually worse for the 320 kbps file than it is for the 256 kbps file – all other things being equal in the LAME converter parameters. This is a bit misleading, since what you cannot see there is the frequency spectrum of the error signal. I’ll deal with that in a future posting – with some more analysis and explanation to go with it.

For now, let’s play with the VBR presets in LAME. I’ll just show the signal-to-error plots for the 4 settings.

Figure 10: The Signal-to-Error ratio of an MP3 file converted using LAME’s “medium” quality preset.

Figure 11: The Signal-to-Error ratio of an MP3 file converted using LAME’s “standard” quality preset.

Figure 12: The Signal-to-Error ratio of an MP3 file converted using LAME’s “extreme” quality preset.

Figure 13: The Signal-to-Error ratio of an MP3 file converted using LAME’s “insane” quality preset.

So, as you can see in Figures 10 through 13, the signal-to-error ratio can be improved with the VBR presets, reaching a peak of over 60 dB for the “Insane” setting, for this track…

As I said a couple of times above:

You have to be careful about interpreting these graphs from a background of “knowing” what a SNR is… This error is not normal “distortion” or “noise” – at least from a perceptual point of view…
I’ll go further with this, including some frequency-dependent information in a future posting.

Appendix – LAME parameters and verbose output

For the geeks…

MAC60090:mp3_demos ggm$ lame -b 320 -q 0 –verbose compilation_original.wav lame_320.mp3

LAME 3.99.5 64bits (http://lame.sf.net)

Using polyphase lowpass filter, transition band: 20094 Hz – 20627 Hz

Encoding compilation_original.wav to lame_320.mp3

Encoding as 44.1 kHz j-stereo MPEG-1 Layer III (4.4x) 320 kbps qval=0

misc:

scaling: 1

ch0 (left) scaling: 1

ch1 (right) scaling: 1

huffman search: best (outside loop)

experimental Y=0

…

stream format:

MPEG-1 Layer 3

2 channel – joint stereo

padding: off

constant bitrate – CBR

using LAME Tag

…

psychoacoustic:

using short blocks: channel coupled

subblock gain: 1

adjust masking: -10 dB

adjust masking short: -11 dB

quantization comparison: 9

^ comparison short blocks: 9

noise shaping: 1

^ amplification: 2

^ stopping: 1

ATH: using

^ type: 4

^ shape: 0 (only for type 4)

^ level adjustement: -12 dB

^ adjust type: 3

^ adjust sensitivity power: 1.000000

experimental psy tunings by Naoki Shibata

adjust masking bass=-0.5 dB, alto=-0.25 dB, treble=-0.025 dB, sfb21=0.5 dB

using temporal masking effect: yes

interchannel masking ratio: 0

…

Frame | CPU time/estim | REAL time/estim | play/CPU | ETA

37028/37028 (100%)| 2:07/ 2:07| 2:08/ 2:08| 7.5929x| 0:00

————————————————————————————————–

kbps LR MS % long switch short %

320.0 73.7 26.3 93.4 3.4 3.1

Writing LAME Tag…done

ReplayGain: -2.6dB

MAC60090:mp3_demos ggm$ lame -b 256 -q 0 –verbose compilation_original.wav lame_256.mp3

LAME 3.99.5 64bits (http://lame.sf.net)

Using polyphase lowpass filter, transition band: 19383 Hz – 19916 Hz

Encoding compilation_original.wav to lame_256.mp3

Encoding as 44.1 kHz j-stereo MPEG-1 Layer III (5.5x) 256 kbps qval=0

misc:

scaling: 1

ch0 (left) scaling: 1

ch1 (right) scaling: 1

huffman search: best (outside loop)

experimental Y=0

…

stream format:

MPEG-1 Layer 3

2 channel – joint stereo

padding: off

constant bitrate – CBR

using LAME Tag

…

psychoacoustic:

using short blocks: channel coupled

subblock gain: 1

adjust masking: -8 dB

adjust masking short: -8.8 dB

quantization comparison: 9

^ comparison short blocks: 9

noise shaping: 1

^ amplification: 2

^ stopping: 1

ATH: using

^ type: 4

^ shape: 1 (only for type 4)

^ level adjustement: -10 dB

^ adjust type: 3

^ adjust sensitivity power: 1.000000

experimental psy tunings by Naoki Shibata

adjust masking bass=-0.5 dB, alto=-0.25 dB, treble=-0.025 dB, sfb21=0.5 dB

using temporal masking effect: yes

interchannel masking ratio: 0

…

Frame | CPU time/estim | REAL time/estim | play/CPU | ETA

37028/37028 (100%)| 1:50/ 1:50| 1:51/ 1:51| 8.7235x| 0:00

————————————————————————————————–

kbps LR MS % long switch short %

256.0 71.6 28.4 93.4 3.4 3.1

Writing LAME Tag…done

ReplayGain: -2.6dB

MAC60090:mp3_demos ggm$ lame -b 128 -q 0 –verbose compilation_original.wav lame_128.mp3

LAME 3.99.5 64bits (http://lame.sf.net)

Using polyphase lowpass filter, transition band: 16538 Hz – 17071 Hz

Encoding compilation_original.wav to lame_128.mp3

Encoding as 44.1 kHz j-stereo MPEG-1 Layer III (11x) 128 kbps qval=0

misc:

scaling: 0.95

ch0 (left) scaling: 1

ch1 (right) scaling: 1

huffman search: best (outside loop)

experimental Y=0

…

stream format:

MPEG-1 Layer 3

2 channel – joint stereo

padding: off

constant bitrate – CBR

using LAME Tag

…

psychoacoustic:

using short blocks: channel coupled

subblock gain: 1

adjust masking: 0 dB

adjust masking short: 0 dB

quantization comparison: 9

^ comparison short blocks: 9

noise shaping: 2

^ amplification: 2

^ stopping: 1

ATH: using

^ type: 4

^ shape: 4 (only for type 4)

^ level adjustement: -3 dB

^ adjust type: 3

^ adjust sensitivity power: 1.000000

experimental psy tunings by Naoki Shibata

adjust masking bass=-0.5 dB, alto=-0.25 dB, treble=-0.025 dB, sfb21=0.5 dB

using temporal masking effect: yes

interchannel masking ratio: 0.0002

…

Frame | CPU time/estim | REAL time/estim | play/CPU | ETA

37028/37028 (100%)| 1:33/ 1:33| 1:34/ 1:34| 10.305x| 0:00

————————————————————————————————–

kbps LR MS % long switch short %

128.0 25.2 74.8 95.2 2.6 2.2

Writing LAME Tag…done

ReplayGain: -2.2dB

MAC60090:mp3_demos ggm$ lame –preset medium –verbose compilation_original.wav lame_medium.mp3

LAME 3.99.5 64bits (http://lame.sf.net)

Using polyphase lowpass filter, transition band: 17249 Hz – 17782 Hz

Encoding compilation_original.wav to lame_medium.mp3

Encoding as 44.1 kHz j-stereo MPEG-1 Layer III VBR(q=4)

misc:

scaling: 1

ch0 (left) scaling: 1

ch1 (right) scaling: 1

huffman search: best (outside loop)

experimental Y=1

…

stream format:

MPEG-1 Layer 3

2 channel – joint stereo

padding: all

variable bitrate – VBR mtrh (default)

using LAME Tag

…

psychoacoustic:

using short blocks: channel coupled

subblock gain: 1

adjust masking: 0 dB

adjust masking short: 0 dB

quantization comparison: 9

^ comparison short blocks: 9

noise shaping: 1

^ amplification: 2

^ stopping: 1

ATH: using

^ type: 5

^ shape: 2 (only for type 4)

^ level adjustement: -0 dB

^ adjust type: 3

^ adjust sensitivity power: 6.309574

experimental psy tunings by Naoki Shibata

adjust masking bass=-0.5 dB, alto=-0.25 dB, treble=-0.025 dB, sfb21=3.5 dB

using temporal masking effect: no

interchannel masking ratio: 0

…

Frame | CPU time/estim | REAL time/estim | play/CPU | ETA

37028/37028 (100%)| 0:18/ 0:18| 0:19/ 0:19| 53.116x| 0:00

32 [ 37] %

40 [ 4] *

48 [ 14] %

56 [ 8] %

64 [ 105] %

80 [ 423] %*

96 [ 831] %***

112 [ 2596] %%%********

128 [17134] %%%%%%%%%%%%%%%%%%%%***********************************************

160 [12811] %%%%%%%%%%%%%%%%%%%%%%%%***************************

192 [ 1330] %%****

224 [ 836] %%**

256 [ 683] %**

320 [ 216] %

——————————————————————————-

kbps LR MS % long switch short %

144.3 35.5 64.5 90.7 4.6 4.7

Writing LAME Tag…done

ReplayGain: -2.6dB

MAC60090:mp3_demos ggm$ lame –preset standard –verbose compilation_original.wav lame_standard.mp3

LAME 3.99.5 64bits (http://lame.sf.net)

Using polyphase lowpass filter, transition band: 18671 Hz – 19205 Hz

Encoding compilation_original.wav to lame_standard.mp3

Encoding as 44.1 kHz j-stereo MPEG-1 Layer III VBR(q=2)

misc:

scaling: 1

ch0 (left) scaling: 1

ch1 (right) scaling: 1

huffman search: best (outside loop)

experimental Y=0

…

stream format:

MPEG-1 Layer 3

2 channel – joint stereo

padding: all

variable bitrate – VBR mtrh (default)

using LAME Tag

…

psychoacoustic:

using short blocks: channel coupled

subblock gain: 1

adjust masking: -2.6 dB

adjust masking short: -2.6 dB

quantization comparison: 9

^ comparison short blocks: 9

noise shaping: 1

^ amplification: 2

^ stopping: 1

ATH: using

^ type: 5

^ shape: 2 (only for type 4)

^ level adjustement: -3.7 dB

^ adjust type: 3

^ adjust sensitivity power: 1.995262

experimental psy tunings by Naoki Shibata

adjust masking bass=-0.5 dB, alto=-0.25 dB, treble=-0.025 dB, sfb21=6.25 dB

using temporal masking effect: no

interchannel masking ratio: 0

…

Frame | CPU time/estim | REAL time/estim | play/CPU | ETA

37028/37028 (100%)| 0:19/ 0:19| 0:20/ 0:20| 48.732x| 0:00

32 [ 0]

40 [ 0]

48 [ 1] %

56 [ 0]

64 [ 15] %

80 [ 26] %

96 [ 17] %

112 [ 135] %

128 [ 1673] %*******

160 [15048] %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%*****************************

192 [15688] %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%*****************

224 [ 1986] %%%%%****

256 [ 1602] %%%%***

320 [ 837] %%**

——————————————————————————-

kbps LR MS % long switch short %

183.0 60.0 40.0 90.7 4.6 4.7

Writing LAME Tag…done

ReplayGain: -2.6dB

MAC60090:mp3_demos ggm$ lame –preset extreme –verbose compilation_original.wav lame_extreme.mp3

LAME 3.99.5 64bits (http://lame.sf.net)

polyphase lowpass filter disabled

Encoding compilation_original.wav to lame_extreme.mp3

Encoding as 44.1 kHz j-stereo MPEG-1 Layer III VBR(q=0)

misc:

scaling: 1

ch0 (left) scaling: 1

ch1 (right) scaling: 1

huffman search: best (outside loop)

experimental Y=0

…

stream format:

MPEG-1 Layer 3

2 channel – joint stereo

padding: all

variable bitrate – VBR mtrh (default)

using LAME Tag

…

psychoacoustic:

using short blocks: channel coupled

subblock gain: 1

adjust masking: -6.8 dB

adjust masking short: -6.8 dB

quantization comparison: 9

^ comparison short blocks: 9

noise shaping: 1

^ amplification: 2

^ stopping: 1

ATH: using

^ type: 5

^ shape: 1 (only for type 4)

^ level adjustement: -7.1 dB

^ adjust type: 3

^ adjust sensitivity power: 1.000000

experimental psy tunings by Naoki Shibata

adjust masking bass=-0.5 dB, alto=-0.25 dB, treble=-0.025 dB, sfb21=8.25 dB

using temporal masking effect: no

interchannel masking ratio: 0

…

Frame | CPU time/estim | REAL time/estim | play/CPU | ETA

37028/37028 (100%)| 0:21/ 0:21| 0:22/ 0:22| 44.584x| 0:00

32 [ 0]

40 [ 0]

48 [ 0]

56 [ 0]

64 [ 0]

80 [ 0]

96 [ 0]

112 [ 1] %

128 [ 0]

160 [ 408] %*

192 [ 1961] %%******

224 [16481] %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%***************

256 [13387] %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%*************

320 [ 4790] %%%%%%%%%%%%%*******

——————————————————————————-

kbps LR MS % long switch short %

245.6 70.9 29.1 90.7 4.6 4.7

Writing LAME Tag…done

ReplayGain: -2.6dB

MAC60090:mp3_demos ggm$ lame –preset insane –verbose compilation_original.wav lame_insane.mp3

LAME 3.99.5 64bits (http://lame.sf.net)

Using polyphase lowpass filter, transition band: 20094 Hz – 20627 Hz

Encoding compilation_original.wav to lame_insane.mp3

Encoding as 44.1 kHz j-stereo MPEG-1 Layer III (4.4x) 320 kbps qval=3

misc:

scaling: 1

ch0 (left) scaling: 1

ch1 (right) scaling: 1

huffman search: best (outside loop)

experimental Y=0

…

stream format:

MPEG-1 Layer 3

2 channel – joint stereo

padding: off

constant bitrate – CBR

using LAME Tag

…

psychoacoustic:

using short blocks: channel coupled

subblock gain: 1

adjust masking: -10 dB

adjust masking short: -11 dB

quantization comparison: 9

^ comparison short blocks: 9

noise shaping: 1

^ amplification: 1

^ stopping: 1

ATH: using

^ type: 4

^ shape: 0 (only for type 4)

^ level adjustement: -12 dB

^ adjust type: 3

^ adjust sensitivity power: 1.000000

experimental psy tunings by Naoki Shibata

adjust masking bass=-0.5 dB, alto=-0.25 dB, treble=-0.025 dB, sfb21=0.5 dB

using temporal masking effect: yes

interchannel masking ratio: 0

…

Frame | CPU time/estim | REAL time/estim | play/CPU | ETA

37028/37028 (100%)| 0:28/ 0:28| 0:28/ 0:28| 33.937x| 0:00

——————————————————————————-

kbps LR MS % long switch short %

320.0 73.7 26.3 93.4 3.4 3.1

Writing LAME Tag…done

ReplayGain: -2.6dB

david moran says:

2017/11/21 at 10:02 PM

This analysis assumes the coder preserves the waveform, a blunt assumption, and would fall apart with a coder which preserves the perceived sound but not the waveform (e.g. SE-AAC).
An expert points to some sophisticated analysis programs (PEAQ) doing a moderately good job of predicting audible changes.
It remains area of investigation.

earfluff and eyecandy

mostly audio, but with some other stuff occasionally

One way to compare CODEC quality

Appendix – LAME parameters and verbose output

One Response on “One way to compare CODEC quality”

david moran says: