One way to compare CODEC quality

I’m often asked about my opinion regarding sound quality vs. compression formats or sampling rates or bit depths or psychoacoustic CODEC’s or other things like that…

Of course, there are lots of ways to decide on such an opinion, depending on what parameters you use to define “sound quality” and therefore what it is you’re asking specifically…

One way to think of this is to consider that the original sound file is the “reference” (regardless of how “good” or “bad” it is…), and when you encode it somehow (say, by changing sampling rates, or making it an MP3 file, for example), AND that encoding makes it different, then the resulting difference from the original can be considered an error.

So, I took a compilation of tracks that I often use for listening to loudspeakers. This is about 13 minutes long and is made of excerpts of many different recordings and recording styles, ranging from anechoic female speech, through a cappella choral, orchestral music, jazz, hard rock, heavy metal, and hip hop. The original tracks were all taken from 44.1 kHz / 16-bit CD’s, and the compilation is a 44.1 kHz / 16 bit result. This is what we’ll call the “reference”.

I then used LAME to encode the compilation in different bitrates of MP3. I re-encoded as 320, 256, and 128 CBR (Constant Bit Rate). I also used the “–preset” option to make encodings in the “insane”, “extreme”, “standard”, and “medium” settings (I’ve included the details of this at the bottom in the “Appendix”). Three of these four presets are VBR – the “Insane” setting is a CBR 320 kbps with some tweaked parameters.

 

I decoded those MP3 files back to PCM, and compared them to the original, of course making sure that everything was time- and gain-aligned. (There are some small differences in the overall level of the original file and the MP3 output – which is different for different bitrates. If I did not do this, then I would be exaggerating the differences between the original and the encoded versions – so this gain difference was calculated and compensated for, before subtracting the original from the MP3.)

 

Let’s take a look at a plot of the sample values in the left channel of the beginning of the track.

Figure 1. The original (in black) and the decoded 128 kbps MP3 file.

The plot above shows the first 44100 samples in the track (the first second of sound). The red plot is the decoded 128 kbps MP3. The black plot (which is difficult to see because it is overlapped by the red plot – except in the signal peaks) is the original file. For example, if I zoom into the area around the beginning of the sound (say, starting around sample number 15800) then we see this

Figure 2. A close-up of a portion of Figure 1.

So, as you can see in the two plots above, the decoded 128 kbps MP3 and the original 44.1/16 file are different. But, the difference is small relative to the levels of the signals themselves. The question is, how small is the difference, exactly?

We can find this out by subtracting the original signal from the decoded MP3 output, sample by sample. The result of this is shown in the plot below.

Figure 3. The difference between the two plots in Figure 2.

Notice that the vertical scale of the plot in Figure 3 is small. This is because it shows the difference between the two lines in Figure 2, which is also quite small.

Let’s think for a minute about how I arrived at the signal in Figure 3. I subtracted the Original signal from the MP3 output. In other words:

MP3 output – Original = Difference

If we consider that the difference between the MP3 output and the Original can be thought of as an “error”, and if I move the terms in the equation above, I get the following:

MP3 output – Original = Error

Original + Error = MP3 output

So, the question is: how loud is that error relative to the signal we’re listening to? The idea here is that, the louder the error, the easier it will be to detect.

Figure 4, below, shows this level difference over time. The black curve is a running RMS level of the decoded 128 kbps MP3 file. As you can see there, it ranges from about -30 dB FS to about +10 dB FS. You may think that it’s strange that it “only” goes to -10 dB FS – but this is because the time window I’m using to calculate the RMS value of the signal is 500 ms long. The peaks of the track reach full scale, but since my time window is long, this tends to pull down the apparent level (because the peaks are short). (NB: If you want to argue about the choice of a 500 ms time window, please wait until I’ve followed up this posting with another one that divides things up by frequency band…)

The res curve in Figure 4 is a running RMS value of the Error signal – the difference between the MP3 file and the original. As you can see there, that error signal ranges from about -50 dB FS to about -30 dB FS, give or take…

Figure 4. Running measures of the level of the decoded 128 kpbs MP3 file (in black) and the error signal (in red).

We can find the running value of the difference between the level of the MP3 file and the level of the Error it contains by subtracting the black curve from the red curve. The result of this is shown in Figure 5, below.

Figure 5: The difference in level between the error signal and the decoded 128 kbps MP3 file.

So, Figure 5, therefore, shows the measure of how loud the signal is relative to the error that makes it different from the original. If this error signal were just harmonic distortion, then we could call this a measure of THD in dB. If it were just good-old-fashioned noise, like on a magnetic tape, then we could call it a signal-to-noise ratio. However, this is neither distortion or noise in the traditional sense – or, maybe more accurately, it’s both…

So, let’s call the plot in Figure 5 a “signal-to-error ratio”. What we can see there is that, for this particular track, for the settings that I used to make the 128 kbps MP3 file, the error – the MP3 artefacts – are only 20 to 25 dB below the signal most of the time. Now, don’t jump to conclusions here. This does not mean that they would be as audible as white noise that is only 25 dB below the signal. This is because part of the “magic” of the MP3 encoder is that it tries to ensure that the error can “hide” under the signal by placing the error signal in the same frequency band(s) as the signal. Typically, white noise is in a different band than the signal, so it’s easier to hear because it’s not masked. So, be very careful about interpreting this plot. This is a measurable signal-to-error ratio, but it cannot be directly compared to a signal-to-noise ratio.

Let’s now increase the bitrate of the MP3 encoding, allowing the encoder to increase the quality.

Figure 6. A running RMS of a decoded 256 kbps MP3 file (black) and the difference between that signal and the original (red).

 

Figure 7: The Signal-to-Error ratio of a 256 kbps MP3 file.

 

Figure 6 and 7 show the same information as before, but for a 256 kpbs encoding of the same track. As you can see there, by doubling the bitrate of the MP3, we have increased our signal-to-error ratio by about 10 to 15 dB or so – to about 35 or 40 dB.

Figure 8: A running RMS of a decoded 320 kbps MP3 file (black) and the difference between that signal and the original (red).
Figure 9: The Signal-to-Error ratio of a 320 kbps MP3 file.

As you can see in Figures 8 and 9 above, increasing the MP3 bitrate to 320 kbps can improve the Signal-to-Error ratio from about 25 dB (for 128 kbps) to about 40 dB or so.

Now, if you’re looking carefully, you might notice that, some times in the track that I used for testing, the signal-to-error ratio is actually worse for the 320 kbps file than it is for the 256 kbps file – all other things being equal in the LAME converter parameters. This is a bit misleading, since what you cannot see there is the frequency spectrum of the error signal. I’ll deal with that in a future posting – with some more analysis and explanation to go with it.

For now, let’s play with the VBR presets in LAME. I’ll just show the signal-to-error plots for the 4 settings.

 

Figure 10: The Signal-to-Error ratio of an MP3 file converted using LAME’s “medium” quality preset.
Figure 11: The Signal-to-Error ratio of an MP3 file converted using LAME’s “standard” quality preset.
Figure 12: The Signal-to-Error ratio of an MP3 file converted using LAME’s “extreme” quality preset.
Figure 13: The Signal-to-Error ratio of an MP3 file converted using LAME’s “insane” quality preset.

So, as you can see in Figures 10 through 13, the signal-to-error ratio can be improved with the VBR presets, reaching a peak of over 60 dB for the “Insane” setting, for this track…

 

 

As I said a couple of times above:

  • You have to be careful about interpreting these graphs from a background of “knowing” what a SNR is… This error is not normal “distortion” or “noise” – at least from a perceptual point of view…
  • I’ll go further with this, including some frequency-dependent information in a future posting.

 

 

Appendix – LAME parameters and verbose output

For the geeks…

 

MAC60090:mp3_demos ggm$ lame -b 320 -q 0 –verbose  compilation_original.wav lame_320.mp3
LAME 3.99.5 64bits (http://lame.sf.net)
Using polyphase lowpass filter, transition band: 20094 Hz – 20627 Hz
Encoding compilation_original.wav to lame_320.mp3
Encoding as 44.1 kHz j-stereo MPEG-1 Layer III (4.4x) 320 kbps qval=0
misc:
scaling: 1
ch0 (left) scaling: 1
ch1 (right) scaling: 1
huffman search: best (outside loop)
experimental Y=0
stream format:
MPEG-1 Layer 3
2 channel – joint stereo
padding: off
constant bitrate – CBR
using LAME Tag
psychoacoustic:
using short blocks: channel coupled
subblock gain: 1
adjust masking: -10 dB
adjust masking short: -11 dB
quantization comparison: 9
^ comparison short blocks: 9
noise shaping: 1
^ amplification: 2
^ stopping: 1
ATH: using
^ type: 4
^ shape: 0 (only for type 4)
^ level adjustement: -12 dB
^ adjust type: 3
^ adjust sensitivity power: 1.000000
experimental psy tunings by Naoki Shibata
  adjust masking bass=-0.5 dB, alto=-0.25 dB, treble=-0.025 dB, sfb21=0.5 dB
using temporal masking effect: yes
interchannel masking ratio: 0
    Frame          |  CPU time/estim | REAL time/estim | play/CPU |    ETA
 37028/37028 (100%)|    2:07/    2:07|    2:08/    2:08|   7.5929x|    0:00
————————————————————————————————–
   kbps        LR    MS  %     long switch short %
  320.0       73.7  26.3        93.4   3.4   3.1
Writing LAME Tag…done
ReplayGain: -2.6dB
MAC60090:mp3_demos ggm$ lame -b 256 -q 0 –verbose  compilation_original.wav lame_256.mp3
LAME 3.99.5 64bits (http://lame.sf.net)
Using polyphase lowpass filter, transition band: 19383 Hz – 19916 Hz
Encoding compilation_original.wav to lame_256.mp3
Encoding as 44.1 kHz j-stereo MPEG-1 Layer III (5.5x) 256 kbps qval=0
misc:
scaling: 1
ch0 (left) scaling: 1
ch1 (right) scaling: 1
huffman search: best (outside loop)
experimental Y=0
stream format:
MPEG-1 Layer 3
2 channel – joint stereo
padding: off
constant bitrate – CBR
using LAME Tag
psychoacoustic:
using short blocks: channel coupled
subblock gain: 1
adjust masking: -8 dB
adjust masking short: -8.8 dB
quantization comparison: 9
^ comparison short blocks: 9
noise shaping: 1
^ amplification: 2
^ stopping: 1
ATH: using
^ type: 4
^ shape: 1 (only for type 4)
^ level adjustement: -10 dB
^ adjust type: 3
^ adjust sensitivity power: 1.000000
experimental psy tunings by Naoki Shibata
  adjust masking bass=-0.5 dB, alto=-0.25 dB, treble=-0.025 dB, sfb21=0.5 dB
using temporal masking effect: yes
interchannel masking ratio: 0
    Frame          |  CPU time/estim | REAL time/estim | play/CPU |    ETA
 37028/37028 (100%)|    1:50/    1:50|    1:51/    1:51|   8.7235x|    0:00
————————————————————————————————–
   kbps        LR    MS  %     long switch short %
  256.0       71.6  28.4        93.4   3.4   3.1
Writing LAME Tag…done
ReplayGain: -2.6dB
MAC60090:mp3_demos ggm$ lame -b 128 -q 0 –verbose  compilation_original.wav lame_128.mp3
LAME 3.99.5 64bits (http://lame.sf.net)
Using polyphase lowpass filter, transition band: 16538 Hz – 17071 Hz
Encoding compilation_original.wav to lame_128.mp3
Encoding as 44.1 kHz j-stereo MPEG-1 Layer III (11x) 128 kbps qval=0
misc:
scaling: 0.95
ch0 (left) scaling: 1
ch1 (right) scaling: 1
huffman search: best (outside loop)
experimental Y=0
stream format:
MPEG-1 Layer 3
2 channel – joint stereo
padding: off
constant bitrate – CBR
using LAME Tag
psychoacoustic:
using short blocks: channel coupled
subblock gain: 1
adjust masking: 0 dB
adjust masking short: 0 dB
quantization comparison: 9
^ comparison short blocks: 9
noise shaping: 2
^ amplification: 2
^ stopping: 1
ATH: using
^ type: 4
^ shape: 4 (only for type 4)
^ level adjustement: -3 dB
^ adjust type: 3
^ adjust sensitivity power: 1.000000
experimental psy tunings by Naoki Shibata
  adjust masking bass=-0.5 dB, alto=-0.25 dB, treble=-0.025 dB, sfb21=0.5 dB
using temporal masking effect: yes
interchannel masking ratio: 0.0002
    Frame          |  CPU time/estim | REAL time/estim | play/CPU |    ETA
 37028/37028 (100%)|    1:33/    1:33|    1:34/    1:34|   10.305x|    0:00
————————————————————————————————–
   kbps        LR    MS  %     long switch short %
  128.0       25.2  74.8        95.2   2.6   2.2
Writing LAME Tag…done
ReplayGain: -2.2dB
MAC60090:mp3_demos ggm$ lame –preset medium –verbose  compilation_original.wav lame_medium.mp3
LAME 3.99.5 64bits (http://lame.sf.net)
Using polyphase lowpass filter, transition band: 17249 Hz – 17782 Hz
Encoding compilation_original.wav to lame_medium.mp3
Encoding as 44.1 kHz j-stereo MPEG-1 Layer III VBR(q=4)
misc:
scaling: 1
ch0 (left) scaling: 1
ch1 (right) scaling: 1
huffman search: best (outside loop)
experimental Y=1
stream format:
MPEG-1 Layer 3
2 channel – joint stereo
padding: all
variable bitrate – VBR mtrh (default)
using LAME Tag
psychoacoustic:
using short blocks: channel coupled
subblock gain: 1
adjust masking: 0 dB
adjust masking short: 0 dB
quantization comparison: 9
^ comparison short blocks: 9
noise shaping: 1
^ amplification: 2
^ stopping: 1
ATH: using
^ type: 5
^ shape: 2 (only for type 4)
^ level adjustement: -0 dB
^ adjust type: 3
^ adjust sensitivity power: 6.309574
experimental psy tunings by Naoki Shibata
  adjust masking bass=-0.5 dB, alto=-0.25 dB, treble=-0.025 dB, sfb21=3.5 dB
using temporal masking effect: no
interchannel masking ratio: 0
    Frame          |  CPU time/estim | REAL time/estim | play/CPU |    ETA
 37028/37028 (100%)|    0:18/    0:18|    0:19/    0:19|   53.116x|    0:00
 32 [   37] %
 40 [    4] *
 48 [   14] %
 56 [    8] %
 64 [  105] %
 80 [  423] %*
 96 [  831] %***
112 [ 2596] %%%********
128 [17134] %%%%%%%%%%%%%%%%%%%%***********************************************
160 [12811] %%%%%%%%%%%%%%%%%%%%%%%%***************************
192 [ 1330] %%****
224 [  836] %%**
256 [  683] %**
320 [  216] %
——————————————————————————-
   kbps        LR    MS  %     long switch short %
  144.3       35.5  64.5        90.7   4.6   4.7
Writing LAME Tag…done
ReplayGain: -2.6dB
MAC60090:mp3_demos ggm$ lame –preset standard –verbose  compilation_original.wav lame_standard.mp3
LAME 3.99.5 64bits (http://lame.sf.net)
Using polyphase lowpass filter, transition band: 18671 Hz – 19205 Hz
Encoding compilation_original.wav to lame_standard.mp3
Encoding as 44.1 kHz j-stereo MPEG-1 Layer III VBR(q=2)
misc:
scaling: 1
ch0 (left) scaling: 1
ch1 (right) scaling: 1
huffman search: best (outside loop)
experimental Y=0
stream format:
MPEG-1 Layer 3
2 channel – joint stereo
padding: all
variable bitrate – VBR mtrh (default)
using LAME Tag
psychoacoustic:
using short blocks: channel coupled
subblock gain: 1
adjust masking: -2.6 dB
adjust masking short: -2.6 dB
quantization comparison: 9
^ comparison short blocks: 9
noise shaping: 1
^ amplification: 2
^ stopping: 1
ATH: using
^ type: 5
^ shape: 2 (only for type 4)
^ level adjustement: -3.7 dB
^ adjust type: 3
^ adjust sensitivity power: 1.995262
experimental psy tunings by Naoki Shibata
  adjust masking bass=-0.5 dB, alto=-0.25 dB, treble=-0.025 dB, sfb21=6.25 dB
using temporal masking effect: no
interchannel masking ratio: 0
    Frame          |  CPU time/estim | REAL time/estim | play/CPU |    ETA
 37028/37028 (100%)|    0:19/    0:19|    0:20/    0:20|   48.732x|    0:00
 32 [    0]
 40 [    0]
 48 [    1] %
 56 [    0]
 64 [   15] %
 80 [   26] %
 96 [   17] %
112 [  135] %
128 [ 1673] %*******
160 [15048] %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%*****************************
192 [15688] %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%*****************
224 [ 1986] %%%%%****
256 [ 1602] %%%%***
320 [  837] %%**
——————————————————————————-
   kbps        LR    MS  %     long switch short %
  183.0       60.0  40.0        90.7   4.6   4.7
Writing LAME Tag…done
ReplayGain: -2.6dB
MAC60090:mp3_demos ggm$ lame –preset extreme –verbose  compilation_original.wav lame_extreme.mp3
LAME 3.99.5 64bits (http://lame.sf.net)
polyphase lowpass filter disabled
Encoding compilation_original.wav to lame_extreme.mp3
Encoding as 44.1 kHz j-stereo MPEG-1 Layer III VBR(q=0)
misc:
scaling: 1
ch0 (left) scaling: 1
ch1 (right) scaling: 1
huffman search: best (outside loop)
experimental Y=0
stream format:
MPEG-1 Layer 3
2 channel – joint stereo
padding: all
variable bitrate – VBR mtrh (default)
using LAME Tag
psychoacoustic:
using short blocks: channel coupled
subblock gain: 1
adjust masking: -6.8 dB
adjust masking short: -6.8 dB
quantization comparison: 9
^ comparison short blocks: 9
noise shaping: 1
^ amplification: 2
^ stopping: 1
ATH: using
^ type: 5
^ shape: 1 (only for type 4)
^ level adjustement: -7.1 dB
^ adjust type: 3
^ adjust sensitivity power: 1.000000
experimental psy tunings by Naoki Shibata
  adjust masking bass=-0.5 dB, alto=-0.25 dB, treble=-0.025 dB, sfb21=8.25 dB
using temporal masking effect: no
interchannel masking ratio: 0
    Frame          |  CPU time/estim | REAL time/estim | play/CPU |    ETA
 37028/37028 (100%)|    0:21/    0:21|    0:22/    0:22|   44.584x|    0:00
 32 [    0]
 40 [    0]
 48 [    0]
 56 [    0]
 64 [    0]
 80 [    0]
 96 [    0]
112 [    1] %
128 [    0]
160 [  408] %*
192 [ 1961] %%******
224 [16481] %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%***************
256 [13387] %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%*************
320 [ 4790] %%%%%%%%%%%%%*******
——————————————————————————-
   kbps        LR    MS  %     long switch short %
  245.6       70.9  29.1        90.7   4.6   4.7
Writing LAME Tag…done
ReplayGain: -2.6dB
MAC60090:mp3_demos ggm$ lame –preset insane –verbose  compilation_original.wav lame_insane.mp3
LAME 3.99.5 64bits (http://lame.sf.net)
Using polyphase lowpass filter, transition band: 20094 Hz – 20627 Hz
Encoding compilation_original.wav to lame_insane.mp3
Encoding as 44.1 kHz j-stereo MPEG-1 Layer III (4.4x) 320 kbps qval=3
misc:
scaling: 1
ch0 (left) scaling: 1
ch1 (right) scaling: 1
huffman search: best (outside loop)
experimental Y=0
stream format:
MPEG-1 Layer 3
2 channel – joint stereo
padding: off
constant bitrate – CBR
using LAME Tag
psychoacoustic:
using short blocks: channel coupled
subblock gain: 1
adjust masking: -10 dB
adjust masking short: -11 dB
quantization comparison: 9
^ comparison short blocks: 9
noise shaping: 1
^ amplification: 1
^ stopping: 1
ATH: using
^ type: 4
^ shape: 0 (only for type 4)
^ level adjustement: -12 dB
^ adjust type: 3
^ adjust sensitivity power: 1.000000
experimental psy tunings by Naoki Shibata
  adjust masking bass=-0.5 dB, alto=-0.25 dB, treble=-0.025 dB, sfb21=0.5 dB
using temporal masking effect: yes
interchannel masking ratio: 0
    Frame          |  CPU time/estim | REAL time/estim | play/CPU |    ETA
 37028/37028 (100%)|    0:28/    0:28|    0:28/    0:28|   33.937x|    0:00
——————————————————————————-
   kbps        LR    MS  %     long switch short %
  320.0       73.7  26.3        93.4   3.4   3.1
Writing LAME Tag…done
ReplayGain: -2.6dB

 

B&O Tech: “Auto” loudness

#76 in a series of articles about the technology behind Bang & Olufsen loudspeakers

If you look at the comments section to a posting I wrote about ABL, you’ll see a short conversation there between me and a happy Beomaster 8000 customer who said that I had made an error in making sweeping generalisations about the function of a “loudness” filter in older gear. I said that, in older gear, a loudness filter boosted the bass (and maybe the treble) with a fixed gain, regardless of listening level (also known as “the position of the volume knob”).  Henning said that this was incorrect, and that, in his Beomaster 8000, the amount of boost applied by the loudness filter was, indeed, varied with volume.

So, I dusted off one of our Beomaster 8000’s (made in the early 1980’s) to find out if he was correct.

 

The Beomaster 8000 under test. I lied when I said that I dusted it off… (Keen-eyed viewers may recognise the insides of a Beolab 90, screwed to the white board in the upper left corner of the photo. That’s used for measurement-based tests that don’t require listening… The way you can tell it’s a Beolab 90 is the circular PCB at the bottom of the board. That circle is the 72 LED’s that normally sit at the top of the loudspeaker.)

 

I sent an MLS signal to the Tape 1 input (left channel) of the Beomaster 8000, and connected a differential probe to the speaker output. (The reason for the probe was to bring the signal back down to something like a line level to keep my sound card happy…)

I set the volume to 0.1, switched the loudness filter off, and measured the magnitude response.

Then I turned the loudness filter on, and measured again.

I repeated this for volume steps 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0, and 5.5. I didn’t do volume step 6.0 because this overloaded the input of my sound card and created the weird artefacts that occur when you clip an MLS signal. No matter…

Then I plotted the results, which are shown below.

 

Remember that these are NOT the absolute magnitude response curves of the Beomaster 8000. These are the DIFFERENCE between the Loudness ON and Loudness OFF at different volume settings.

At the top, you see a green line which is very, very flat. This means that, at the highest volume setting I tested (vol = 5.5) there was no difference between loudness on and off.

As you start coming down, you can see that the bass is boosted more and more, starting even at volume step 5.0 (the purple line, second from the top). At the bottom volume step (0.1, there is a nearly 35 dB boost at 20 Hz when the loudness filter is on.

You may also notice two other things in these plots. The first is the ripple in the lower curves. the second is the apparent treble boost at the bottom setting. Both of these artefacts are not actually in the signal. These are artefacts of the measurements that I did. So, you should ignore them, since they’re not there in “real life”.

 

So, Henning, I was wrong and you are correct – the Beomaster 8000 does indeed have a loudness filter that varies with volume. I stand corrected. Thanks for the info – and a fun afternoon!

B&O Tech: Distance Tweaking

#75 in a series of articles about the technology behind Bang & Olufsen loudspeakers

So, you’ve just installed a pair of loudspeakers, or a multichannel surround system. If you’re a normal person then you have not set up your system following the recommendations stated in the International Telecommunications Union’s document “Rec. ITU-R BS.775-1: MULTICHANNEL STEREOPHONIC SOUND SYSTEM WITH AND WITHOUT ACCOMPANYING PICTURE”. That document states that, in a best case, you should use a loudspeaker placement as is shown below in Figure 1.

 

Fig 1. The “ITU 775” recommendation for a 5-channel loudspeaker configuration. All loudspeaker should be matched, and be the same distance from the listening position at the angles shown in the figure.

 

In a typical configuration, the loudspeakers are NOT the same distance from the listening position – and this is a BIG problem if you’re worried about the accuracy of phantom image placement. Why is this? Well, let’s back up a little…

Localisation in the Real World

Let’s say that you and I were standing out in the middle of a snow-covered frozen pond on a quiet winter day. I stand some distance away from you and we have a conversation. When I’m doing the talking, the sound of my voice leaves my mouth and moves towards you.

If I’m directly in front of you, then the sound (in theory) arrives at both of your ears simultaneously (resulting in an Interaural Time Difference or ITD of 0 ms) and at exactly the same level (resulting in an Interaural Amplitude Difference or IAD of 0 dB). Your brain  detects that the ITD is 0 ms and the IAD is 0 dB, and decides that I must be directly in front of you (or directly behind you, or above you – at least I must be somewhere on your sagittal plane…)

If I move slightly to your left, then two things happen, generally speaking. Firstly, the sound of my voice arrives at your left ear before your right ear because it’s closer to me. Secondly, the sound of my voice is generally louder in your left ear than in your right ear, not only because it’s closer, but (mostly) because your head shadows your right ear from the sound of my voice. So, you brain detects that my voice is earlier and louder in your left ear, so I must be somewhere on your left.

Of course, there are many other, smaller cues that tell you where the sound is coming from exactly – but we don’t need to get into those details today.

There are two important thing to note here. The first is that these two principal cues – the ITD and the IAD – are not equally important. If they got in a fight, the ITD would win. If a sound arrived at your left ear earlier, but was louder in your right ear, it would have to be a LOT louder in the right ear to convince you that you should ignore the ITD information…

The second thing is that the time differences we’re talking about are very very small. If I were directly to one side of you, looking directly at your left ear, say… then the sound would arrive at your right ear approximately only 700 µs – that’s 700 millionths of a second or 0.0007 seconds later than at your left ear.

So, the moral of this story so far is that we are very sensitive to differences in the time of arrival of a sound at our two ears.

Localisation in a reproduced world

Now go back to the same snow-covered frozen lake with a pair of loudspeakers instead of bringing me along, and set them up in a standard stereo configuration, where the listening position and the two loudspeakers form an equilateral triangle. This means that when you sit and listen to the signals coming out of the loudspeakers

  • the two loudspeakers are the same distance from the listening position, and
  • the left loudspeaker is 30º to the left of front-centre, and the right loudspeaker is 30º to the right of front-centre.

Have a seat and we’ll play some sound. To start, we’ll play the same sound in both loudspeakers at exactly the same time, and at exactly the same level. Initially, the sound from the left loudspeaker will reach your left ear, and the sound from the right loudspeaker reaches your right ear. A very short time later the sound from the left loudspeaker reaches your right ear and the sound from the right loudspeaker reaches your left ear (this effect is called Interaural Crosstalk – but that’s not important). After this, nothing happens, because you are sitting in the middle of a frozen lake covered in snow – so there are no reflections from anything.

Since the sounds in the two loudspeakers are identical, then the sounds in your ears are also identical to each other. And, just as is the case in real-life, if the sounds in your two ears are identical, you’ll localise the sound source as coming from somewhere on your sagittal plane. Due to some other details in the localisation cues that we’re not talking about here, chances are that you’ll hear the sound as originating from a position directly in front of you – between the two loudspeakers.

Because the apparent location of that sound is a position where there is no loudspeaker, it’e like a ghost – so it’s called a “phantom centre” image.

That’s the centre image, but how do we move the image slightly to one side or the other? It’s actually really easy – we just need to remember the effects of ITD and IAD, and do something similar.

So, if I play a sound out of both loudspeakers at exactly the same time, but I make one loudspeaker slightly louder than the other, then the phantom image will appear to come from a position that is closer to the louder loudspeaker. So, if the right channel is louder than the left channel, then the image appears to come from somewhere on the right. Eventually, if the right loudspeaker is louder enough (about 15 dB, give or take), then the image will appear to be in that loudspeaker.

Similarly, if I were to keep the levels of the two loudspeakers identical, but I were to play the sound out of the right loudspeaker a little earlier instead, then the phantom image will also move towards the earlier loudspeaker.

There have been many studies done to find out exactly what apparent phantom image position results from  exactly what level or delay difference between the two loudspeakers (or a combination of the two). One of the first ones was done by Gert Simonsen in 1983, in which he found the following results.

 

Image Position Amplitude difference Time difference
0.0 dB 0.0 ms
10º 2.5 dB 0.2 ms
20º 5.5 dB 0.44 ms
30º 15.0 dB 1.12 ms

 

Note that this test was done with loudspeakers at ±30º – so the bottom line of the table means “in one of the loudspeakers”. Also, I have to be clear that the values in this table are NOT to be used concurrently. So, this shows the values that are needed to produce the desired phantom image location using EITHER amplitude differences OR time differences.

Again, the same two important points apply.

Firstly, the time differences are a more “powerful” cue than the amplitude differences. In other words, if the left loudspeaker is earlier, but the right loudspeaker is louder, you’ll hear the phantom image location towards the left, unless the right loudspeaker is a LOT louder.

Secondly, you are VERY sensitive to time differences. The left loudspeaker only needs to be 1.12 ms earlier than the right loudspeaker in order for the phantom image to move all the way into that loudspeaker. That’s equivalent to the left loudspeaker being about 38.5 cm closer than the right loudspeaker (because the speed of sound is about 344 m/s (depending on the temperature) and 0.00112 * 344 = 0.385 m).

Those last two paragraphs were the “punch line” – if the distances to the loudspeakers are NOT the same, then, unless you do something about it, you’ll wind up hearing your phantom images pulling towards the closer loudspeaker. And it doesn’t take much of an error in distance to produce a big effect.

 

Whaddya gonna do about it?

Almost every surround processor and Audio Video Receiver in the world gives you the option of entering the Speaker Distances in a menu somewhere. There are two possible reasons for this.

The first is not so important – it’s to align the sound at the listening position with the video. If you’re sitting 3 m from the loudspeakers and the TV, then the sound arrives 8.7 ms after you see the picture (the same is true if you are listening to a person speaking 3 m away from you). To eliminate this delay, the loudspeakers could produce the sound 8.7 ms too early, and the sound would reach you at the same time as you see the video. As I said, however, this is not a problem to lose much sleep over, unless you sit VERY far away from your television.

The second reason is very important, as we’ve already seen. If, as we established at the start of this posting, you’re a normal person, then your loudspeakers are not all the same distance from the listening position. This means that you should apply a delay to the closer loudspeaker(s) to get them to “wait” for the sound as it travels towards you from the further loudspeakers. That way, if you have the same sound in all channels at the same time, then the loudspeaker do NOT produce it at the same time, but it arrives at the listening position simultaneously, as it should.

Problem solved! Right?

Wrong.

Corrections that need correcting

Let’s make a configuration of a pair of loudspeakers and a listening position that is obviously wrong.

Fig 2. A stereo pair of loudspeakers and a listening position that is no where near the correct location. Notice that the right loudspeaker is much closer than the left.

Figure 2 shows the example of a very bad loudspeaker configuration for stereo listening. (I’m keeping things restricted to two channels to keep things simple – but multichannel is the same…) The right loudspeaker is much closer than the left loudspeaker, so all phantom images will appear to “bunch together” into the right loudspeaker.

Fig 4. Measuring the distance to the furthest loudspeaker from the listening position

So, to do the correction, you measure the distances to the two loudspeakers from the listening position and enter those two values into the surround processor. It then subtracts the smaller distance from the larger distance, converts that to a delay time, and delays the closer loudspeaker by that amount to compensate for the difference.

Fig 5. The gray circle shows the apparent position of the right loudspeaker, after a delay has been applied to it, assuming that there are no other cues (such as level or reflections in the room) to tell you where it is.

So, after the delay is applied to the closer loudspeaker, in theory, you have a stereo pair of loudspeakers that are equidistant from the listening position. This means that, instead of hearing  (for example) the phantom centre images in the closer loudspeaker, you’ll hear it as being positioned at the centre point between the distant loudspeaker (the left one, in this example) and the “virtual” one (the right one in this example). This is shown below.

Fig 6. The small grey dot shows the theoretical position of the resulting phantom centre after the two loudspeakers have been time-aligned using delays based on distance to the listening position.

As you can see in Figure 6, the resulting phantom image is at the centre point between the two resulting loudspeakers. But, if you look not-too-carefully-at-all, then you can see that the angle from the listening position to that centre point is not the same angle as the centre point between the two REAL loudspeakers (the black dot).

Fig 7. Notice that the corrected phantom image location (indicated by the arrow) is not the same as the desired phantom centre. (which might be, for example, the centre of a television…)

So, this means that, if you use distances ONLY to time-align two (or more) loudspeakers, then your correction till not be perfect. And, the more incorrect your actual loudspeaker configuration, the more incorrect the correction will be.

How do I fix it?

Notice that, after “correction”, the phantom image is still pulling towards the closer loudspeaker.

As we saw above, in order to push a phantom centre image towards a loudspeaker, you have to make the sound in that loudspeaker earlier.

So, what we need to do, after the distance-based time alignment is done, is to force the more distant loudspeaker to be a little earlier than the closer one. That will pull the phantom image towards it.

In order to use a distance compensation to make a loudspeaker produce the sound earlier, we have to tell the processor that it’s further away than it actually is. This makes the processor “think” that it needs to send the sound out early to compensate for the extra propagation delay caused by the distance.

So, to make the further loudspeaker a little early relative to the other loudspeaker, we either have to tell the processor that it’s further away from the listening position than it really is, or we reduce the reported distance to the closer loudspeaker to delay it a little more.

This means that, in the example shown in Figure 7, above, we should add a little to the distance to the left loudspeaker before entering the value in the menus, or subtract a little from the distance to the right loudspeaker instead.

How much is enough?

You might, at this point, be asking yourself “Why can’t this be done automatically? It’s just a little trigonometry, after all…”

If things were as simple as I’ve described here, then you’d be right – the math that is converting distance compensation to audio delays could include this offset, and everything would be fine.

The problem is that I’ve over-simplified a little on the way through. For example, not everyone hears exactly a 10º shift in phantom image with a 2.5 dB inter-channel amplitude difference. Those numbers are the average of a listening test with a number of subjects. Also, when other researchers have done the same test, they get slightly different results. (see this page for information).

Also, the directivity of the loudspeaker will have an influence (that is likely going to be frequency-dependent). So, if you’ve “toed in” your loudspeakers, then (in the example above) the further one will be “aimed” at you better than the closer one, which will have an influence on the perceived location of the phantom centre.

So, the only way to really do the final “tweaking” or “fine tuning” of the distance-compensation delays is to do it by listening.

Normally, I start by entering the distances correctly. Then, while sitting in the listening position, I use a monophonic track (Suzanne Vega singing “Tom’s Diner” works well) and I increase the distance in the surround processor’s menu of the loudspeaker that I want to pull the image towards. In other words, if the phantom centre appears to be located too far to the left, I “lie” to the surround processor and tell it that the right loudspeaker is further by 10 cm. I keep adding distance until the image is moved to the correct location.

B&O Tech: Directivity and Reflections

#73 in a series of articles about the technology behind Bang & Olufsen loudspeakers

The setup

Let’s start by inventing a loudspeaker. It has a perfectly flat on-axis response in a free field. This means that if you send a signal into it, then it doesn’t cause any particular frequency to sound louder or quieter than the others when you measure it in an infinite space that is free of reflections.

We’ll also say that it has a perfectly omnidirectional directivity. This means that the loudspeaker has the same behaviour in all directions – there is no “front” or “back” – sound goes everywhere identically.

Let’s then put that loudspeaker in a strange room that has only two walls – the left wall and the front wall – and these extend to infinity. We’ll put the loudspeaker, say 1 m from the left wall and 70 cm from the front wall. These are completely arbitrary values, but they’re not weird… Finally, we’ll sit 3 m away from the loudspeaker, as if we were set up to listen to it as the left front loudspeaker in a stereo pair.

A floorpan of that setup is shown below in Figure 1.

Fig 01: The centre of the red circle shows the location of the loudspeaker, and the circle itself represents the fact that the loudspeaker is omnidirectional. The blue lines are the walls, and the blue asterisk is the listening position.

If the two walls were completely absorptive, then there would be no energy reflected from them. If we were to replace the loudspeaker with a light bulb, then the equivalent would be to paint the walls flat black so no light would be reflected. In this theoretically perfect case, then the impulse response and the magnitude response of the loudspeaker at the listening position would be the same as in a free field, since there are no reflections. These would look like the plots in Figure 2.

Fig 2. The impulse response and the magnitude response of the arrangement shown in Figure 1. Note that it takes approximately 9 ms for the sound to reach the listening position, and that there are no reflections after that. The magnitude response is completely flat, but has an overall gain of almost -10 dB since it is measured relative to a reference distance of 1 m.

 

Through the looking glass

Imagine that you’re standing outdoors on a moonless night, and the only things you have with you are a lightbulb (that is magically lit) and a mirror. You’ll be able to see two light bulbs – the real one, and the one that is reflected by the mirror. If there is really no other light and no other objects, then you won’t even know that it’s a mirror, and you’ll just see two light bulbs (unless, of course, you can see yourself as well…)

In 1929, an acoustical physicist working at Bell Laboratories named Carl F. Eyring presented a new idea to the Acoustical Society of America. He was trying to calculate the reverberation time in “dead” rooms by considering that the walls were perfect mirrors, and that instead of thinking of sound sources and reflections, you could just pretend that the walls didn’t exist, and that the reflections were actually just images of other sound sources on the other side of the wall (just like that second light bulb in the example above…)

Fig 3: Two conceptual diagrams showing (in a perfect world) identical systems. On the left is the direct sound (the black arrow) and the reflected sound (the red arrow) reaching the microphone. On the right, we see the direct sound coming from the “real” loudspeaker, and the reflection coming from an image of that loudspeaker on the other side of the wall – as if the wall were not there.

This method of simulating and predicting acoustical behaviour in rooms, now called the “image model” has been used by many people over the decades. Eyring published a paper describing it in 1930, but it has since been standard method, both for prediction and acoustical simulation (first proposed by Allen and Berkley in 1979).

 

The effects of one sidewall reflection

Let’s use the image model to do a very basic prediction of what will happen to our impulse and magnitude responses if we have a single reflection from the left-hand wall.

Fig 4: An image model representation of the same loudspeaker shown in Figure 1, with a perfectly reflective sidewall on the left.
Fig. 5:The Impulse response and magnitude response of the resulting signal at the listening position. Notice that the direct sound is identical to that shown in Figure 2, but now there is an additional reflection that arrives about 5 ms later, and quieter, since the path the reflection takes is longer. The resulting magnitude response is a classic “comb filter”, so-called because it looks like a hair comb if you plot it on a linear frequency scale.

As can be seen in Figure 5, the resulting magnitude response of an omnidirectional loudspeaker with a single, perfect reflection certainly has some noticeable artefacts. If the listening position were closer to the loudspeaker, the artefacts would be smaller, since the reflected signal would be quieter than the direct sound The further away you get, the more the two path lengths are the same, and therefore the bigger the effect on the summed signal.

Of course, this is an unrealistic simulation, since everything is “perfect” – perfect reflection, perfectly omnidirectional loudspeaker with a perfectly flat magnitude response, and so on… However, for the purposes of this posting, that’s good enough.

Let’s now change the directivity of the loudspeaker to alter the balance of level between the direct and the reflected sounds. We’ll make the loudspeaker’s beam width more narrow, giving it the same behaviour as a cardioid microphone (which is called a cardioid because a polar plot of its directivity pattern looks like a heart – cardiovascular and cardioid have the same root).

Fig. 6: The same arrangement as shown in Figure 4, but with more directional loudspeakers.
Fig. 7: The impulse and magnitude responses of the arrangement shown in Figure 6.

If you look at Figure 7, you’ll see that the times of arrival of the two signals have not changed, but that the effect of the artefact in the frequency domain is reduced (the peaks and dips are smaller). The frequencies of the peaks and dips are the same as in Figure 5 because those are determined by the delay difference between the two spikes in the impulse response. The peaks and dips are smaller because the reflected sound is quieter (because the image loudspeaker – the reflected signal is beaming in a different direction).

Let’s try a different directivity pattern – a dipole, which has a polar patter than looks like a figure “8”.

Fig 8: A dipole loudspeaker and its image, in the same positions as in Figure 4.
Fig 9: The impulse and magnitude responses of the arrangement shown in Figure 8.

Notice now that, because the listening position is almost perfectly in line with the “null” – the “dead zone” of the reflected loudspeaker, there is almost nothing to reflect. Consequently, there is very little effect on the on-axis magnitude response of the loudspeaker, as can be seen in the magnitude response in Figure 9.

So, the moral of the story so far is that without moving the loudspeaker or the listening position, or changing the wall’s characteristics, the time response and magnitude response of the loudspeaker at the listening position is heavily dependent on the loudspeaker’s directivity.

 

Two reflections

Let’s continue the experiment, making the front wall reflective as well.

Fig 10: An image model representation of two reflections and an omnidirectional loudspeaker.

 

Fig 11. The impulse and magnitude responses at the listening position with two reflections and an omnidirectional loudspeaker, as shown in Figure 10. Note that the second impulse (the first reflection after the direct sound) is the one from the front wall, since that image is closer to the listening position.

 

Fig 12. An image model representation of two reflections and a loudspeaker with a cardioid directivity.

 

Fig 13. The artefacts on the magnitude response are considerably less for a cardioid loudspeaker than with an omnidirectional, since, as can be seen, the reflected signals are considerably quieter due to the angle of the listening position with respect to the rotation of the image loudspeakers.

 

Fig 14: An image model representation of two reflections and a loudspeaker with a dipole directivity.

 

Fig 15: The artefacts on the magnitude response are different when the loudspeaker is a dipole. Notice that the sidewall reflection is much quieter than the front wall reflection, as we saw already in Figure 9.

In Figure 15, one additional effect can be seen. Since the reflection off the front wall is negative (in other words, it “pulls” when the direct sound “pushes”) due to the behaviour of the dipole, there is a cancellation in the low frequencies, causing a drop in level in the low end. If we were to push the loudspeaker closer to the front wall, this effect would become more and more obvious.

 

The moral of the story is…

Of course, this is all very theoretical, however, it should give you an idea of three things.

The first is a simple method of thinking about reflections. You can use the Image Model method to imagine that your walls are mirrors, and you can “see” the other loudspeakers on the other sides of those mirrors. Those images are where your reflections are coming from.

The second is the obvious point – that the summed magnitude response of a loudspeaker’s direct sound, and its reflections is dependent on many things, the directivity being one of them.

The third is possibly the most important. All three of the loudspeaker models I’ve used here have razor-flat on-axis responses in a free field. So, if you were trying to decide which of these three loudspeakers to buy, you’d look at their “frequency response” plots or data and see that all of them are flat to within 0.0001 dB from 1 Hz to infinity Hz, and you’d think that they’d all sound “the same” under the same conditions. However, nothing could be further from the truth. These three loudspeaker with identical on-axis responses will sound completely different. This does not mean that an on-axis magnitude response is useless. It only means that it’s useless in the absence of other information such as the loudspeaker’s power response or its frequency-dependent directivity.

To keep things simple, I have not included frequency-dependent directivity effects. I may do that some day – but beware that it is not enough to say “the loudspeaker beams at higher frequencies so I don’t have to worry about it up there” because that’s not necessarily true – it’s different from loudspeaker to loudspeaker.

This also means that none of the plots I’ve shown here can be used to conclude anything about the real world. All it’s good for is to get a conceptual, intuitive idea of what’s going on when you put a loudspeaker near a wall.

One final comment: the microphone that I’m simulating here has an omnidirectional characteristic. This means that it is as sensitive to the reflected sound as it is to the direct sound, since the angle of incidence of the sound is irrelevant. The way we humans perceive sound is different. We do not perceive the comb filter that the microphone sees when the reflection is coming in from our side, since this is information that is recognised by the brain as being reflected – and it’s used to determine the distance to the sound source. However, if you plug one ear, you may notice that things sound more like you see in the plots, since you lose part of your ability to localise the direction of the signals in the horizontal plane.

 

For more reading…

Allen, J.B., & Berkley, D.A. (1979) “Image Method for Efficiently Simulating Small-Room Acoustics,” Journal of the Acoustical Society of America, 65(4): 943-950, April.

Eyring, C.F. (1930) “Reverberation Time in ‘Dead’ Rooms,” Journal of the Acoustical Society of America, 1: 217-241.

Gibbs, B.M., & Jones, D.K., (1972) “A Simple Image Method for Calculating the Distribution of Sound Pressure Levels Within an Enclosure,” Acustica, 26: 24-32.