Typical Errors in Digital Audio: Part 1

Introduction

Once upon a time, when I was a young whipper snapper, studying how to be a recording engineer (which is half of being a tonmeister) I had a textbook on sound recording. There were chapters in there on musical instruments, acoustics, microphones, mixing consoles, magnetic tape, and so on.. There was also a section on something called “digital audio” – but it was a portion of the chapter titled “Noise Reduction”.

Fast-forward a couple of years to 1983 and a new technology hit the market called “Compact Disc” (Here’s a fun fact for impressing people at your next dinner party: The “c” at the end of “disc” means it’s an optical medium. If it were magnetic, it would be a “disk”. So: Compact Disc, but Hard Disk.) Back then, the magazine advertisement read “Perfect Sound. Forever.” Then it hit the real world and the complaints started rolling in from people who believed that they knew things about audio. Some of these complaints were valid, and some were less so… Many of the ones that were valid no longer are, but it’s difficult to un-do a first impression.

Nowadays, it is very likely that almost-all-to-all of the music you listen to has been digital at some point in its life. Even if you’re listening to vinyl, it should not surprise you to know that the master version of the recording you’re hearing was probably stored on a hard disk or passed through a digital mixing console – or at least some of the tracks included some kind of digital processing (say, a guitar pedal or a reverb unit, for example). (I know, I know… There are exceptions. However, if you want to send me anti-digital hate mail you may not do it using a digital communication format such as e-mail. Use an analogue pen to write out your words on a piece of paper and send it to me by post. I look forward to receiving your analogue letters.)

Nowadays, a big part of my “day job” is to test (digital) audio systems to find out what’s wrong with them. So, I thought it would be interesting to do a series of postings that describe the typical kinds of errors that I look for (and find) when I’m digging down into the details.

In order to do this, I’m going to start by being a little redundant and describe the basics of how audio is converted from an analogue signal to a digital one – and hopefully address some of the misconceptions that are associated with this conversion process.

A quick introduction to sound

At the simplest level, sound can be described as a small change in air pressure (or barometric pressure) over short periods of time. If you’d like to have a better and more edu-tain-y version of this statement with animations and pretty colours, you could take 10 minutes to watch this video, for example.

That change in pressure can be “captured” by using a microphone, that is (at the simplest level) a device that has a change in air pressure at its input and a change in electrical voltage at its output. Ignoring a lot of details, we could say that if you were to plot a measurement of the air pressure (at the input of the microphone) over time, and you were to compare it to a plot of the measurement of the voltage (at the output of the microphone) over time, you would see the same curve on the two graphs. This means that the change in voltage is analogous to the change in air pressure.

 

Fig 1. Notice that (in theory, and ignoring a lot of things…) the change in air pressure over time at the input of the microphone is identical to the change in voltage over time at its output. Of course, this is not true in real life – microphones lie like a cheap rug…

At this point in the conversation, I’ll make a point to say that, in theory, we could “zoom in” on either of those two curves shown in Figure 1 and see more and more details. This is like looking at a map of Canada – it has lots of crinkly, jagged lines. If you zoom in and look at  the map of Newfoundland and Labrador, you’ll see that it has finer, crinkly, jagged lines. If you zoom in further, and stand where the water meets the shore in Trepassey and take a photo of your feet, you could copy it to draw a map of the line of where the water comes in around the rocks – and your toes – and you would wind up with even finer, crinkly, jagged lines… You could take this even further and get down to a microscopic or molecular level – but you get the idea… The point is that, in theory, both of the plots in Figure 1 have infinite resolution, both in time and in air pressure or voltage.

Now, let’s say that you wanted to take that microphone’s output and transmit it through a bunch of devices and wires that, in theory, all do nothing to the signal. Let’s say, for example, that you take the mic’s output, send it through a wire to a box that makes the signal twice as loud. Then take the output of that box and send it through a wire to another box that makes it half as loud. You take the output of that box and send it through a wire to a measuring device. What will you see? Unfortunately, none of the wires or boxes in the chain can be perfect, so you’ll probably see the signal plus something else which we’ll call the “error” in the system’s output. We can call it the error because, if we measure the input voltage and the output voltage at any one instant, we’ll probably see that they’re not identical. Since they should be identical, then the system must be making a mistake in transmitting the signal – so it makes errors…

Fig 2. If you send an audio signal through some wires and devices that (in theory) do nothing to the signal, you’ll find out that they add some extra stuff that you don’t want.

Pedantic Sidebar: Some people will call that error that the system adds to the signal “noise” – but I’m not going to call it that. This is because “noise” is a specific thing – noise is random – so if it’s not random, it’s not noise. Also, although the signal has been distorted (in that the output of the system is not identical to the input) I won’t call it “distortion” either, since distortion is a name that’s given to something that happens to the signal because the signal is there. (We would probably get at least some of the error out of our system even if we didn’t send any audio into it.) So, we could be slightly geeky and adequately vague and call the extra stuff “Distortion plus noise” but not “THD+N” – which stands for “Total Harmonic Distortion Plus Noise” – because not all kinds of distortion will produce a harmonic of the signal… but I’m getting ahead of myself…

So, we want to transmit (or store) the audio signal – but we want to reduce the noise caused by the transmission (or storage) system. One way to do this is to spend more money on your system. Use wires with better shielding, amplifiers with lower noise floors, bigger power supplies so that you don’t come close to their limits, run your magnetic tape twice as fast, and so on and so on. Or, you could convert the analogue signal (remember that it’s analogous to the change in air pressure over time) to one that is represented (and therefore transmitted or stored) digitally instead.

What does this mean?

Conversion from analogue to digital and back
(but skipping important details)

IMPORTANT: If you read this section, then please read the following postings as well. This is because, in order to keep things simple to start, I’m about to leave out some important details that I’ll add afterwards. However, if you don’t add the details, you could (understandably) jump to some incorrect conclusions (that many others before you have concluded…) So, if you don’t have time to read both sections, please don’t read either of them.

In the example above, we made a varying voltage that was analogous to the varying air pressure. If we wanted to store this, we could do it by varying the amount of magnetism on a wire or a coating on a tape, for example. Or we could cut a wiggly groove in a bit of vinyl that has a similar shape to the curve in the plots in Figure 1. Or, we could do something else: we could get a metronome (or a clock) and make a measurement of the voltage every time the metronome clicks, and write down the measurements.

For example, let’s zoom in on the first little bit of the signal in the plots in Figure 1

Fig. 3 The same curve as was shown in Figure 1 – but zoomed in to the very beginning.

We’ll then put on a metronome and make a measurement of the voltage every time we hear the metronome click…

Fig 4. The same curve (in red) measured at regular intervals (in black)

We can then keep the measurements (remembering how often we made them…) and write them down like this:

0.3000
0.4950
0.5089
0.3351
0.1116
0.0043
0.0678
0.2081
0.2754
0.2042
0.0730
0.0345
0.1775

We can store this series of numbers on a computer’s hard disk, for example. We can then come back tomorrow, and convert the measurements to voltages. First we read the measurements, and create the appropriate voltage…

Fig. 5. The voltages that we stored as measurements

We then make a “staircase” waveform by “holding” those voltages until the next value comes in.

Fig 6. We make a “staircase” curve using the voltages.

All we need to do then is to use a low-pass filter to smooth out the hard edges of the staircase.

Fig 7. When we smooth out the staircase, we get back the original signal (in red).

 

So, in this example, we’ve gone from an analogue signal (the red curve in Figure 3) to a digital signal (the series of numbers), and back to an analogue signal (the red curve in Figure 7).

In some ways, this is a bit like the way a movie works. When you watch a movie, you see a series of still photographs, probably taken at a rate of 24 pictures (or frames) per second. If you play those photos back at the same rate (24 fps or frames per second), you think you see movement. However, this is because your eyes and brain aren’t fast enough to see 24 individual photos per second – so you are fooled into thinking that things on the screen are moving.

However, digital audio is slightly different from film in two ways:

  • The sound (equivalent to the movement in the film) is actually happening. It’s not a trick that relies on your ears and brain being too slow.
  • If, when you were filming the movie, something were to happen between frames (say, the flash of a gunshot, for example) then it would never be caught on film. This is because the photos are discrete moments in time – and what happens between them is lost. However, if something were to make a very, very short sound between two samples (two measurements) in the digital audio signal – it would not be lost. This is because of something that happens at the beginning of the chain that I haven’t described… yet…

However, there are some “artefacts” (a fancy term for “weird errors”) that are present both in film and in digital audio that we should talk about.

The first is an error that happens when you mess around with the rate at which you take the measurements (called the “sampling rate”) or the photos (called the “frame rate”) – and, more importantly, when you need to worry about this. Let’s say that you make a film at 24 fps. If you play this back at a higher frame rate, then things will move very quickly (like old-fashioned baseball movies…). If you play them back at a lower frame rate, then things move in slow motion. So, for things to look “normal” you have to play the movie at the same rate that it was filmed. However, as longs no one is looking, you can transfer the movie as fast as you like. For example, if you wanted to copy the film, you could set up a movie camera so it was pointing at a movie screen and film the film. As long as the movie on the screen is running in sync with the camera, you can do this at any frame rate you like. But you’ll have to watch the copy at the same frame rate as the original film…

The second is an easy artefact to recognise. If you see a car accelerating from 0 to something fast on film, you’ll see the wheels of the car start to get faster and faster, then, as the car gets faster, the wheels slow down, stop, and then start going backwards… This does not happen in real life (unless you’re in a place lit with flashing lights like fluorescent bulbs or LED’s). I’ll do a posting explaining why this happens – but the thing to remember here is that the speed of the wheel rotation that you see on the film (the one that’s actually captured by the filming…) is not the real rotational speed of the wheel. However, those two rotational speeds are related to each other (and to the frame rate of the film). If you change the real rotational rate or the frame rate, you’ll change the rotational rate in the film. So, we call this effect “aliasing” because it’s a false version (an alias) of the real thing – but it’s always the same alias (assuming you repeat the conditions…) Digital audio can also suffer from aliasing, but in this case, you put in one frequency (which is actually the same as a rotational speed) and you get out another one. This is not the same as harmonic distortion, since the frequency that you get out is due to a relationship between the original frequency and the sampling rate, so the result is almost never a multiple of the input frequency.

 

Some details that I left out…

One of the things I said above was something like “we measure the voltage and store the results” and the example I gave was a nice series of numbers that only had 4 digits after the decimal point. This statement has some implications that we need to discuss.

Let’s say that I have a thing that I need to measure. For example, Figure 8 shows a piece of metal, and I want to measure its width.

Fig 8. A piece of metal with a width of “approximately 57 mm”.

Using my ruler, I can see that this piece of metal is about 57 mm wide. However, if I were geeky (and I am) I would say that this is not precise enough – and therefore it’s not accurate. The problem is that my ruler is only graduated in millimetres. So, if I try to measure anything that is not exactly an integer number of mm long, I’ll either have to guess (and be wrong) or round the measurement to the nearest millimetre (and be wrong).

So, if I wanted you to make a piece of metal the same width as my piece of metal, and I used the ruler in Figure 8, we would probably wind up with metal pieces of two different widths. In order to make this better, we need a better ruler – like the one in Figure 9.

Fig 9. The same piece of metal being measured with a vernier caliper. This gives us additional precision (down to 0.05 mm) so we can make a more accurate measurement.

Figure 9 shows a vernier caliper (a fancy type of ruler) being used to measure the same piece of metal. The caliper has a resolution of 0.05 mm instead of the 1 mm available on the ruler in Figure 8. So, we can make a much more accurate measurement of the metal because we have a measuring device with a higher precision.

The conversion of a digital audio signal is the same. As I said above, we measure the voltage of the electrical signal, and transmit (or store) the measurement. The question is: how accurate and precise is your measurement? As we saw above, this is (partly) determined by how many digits are in the number that you use when you “write down” the measurement.

Since the voltage measurements in digital audio are recorded in binary rather than decimal (we use 0 and 1 to write down the number instead of 0 up to 9) then we use Binary digITS – or “bits” instead of decimal digits (which are not called “dits”). The number of bits we have in the number that we write down (partly) determines the precision of the measurement of the voltage – and therefore (possibly), our accuracy…

Just like the example of the ruler in Figure 8, above, we have a limited resolution in our measurement. For example, if we had only 4 bits to work with then the waveform in 4 – the one we have to measure – would be measured with the “ruler” shown on the left side of Figure 10, below.

Fig 10: The waveform from Figure 4 as a voltage (notice the Y-axis on the right). We have to measure these values using the ruler with the resolution shown on the Y-axis on the left.

When we do this, we have to round off the value to the nearest “tick” on our ruler, as shown in Figure 11.

Fig 11: The values from figure 10 (shown as the circles) rounded off to the nearest value on our 4-bit ruler (the red staircase).

Using this “ruler” which gives a write-down-able “quantity” to the measurement, we get the following values for the red staircase:

0010
0100
0100
0011
0001
0000
0001
0010
0010
0010
0001
0000
0001

When we “play these back” we get the staircase again, shown in Figure 12.

Fig 12: The output of the measurements. Notice that all values sit exactly on one of the values for the “ruler” on the left Y-axis of the plot.

Of course, this means that, by rounding off the values, we have introduced an error in the system (just like the measurement in Figure 8 has a bigger error than the one in Figure 9). We can calculate this error if we just subtract the original signal from the output signal (in other words, Figure 12 minus Figure 10) to get Figure 13.

Fig 13: The error that we produced due to the rounding off of the signal when we did the measurements. Notice that the error is always less than 0.5 of a “tick” of the ruler on the left Y-axis.

 

In order to improve our accuracy of the measurement, we have to increase the precision of the values. We can do this by adding an extra digit (or bit) to the number that we use to record the value.

If we were using decimal numbers (0-9) then adding an extra digit to the number would give us 10 times as many possibilities. (For example, if we were using 4 digits after the decimal in the example at the start of this posting, we have a total of 10,000 possible values – 0.0000 to 0.9999. If we add one more digit, we increase the resolution to 100,000 possible values – 0.00000 to 0.99999 ).

In binary, adding one extra digit gives us twice as many “ticks” on the ruler. So, using 4 bits gives us 16 possible values. Increasing to 5 bits gives us 32 possible values.

If you’re listening to a CD, then the individual measurements of each voltage – the “sample values” – are stored with 16 bits, which means that we have 65,536 possible values to pick from.

Remember that this means that we have more “ticks” on our ruler – but we don’t necessarily increase its range. So, for example, we’re still measuring a voltage from -1 V to 1 V – we just have more and more resolution to do that measurement with.

Error #1

Finally! We get to the beginning of the point of the posting in the first place. My whole reason for starting this series of postings was to talk about errors in digital audio.

So, the first one to talk about is whether we have “bit matching” in a system where we expect to do so. For example, if you look at the S/P-DIF output of a good-old-fashioned CD player, do the sample values that are transmitted on that wire identical to the ones on the disc?

This is a fairly easy test to make (in theory). All you have to do is to record the digital signal on the S/P-DIF output of your CD player, subtract the original signal that’s on the disc (making sure that you have done your time alignment correctly). If you have anything other than nothing left over, then something went wrong somewhere.

If the result of this test is that you do NOT get nothing remaining, you cannot jump in head first and say that your S/P-DIF output is not working properly. For example, some sound cards have a sampling rate converter at their digital input. So, if you are capturing the CD player’s output using such a sound card on your computer, then perhaps the errors that you see are being produced by your sound card – and not your player.

 

A little associated story

This was a method that I used to do the final testing of Wireless Power Link for B&O. I created a little software application that made a signal and sent it out digitally to a Wireless Power Link transmitter (which was running with a resolution of 24 bits – giving us 16,777,216 possible values). I then connected a Wireless Power Link receiver’s output to the same computer. The computer knew how much time it took the signal to get from its output, through the wireless transmission system, back to its input (about 5 ms). So, I took the “output” signal, delayed it by that amount, and then subtracted it from the “input” signal. I then made a detector that counted every bit (instead of every sample) that was incorrect.

The reason I was counting bit errors instead of sample errors was that we wanted to be able to diagnose problems if we found them. If you find out that “this sample is wrong” – you don’t necessarily know whether it was one or more bit errors that caused the problem. By counting bit errors, you have a little more information that can help you diagnose the source of problems when you find them.

Sidebar: since this test was running at 48 kHz and 24 bits with a 2-channel system, that means that there were 2,304,000 bits per second being checked every second

This test ran 24-hours a day continuously for over 11 days. In that time, we found 0 bit errors. That means that we got 0 errors in more than 2,189,721,600,000 bits, which was good.

Now, just before anyone gets excited: that test was run to find out whether the WPL system was able to deliver a bit-perfect output in the absence of any external disturbances. So, the transmitter and the receiver were not moved at any time during the test, and nothing was moved between them – and the result was that the system behaved perfectly.

 

One way to compare CODEC quality

I’m often asked about my opinion regarding sound quality vs. compression formats or sampling rates or bit depths or psychoacoustic CODEC’s or other things like that…

Of course, there are lots of ways to decide on such an opinion, depending on what parameters you use to define “sound quality” and therefore what it is you’re asking specifically…

One way to think of this is to consider that the original sound file is the “reference” (regardless of how “good” or “bad” it is…), and when you encode it somehow (say, by changing sampling rates, or making it an MP3 file, for example), AND that encoding makes it different, then the resulting difference from the original can be considered an error.

So, I took a compilation of tracks that I often use for listening to loudspeakers. This is about 13 minutes long and is made of excerpts of many different recordings and recording styles, ranging from anechoic female speech, through a cappella choral, orchestral music, jazz, hard rock, heavy metal, and hip hop. The original tracks were all taken from 44.1 kHz / 16-bit CD’s, and the compilation is a 44.1 kHz / 16 bit result. This is what we’ll call the “reference”.

I then used LAME to encode the compilation in different bitrates of MP3. I re-encoded as 320, 256, and 128 CBR (Constant Bit Rate). I also used the “–preset” option to make encodings in the “insane”, “extreme”, “standard”, and “medium” settings (I’ve included the details of this at the bottom in the “Appendix”). Three of these four presets are VBR – the “Insane” setting is a CBR 320 kbps with some tweaked parameters.

 

I decoded those MP3 files back to PCM, and compared them to the original, of course making sure that everything was time- and gain-aligned. (There are some small differences in the overall level of the original file and the MP3 output – which is different for different bitrates. If I did not do this, then I would be exaggerating the differences between the original and the encoded versions – so this gain difference was calculated and compensated for, before subtracting the original from the MP3.)

 

Let’s take a look at a plot of the sample values in the left channel of the beginning of the track.

Figure 1. The original (in black) and the decoded 128 kbps MP3 file.

The plot above shows the first 44100 samples in the track (the first second of sound). The red plot is the decoded 128 kbps MP3. The black plot (which is difficult to see because it is overlapped by the red plot – except in the signal peaks) is the original file. For example, if I zoom into the area around the beginning of the sound (say, starting around sample number 15800) then we see this

Figure 2. A close-up of a portion of Figure 1.

So, as you can see in the two plots above, the decoded 128 kbps MP3 and the original 44.1/16 file are different. But, the difference is small relative to the levels of the signals themselves. The question is, how small is the difference, exactly?

We can find this out by subtracting the original signal from the decoded MP3 output, sample by sample. The result of this is shown in the plot below.

Figure 3. The difference between the two plots in Figure 2.

Notice that the vertical scale of the plot in Figure 3 is small. This is because it shows the difference between the two lines in Figure 2, which is also quite small.

Let’s think for a minute about how I arrived at the signal in Figure 3. I subtracted the Original signal from the MP3 output. In other words:

MP3 output – Original = Difference

If we consider that the difference between the MP3 output and the Original can be thought of as an “error”, and if I move the terms in the equation above, I get the following:

MP3 output – Original = Error

Original + Error = MP3 output

So, the question is: how loud is that error relative to the signal we’re listening to? The idea here is that, the louder the error, the easier it will be to detect.

Figure 4, below, shows this level difference over time. The black curve is a running RMS level of the decoded 128 kbps MP3 file. As you can see there, it ranges from about -30 dB FS to about +10 dB FS. You may think that it’s strange that it “only” goes to -10 dB FS – but this is because the time window I’m using to calculate the RMS value of the signal is 500 ms long. The peaks of the track reach full scale, but since my time window is long, this tends to pull down the apparent level (because the peaks are short). (NB: If you want to argue about the choice of a 500 ms time window, please wait until I’ve followed up this posting with another one that divides things up by frequency band…)

The res curve in Figure 4 is a running RMS value of the Error signal – the difference between the MP3 file and the original. As you can see there, that error signal ranges from about -50 dB FS to about -30 dB FS, give or take…

Figure 4. Running measures of the level of the decoded 128 kpbs MP3 file (in black) and the error signal (in red).

We can find the running value of the difference between the level of the MP3 file and the level of the Error it contains by subtracting the black curve from the red curve. The result of this is shown in Figure 5, below.

Figure 5: The difference in level between the error signal and the decoded 128 kbps MP3 file.

So, Figure 5, therefore, shows the measure of how loud the signal is relative to the error that makes it different from the original. If this error signal were just harmonic distortion, then we could call this a measure of THD in dB. If it were just good-old-fashioned noise, like on a magnetic tape, then we could call it a signal-to-noise ratio. However, this is neither distortion or noise in the traditional sense – or, maybe more accurately, it’s both…

So, let’s call the plot in Figure 5 a “signal-to-error ratio”. What we can see there is that, for this particular track, for the settings that I used to make the 128 kbps MP3 file, the error – the MP3 artefacts – are only 20 to 25 dB below the signal most of the time. Now, don’t jump to conclusions here. This does not mean that they would be as audible as white noise that is only 25 dB below the signal. This is because part of the “magic” of the MP3 encoder is that it tries to ensure that the error can “hide” under the signal by placing the error signal in the same frequency band(s) as the signal. Typically, white noise is in a different band than the signal, so it’s easier to hear because it’s not masked. So, be very careful about interpreting this plot. This is a measurable signal-to-error ratio, but it cannot be directly compared to a signal-to-noise ratio.

Let’s now increase the bitrate of the MP3 encoding, allowing the encoder to increase the quality.

Figure 6. A running RMS of a decoded 256 kbps MP3 file (black) and the difference between that signal and the original (red).

 

Figure 7: The Signal-to-Error ratio of a 256 kbps MP3 file.

 

Figure 6 and 7 show the same information as before, but for a 256 kpbs encoding of the same track. As you can see there, by doubling the bitrate of the MP3, we have increased our signal-to-error ratio by about 10 to 15 dB or so – to about 35 or 40 dB.

Figure 8: A running RMS of a decoded 320 kbps MP3 file (black) and the difference between that signal and the original (red).
Figure 9: The Signal-to-Error ratio of a 320 kbps MP3 file.

As you can see in Figures 8 and 9 above, increasing the MP3 bitrate to 320 kbps can improve the Signal-to-Error ratio from about 25 dB (for 128 kbps) to about 40 dB or so.

Now, if you’re looking carefully, you might notice that, some times in the track that I used for testing, the signal-to-error ratio is actually worse for the 320 kbps file than it is for the 256 kbps file – all other things being equal in the LAME converter parameters. This is a bit misleading, since what you cannot see there is the frequency spectrum of the error signal. I’ll deal with that in a future posting – with some more analysis and explanation to go with it.

For now, let’s play with the VBR presets in LAME. I’ll just show the signal-to-error plots for the 4 settings.

 

Figure 10: The Signal-to-Error ratio of an MP3 file converted using LAME’s “medium” quality preset.
Figure 11: The Signal-to-Error ratio of an MP3 file converted using LAME’s “standard” quality preset.
Figure 12: The Signal-to-Error ratio of an MP3 file converted using LAME’s “extreme” quality preset.
Figure 13: The Signal-to-Error ratio of an MP3 file converted using LAME’s “insane” quality preset.

So, as you can see in Figures 10 through 13, the signal-to-error ratio can be improved with the VBR presets, reaching a peak of over 60 dB for the “Insane” setting, for this track…

 

 

As I said a couple of times above:

  • You have to be careful about interpreting these graphs from a background of “knowing” what a SNR is… This error is not normal “distortion” or “noise” – at least from a perceptual point of view…
  • I’ll go further with this, including some frequency-dependent information in a future posting.

 

 

Appendix – LAME parameters and verbose output

For the geeks…

 

MAC60090:mp3_demos ggm$ lame -b 320 -q 0 –verbose  compilation_original.wav lame_320.mp3
LAME 3.99.5 64bits (http://lame.sf.net)
Using polyphase lowpass filter, transition band: 20094 Hz – 20627 Hz
Encoding compilation_original.wav to lame_320.mp3
Encoding as 44.1 kHz j-stereo MPEG-1 Layer III (4.4x) 320 kbps qval=0
misc:
scaling: 1
ch0 (left) scaling: 1
ch1 (right) scaling: 1
huffman search: best (outside loop)
experimental Y=0
stream format:
MPEG-1 Layer 3
2 channel – joint stereo
padding: off
constant bitrate – CBR
using LAME Tag
psychoacoustic:
using short blocks: channel coupled
subblock gain: 1
adjust masking: -10 dB
adjust masking short: -11 dB
quantization comparison: 9
^ comparison short blocks: 9
noise shaping: 1
^ amplification: 2
^ stopping: 1
ATH: using
^ type: 4
^ shape: 0 (only for type 4)
^ level adjustement: -12 dB
^ adjust type: 3
^ adjust sensitivity power: 1.000000
experimental psy tunings by Naoki Shibata
  adjust masking bass=-0.5 dB, alto=-0.25 dB, treble=-0.025 dB, sfb21=0.5 dB
using temporal masking effect: yes
interchannel masking ratio: 0
    Frame          |  CPU time/estim | REAL time/estim | play/CPU |    ETA
 37028/37028 (100%)|    2:07/    2:07|    2:08/    2:08|   7.5929x|    0:00
————————————————————————————————–
   kbps        LR    MS  %     long switch short %
  320.0       73.7  26.3        93.4   3.4   3.1
Writing LAME Tag…done
ReplayGain: -2.6dB
MAC60090:mp3_demos ggm$ lame -b 256 -q 0 –verbose  compilation_original.wav lame_256.mp3
LAME 3.99.5 64bits (http://lame.sf.net)
Using polyphase lowpass filter, transition band: 19383 Hz – 19916 Hz
Encoding compilation_original.wav to lame_256.mp3
Encoding as 44.1 kHz j-stereo MPEG-1 Layer III (5.5x) 256 kbps qval=0
misc:
scaling: 1
ch0 (left) scaling: 1
ch1 (right) scaling: 1
huffman search: best (outside loop)
experimental Y=0
stream format:
MPEG-1 Layer 3
2 channel – joint stereo
padding: off
constant bitrate – CBR
using LAME Tag
psychoacoustic:
using short blocks: channel coupled
subblock gain: 1
adjust masking: -8 dB
adjust masking short: -8.8 dB
quantization comparison: 9
^ comparison short blocks: 9
noise shaping: 1
^ amplification: 2
^ stopping: 1
ATH: using
^ type: 4
^ shape: 1 (only for type 4)
^ level adjustement: -10 dB
^ adjust type: 3
^ adjust sensitivity power: 1.000000
experimental psy tunings by Naoki Shibata
  adjust masking bass=-0.5 dB, alto=-0.25 dB, treble=-0.025 dB, sfb21=0.5 dB
using temporal masking effect: yes
interchannel masking ratio: 0
    Frame          |  CPU time/estim | REAL time/estim | play/CPU |    ETA
 37028/37028 (100%)|    1:50/    1:50|    1:51/    1:51|   8.7235x|    0:00
————————————————————————————————–
   kbps        LR    MS  %     long switch short %
  256.0       71.6  28.4        93.4   3.4   3.1
Writing LAME Tag…done
ReplayGain: -2.6dB
MAC60090:mp3_demos ggm$ lame -b 128 -q 0 –verbose  compilation_original.wav lame_128.mp3
LAME 3.99.5 64bits (http://lame.sf.net)
Using polyphase lowpass filter, transition band: 16538 Hz – 17071 Hz
Encoding compilation_original.wav to lame_128.mp3
Encoding as 44.1 kHz j-stereo MPEG-1 Layer III (11x) 128 kbps qval=0
misc:
scaling: 0.95
ch0 (left) scaling: 1
ch1 (right) scaling: 1
huffman search: best (outside loop)
experimental Y=0
stream format:
MPEG-1 Layer 3
2 channel – joint stereo
padding: off
constant bitrate – CBR
using LAME Tag
psychoacoustic:
using short blocks: channel coupled
subblock gain: 1
adjust masking: 0 dB
adjust masking short: 0 dB
quantization comparison: 9
^ comparison short blocks: 9
noise shaping: 2
^ amplification: 2
^ stopping: 1
ATH: using
^ type: 4
^ shape: 4 (only for type 4)
^ level adjustement: -3 dB
^ adjust type: 3
^ adjust sensitivity power: 1.000000
experimental psy tunings by Naoki Shibata
  adjust masking bass=-0.5 dB, alto=-0.25 dB, treble=-0.025 dB, sfb21=0.5 dB
using temporal masking effect: yes
interchannel masking ratio: 0.0002
    Frame          |  CPU time/estim | REAL time/estim | play/CPU |    ETA
 37028/37028 (100%)|    1:33/    1:33|    1:34/    1:34|   10.305x|    0:00
————————————————————————————————–
   kbps        LR    MS  %     long switch short %
  128.0       25.2  74.8        95.2   2.6   2.2
Writing LAME Tag…done
ReplayGain: -2.2dB
MAC60090:mp3_demos ggm$ lame –preset medium –verbose  compilation_original.wav lame_medium.mp3
LAME 3.99.5 64bits (http://lame.sf.net)
Using polyphase lowpass filter, transition band: 17249 Hz – 17782 Hz
Encoding compilation_original.wav to lame_medium.mp3
Encoding as 44.1 kHz j-stereo MPEG-1 Layer III VBR(q=4)
misc:
scaling: 1
ch0 (left) scaling: 1
ch1 (right) scaling: 1
huffman search: best (outside loop)
experimental Y=1
stream format:
MPEG-1 Layer 3
2 channel – joint stereo
padding: all
variable bitrate – VBR mtrh (default)
using LAME Tag
psychoacoustic:
using short blocks: channel coupled
subblock gain: 1
adjust masking: 0 dB
adjust masking short: 0 dB
quantization comparison: 9
^ comparison short blocks: 9
noise shaping: 1
^ amplification: 2
^ stopping: 1
ATH: using
^ type: 5
^ shape: 2 (only for type 4)
^ level adjustement: -0 dB
^ adjust type: 3
^ adjust sensitivity power: 6.309574
experimental psy tunings by Naoki Shibata
  adjust masking bass=-0.5 dB, alto=-0.25 dB, treble=-0.025 dB, sfb21=3.5 dB
using temporal masking effect: no
interchannel masking ratio: 0
    Frame          |  CPU time/estim | REAL time/estim | play/CPU |    ETA
 37028/37028 (100%)|    0:18/    0:18|    0:19/    0:19|   53.116x|    0:00
 32 [   37] %
 40 [    4] *
 48 [   14] %
 56 [    8] %
 64 [  105] %
 80 [  423] %*
 96 [  831] %***
112 [ 2596] %%%********
128 [17134] %%%%%%%%%%%%%%%%%%%%***********************************************
160 [12811] %%%%%%%%%%%%%%%%%%%%%%%%***************************
192 [ 1330] %%****
224 [  836] %%**
256 [  683] %**
320 [  216] %
——————————————————————————-
   kbps        LR    MS  %     long switch short %
  144.3       35.5  64.5        90.7   4.6   4.7
Writing LAME Tag…done
ReplayGain: -2.6dB
MAC60090:mp3_demos ggm$ lame –preset standard –verbose  compilation_original.wav lame_standard.mp3
LAME 3.99.5 64bits (http://lame.sf.net)
Using polyphase lowpass filter, transition band: 18671 Hz – 19205 Hz
Encoding compilation_original.wav to lame_standard.mp3
Encoding as 44.1 kHz j-stereo MPEG-1 Layer III VBR(q=2)
misc:
scaling: 1
ch0 (left) scaling: 1
ch1 (right) scaling: 1
huffman search: best (outside loop)
experimental Y=0
stream format:
MPEG-1 Layer 3
2 channel – joint stereo
padding: all
variable bitrate – VBR mtrh (default)
using LAME Tag
psychoacoustic:
using short blocks: channel coupled
subblock gain: 1
adjust masking: -2.6 dB
adjust masking short: -2.6 dB
quantization comparison: 9
^ comparison short blocks: 9
noise shaping: 1
^ amplification: 2
^ stopping: 1
ATH: using
^ type: 5
^ shape: 2 (only for type 4)
^ level adjustement: -3.7 dB
^ adjust type: 3
^ adjust sensitivity power: 1.995262
experimental psy tunings by Naoki Shibata
  adjust masking bass=-0.5 dB, alto=-0.25 dB, treble=-0.025 dB, sfb21=6.25 dB
using temporal masking effect: no
interchannel masking ratio: 0
    Frame          |  CPU time/estim | REAL time/estim | play/CPU |    ETA
 37028/37028 (100%)|    0:19/    0:19|    0:20/    0:20|   48.732x|    0:00
 32 [    0]
 40 [    0]
 48 [    1] %
 56 [    0]
 64 [   15] %
 80 [   26] %
 96 [   17] %
112 [  135] %
128 [ 1673] %*******
160 [15048] %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%*****************************
192 [15688] %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%*****************
224 [ 1986] %%%%%****
256 [ 1602] %%%%***
320 [  837] %%**
——————————————————————————-
   kbps        LR    MS  %     long switch short %
  183.0       60.0  40.0        90.7   4.6   4.7
Writing LAME Tag…done
ReplayGain: -2.6dB
MAC60090:mp3_demos ggm$ lame –preset extreme –verbose  compilation_original.wav lame_extreme.mp3
LAME 3.99.5 64bits (http://lame.sf.net)
polyphase lowpass filter disabled
Encoding compilation_original.wav to lame_extreme.mp3
Encoding as 44.1 kHz j-stereo MPEG-1 Layer III VBR(q=0)
misc:
scaling: 1
ch0 (left) scaling: 1
ch1 (right) scaling: 1
huffman search: best (outside loop)
experimental Y=0
stream format:
MPEG-1 Layer 3
2 channel – joint stereo
padding: all
variable bitrate – VBR mtrh (default)
using LAME Tag
psychoacoustic:
using short blocks: channel coupled
subblock gain: 1
adjust masking: -6.8 dB
adjust masking short: -6.8 dB
quantization comparison: 9
^ comparison short blocks: 9
noise shaping: 1
^ amplification: 2
^ stopping: 1
ATH: using
^ type: 5
^ shape: 1 (only for type 4)
^ level adjustement: -7.1 dB
^ adjust type: 3
^ adjust sensitivity power: 1.000000
experimental psy tunings by Naoki Shibata
  adjust masking bass=-0.5 dB, alto=-0.25 dB, treble=-0.025 dB, sfb21=8.25 dB
using temporal masking effect: no
interchannel masking ratio: 0
    Frame          |  CPU time/estim | REAL time/estim | play/CPU |    ETA
 37028/37028 (100%)|    0:21/    0:21|    0:22/    0:22|   44.584x|    0:00
 32 [    0]
 40 [    0]
 48 [    0]
 56 [    0]
 64 [    0]
 80 [    0]
 96 [    0]
112 [    1] %
128 [    0]
160 [  408] %*
192 [ 1961] %%******
224 [16481] %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%***************
256 [13387] %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%*************
320 [ 4790] %%%%%%%%%%%%%*******
——————————————————————————-
   kbps        LR    MS  %     long switch short %
  245.6       70.9  29.1        90.7   4.6   4.7
Writing LAME Tag…done
ReplayGain: -2.6dB
MAC60090:mp3_demos ggm$ lame –preset insane –verbose  compilation_original.wav lame_insane.mp3
LAME 3.99.5 64bits (http://lame.sf.net)
Using polyphase lowpass filter, transition band: 20094 Hz – 20627 Hz
Encoding compilation_original.wav to lame_insane.mp3
Encoding as 44.1 kHz j-stereo MPEG-1 Layer III (4.4x) 320 kbps qval=3
misc:
scaling: 1
ch0 (left) scaling: 1
ch1 (right) scaling: 1
huffman search: best (outside loop)
experimental Y=0
stream format:
MPEG-1 Layer 3
2 channel – joint stereo
padding: off
constant bitrate – CBR
using LAME Tag
psychoacoustic:
using short blocks: channel coupled
subblock gain: 1
adjust masking: -10 dB
adjust masking short: -11 dB
quantization comparison: 9
^ comparison short blocks: 9
noise shaping: 1
^ amplification: 1
^ stopping: 1
ATH: using
^ type: 4
^ shape: 0 (only for type 4)
^ level adjustement: -12 dB
^ adjust type: 3
^ adjust sensitivity power: 1.000000
experimental psy tunings by Naoki Shibata
  adjust masking bass=-0.5 dB, alto=-0.25 dB, treble=-0.025 dB, sfb21=0.5 dB
using temporal masking effect: yes
interchannel masking ratio: 0
    Frame          |  CPU time/estim | REAL time/estim | play/CPU |    ETA
 37028/37028 (100%)|    0:28/    0:28|    0:28/    0:28|   33.937x|    0:00
——————————————————————————-
   kbps        LR    MS  %     long switch short %
  320.0       73.7  26.3        93.4   3.4   3.1
Writing LAME Tag…done
ReplayGain: -2.6dB

 

B&O Tech: Distance Tweaking

#75 in a series of articles about the technology behind Bang & Olufsen loudspeakers

So, you’ve just installed a pair of loudspeakers, or a multichannel surround system. If you’re a normal person then you have not set up your system following the recommendations stated in the International Telecommunications Union’s document “Rec. ITU-R BS.775-1: MULTICHANNEL STEREOPHONIC SOUND SYSTEM WITH AND WITHOUT ACCOMPANYING PICTURE”. That document states that, in a best case, you should use a loudspeaker placement as is shown below in Figure 1.

 

Fig 1. The “ITU 775” recommendation for a 5-channel loudspeaker configuration. All loudspeaker should be matched, and be the same distance from the listening position at the angles shown in the figure.

 

In a typical configuration, the loudspeakers are NOT the same distance from the listening position – and this is a BIG problem if you’re worried about the accuracy of phantom image placement. Why is this? Well, let’s back up a little…

Localisation in the Real World

Let’s say that you and I were standing out in the middle of a snow-covered frozen pond on a quiet winter day. I stand some distance away from you and we have a conversation. When I’m doing the talking, the sound of my voice leaves my mouth and moves towards you.

If I’m directly in front of you, then the sound (in theory) arrives at both of your ears simultaneously (resulting in an Interaural Time Difference or ITD of 0 ms) and at exactly the same level (resulting in an Interaural Amplitude Difference or IAD of 0 dB). Your brain  detects that the ITD is 0 ms and the IAD is 0 dB, and decides that I must be directly in front of you (or directly behind you, or above you – at least I must be somewhere on your sagittal plane…)

If I move slightly to your left, then two things happen, generally speaking. Firstly, the sound of my voice arrives at your left ear before your right ear because it’s closer to me. Secondly, the sound of my voice is generally louder in your left ear than in your right ear, not only because it’s closer, but (mostly) because your head shadows your right ear from the sound of my voice. So, you brain detects that my voice is earlier and louder in your left ear, so I must be somewhere on your left.

Of course, there are many other, smaller cues that tell you where the sound is coming from exactly – but we don’t need to get into those details today.

There are two important thing to note here. The first is that these two principal cues – the ITD and the IAD – are not equally important. If they got in a fight, the ITD would win. If a sound arrived at your left ear earlier, but was louder in your right ear, it would have to be a LOT louder in the right ear to convince you that you should ignore the ITD information…

The second thing is that the time differences we’re talking about are very very small. If I were directly to one side of you, looking directly at your left ear, say… then the sound would arrive at your right ear approximately only 700 µs – that’s 700 millionths of a second or 0.0007 seconds later than at your left ear.

So, the moral of this story so far is that we are very sensitive to differences in the time of arrival of a sound at our two ears.

Localisation in a reproduced world

Now go back to the same snow-covered frozen lake with a pair of loudspeakers instead of bringing me along, and set them up in a standard stereo configuration, where the listening position and the two loudspeakers form an equilateral triangle. This means that when you sit and listen to the signals coming out of the loudspeakers

  • the two loudspeakers are the same distance from the listening position, and
  • the left loudspeaker is 30º to the left of front-centre, and the right loudspeaker is 30º to the right of front-centre.

Have a seat and we’ll play some sound. To start, we’ll play the same sound in both loudspeakers at exactly the same time, and at exactly the same level. Initially, the sound from the left loudspeaker will reach your left ear, and the sound from the right loudspeaker reaches your right ear. A very short time later the sound from the left loudspeaker reaches your right ear and the sound from the right loudspeaker reaches your left ear (this effect is called Interaural Crosstalk – but that’s not important). After this, nothing happens, because you are sitting in the middle of a frozen lake covered in snow – so there are no reflections from anything.

Since the sounds in the two loudspeakers are identical, then the sounds in your ears are also identical to each other. And, just as is the case in real-life, if the sounds in your two ears are identical, you’ll localise the sound source as coming from somewhere on your sagittal plane. Due to some other details in the localisation cues that we’re not talking about here, chances are that you’ll hear the sound as originating from a position directly in front of you – between the two loudspeakers.

Because the apparent location of that sound is a position where there is no loudspeaker, it’e like a ghost – so it’s called a “phantom centre” image.

That’s the centre image, but how do we move the image slightly to one side or the other? It’s actually really easy – we just need to remember the effects of ITD and IAD, and do something similar.

So, if I play a sound out of both loudspeakers at exactly the same time, but I make one loudspeaker slightly louder than the other, then the phantom image will appear to come from a position that is closer to the louder loudspeaker. So, if the right channel is louder than the left channel, then the image appears to come from somewhere on the right. Eventually, if the right loudspeaker is louder enough (about 15 dB, give or take), then the image will appear to be in that loudspeaker.

Similarly, if I were to keep the levels of the two loudspeakers identical, but I were to play the sound out of the right loudspeaker a little earlier instead, then the phantom image will also move towards the earlier loudspeaker.

There have been many studies done to find out exactly what apparent phantom image position results from  exactly what level or delay difference between the two loudspeakers (or a combination of the two). One of the first ones was done by Gert Simonsen in 1983, in which he found the following results.

 

[table]

Image Position, Amplitude difference, Time difference

0º, 0.0 dB, 0.0 ms

10º, 2.5 dB, 0.2 ms

20º, 5.5 dB, 0.44 ms

30º, 15.0 dB, 1.12 ms

[/table]

 

Note that this test was done with loudspeakers at ±30º – so the bottom line of the table means “in one of the loudspeakers”. Also, I have to be clear that the values in this table are NOT to be used concurrently. So, this shows the values that are needed to produce the desired phantom image location using EITHER amplitude differences OR time differences.

Again, the same two important points apply.

Firstly, the time differences are a more “powerful” cue than the amplitude differences. In other words, if the left loudspeaker is earlier, but the right loudspeaker is louder, you’ll hear the phantom image location towards the left, unless the right loudspeaker is a LOT louder.

Secondly, you are VERY sensitive to time differences. The left loudspeaker only needs to be 1.12 ms earlier than the right loudspeaker in order for the phantom image to move all the way into that loudspeaker. That’s equivalent to the left loudspeaker being about 38.5 cm closer than the right loudspeaker (because the speed of sound is about 344 m/s (depending on the temperature) and 0.00112 * 344 = 0.385 m).

Those last two paragraphs were the “punch line” – if the distances to the loudspeakers are NOT the same, then, unless you do something about it, you’ll wind up hearing your phantom images pulling towards the closer loudspeaker. And it doesn’t take much of an error in distance to produce a big effect.

 

Whaddya gonna do about it?

Almost every surround processor and Audio Video Receiver in the world gives you the option of entering the Speaker Distances in a menu somewhere. There are two possible reasons for this.

The first is not so important – it’s to align the sound at the listening position with the video. If you’re sitting 3 m from the loudspeakers and the TV, then the sound arrives 8.7 ms after you see the picture (the same is true if you are listening to a person speaking 3 m away from you). To eliminate this delay, the loudspeakers could produce the sound 8.7 ms too early, and the sound would reach you at the same time as you see the video. As I said, however, this is not a problem to lose much sleep over, unless you sit VERY far away from your television.

The second reason is very important, as we’ve already seen. If, as we established at the start of this posting, you’re a normal person, then your loudspeakers are not all the same distance from the listening position. This means that you should apply a delay to the closer loudspeaker(s) to get them to “wait” for the sound as it travels towards you from the further loudspeakers. That way, if you have the same sound in all channels at the same time, then the loudspeaker do NOT produce it at the same time, but it arrives at the listening position simultaneously, as it should.

Problem solved! Right?

Wrong.

Corrections that need correcting

Let’s make a configuration of a pair of loudspeakers and a listening position that is obviously wrong.

Fig 2. A stereo pair of loudspeakers and a listening position that is no where near the correct location. Notice that the right loudspeaker is much closer than the left.

Figure 2 shows the example of a very bad loudspeaker configuration for stereo listening. (I’m keeping things restricted to two channels to keep things simple – but multichannel is the same…) The right loudspeaker is much closer than the left loudspeaker, so all phantom images will appear to “bunch together” into the right loudspeaker.

Fig 4. Measuring the distance to the furthest loudspeaker from the listening position

So, to do the correction, you measure the distances to the two loudspeakers from the listening position and enter those two values into the surround processor. It then subtracts the smaller distance from the larger distance, converts that to a delay time, and delays the closer loudspeaker by that amount to compensate for the difference.

Fig 5. The gray circle shows the apparent position of the right loudspeaker, after a delay has been applied to it, assuming that there are no other cues (such as level or reflections in the room) to tell you where it is.

So, after the delay is applied to the closer loudspeaker, in theory, you have a stereo pair of loudspeakers that are equidistant from the listening position. This means that, instead of hearing  (for example) the phantom centre images in the closer loudspeaker, you’ll hear it as being positioned at the centre point between the distant loudspeaker (the left one, in this example) and the “virtual” one (the right one in this example). This is shown below.

Fig 6. The small grey dot shows the theoretical position of the resulting phantom centre after the two loudspeakers have been time-aligned using delays based on distance to the listening position.

As you can see in Figure 6, the resulting phantom image is at the centre point between the two resulting loudspeakers. But, if you look not-too-carefully-at-all, then you can see that the angle from the listening position to that centre point is not the same angle as the centre point between the two REAL loudspeakers (the black dot).

Fig 7. Notice that the corrected phantom image location (indicated by the arrow) is not the same as the desired phantom centre. (which might be, for example, the centre of a television…)

So, this means that, if you use distances ONLY to time-align two (or more) loudspeakers, then your correction till not be perfect. And, the more incorrect your actual loudspeaker configuration, the more incorrect the correction will be.

How do I fix it?

Notice that, after “correction”, the phantom image is still pulling towards the closer loudspeaker.

As we saw above, in order to push a phantom centre image towards a loudspeaker, you have to make the sound in that loudspeaker earlier.

So, what we need to do, after the distance-based time alignment is done, is to force the more distant loudspeaker to be a little earlier than the closer one. That will pull the phantom image towards it.

In order to use a distance compensation to make a loudspeaker produce the sound earlier, we have to tell the processor that it’s further away than it actually is. This makes the processor “think” that it needs to send the sound out early to compensate for the extra propagation delay caused by the distance.

So, to make the further loudspeaker a little early relative to the other loudspeaker, we either have to tell the processor that it’s further away from the listening position than it really is, or we reduce the reported distance to the closer loudspeaker to delay it a little more.

This means that, in the example shown in Figure 7, above, we should add a little to the distance to the left loudspeaker before entering the value in the menus, or subtract a little from the distance to the right loudspeaker instead.

How much is enough?

You might, at this point, be asking yourself “Why can’t this be done automatically? It’s just a little trigonometry, after all…”

If things were as simple as I’ve described here, then you’d be right – the math that is converting distance compensation to audio delays could include this offset, and everything would be fine.

The problem is that I’ve over-simplified a little on the way through. For example, not everyone hears exactly a 10º shift in phantom image with a 2.5 dB inter-channel amplitude difference. Those numbers are the average of a listening test with a number of subjects. Also, when other researchers have done the same test, they get slightly different results. (see this page for information).

Also, the directivity of the loudspeaker will have an influence (that is likely going to be frequency-dependent). So, if you’ve “toed in” your loudspeakers, then (in the example above) the further one will be “aimed” at you better than the closer one, which will have an influence on the perceived location of the phantom centre.

So, the only way to really do the final “tweaking” or “fine tuning” of the distance-compensation delays is to do it by listening.

Normally, I start by entering the distances correctly. Then, while sitting in the listening position, I use a monophonic track (Suzanne Vega singing “Tom’s Diner” works well) and I increase the distance in the surround processor’s menu of the loudspeaker that I want to pull the image towards. In other words, if the phantom centre appears to be located too far to the left, I “lie” to the surround processor and tell it that the right loudspeaker is further by 10 cm. I keep adding distance until the image is moved to the correct location.