“High-Res” Audio: Part 11: How high can you go?

Part 1
Part 2
Part 3
Part 4
Part 5
Part 6
Part 7
Part 8a
Part 8b
Part 9
Part 10

If you you get an audiometry test done, you’ll be shown into a small room, about the size of a public bathroom stall. Someone will put a pair of headphones on you, and pass you a small handle with a button. Your instructions are to press the button if you hear a tone. Then the audiometrist will leave the room, closing the door, and you’ll suddenly realise that if there’s any noise in this room, it’s because you’re making it.

Then you hear a beep in your left ear. You press the button. You hear a quieter beep. Press. Quieter beep. Press…. …. …. Beep, press… …. …. …. Beep, press…. New frequency beep, loud again. Press… and so on.

What’s happening here is that you’re presented with a sine tone at some frequency, probably loud enough for you to hear. You press. The tone gets quieter, and you press again. Eventually, the tone is so quiet that you cannot hear it (this is normal) so you don’t press. So, the tone gets louder, and you press. Then it gets quieter again, until you can’t hear it again.

By crossing over that threshold of “can hear” and “can’t hear” a couple of times, the audiometrist finds out whether or not you got lucky… If you bottom out at the same level a couple of times in a row, then that’s your threshold of hearing at that frequency in that ear.

The frequency changes (usually by 1 octave, but sometimes less), and the whole process is repeated.

If you get a full test done, then this is probably done at 9 frequencies (250, 500, 1k, 1.5k, 2k, 3k, 4k, 6k, and 8kHz) in both ears individually – 18 tests in all.

You’ll then be given a sheet of paper, or at least shown a plot of your hearing threshold. Typically, if you have “normal” hearing (whatever that means) your thresholds will all be sitting on a horizontal line marked 0 dB. If you’re “better than normal” then you get a negative score, if you’re “worse than normal” you get a positive score.

What does this mean?

Let’s start over.

If a lot of people do this test, and we only test at 1 kHz, we’ll find out that, after the results are averaged, the group can hear the 1 kHz sine tone when the change in air pressure at the ear entrance is 20 µPa. We’re not going to talk about what this means other than to say that “sound is a change in air pressure over time, and that pressure is measured in pascals, abbreviated Pa”. Needless to say, 20 µPa is pretty quiet, since it’s the quietest sound a group of people can hear at 1 kHz when you take their average.

If you did that test at a much lower frequency, you would find out that people aren’t as good at hearing quiet sounds. In other words, at 100 Hz, the sine tone has to be louder than 20 µPa for people to hear it.

The same is true if you repeated the test at a much higher frequency – say, 10,000 Hz.

If you did this test at a lot of frequencies, then you’d find out that, on average, the threshold of hearing for a human follows the bottom red line of the plot in Figure 1, borrowed from Wikipedia.

Figure 1: The bottom red curve is the average threshold of hearing for a human being.

That bottom plot shows the threshold of hearing for different frequencies, plotted in dB SPL. Notice that, at 1 kHz, the line is at 0 dB SPL. This is because 0 dB SPL is defined to be the average threshold of hearing of a human at 1 kHz, which is 20 µPa. So, it’s not an accident…

Looking at that plot, you can see that, in order to hear a sine tone at 20 Hz, the tone has got to be more than 70 dB louder (that’s a LOT louder). So, a microphone “sees” a 73 dB SPL, 20 Hz sine tone as being louder than a 0 dB SPL, 1 kHz sine tone – but as far as you’re concerned, they’re both “the quietest sound you can hear” – therefore, they’re the same level.

If we take that threshold of hearing curve, and we play tones at those levels for those frequencies, then you should “just be able to” hear them. So, we’ll call those levels “0 dB” – since it’s the same as what is expected of you.

In other words, the piece of paper you got from the audiometrist tells you how much above or below that red threshold of hearing YOU sit.

Now, let’s back up a bit.

  1. I said that, in your test, you only went up to 8 kHz. This is because, above that (and possibly even before that) the headphones might not be trust-worthy, and even a tiny movement (say a couple of millimetres) in the position of the headphones will have a (relatively) big effect on the level at your eardrum. So, rather than get people worried about losing their hearing at 20,000 Hz (when, in fact, they were actually just wearing the headphones 1 mm too far forward), you won’t get tested.
  2. Notice how variable that threshold of hearing line is. There are big changes in level over the “audible” frequency range.
  3. Remember that the threshold of hearing curve is an AVERAGE of a lot of people. Just like no one has 2.6 children, no one has this exact response. And, if you are some freak of nature and you DO have exactly that response, you don’t for long… we all get old…
  4. Notice how that threshold of hearing curve only goes up to about 16 kHz, and above that it says “estimated”. See point #1.

Now, you should know that your ability to hear a sine tone at some frequency is defined as how your ability compares to an expectation based on an average, within a relatively small frequency band: 250 to 8 kHz.

Then you look at a textbook or you read a website that says “humans can hear from 20 Hz to 20 kHz”, which is not enough information to be either true or false… It’s like saying “humans are usually between 0 and 10 m tall” which is also sort of true, but also adequately vague to be potentially worse-than-useless information.

The truth is, unfortunately, much more complicated… However, it’s fair to say that, in order for you to just hear a sine tone at 20 kHz, it would have to be much, much louder than one at 1 kHz. In fact, if I played a 20 kHz sine tone loud enough for you to hear, measured that level, and then played a 1 kHz sine tone for you at the same level, you’d probably punch me – after you had passed out due to the pain, woken up, hunted me down, and found me… (I’d already have run away by then….)

So what?

We humans like nice, tidy, answers. “It will rain tomorrow” is preferable to “there is a 70 – 80% chance of scattered showers in the afternoon tomorrow”. We even get mad when the information is correct, but we interpret it tidily… For example, we’ll complain about getting rained on in the middle of our hike, when there was only a 10% chance of rain. On the other hand, if there was a 10% chance of winning 1 Million dollars in the lottery, we’d all buy a ticket.

Anyways, once-upon-a-time, when the committee for inventing the compact disc was holding meetings, they said “what should the sampling rate be?” and someone said “at least 40 kHz, because we can hear up to 20 kHz”. (The reason it’s 44100 is related to the fact that the bits were stored as black and white stripes on video tape, and NTSC and PAL come close to meeting each other close to that number, when you look at the numbers of lines per field and frames per second.)

Of course, like any first-generation thing, digital recording equipment wasn’t very good at the start (back around 1980 or so) – so the first DDD recordings that were released on CD sounded… well…. weird. There was quantisation distortion because they hadn’t figured out dither yet, only 12 or 13 of the bit values were working properly on the ADC’s, the anti-aliasing filters were implemented as analogue circuits, so they let some stuff through that aliased, and they rang (“sang along”) with the signal at a high frequency… All of that added up to “weird” – possibly even “bad”. Then, people who had good equipment (high-end turntables or, even better, 1/4″ tape running at 30 ips) listened to this new format, decided it was bad, and that was that.

Some of them asked “why is is bad?” and one answer they came up with was the band limiting… If the system can’t capture or store or play materials above 20 kHz, then it’s useless… Right? Maybe…

Then, instruments were put in front of measurement microphones and spectra were measured – and the proof was in. Trumpets with harmon (wah-wah) mutes, when pointing directly at the microphone, contain harmonics as high as 50 kHz! This must explain why CDs sound bad! Right? Maybe…

Then Rupert Neve did a demo at an AES (Audio Engineering Society) convention where he played people two tones. Both were at 7 kHz, but one was a sine wave and the other was a square wave (at some level). The question was: have a listen and tell me which is which. The results were the same as if everyone was just guessing. (Remember that, in order to make a square wave, you need to add odd harmonics – so the lowest-frequency content difference between a 7 kHz sine wave and a 7 kHz square wave is at 21 kHz.) Proof that we don’t need to go above 20 kHz, right? Maybe…

Some years ago, I took some “high resolution” audio files and measured their spectral content. One particularly interesting result is shown in Figures 2, below.

Figure 2: The spectral content of a 96/24 “high resolution” audio file I bought.

Look at that spike in the top end – around 20 kHz. What musical instrument makes that sound? The answer is “no musical instrument makes that sound – at least none of the baroque instruments in that recording make that sound. As I wrote back in 2014:

 If you’re wondering what it might be, I asked a bunch of smart friends, and the best explanation we can come up with is that it’s noise from a switched-mode power supply that is somehow bleeding into the recording. HOW it’s bleeding into the recording is a potentially interesting question for recording engineers. One possibility is that one of the musicians was charging up a phone in the room where the microphones were – and the mic’s just picked up the noise. Another possibility is that the power supply noise is bleeding electrically into the recording chain – maybe it’s a computer power supply or the sound card and the manufacturer hasn’t thought about isolating this high frequency noise from the audio path. Or, maybe it’s something else.

Interestingly, this is a conflict of two engineers. The designer of the power supply (assuming that’s what it is…) said “I’ll put the switching frequency above 20 kHz so that no one will hear it” and the recording engineer said “I’ll record this at 96 kHz so that people can get the content they’re missing…” The problem is that the content you’re missing is something you don’t want…

Similarly, if you listen to Eric Clapton’s “Unplugged” album with headphones or loudspeakers that have a low-enough low-frequency range, you’ll hear a loud thump, thump, thump going along with the music. This is the sound of someone tapping their foot on a temporary stage floor, shaking a vocal microphone. In my not-very-humble opinion, that should never have made it out to the public release. However, my guess is that the speakers it was mastered on didn’t go low enough… (OR, it was an artistic decision, and I would have done it differently.) Assuming that I’m right, then this is a second example where a “better” system sounds “worse”.

Of course, through all of this, I have assumed that your loudspeakers or headphones can produce the signals that we’re talking about in the direction that you’re sitting in, and that those signals are not being masked by other sounds in the room (like phone chargers singing…) However, to complicate things with reality would just be too far to go today…

Conclusions?

I don’t have any, but I have some questions and (as usual) some opinions…

  • Does a harmon mute on a trumpet produce energy at 50 kHz, if you’re sitting right in front of it?
    Yes.
  • Do you want to sit right in front of a trumpet with a harmon mute?
    Debatable.
  • Can a high-res audio recording include the sound of a phone charger?
    Yes.
  • Do you want to have an expensive recording of a baroque ensemble with obligato phone charger?
    Probably not – the charger is not in Buxtehude’s original score as far as I can see.
  • Can you hear the difference between a 7 kHz sine and a 7 kHz square wave?
    Depends on the speaker / headphone, the listening position, the background noise level, and whether or not you were out clubbing last night. Heads or tails?
  • Will you feel better by knowing that your file contains “audio” content above 20 kHz? Probably.
    Placebos have been known to work bigger miracles than this. (But don’t forget the stuff I said about sampling rate converters earlier…)

“High-Res” Audio: Part 4 – Know your limits

Part 1
Part 2
Part 3

If you’ve read the three introductory parts of this series, linked above; and if you’re still awake, then we are ready to start putting things together and jumping to incorrect conclusions…

Let’s say that you’ve been hired to specify a digital audio system for some reason (we’ll assume that it’s an LPCM system – nothing exotic). Using the information I’ve told you so far, you can make two decisions in your specification:

You select a bit depth to be enough to give you the dynamic range you desire. In this case, “dynamic range” means the “distance” in level between the loudest sound you can record / store / transmit (I isn’t say what the “digital audio system” was going to be used for) and the inherent noise floor of the system. If you’re recording the background noise on an airplane while it’s in flight, you don’t need a big dynamic range, because it’s always loud, and never changes. However, if you’re recording a Japanese Taiko Drummer group, you’ll need a huge usable dynamic range because the loud parts of the performance are a LOT louder than the quietest parts.

As we saw in Part 3, an LPCM digital audio system cannot record any audio that has a frequency higher than 1/2 the sampling rate. So, you select a sampling rate that is at least 2x the highest frequency you’re interested in. For example, if you believe the books that say you can hear from 20 Hz to 20,000 Hz, then you might decide that your sampling rate has to be at least 40,000 Hz. On the other hand, if you’re making a subwoofer that you know will never be fed a signal above 120 Hz, then you don’t need a sampling rate higher than 240 Hz.

Don’t get angry yet. I’m just keeping these numbers simple to make the math easy. Later on, I’ll explain why what I just said might not be correct.

Mistake #1

I just jumped to at least three conclusions (probably more) that are going to haunt me.

The first was that my “digital audio system” was something like the following:

Figure 1

As you can see there, I took an analogue audio signal, converted it to digital, and then converted it back to analogue. Maybe I transmitted it or stored it in the part that says “digital audio”.

However, the important, and very probably incorrect assumption here is that I did nothing to the signal. No volume control, no bass and treble adjustments… nothing.

Mistake #2

We assumed above that we can define the system’s dynamic range based on the dynamic range of the audio signal itself. However, this makes the assumption that the noise floor of the digital system and the noise floor of your audio signal are identical, which is probably not true. As we saw in Part 2, the noise generated by TPDF dither is white – it has the same probability of having a given amount of energy per Hertz. Since we hear sound logarithmically (meaning that, to us, octaves are equal widths. Equal spacings in Hz are not.) This means that the noise sound “bright” to us – because there’s just as much energy in the top octave (say, 10 kHz to 20 kHz, if you believe the books) as there is in all other frequencies combined from 0 Hz up to 10 kHz.

If, however, the noise floor in your concert hall where the taiko drummers are playing is caused by the air conditioning system, then this noise will be a lot louder in the low frequencies than the the highs – which is not the same.

Therefore it’s too simplistic to say “the noise floor of the digital system” and the “noise floor of the signal” – since these two noise floors are different. (As Steven Wright said: “It doesn’t matter what temperature the room is, it’s always room temperature.”)

Mistake #3

As we’ll see later, if you’re going to do anything to the signal while it’s in the “digital domain”, then you need to take that into consideration when you’re deciding on your sampling rate. It’s not enough to say “useful audio bandwidth times 2” because there are some side effects that need to be remembered…

However, counter-intuitively, it could be that, in order to improve your system, you’ll want to make the sampling rate LOWER instead of HIGHER – so this is not a simple case of “more is better”.

We’ll get to that topic later. For now, I’ll leave you in suspense.

Some details

One thing we saw in Part 3 was that, if we have an audio signal with energy at a frequency higher than 1/2 the sampling rate, and if that signal gets into the analogue-to-digital converter (ADC), then the output of the ADC will contain an error. We’ll get out energy at frequencies that were not in the original, due to the effect called “aliasing“.

Once that’s in the digital audio signal, there’s no removing it, so we need to make sure that the too-high-frequency signals don’t get into the ADC’s input in the first place. This is done using a low-pass filter that (in theory) removes all energy in the signal above the Nyquist frequency (which is equal to 1/2 the sampling rate). Since that low-pass filter prevents aliasing, we call it an anti-aliasing filter. Normally, these days, that antialiasing filter is built into the ADC itself.

As we also saw in Part 3, the digital-to-analogue converter (DAC) has to smooth out the digital signal to convert it from a “staircase” wave to a smoother one. That’s also done with a low-pass filter that eliminates all the harmonics that would be required to make the staircase have sharp corners. Since this is done to re-construct the analogue signal, it’s called a “reconstruction filter“.

This means that, if we pull apart some of the components in the signal chain I showed in Figure 1, it really looks more like this:

Figure 2.

On to Part 5.

“High-Res” Audio: Part 3 – Frequency Limits

Reminder: This is still just the lead-up to the real topic of this series. However, we have to get some basics out of the way first…

Just like the last posting, this is a copy-and-paste from an article that I wrote for another series. However, this one is important, and rather than just link you to a different page, I’ve reproduced it (with some minor editing to make it fit) here.

Part 1
Part 2

In the first posting in this series, I talked about digital audio (more accurately, Linear Pulse Code Modulation or LPCM digital audio) is basically just a string of stored measurements of the electrical voltage that is analogous to the audio signal, which is a change in pressure over time… In the second posting in the series, we looked at a “trick” for dealing with the issue of quantisation (the fact that we have a limited resolution for measuring the amplitude of the audio signal). This trick is to add dither (a fancy word for “noise”) to the signal before we quantise it in order to randomise the error and turn it into noise instead of distortion.

In this posting, we’ll look at some of the problems incurred by the way we carve up time into discrete moments when we grab those samples.

Let’s make a wheel that has one spoke. We’ll rotate it at some speed, and make a film of it turning. We can define the rotational speed in RPM – rotations per minute, but this is not very useful. In this case, what’s more useful is to measure the wheel rotation speed in degrees per frame of the film.

Fig 1. The position of a clockwise-rotating wheel (with only one spoke) for 9 frames of a film. Each column shows a different rotational speed of the wheel. The far left column is the slowest rate of rotation. The far right column is the fastest rate of rotation. Red wheels show the frame in which the sequence starts repeating.

Take a look at the left-most column in Figure 1. This shows the wheel rotating 45º each frame. If we play back these frames, the wheel will look like it’s rotating 45º per frame. So, the playback of the wheel rotating looks the same as it does in real life.

This is more or less the same for the next two columns, showing rotational speeds of 90º and 135º per frame.

However, things change dramatically when we look at the next column – the wheel rotating at 180º per frame. Think about what this would look like if we played this movie (assuming that the frame rate is pretty fast – fast enough that we don’t see things blinking…) Instead of seeing a rotating wheel with only one spoke, we would see a wheel that’s not rotating – and with two spokes.

This is important, so let’s think about this some more. This means that, because we are cutting time into discrete moments (each frame is a “slice” of time) and at a regular rate (I’m assuming here that the frame rate of the film does not vary), then the movement of the wheel is recorded (since our 1 spoke turns into 2) but the direction of movement does not. (We don’t know whether the wheel is rotating clockwise or counter-clockwise. Both directions of rotation would result in the same film…)

Now, let’s move over one more column – where the wheel is rotating at 225º per frame. In this case, if we look at the film, it appears that the wheel is back to having only one spoke again – but it will appear to be rotating backwards at a rate of 135º per frame. So, although the wheel is rotating clockwise, the film shows it rotating counter-clockwise at a different (slower) speed. This is an effect that you’ve probably seen many times in films and on TV. What may come as a surprise is that this never happens in “real life” unless you’re in a place where the lights are flickering at a constant rate (as in the case of fluorescent or some LED lights, for example).

Again, we have to consider the fact that if the wheel actually were rotating counter-clockwise at 135º per frame, we would get exactly the same thing on the frames of the film as when the wheel if rotating clockwise at 225º per frame. These two events in real life will result in identical photos in the film. This is important – so if it didn’t make sense, read it again.

This means that, if all you know is what’s on the film, you cannot determine whether the wheel was going clockwise at 225º per frame, or counter-clockwise at 135º per frame. Both of these conclusions are valid interpretations of the “data” (the film). (Of course, there are more – the wheel could have rotated clockwise by 360º+225º = 585º or counter-clockwise by 360º+135º = 495º, for example…)

Since these two interpretations of reality are equally valid, we call the one we know is wrong an alias of the correct answer. If I say “The Big Apple”, most people will know that this is the same as saying “New York City” – it’s an alias that can be interpreted to mean the same thing.

Wheels and Slinkies

We people in audio commit many sins. One of them is that, every time we draw a plot of anything called “audio” we start out by drawing a sine wave. (A similar sin is committed by musicians who, at the first opportunity to play a grand piano, will play a middle-C, as if there were no other notes in the world.) The question is: what, exactly, is a sine wave?

Get a Slinky – or if you don’t want to spend money on a brand name, get a spring. Look at it from one end, and you’ll see that it’s a circle, as can be (sort of) seen in Figure 2.

Fig 2. A Slinky, seen from one end. If I had really lined things up, this would just look like a shiny circle.

Since this is a circle, we can put marks on the Slinky at various amounts of rotation, as in Figure 3.

Fig 3. The same Slinky, marked in increasing angles of 45º.

Of course, I could have put the 0º mark anywhere. I could have also rotated counter-clockwise instead of clockwise. But since both of these are arbitrary choices, I’m not going to debate either one.

Now, let’s rotate the Slinky so that we’re looking at from the side. We’ll stretch it out a little too…

Fig 4. The same Slinky, stretched a little, and viewed from the side.

Let’s do that some more…

Fig 5. The same Slinky, stretched more, and viewed from the “side” (in a direction perpendicular to the axis of the rotation).

When you do this, and you look at the Slinky directly from one side, you are able to see the vertical change of the spring from the centre as a result of the change in rotation. For example, we can see in Figure 6 that, if you mark the 45º rotation point in this view, the distance from the centre of the spring is 71% of the maximum height of the spring (at 90º).

Fig 6. The same markings shown in Figure 3, when looking at the Slinky from the side. Note that, if we didn’t have the advantage of a little perspective (and a spring made of flat metal), we would not know whether the 0º point was closer or further away from us than the 180º point. In other words, we wouldn’t know if the Slinky was rotating clockwise or counter-clockwise.

So what? Well, basically, the “punch line” here is that a sine wave is actually a “side view” of a rotation. So, Figure 7, shows a measurement – a capture – of the amplitude of the signal every 45º.

Fig 7. Each measurement (a black “lollipop”) is a measurement of the vertical change of the signal as a result of rotating 45º.

Since we can now think of a sine wave as a rotation of a circle viewed from the side, it should be just a small leap to see that Figure 7 and the left-most column of Figure 1 are basically identical.

Let’s make audio equivalents of the different columns in Figure 1.

Fig 8. A sampled cosine wave where the frequency of the signal is equivalent to 90º per sample period. This is identical to the “90º per frame” column in Figure 1.
Fig 9. A sampled cosine wave where the frequency of the signal is equivalent to 135º per sample period. This is identical to the “135º per frame” column in Figure 1.
Fig 10. A sampled cosine wave where the frequency of the signal is equivalent to 180º per sample period. This is identical to the “180º per frame” column in Figure 1.

Figure 10 is an important one. Notice that we have a case here where there are exactly 2 samples per period of the cosine wave. This means that our sampling frequency (the number of samples we make per second) is exactly one-half of the frequency of the signal. If the signal gets any higher in frequency than this, then we will be making fewer than 2 samples per period. And, as we saw in Figure 1, this is where things start to go haywire.

Fig 11. A sampled cosine wave where the frequency of the signal is equivalent to 225º per sample period. This is identical to the “225º per frame” column in Figure 1.

Figure 11 shows the equivalent audio case to the “225º per frame” column in Figure 1. When we were talking about rotating wheels, we saw that this resulted in a film that looked like the wheel was rotating backwards at the wrong speed. The audio equivalent of this “wrong speed” is “a different frequency” – the alias of the actual frequency. However, we have to remember that both the correct frequency and the alias are valid answers – so, in fact, both frequencies (or, more accurately, all of the frequencies) exist in the signal.

So, we could take Fig 11, look at the samples (the black lollipops) and figure out what other frequency fits these. That’s shown in Figure 12.

Fig 12. The red signal and the black samples of it are the same as was shown in Figure 11. However, another frequency (the blue signal) also fits those samples. So, both the red signal and the blue signal exist in our system.

Moving up in frequency one more step, we get to the right-hand column in Figure 1, whose equivalent, including the aliased signal, are shown in Figure 13.

Fig 13. A signal (the red curve) that has a frequency equivalent to 280º of rotation per sample, its samples (the black lollipops) and the aliased additional signal that results (the blue curve).

Do I need to worry yet?

Hopefully, now, you can see that an LPCM system has a limit with respect to the maximum frequency that it can deal with appropriately. Specifically, the signal that you are trying to capture CANNOT exceed one-half of the sampling rate. So, if you are recording a CD, which has a sampling rate of 44,100 samples per second (or 44.1 kHz) then you CANNOT have any audio signals in that system that are higher than 22,050 Hz.

That limit is commonly known as the “Nyquist frequency“, named after Harry Nyquist – one of the persons who figured out that this limit exists.

In theory, this is always true. So, when someone did the recording destined for the CD, they made sure that the signal went through a low-pass filter that eliminated all signals above the Nyquist frequency.

In practice, however, there are many cases where aliasing occurs in digital audio systems because someone wasn’t paying enough attention to what was happening “under the hood” in the signal processing of an audio device. This will come up later.

Two more details to remember…

There’s an easy way to predict the output of a system that’s suffering from aliasing if your input is sinusoidal (and therefore contains only one frequency). The frequency of the output signal will be the same distance from the Nyquist frequency as the frequency if the input signal. In other words, the Nyquist frequency is like a “mirror” that “reflects” the frequency of the input signal to another frequency below Nyquist.

This can be easily seen in the upper plot of Figure 14. The distance from the Input signal and the Nyquist is the same as the distance between the output signal and the Nyquist.

Also, since that Nyquist frequency acts as a mirror, then the Input and output signal’s frequencies will move in opposite directions (this point will help later).

Fig 14. Two plots showing the same information about an Input Signal above the Nyquist frequency and the output alias signal. Notice that, in the linear plot on top, it’s easier to see that the Nyquist frequency is the mirror point at the centre of the frequencies of the Input and Output signals.

Usually, frequency-domain plots are done on a logarithmic scale, because this is more intuitive for we humans who hear logarithmically. (For example, we hear two consecutive octaves on a piano as having the same “interval” or “width”. We don’t hear the width of the upper octave as being twice as wide, like a measurement system does. that’s why music notation does not get wider on the top, with a really tall treble clef.) This means that it’s not as obvious that the Nyquist frequency is in the centre of the frequencies of the input signal and its alias below Nyquist.

On to Part 4

“High-Res” Audio: Part 2 – Resolution

Reminder: This is still just the lead-up to the real topic of this series. However, we have to get some basics out of the way first…

Just like the last posting, this is a copy-and-paste from an article that I wrote for another series. However, this one is important, and rather than just link you to a different page, I’ve reproduced it (with some minor editing to make it fit) here.

Back to Part 1

In the last posting, I talked about digital audio (more accurately, Linear Pulse Code Modulation or LPCM digital audio) is basically just a string of stored measurements of the electrical voltage that is analogous to the audio signal, which is a change in pressure over time…

For now, we’ll say that each measurement is rounded off to the nearest possible “tick” on the ruler that we’re using to measure the voltage. That rounding results in an error. However, (assuming that everything is working correctly) that error can never be bigger than 1/2 of a “step”. Therefore, in order to reduce the amount of error, we need to increase the number of ticks on the ruler.

Now we have to introduce a new word. If we really had a ruler, we could talk about whether the ticks are 1 mm apart – or 1/16″ – or whatever. We talk about the resolution of the ruler in terms of distance between ticks. However, if we are going to be more general, we can talk about the distance between two ticks being one “quantum” – a fancy word for the smallest step size on the ruler.

So, when you’re “rounding off to the nearest value” you are “quantising” the measurement (or “quantizing” it, if you live in Noah Webster’s country and therefore you harbor the belief that wordz should be spelled like they sound – and therefore the world needz more zees). This also means that the amount of error that you get as a result of that “rounding off” is called “quantisation error“.

In some explanations of this problem, you may read that this error is called “quantisation noise”. However, this isn’t always correct. This is because if something is “noise” then is is random, and therefore impossible to predict. However, that’s not strictly the case for quantisation error. If you know the signal, and you know the quantisation values, then you’ll be able to predict exactly what the error will be. So, although that error might sound like noise, technically speaking, it’s not. This can easily be seen in Figures 1 through 3 which demonstrate that the quantisation error causes a periodic, predictable error (and therefore harmonic distortion), not a random error (and therefore noise).

Sidebar: The reason people call it quantisation noise is that, if the signal is complicated (unlike a sine wave) and high in level relative to the quantisation levels – say a recording of Britney Spears, for example – then the distortion that is generated sounds “random-ish”, which causes people to jump to the conclusion that it’s noise.

Fig 1: The first cycle of a periodic signal (in this case, a sinusoidal waveform) that we are going to quantise using a 4-bit system (notice the 4 bits in the scale on the left).
Fig 2: The same waveform shown in Figure 1 after quantisation (rounding off) in a 4-bit world.
Fig 3: The difference between Figure 2 and Figure 1. I made this by subtracting the original signal from the quantised version. This is the error in the quantised waveform – the quantisation error. Notice that it is not noise… it’s completely predictable and it will repeat with repetitions of the signal. Therefore the result of this is distortion, not noise…

Now, let’s talk about perception for a while… We humans are really good at detecting patterns – signals – in an otherwise noisy world. This is just as true with hearing as it is with vision. So, if you have a sound that exists in a truly random background noise, then you can focus on listening to the sound and ignore the noise. For example, if you (like me) are old enough to have used cassette tapes, then you can remember listening to songs with a high background noise (the “tape hiss”) – but it wasn’t too annoying because the hiss was independent of the music, and constant. However, if you, like me, have listened to Bob Marley’s live version of “No Woman No Cry” from the “Legend” album, then you, like me, would miss the the feedback in the PA system at that point in the song when the FoH engineer wasn’t paying enough attention… That noise (the howl of the feedback) is not noise – it’s a signal… Which makes it just as important as the song itself. (I could get into a long boring talk about John Cage at this point, but I’ll try to not get too distracted…)

The problem with the signal in Figure 2 is that the error (shown in Figure 3) is periodic – it’s a signal that demands attention. If the signal that I was sending into the quantisation system (in Figure 1) was a little more complicated than a sine wave – say a sine wave with an amplitude modulation – then the error would be easily “trackable” by anyone who was listening.

So, what we want to do is to quantise the signal (because we’re assuming that we can’t make a better “ruler”) but to make the error random – so it is changed from distortion to noise. We do this by adding noise to the signal before we quantise it. The result of this is that the error will be randomised, and will become independent of the original signal… So, instead of a modulating signal with modulated distortion, we get a modulated signal with constant noise – which is easier for us to ignore. (It has the added benefit of spreading the frequency content of the error over a wide frequency band, rather than being stuck on the harmonics of the original signal… but let’s not talk about that…)

For example…

Let’s take a look at an example of this from an equivalent world – digital photography.

The photo in Figure 4 is a black and white photo – which actually means that it’s comprised of shades of gray ranging from black all the way to white. The photo has 272,640 individual pixels (because it’s 640 pixels wide and 426 pixels high). Each of those pixels is some shade of gray, but that shading does not have an infinite resolution. There are “only” 256 possible shades of gray available for each pixel.

So, each pixel has a number that can range from 0 (black) up to 255 (white).

Fig 4: A photo of a building in Paris. Each pixel in this photo has one of 256 possible levels of gray – from white (255) down to black (0).

If we were to zoom in to the top left corner of the photo and look at the values of the 64 pixels there (an 8×8 pixel square), you’d see that they are:

86 86 90 88 87 87 90 91
86 88 90 90 89 87 90 91
88 89 91 90 89 89 90 94
88 90 91 93 90 90 93 94
89 93 94 94 91 93 94 96
90 93 94 95 94 91 95 96
93 94 97 95 94 95 96 97
93 94 97 97 96 94 97 97

What if we were to reduce the available resolution so that there were fewer shades of gray between white and black? We can take the photo in Figure 1 and round the value in each pixel to the new value. For example, Figure 5 shows an example of the same photo reduced to only 6 levels of gray.

Fig 5: The same photo of the same building. Each pixel in this photo has one of 6 possible levels of gray. Notice that some details are lost – like the smooth transitions in the clouds, or the stripes in the marble in the pillars.

Now, if we look at those same pixels in the upper left corner, we’d see that their values are

102 102 102 102 102 102 102 102
102 102 102 102 102 102 102 102
102 102 102 102 102 102 102 102
102 102 102 102 102 102 102 102
102 102 102 102 102 102 102 102
102 102 102 102 102 102 102 102
102 102 102 102 102 102 102 102
102 102 102 102 102 102 102 102

They’ve all been quantised to the nearest available level, which is 102. (Our possible values are restricted to 0, 51, 102, 154, 205, and 255).

So, we can see that, by quantising the gray levels from 256 possible values down to only 6, we lose details in the photo. This should not be a surprise… That loss of detail means that, for example, the gentle transition from lighter to darker gray in the sky in the original is “flattened” to a light spot in a darker background, with a jagged edge at the transition between the two. Also, the details of the wall pillars between the windows are lost.

If we take our original photo and add noise to it – so were adding a random value to the value of each pixel in the original photo (I won’t talk about the range of those random values…) it will look like Figure 6. This photo has all 256 possible values of gray – the same as in Figure 1.

Fig 6: A photo of noise with the same width and height as the original photo, with random values (ranging from 0 to 255) in each pixel.

If we then quantise Figure 6 using our 6 possible values of gray, we get Figure 7. Notice that, although we do not have more grays than in Figure 5, we can see things like the gradual shading in the sky and some details in the walls between the tall windows.

Fig 7: The same photo of the same building in Figure 4. Each pixel in this photo ALSO only has one of 6 possible levels of gray – just like in Figure 5. However, this version is the result of quantising the original photo with the noise added before quantisation. The result is admittedly noisy – but we are able to see pattens in the noise that preserve some of the details that we lost in Figure 5.

That noise that we add to the original signal is called dither – because it is forcing the quantiser to be indecisive about which level to quantise to choose.

I should be clear here and say that dither does not eliminate quantisation error. The purpose of dither is to randomise the error, turning the quantisation error into noise instead of distortion. This makes it (among other things) independent of the signal that you’re listening to, so it’s easier for your brain to separate it from the music, and ignore it.

Addendum: Binary basics and SNR

We normally write down our numbers using a “base 10” notation. So, when I write down 9374 – I mean
9 x 1000 + 3 x 100 + 7 x 10 + 4 x 1
or
9 x 103 + 3 x 102 + 7 x 101 + 4 x 100

We use base 10 notation – a system based on 10 digits (0 through 9) because we have 10 fingers.

If we only had 2 fingers, we would do things differently… We would only have 2 digits (0 and 1) and we would write down numbers like this:
11101

which would be the same as saying
1 x 16 + 1 x 8 + 1 x 4 + 0 x 2 + 1 x 1
or
1 x 24 + 1 x 23 + 1 x 22 + 0 x 21 + 1 x 20

The details of this are not important – but one small point is. If we’re using a base-10 system and we increase the number by one more digit – say, going from a 3-digit number to a 4-digit number, then we increase the possible number of values we can represent by a factor of 10. (in other words, there are 10 times as many possible values in the number XXXX than in XXX.)

If we’re using a base-2 system and we increase by one extra digit, we increase the number of possible values by a factor of 2. So XXXX has 2 times as many possible values as XXX.

Now, remember that the error that we generate when we quantise is no bigger than 1/2 of a quantisation step, regardless of the number of steps. So, if we double the number of steps (by adding an extra binary digit or bit to the value that we’re storing), then the signal can be twice as “far away” from the quantisation error.

This means that, by adding an extra bit to the stored value, we increase the potential signal-to-error ratio of our LPCM system by a factor of 2 – or 6.02 dB.

So, if we have a 16-bit LPCM signal, then a sine wave at the maximum level that it can be without clipping is about 6 dB/bit * 16 bits – 3 dB = 93 dB louder than the error. The reason we subtract the 3 dB from the value is that the error is +/- 0.5 of a quantisation step (normally called an “LSB” or “Least Significant Bit”).

Note as well that this calculation is just a rule of thumb. It is neither precise nor accurate, since the details of exactly what kind of error we have will have a minor effect on the actual number. However, it will be close enough.

On to Part 3.

“High-Res” Audio: Part 1

I’ve been debating writing a series of postings about “high resolution” audio for a long time – years. Lately, (probably because of some hype generated by some recent press releases) I’ve been getting lots of question (no, that’s not a typo) about it, so it appears the time has come…

To start: the question that I get (a lot) is “If I can’t hear above 20 kHz, then what’s the use of high-res?” As I’ll explain as we go through, this is only one, rather small aspect to consider in this topic. In fact, it might be the least important issue to consider.

However, before I write too much, I’ll say that I’m not going to argue for or against higher resolutions in digital audio systems. I’m only going to go through a bunch of issues that can be used to argue either for or against them. So, there’s not going to be a big reveal at the end of this series telling you that high-res is either better, worse, or no different than whatever you’re using now. It’s merely going to be a discussion of a number of issues that need to be weighed. The problem is that this entire topic is complicated – and there’s no single “right” answer, as I’ll argue as we go along.

To start, let’s get down to basics and look (once again, from the perspectives of this website) at what sound is, and how it’s converted from an analogue electrical signal into a digital representation. The good thing is that I’ve written this introduction before in a different series of postings. So, I’m going to be extremely lazy and just copy-and-paste that information here. I’m not just referring you to another page because I’m intentionally leaving some things out because we’re headed into having a different discussion this time.

A quick introduction to sound

At the simplest level, sound can be described as a small change in air pressure (or barometric pressure) over short periods of time. If you’d like to have a better and more edu-tain-y version of this statement with animations and pretty colours, you could take 10 minutes to watch this video, for example.

That change in pressure can be “captured” by using a microphone, that is (at the simplest level) a device that has a change in air pressure at its input and a change in electrical voltage at its output. Ignoring a lot of details, we could say that if you were to plot a measurement of the air pressure (at the input of the microphone) over time, and you were to compare it to a plot of the measurement of the voltage (at the output of the microphone) over time, you would see the same curve on the two graphs. This means that the change in voltage is analogous to the change in air pressure.

Fig 1. Notice that (in theory, and ignoring a lot of things…) the change in air pressure over time at the input of the microphone is identical to the change in voltage over time at its output. Of course, this is not true in real life – microphones lie like a cheap rug…

At this point in the conversation, I’ll make a point to say that, in theory, we could “zoom in” on either of those two curves shown in Figure 1 and see more and more details. This is like looking at a map of Canada – it has lots of crinkly, jagged lines. If you zoom in and look at  the map of Newfoundland and Labrador, you’ll see that it has finer, crinkly, jagged lines. If you zoom in further, and stand where the water meets the shore in Trepassey and take a photo of your feet, you could copy it to draw a map of the line of where the water comes in around the rocks – and your toes – and you would wind up with even finer, crinkly, jagged lines… You could take this even further and get down to a microscopic or molecular level – but you get the idea… The point is that, in theory, both of the plots in Figure 1 have infinite resolution, both in time and in air pressure or voltage.

Now, let’s say that you wanted to take that microphone’s output and transmit it through a bunch of devices and wires that, in theory, all do nothing to the signal. Let’s say, for example, that you take the mic’s output, send it through a wire to a box that makes the signal twice as loud. Then take the output of that box and send it through a wire to another box that makes it half as loud. You take the output of that box and send it through a wire to a measuring device. What will you see? Unfortunately, none of the wires or boxes in the chain can be perfect, so you’ll probably see the signal plus something else which we’ll call the “error” in the system’s output. We can call it the error because, if we measure the input voltage and the output voltage at any one instant, we’ll probably see that they’re not identical. Since they should be identical, then the system must be making a mistake in transmitting the signal – so it makes errors…

Fig 2. If you send an audio signal through some wires and devices that (in theory) do nothing to the signal, you’ll find out that they add some extra stuff that you don’t want.

Pedantic Sidebar: Some people will call that error that the system adds to the signal “noise” – but I’m not going to call it that. This is because “noise” is a specific thing – noise is random – so if it’s not random, it’s not noise. Also, although the signal has been distorted (in that the output of the system is not identical to the input) I won’t call it “distortion” either, since distortion is a name that’s given to something that happens to the signal because the signal is there. (We would probably get at least some of the error out of our system even if we didn’t send any audio into it.) So, we could be slightly geeky and adequately vague and call the extra stuff “Distortion plus noise” but not “THD+N” – which stands for “Total Harmonic Distortion Plus Noise” – because not all kinds of distortion will produce a harmonic of the signal… but I’m getting ahead of myself…

So, we want to transmit (or store) the audio signal – but we want to reduce the noise caused by the transmission (or storage) system. One way to do this is to spend more money on your system. Use wires with better shielding, amplifiers with lower noise floors, bigger power supplies so that you don’t come close to their limits, run your magnetic tape twice as fast, and so on and so on. Or, you could convert the analogue signal (remember that it’s analogous to the change in air pressure over time) to one that is represented (and therefore transmitted or stored) digitally instead.

What does this mean?

Conversion from analogue to digital and back
(but skipping important details)

IMPORTANT: If you read this section, then please read the following postings as well. This is because, in order to keep things simple to start, I’m about to leave out some important details that I’ll add afterwards. However, if you don’t add the details, you could (understandably) jump to some incorrect conclusions (that many others before you have concluded…) So, if you don’t have time to read both sections, please don’t read either of them.

In the example above, we made a varying voltage that was analogous to the varying air pressure. If we wanted to store this, we could do it by varying the amount of magnetism on a wire or a coating on a tape, for example. Or we could cut a wiggly groove in a bit of vinyl that has a similar shape to the curve in the plots in Figure 1. Or, we could do something else: we could get a metronome (or a clock) and make a measurement of the voltage every time the metronome clicks, and write down the measurements.

For example, let’s zoom in on the first little bit of the signal in the plots in Figure 1

Fig. 3 The same curve as was shown in Figure 1 – but zoomed in to the very beginning.

We’ll then put on a metronome and make a measurement of the voltage every time we hear the metronome click…

Fig 4. The same curve (in red) measured at regular intervals (in black)

We can then keep the measurements (remembering how often we made them…) and write them down like this:

0.3000
0.4950
0.5089
0.3351
0.1116
0.0043
0.0678
0.2081
0.2754
0.2042
0.0730
0.0345
0.1775

We can store this series of numbers on a computer’s hard disk, for example. We can then come back tomorrow, and convert the measurements to voltages. First we read the measurements, and create the appropriate voltage…

Fig. 5. The voltages that we stored as measurements

We then make a “staircase” waveform by “holding” those voltages until the next value comes in.

Fig 6. We make a “staircase” curve using the voltages.

All we need to do then is to use a low-pass filter to smooth out the hard edges of the staircase.

Fig 7. When we smooth out the staircase, we get back the original signal (in red).

So, in this example, we’ve gone from an analogue signal (the red curve in Figure 3) to a digital signal (the series of numbers), and back to an analogue signal (the red curve in Figure 7).

In some ways, this is a bit like the way a movie works. When you watch a movie, you see a series of still photographs, probably taken at a rate of 24 pictures (or frames) per second. If you play those photos back at the same rate (24 fps or frames per second), you think you see movement. However, this is because your eyes and brain aren’t fast enough to see 24 individual photos per second – so you are fooled into thinking that things on the screen are moving.

However, digital audio is slightly different from film in two ways:

  • The sound (equivalent to the movement in the film) is actually happening. It’s not a trick that relies on your ears and brain being too slow.
  • If, when you were filming the movie, something were to happen between frames (say, the flash of a gunshot, for example) then it would never be caught on film. This is because the photos are discrete moments in time – and what happens between them is lost. However, if something were to make a very, very short sound between two samples (two measurements) in the digital audio signal – it would not be lost. This is because of something that happens at the beginning of the chain that I haven’t described… yet…

However, there are some “artefacts” (a fancy term for “weird errors”) that are present both in film and in digital audio that we should talk about.

The first is an error that happens when you mess around with the rate at which you take the measurements (called the “sampling rate”) or the photos (called the “frame rate”) – and, more importantly, when you need to worry about this. Let’s say that you make a film at 24 fps. If you play this back at a higher frame rate, then things will move very quickly (like old-fashioned baseball movies…). If you play them back at a lower frame rate, then things move in slow motion. So, for things to look “normal” you have to play the movie at the same rate that it was filmed. However, as long as no one is looking, you can transfer the movie as fast as you like. For example, if you wanted to copy the film, you could set up a movie camera so it was pointing at a movie screen and film the film. As long as the movie on the screen is running in sync with the camera, you can do this at any frame rate you like. But you’ll have to watch the copy at the same frame rate as the original film… (Note that this issue is not something that will come up in this series of postings about high resolution audio)

The second is an easy artefact to recognise. If you see a car accelerating from 0 to something fast on film, you’ll see the wheels of the car start to get faster and faster, then, as the car gets faster, the wheels slow down, stop, and then start going backwards… This does not happen in real life (unless you’re in a place lit with flashing lights like fluorescent bulbs or LED’s). I’ll do a posting explaining why this happens – but the thing to remember here is that the speed of the wheel rotation that you see on the film (the one that’s actually captured by the filming…) is not the real rotational speed of the wheel. However, those two rotational speeds are related to each other (and to the frame rate of the film). If you change the real rotational rate or the frame rate, you’ll change the rotational rate in the film. So, we call this effect “aliasing” because it’s a false version (an alias) of the real thing – but it’s always the same alias (assuming you repeat the conditions…) Digital audio can also suffer from aliasing, but in this case, you put in one frequency (which is actually the same as a rotational speed) and you get out another one. This is not the same as harmonic distortion, since the frequency that you get out is due to a relationship between the original frequency and the sampling rate, so the result is almost never a multiple of the input frequency. (We’re going to dig into this a lot deeper through this series of postings about high resolution audio, so if it doesn’t immediately make sense, don’t worry…)

Some important details that I left out…

One of the things I said above was something like “we measure the voltage and store the results” and the example I gave was a nice series of numbers that only had 4 digits after the decimal point. This statement has some implications that we need to discuss.

Let’s say that I have a thing that I need to measure. For example, Figure 8 shows a piece of metal, and I want to measure its width.

Fig 8. A piece of metal with a width of “approximately 57 mm”.

Using my ruler, I can see that this piece of metal is about 57 mm wide. However, if I were geeky (and I am) I would say that this is not precise enough – and therefore it’s not accurate. The problem is that my ruler is only graduated in millimetres. So, if I try to measure anything that is not exactly an integer number of mm long, I’ll either have to guess (and be wrong) or round the measurement to the nearest millimetre (and be wrong).

So, if I wanted you to make a piece of metal the same width as my piece of metal, and I used the ruler in Figure 8, we would probably wind up with metal pieces of two different widths. In order to make this better, we need a better ruler – like the one in Figure 9.

Fig 9. The same piece of metal being measured with a vernier caliper. This gives us additional precision (down to 0.05 mm) so we can make a more accurate measurement.

Figure 9 shows a vernier caliper (a fancy type of ruler) being used to measure the same piece of metal. The caliper has a resolution of 0.05 mm instead of the 1 mm available on the ruler in Figure 8. So, we can make a much more accurate measurement of the metal because we have a measuring device with a higher precision.

The conversion of a digital audio signal is the same. As I said above, we measure the voltage of the electrical signal, and transmit (or store) the measurement. The question is: how accurate and precise is your measurement? As we saw above, this is (partly) determined by how many digits are in the number that you use when you “write down” the measurement.

Since the voltage measurements in digital audio are recorded in binary rather than decimal (we use 0 and 1 to write down the number instead of 0 up to 9) then we use Binary digITS – or “bits” instead of decimal digits (which are not called “dits”). The number of bits we have in the number that we write down (partly) determines the precision of the measurement of the voltage – and therefore (possibly), our accuracy…

Just like the example of the ruler in Figure 8, above, we have a limited resolution in our measurement. For example, if we had only 4 bits to work with then the waveform in 4 – the one we have to measure – would be measured with the “ruler” shown on the left side of Figure 10, below.

Fig 10: The waveform from Figure 4 as a voltage (notice the Y-axis on the right). We have to measure these values using the ruler with the resolution shown on the Y-axis on the left.

When we do this, we have to round off the value to the nearest “tick” on our ruler, as shown in Figure 11.

Fig 11: The values from figure 10 (shown as the circles) rounded off to the nearest value on our 4-bit ruler (the red staircase).

Using this “ruler” which gives a write-down-able “quantity” to the measurement, we get the following values for the red staircase:

0010
0100
0100
0011
0001
0000
0001
0010
0010
0010
0001
0000
0001

When we “play these back” we get the staircase again, shown in Figure 12.

Fig 12: The output of the measurements. Notice that all values sit exactly on one of the values for the “ruler” on the left Y-axis of the plot.

Of course, this means that, by rounding off the values, we have introduced an error in the system (just like the measurement in Figure 8 has a bigger error than the one in Figure 9). We can calculate this error if we just subtract the original signal from the output signal (in other words, Figure 12 minus Figure 10) to get Figure 13.

Fig 13: The error that we produced due to the rounding off of the signal when we did the measurements. Notice that the error is always less than 0.5 of a “tick” of the ruler on the left Y-axis.

In order to improve our accuracy of the measurement, we have to increase the precision of the values. We can do this by adding an extra digit (or bit) to the number that we use to record the value.

If we were using decimal numbers (0-9) then adding an extra digit to the number would give us 10 times as many possibilities. (For example, if we were using 4 digits after the decimal in the example at the start of this posting, we have a total of 10,000 possible values – 0.0000 to 0.9999. If we add one more digit, we increase the resolution to 100,000 possible values – 0.00000 to 0.99999 ).

In binary, adding one extra digit gives us twice as many “ticks” on the ruler. So, using 4 bits gives us 16 possible values. Increasing to 5 bits gives us 32 possible values.

If you’re listening to a CD, then the individual measurements of each voltage – the “sample values” – are stored with 16 bits, which means that we have 65,536 possible values to pick from.

Remember that this means that we have more “ticks” on our ruler – but we don’t necessarily increase its range. So, for example, we’re still measuring a voltage from -1 V to 1 V – we just have more and more resolution with which we can do that measurement.

On to Part 2…

Turn it down half-way…

#81 in a series of articles about the technology behind Bang & Olufsen loudspeakers

Bertrand Russell once said, “In all affairs it’s a healthy thing now and then to hang a question mark on the things you have long taken for granted.”

This article is a discussion, both philosophical and technical about what a volume control is, and what can be expected of it. This seems like a rather banal topic, but I find it surprising how often I’m required to explain it.

Why am I writing this?

I often get questions from colleagues and customers that sound something like the following:

  • Why does my Beovision television’s volume control only go to 90%? Why can’t I go to 100%?
  • I set the volume on my TV to 40, so why is it so quiet (or loud)?

The first question comes from people who think that the number on the screen is in percent – but it’s not. The speedometer in your car displays your speed in kilometres per hour (km/h), the tachometer is in revolutions of the engine per minute (RPM) the temperature on your thermostat is in degrees Celsius (ºC), and the display on your Beovision television is on a scale based on decibels (dB). None of these things are in percent (imagine if the speed limit on the highway was 80% of your car’s maximum speed… we’d have more accidents…)

The short reason we use decibels instead of percent is that it means that we can use subtraction instead of division – which is harder to do. The shortcut rule-of-thumb to remember is that, every time you drop by 6 dB on the volume control, you drop by 50% of the output. So, for example, going from Volume step 90 to Volume step 84 is going from 100% to 50%. If I keep going down, then the table of equivalents looks like this:

I’ve used two colours there to illustrate two things:

  • Every time you drop by 6 volume steps, you cut the percentage in half. For example, 60 is five drops of 6 steps, which is 1/2 of 1/2 of 1/2 of 1/2 of 1/2 of 100%, or 3.2% (notice the five halves there…)
  • Every time you drop by 20, you cut the percentage to 1/10. So, Volume Step 50 is 1% of Volume Step 90 because it’s two drops of 20 on the volume control.

If I graph this, showing the percentage equivalent of all 91 volume steps (from 0 to 90) then it looks like this:

Of course, the problem this plot is that everything from about Volume Step 40 and lower looks like 0% because the plot doesn’t have enough detail. But I can fix that by changing the way the vertical axis is displayed, as shown below.

That plot shows exactly the same information. The only difference is that the vertical scale is no longer linearly counting from 0% to 100% in equal steps.

Why do we (and every other audio company) do it this way? The simple reason is that we want to make a volume slider (or knob) where an equal distance (or rotation) corresponds to an equal change in output level. We humans don’t perceive things like change in level in percent – so it doesn’t make sense to use a percent scale.

For the longer explanation, read on…

Basic concepts

We need to start at the very beginning, so here goes:

Volume control and gain

  1. An audio signal is (at least in a digital audio world) just a long list of numbers for each audio channel.
  2. The level of the audio signal can be changed by multiplying it by a number (called the gain).
    1. If you multiply by a value larger than 1, the audio signal gets louder.
    2. If you multiply by a number between 0 and 1, the audio signal gets quieter.
    3. If you multiply by zero, you mute the audio signal.
  3. Therefore, at its simplest core, a volume control implemented in a digital audio system is a multiplication by a gain. You turn up the volume, the gain value increases, and the audio is multiplied by a bigger number producing a bigger result.

That’s the first thing. Now we move on to how we perceive things…

Perception of Level

Speaking very generally, our senses (that we use to perceive the world around us) scale logarithmically instead of linearly. What does this mean? Let’s take an example:

Let’s say that you have $100 in your bank account. If I then told you that you’d won $100, you’d probably be pretty excited about it.

However, if you have $1,000,000 in your bank account, and I told you that you’re won $100, you probably wouldn’t even bother to collect your prize.

This can be seen as strange; the second $100 prize is not less money than the first $100 prize. However, it’s perceived to be very different.

If, instead of being $100, the cash prize were “equal to whatever you have in your bank account” – so the person with $100 gets $100 and the person with $1,000,000 gets $1,000,000, then they would both be equally excited.

The way we perceive audio signals is similar. Let’s say that you are listening to a song by Metallica at some level, and I ask you to turn it down, and you do. Then I ask you to turn it down by the same amount again, and you do. Then I ask you to turn it down by the same amount again, and you do… If I were to measure what just happened to the gain value, what would I find?

Well, let’s say that, the first time, you dropped the gain to 70% of the original level, so (for example) you went from multiplying the audio signal by 1 to multiplying the audio signal by 0.7 (a reduction of 0.3, if we were subtracting, which we’re not). The second time, you would drop by the same amount – which is 70% of that – so from 0.7 to 0.49 (notice that you did not subtract 0.3 to get to 0.4). The third time, you would drop from 0.49 to 0.343. (not subtracting 0.3 from 0.4 to get to 0.1).

In other words, each time you change the volume level by the “same amount”, you’re doing a multiplication in your head (although you don’t know it) – in this example, by 0.7. The important thing to note here is that you are NOT subtracting 0.3 from the gain in each of the above steps – you’re multiplying by 0.7 each time.

What happens if I were to express the above as percentages? Then our volume steps (and some additional ones) would look like this:

100%
70%
49%
34%
24%
17%
12%
8%

Notice that there is a different “distance” between each of those steps if we’re looking at it linearly (if we’re just subtracting adjacent values to find the difference between them). However, each of those steps is a reduction to 70% of the previous value.

This is a case where the numbers (as I’ve expressed them there) don’t match our experience. We hear each reduction in level as the same as the other steps, but they don’t look like they’re the same step size when we write them all down the way I’ve done above. (In other words, the numerical “distance” between 100 and 70 is not the same as the numerical “distance” between 49 and 34, but these steps would sound like the same difference in audio level.)

SIDEBAR: This is very similar / identical to the way we hear and express frequency changes. For example, the figure below shows a musical staff. The red brackets on the left show 3 spacings of one octave each; the distance between each of the marked frequencies sound the same to us. However, as you can see by the frequency indications, each of those octaves has a very different “width” in terms of frequency. Seen another way, the distance in Hertz in the octave from 440 Hz to 880 Hz is equal to the distance from 440 Hz all the way down to 0 Hz (both have a width of 440 Hz). However, to us, these sound like very different intervals.

SIDEBAR to the SIDEBAR: This also means that the distance in Hertz covered by the top octave on a piano is larger than the the distance covered by all of the other keys.

SIDEBAR to the SIDEBAR to the SIDEBAR: This also means that changing your sampling rate from 48 kHz to 96 kHz doubles your bandwidth, but only gives you an extra octave. However, this is not an argument against high-resolution audio, since the frequency range of the output is a small part of the list of pro’s and con’s.)

This is why people who deal with audio don’t use percent – ever. Instead, we use an extra bit of math that uses an evil concept called a logarithm to help to make things make more sense.

What is a logarithm?

If I say the following, you should not raise your eyebrows:

2*3 = 6, therefore 6/2 = 3 and 6/3 = 2

In other words, division is just multiplication done backwards. This can be generalised to the following:

if a*b=c, then c/a=b and c/b=a

Logarithms are similar; they’re just exponents done backwards. For example:

102 = 100, therefore Log10(100) = 2

and generally:

AB=C, therefore LogA(C) = B

Why use a logarithm?

The nice thing about logarithms is that they are a convenient way for a mathematician to do addition instead of multiplication.

For example, if I have the following sequence of numbers:

2, 4, 8, 16, 32, 64, and so on…

It’s easy to see that I’m just multiplying by 2 to get the next number.

What if I were to express the number sequence above as a series of exponents? Then it would look like this:

21, 22, 23, 24, 25, 26

Not useful yet…

What if I asked you to multiply two numbers in that sequence? Say, for example, 1024 * 8192. This would take some work (or at least some scrambling, looking for the calculator app on your phone…). However, it helps to know that this is the same as asking you to multiply 210 * 213 – to which the answer is 223. Notice that 23 is merely 10+13. So, I’ve used exponents to convert the problem from multiplication (1024*8192) to addition (210 * 213 = 2(10+13)).

How did I find out that 8192 = 213? By using a logarithm : Log2(8192) = 13.

In the old days, you would have been given a book of logarithmic tables in school, which was a way of looking up the logarithm of 8192. (Actually, those books were in base 10 and not base 2, so you would have found out that Log10(8192) = 3.9013, which would have made this discussion more confusing…) Nowadays, you can use an antique device called a “calculator” – a simulacrum of which is probably on a device you call a “phone” but is rarely used as such.

I will leave it to the reader to figure out why this quality of logarithms (that they convert multiplication into addition) is why slide rules work.

So what?

Let’s go back to the problem: We want to make a volume slider (or knob) where an equal distance (or rotation) corresponds to an equal change in level. Let’s do a simple one that has 10 steps. Coming down from “maximum” (which we’ll say is a gain of 1 or 100%), it could look like these:

The gain values for four different versions of a 10-step volume control.

The plot above shows four different options for our volume controller. Starting at the maximum (volume step 10) and working downwards to the left, each one drops by the same perceived amount per step. The Black plot shows a drop of 90% per step, the red plot shows a drop of 70% per step (which matches the list of values I put above), Blue is 50% per step, and green is 30% per step.

As you can see, these lines are curved. As you can also see, as you get lower and lower, they get to the point where it gets harder to see the value (for example, the green curve looks like it has the same gain value for Volume steps 1 through 4).

However, we can view this a different way. If we change the scale of our graph’s Y-Axis to a logarithmic one instead of a linear one, the exact same information will look like this:

The same data plotted using a different scale for the Y-Axis.

Notice now that the Y-axis has an equal distance upwards every time the gain multiplies by 10 (the same way the music staff had the same distance every time we multiplied the frequency by 2). By doing this, we now see our gain curves as straight lines instead of curved ones. This makes it easy to read the values both when they’re really small and when they’re (comparatively) big (those first 4 steps on the green curve don’t look the same on that plot).

So, one way to view the values for our Volume controller is to calculate the gains, and then plot them on a logarithmic graph. The other way is to build the logarithm into the gain itself, which is what we do. Instead of reading out gain values in percent, we use Bels (named after Alexander Graham Bell). However, since a Bel is a big step, we we use tenths of a Bel or “decibels” instead. (… In the same way that I tell people that my house is 4,000 km, and not 4,000,000 m from my Mom’s house because a metre is too small a division for a big distance. I also buy 0.5 mm pencil leads – not 0.0005 m pencil leads. There are many times when the basic unit of measurement is not the right scale for the thing you’re talking about.)

In order to convert our gain value (say, of 0.7) to decibels, we do the following equation:

20 * Log10(gain) = Gain in dB

So, we would say

20 * Log10(0.7) = -3.01 dB

I won’t explain why we say 20 * the logarithm, since this is (only a little) complicated.

I will explain why it’s small-d and capital-B when you write “dB”. The small-d is the convention for “deci-“, so 1 decimetre is 1 dm. The capital-B is there because the Bel is named after Alexander Graham Bell. This is similar to the reason we capitalize Hz, V, A, and so on…

So, if you know the linear gain value, you can calculate the equivalent in decibels. If I do this for all of the values in the plots above, it will look like this:

Notice that, on first glance, this looks exactly like the plot in the previous figure (with the logarithmic Y-Axis), however, the Y-Axis on this plot is linear (counting from -100 to 0 in equal distances per step) because the logarithmic scaling is already “built into” the values that we’re plotting.

For example, if we re-do the list of gains above (with a little rounding), it becomes

100% = 0 dB
70% = -3 dB
49% = -6 dB
34% = -9 dB
24% = -12 dB
17% = -15 dB
12% = -18 dB
8% = -21 dB

Notice coming down that list that each time we multiplied the linear gain by 0.7, we just subtracted 3 from the decibel value, because, as we see in the equation above, these mean the same thing.

This means that we can make a volume control – whether it’s a slider or a rotating knob – where the amount that you move or turn it corresponds to the change in level. In other words, if you move the slider by 1 cm or rotate the knob by 10º – NO MATTER WHERE YOU ARE WITHIN THE RANGE – the change is level will be the same as if you made the same movement somewhere else.

This is why Bang & Olufsen devices made since about 1990 (give or take) have a volume control in decibels. In older models, there were 79 steps (0 to 78) or 73 steps (0 to 72), which was expanded to 91 steps (0 to 90) around the year 2000, and expanded again recently to 101 steps (0 to 100). Each step on the volume control corresponds to a 1 dB change in the gain. So, if you change the volume from step 30 to step 40, the change in level will appear to be the same as changing from step 50 to step 60.

Volume Step ≠ Output Level

Up to now, all I’ve said can be condensed into two bullet points:

  • Volume control is a change in the gain value that is multiplied by the incoming signal
  • We express that gain value in decibels to better match the way we hear changes in level

Notice that I didn’t say anything in those two statements about how loud things actually are… This is because the volume setting has almost nothing to do with the level of the output, which, admittedly, is a very strange thing to say…

For example, get a DVD or Blu-ray player, connect it to a television, set the volume of the TV to something and don’t touch it for the rest of this experiment. Now, put in a DVD copy of any movie that has ONLY dialogue, and listen to how loud it sounds. Then, take out the DVD and put in a CD of Metallica’s Death Magnetic, press play. This will be much, much louder. In fact, if you own a B&O TV, the difference in level between those two things is the same as turning up the volume by 31 steps, which corresponds to 31 dB. Why?

When re-recording engineers mix a movie, they aim to make the dialogue sit around a level of 31 dB below maximum (better known as -31 dB FS or “31 decibels below Full Scale”). This gives them enough “headroom” to get much louder for explosions and gunshots to be exciting.

When a mixing engineer and a mastering engineer work on a pop or rock album, it’s not uncommon for them to make it as loud as possible, aiming for maximum (better known as 0 dB FS).

This means that a movie’s dialogue is much quieter than Metallica or Billie Eilish or whomever is popular when you’re reading this, because Metallica is as loud as the explosions in the movie.

The volume setting is just a value that changes that input level… So, If I listen to music at volume step 42 on a Beovision television, and you watch a movie at volume step 73 on the same Beovision television, it’s possible that we’re both hearing the same sound pressure level in our living rooms, because the music is 31 dB louder than the movie, which is the same amount that I’ve turned down my TV relative to yours (73-42 = 31).

In other words, the Volume Setting is not a predictor of how loud it is. A Volume Setting is a little like the accelerator pedal in your car. You can use the pedal to go faster or slower, but there’s no way of knowing how fast you’re driving if you only know how hard you’re pushing on the pedal.

What about other brands and devices?

This is where things get really random:

  • take any device (or computer or audio software)
  • play a sine wave (because thats easy to measure)
  • measure the change in output level as you change the volume setting
  • graph the result
  • Repeat everything above for different devices

You’ll see something like this:

The gain vs. Volume step behaviours of 8 different devices / software players

there are two important things to note in the above plot.

  1. These are the measurements of 8 different devices (or software players or “apps”) and you get 8 different results (although some of them overlap, but this is because those are just different version numbers of the same apps).
    • Notice as well that there’s a big difference here. At a volume setting of “50%” there’s a 20 dB difference between the blue dashed line and the black one with the asterisk markings. 20 dB is a LOT.
  2. None of them look like the straight lines seen in the previous plot, despite the fact that the Y-axis is in decibels. In ALL of these cases, the biggest jumps in level happen at the beginning of the volume control (some are worse than others). This is not only because they’re coming up from a MUTE state – but because they’re designed that way to fool you. How?

Think about using any of these controllers: you turn it 25% of the way up, and it’s already THIS loud! Cool! This speaker has LOTS of power! I’m only at 25%! I’ll definitely buy it! But the truth is, when the slider / knob is at 25% of the way up, you’re already pushing close to the maximum it can deliver.

These are all the equivalent of a car that has high acceleration when starting from 0 km/h, but if you’re doing 100 km/h on the highway, and you push on the accelerator, nothing happens.

First impressions are important…

On the other hand (in support of thee engineers who designed these curves), all of these devices are “one-offs” (meaning that they’re all stand-alone devices) made by companies who make (or expect to be connected to) small loudspeakers. This is part of the reason why the curves look the way they do.

If B&O used those style of gain curves for a Beovision television connected to a pair of Beolab 90s, you’d either

  • be listening at very loud levels, even at low volume settings;
  • or you wouldn’t be able to turn it up enough for music with high dynamic range.

Some quick conclusions

Hopefully, if you’ve read this far and you’re still awake:

  • you will never again use “percent” to describe your volume level
  • you will never again expect that the output level can be predicted by the volume setting
  • you will never expect two devices of two different brands to output the same level when set to the same volume setting
  • you understand why B&O devices have so many volume steps over such a large range.