“High-Res” Audio: Part 2 – Resolution

Reminder: This is still just the lead-up to the real topic of this series. However, we have to get some basics out of the way first…

Just like the last posting, this is a copy-and-paste from an article that I wrote for another series. However, this one is important, and rather than just link you to a different page, I’ve reproduced it (with some minor editing to make it fit) here.

Back to Part 1

In the last posting, I talked about digital audio (more accurately, Linear Pulse Code Modulation or LPCM digital audio) is basically just a string of stored measurements of the electrical voltage that is analogous to the audio signal, which is a change in pressure over time…

For now, we’ll say that each measurement is rounded off to the nearest possible “tick” on the ruler that we’re using to measure the voltage. That rounding results in an error. However, (assuming that everything is working correctly) that error can never be bigger than 1/2 of a “step”. Therefore, in order to reduce the amount of error, we need to increase the number of ticks on the ruler.

Now we have to introduce a new word. If we really had a ruler, we could talk about whether the ticks are 1 mm apart – or 1/16″ – or whatever. We talk about the resolution of the ruler in terms of distance between ticks. However, if we are going to be more general, we can talk about the distance between two ticks being one “quantum” – a fancy word for the smallest step size on the ruler.

So, when you’re “rounding off to the nearest value” you are “quantising” the measurement (or “quantizing” it, if you live in Noah Webster’s country and therefore you harbor the belief that wordz should be spelled like they sound – and therefore the world needz more zees). This also means that the amount of error that you get as a result of that “rounding off” is called “quantisation error“.

In some explanations of this problem, you may read that this error is called “quantisation noise”. However, this isn’t always correct. This is because if something is “noise” then is is random, and therefore impossible to predict. However, that’s not strictly the case for quantisation error. If you know the signal, and you know the quantisation values, then you’ll be able to predict exactly what the error will be. So, although that error might sound like noise, technically speaking, it’s not. This can easily be seen in Figures 1 through 3 which demonstrate that the quantisation error causes a periodic, predictable error (and therefore harmonic distortion), not a random error (and therefore noise).

Sidebar: The reason people call it quantisation noise is that, if the signal is complicated (unlike a sine wave) and high in level relative to the quantisation levels – say a recording of Britney Spears, for example – then the distortion that is generated sounds “random-ish”, which causes people to jump to the conclusion that it’s noise.

Fig 1: The first cycle of a periodic signal (in this case, a sinusoidal waveform) that we are going to quantise using a 4-bit system (notice the 4 bits in the scale on the left).
Fig 2: The same waveform shown in Figure 1 after quantisation (rounding off) in a 4-bit world.
Fig 3: The difference between Figure 2 and Figure 1. I made this by subtracting the original signal from the quantised version. This is the error in the quantised waveform – the quantisation error. Notice that it is not noise… it’s completely predictable and it will repeat with repetitions of the signal. Therefore the result of this is distortion, not noise…

Now, let’s talk about perception for a while… We humans are really good at detecting patterns – signals – in an otherwise noisy world. This is just as true with hearing as it is with vision. So, if you have a sound that exists in a truly random background noise, then you can focus on listening to the sound and ignore the noise. For example, if you (like me) are old enough to have used cassette tapes, then you can remember listening to songs with a high background noise (the “tape hiss”) – but it wasn’t too annoying because the hiss was independent of the music, and constant. However, if you, like me, have listened to Bob Marley’s live version of “No Woman No Cry” from the “Legend” album, then you, like me, would miss the the feedback in the PA system at that point in the song when the FoH engineer wasn’t paying enough attention… That noise (the howl of the feedback) is not noise – it’s a signal… Which makes it just as important as the song itself. (I could get into a long boring talk about John Cage at this point, but I’ll try to not get too distracted…)

The problem with the signal in Figure 2 is that the error (shown in Figure 3) is periodic – it’s a signal that demands attention. If the signal that I was sending into the quantisation system (in Figure 1) was a little more complicated than a sine wave – say a sine wave with an amplitude modulation – then the error would be easily “trackable” by anyone who was listening.

So, what we want to do is to quantise the signal (because we’re assuming that we can’t make a better “ruler”) but to make the error random – so it is changed from distortion to noise. We do this by adding noise to the signal before we quantise it. The result of this is that the error will be randomised, and will become independent of the original signal… So, instead of a modulating signal with modulated distortion, we get a modulated signal with constant noise – which is easier for us to ignore. (It has the added benefit of spreading the frequency content of the error over a wide frequency band, rather than being stuck on the harmonics of the original signal… but let’s not talk about that…)

For example…

Let’s take a look at an example of this from an equivalent world – digital photography.

The photo in Figure 4 is a black and white photo – which actually means that it’s comprised of shades of gray ranging from black all the way to white. The photo has 272,640 individual pixels (because it’s 640 pixels wide and 426 pixels high). Each of those pixels is some shade of gray, but that shading does not have an infinite resolution. There are “only” 256 possible shades of gray available for each pixel.

So, each pixel has a number that can range from 0 (black) up to 255 (white).

Fig 4: A photo of a building in Paris. Each pixel in this photo has one of 256 possible levels of gray – from white (255) down to black (0).

If we were to zoom in to the top left corner of the photo and look at the values of the 64 pixels there (an 8×8 pixel square), you’d see that they are:

86 86 90 88 87 87 90 91
86 88 90 90 89 87 90 91
88 89 91 90 89 89 90 94
88 90 91 93 90 90 93 94
89 93 94 94 91 93 94 96
90 93 94 95 94 91 95 96
93 94 97 95 94 95 96 97
93 94 97 97 96 94 97 97

What if we were to reduce the available resolution so that there were fewer shades of gray between white and black? We can take the photo in Figure 1 and round the value in each pixel to the new value. For example, Figure 5 shows an example of the same photo reduced to only 6 levels of gray.

Fig 5: The same photo of the same building. Each pixel in this photo has one of 6 possible levels of gray. Notice that some details are lost – like the smooth transitions in the clouds, or the stripes in the marble in the pillars.

Now, if we look at those same pixels in the upper left corner, we’d see that their values are

102 102 102 102 102 102 102 102
102 102 102 102 102 102 102 102
102 102 102 102 102 102 102 102
102 102 102 102 102 102 102 102
102 102 102 102 102 102 102 102
102 102 102 102 102 102 102 102
102 102 102 102 102 102 102 102
102 102 102 102 102 102 102 102

They’ve all been quantised to the nearest available level, which is 102. (Our possible values are restricted to 0, 51, 102, 154, 205, and 255).

So, we can see that, by quantising the gray levels from 256 possible values down to only 6, we lose details in the photo. This should not be a surprise… That loss of detail means that, for example, the gentle transition from lighter to darker gray in the sky in the original is “flattened” to a light spot in a darker background, with a jagged edge at the transition between the two. Also, the details of the wall pillars between the windows are lost.

If we take our original photo and add noise to it – so were adding a random value to the value of each pixel in the original photo (I won’t talk about the range of those random values…) it will look like Figure 6. This photo has all 256 possible values of gray – the same as in Figure 1.

Fig 6: A photo of noise with the same width and height as the original photo, with random values (ranging from 0 to 255) in each pixel.

If we then quantise Figure 6 using our 6 possible values of gray, we get Figure 7. Notice that, although we do not have more grays than in Figure 5, we can see things like the gradual shading in the sky and some details in the walls between the tall windows.

Fig 7: The same photo of the same building in Figure 4. Each pixel in this photo ALSO only has one of 6 possible levels of gray – just like in Figure 5. However, this version is the result of quantising the original photo with the noise added before quantisation. The result is admittedly noisy – but we are able to see pattens in the noise that preserve some of the details that we lost in Figure 5.

That noise that we add to the original signal is called dither – because it is forcing the quantiser to be indecisive about which level to quantise to choose.

I should be clear here and say that dither does not eliminate quantisation error. The purpose of dither is to randomise the error, turning the quantisation error into noise instead of distortion. This makes it (among other things) independent of the signal that you’re listening to, so it’s easier for your brain to separate it from the music, and ignore it.

Addendum: Binary basics and SNR

We normally write down our numbers using a “base 10” notation. So, when I write down 9374 – I mean
9 x 1000 + 3 x 100 + 7 x 10 + 4 x 1
or
9 x 103 + 3 x 102 + 7 x 101 + 4 x 100

We use base 10 notation – a system based on 10 digits (0 through 9) because we have 10 fingers.

If we only had 2 fingers, we would do things differently… We would only have 2 digits (0 and 1) and we would write down numbers like this:
11101

which would be the same as saying
1 x 16 + 1 x 8 + 1 x 4 + 0 x 2 + 1 x 1
or
1 x 24 + 1 x 23 + 1 x 22 + 0 x 21 + 1 x 20

The details of this are not important – but one small point is. If we’re using a base-10 system and we increase the number by one more digit – say, going from a 3-digit number to a 4-digit number, then we increase the possible number of values we can represent by a factor of 10. (in other words, there are 10 times as many possible values in the number XXXX than in XXX.)

If we’re using a base-2 system and we increase by one extra digit, we increase the number of possible values by a factor of 2. So XXXX has 2 times as many possible values as XXX.

Now, remember that the error that we generate when we quantise is no bigger than 1/2 of a quantisation step, regardless of the number of steps. So, if we double the number of steps (by adding an extra binary digit or bit to the value that we’re storing), then the signal can be twice as “far away” from the quantisation error.

This means that, by adding an extra bit to the stored value, we increase the potential signal-to-error ratio of our LPCM system by a factor of 2 – or 6.02 dB.

So, if we have a 16-bit LPCM signal, then a sine wave at the maximum level that it can be without clipping is about 6 dB/bit * 16 bits – 3 dB = 93 dB louder than the error. The reason we subtract the 3 dB from the value is that the error is +/- 0.5 of a quantisation step (normally called an “LSB” or “Least Significant Bit”).

Note as well that this calculation is just a rule of thumb. It is neither precise nor accurate, since the details of exactly what kind of error we have will have a minor effect on the actual number. However, it will be close enough.

On to Part 3.

“High-Res” Audio: Part 1

I’ve been debating writing a series of postings about “high resolution” audio for a long time – years. Lately, (probably because of some hype generated by some recent press releases) I’ve been getting lots of question (no, that’s not a typo) about it, so it appears the time has come…

To start: the question that I get (a lot) is “If I can’t hear above 20 kHz, then what’s the use of high-res?” As I’ll explain as we go through, this is only one, rather small aspect to consider in this topic. In fact, it might be the least important issue to consider.

However, before I write too much, I’ll say that I’m not going to argue for or against higher resolutions in digital audio systems. I’m only going to go through a bunch of issues that can be used to argue either for or against them. So, there’s not going to be a big reveal at the end of this series telling you that high-res is either better, worse, or no different than whatever you’re using now. It’s merely going to be a discussion of a number of issues that need to be weighed. The problem is that this entire topic is complicated – and there’s no single “right” answer, as I’ll argue as we go along.

To start, let’s get down to basics and look (once again, from the perspectives of this website) at what sound is, and how it’s converted from an analogue electrical signal into a digital representation. The good thing is that I’ve written this introduction before in a different series of postings. So, I’m going to be extremely lazy and just copy-and-paste that information here. I’m not just referring you to another page because I’m intentionally leaving some things out because we’re headed into having a different discussion this time.

A quick introduction to sound

At the simplest level, sound can be described as a small change in air pressure (or barometric pressure) over short periods of time. If you’d like to have a better and more edu-tain-y version of this statement with animations and pretty colours, you could take 10 minutes to watch this video, for example.

That change in pressure can be “captured” by using a microphone, that is (at the simplest level) a device that has a change in air pressure at its input and a change in electrical voltage at its output. Ignoring a lot of details, we could say that if you were to plot a measurement of the air pressure (at the input of the microphone) over time, and you were to compare it to a plot of the measurement of the voltage (at the output of the microphone) over time, you would see the same curve on the two graphs. This means that the change in voltage is analogous to the change in air pressure.

Fig 1. Notice that (in theory, and ignoring a lot of things…) the change in air pressure over time at the input of the microphone is identical to the change in voltage over time at its output. Of course, this is not true in real life – microphones lie like a cheap rug…

At this point in the conversation, I’ll make a point to say that, in theory, we could “zoom in” on either of those two curves shown in Figure 1 and see more and more details. This is like looking at a map of Canada – it has lots of crinkly, jagged lines. If you zoom in and look at  the map of Newfoundland and Labrador, you’ll see that it has finer, crinkly, jagged lines. If you zoom in further, and stand where the water meets the shore in Trepassey and take a photo of your feet, you could copy it to draw a map of the line of where the water comes in around the rocks – and your toes – and you would wind up with even finer, crinkly, jagged lines… You could take this even further and get down to a microscopic or molecular level – but you get the idea… The point is that, in theory, both of the plots in Figure 1 have infinite resolution, both in time and in air pressure or voltage.

Now, let’s say that you wanted to take that microphone’s output and transmit it through a bunch of devices and wires that, in theory, all do nothing to the signal. Let’s say, for example, that you take the mic’s output, send it through a wire to a box that makes the signal twice as loud. Then take the output of that box and send it through a wire to another box that makes it half as loud. You take the output of that box and send it through a wire to a measuring device. What will you see? Unfortunately, none of the wires or boxes in the chain can be perfect, so you’ll probably see the signal plus something else which we’ll call the “error” in the system’s output. We can call it the error because, if we measure the input voltage and the output voltage at any one instant, we’ll probably see that they’re not identical. Since they should be identical, then the system must be making a mistake in transmitting the signal – so it makes errors…

Fig 2. If you send an audio signal through some wires and devices that (in theory) do nothing to the signal, you’ll find out that they add some extra stuff that you don’t want.

Pedantic Sidebar: Some people will call that error that the system adds to the signal “noise” – but I’m not going to call it that. This is because “noise” is a specific thing – noise is random – so if it’s not random, it’s not noise. Also, although the signal has been distorted (in that the output of the system is not identical to the input) I won’t call it “distortion” either, since distortion is a name that’s given to something that happens to the signal because the signal is there. (We would probably get at least some of the error out of our system even if we didn’t send any audio into it.) So, we could be slightly geeky and adequately vague and call the extra stuff “Distortion plus noise” but not “THD+N” – which stands for “Total Harmonic Distortion Plus Noise” – because not all kinds of distortion will produce a harmonic of the signal… but I’m getting ahead of myself…

So, we want to transmit (or store) the audio signal – but we want to reduce the noise caused by the transmission (or storage) system. One way to do this is to spend more money on your system. Use wires with better shielding, amplifiers with lower noise floors, bigger power supplies so that you don’t come close to their limits, run your magnetic tape twice as fast, and so on and so on. Or, you could convert the analogue signal (remember that it’s analogous to the change in air pressure over time) to one that is represented (and therefore transmitted or stored) digitally instead.

What does this mean?

Conversion from analogue to digital and back
(but skipping important details)

IMPORTANT: If you read this section, then please read the following postings as well. This is because, in order to keep things simple to start, I’m about to leave out some important details that I’ll add afterwards. However, if you don’t add the details, you could (understandably) jump to some incorrect conclusions (that many others before you have concluded…) So, if you don’t have time to read both sections, please don’t read either of them.

In the example above, we made a varying voltage that was analogous to the varying air pressure. If we wanted to store this, we could do it by varying the amount of magnetism on a wire or a coating on a tape, for example. Or we could cut a wiggly groove in a bit of vinyl that has a similar shape to the curve in the plots in Figure 1. Or, we could do something else: we could get a metronome (or a clock) and make a measurement of the voltage every time the metronome clicks, and write down the measurements.

For example, let’s zoom in on the first little bit of the signal in the plots in Figure 1

Fig. 3 The same curve as was shown in Figure 1 – but zoomed in to the very beginning.

We’ll then put on a metronome and make a measurement of the voltage every time we hear the metronome click…

Fig 4. The same curve (in red) measured at regular intervals (in black)

We can then keep the measurements (remembering how often we made them…) and write them down like this:

0.3000
0.4950
0.5089
0.3351
0.1116
0.0043
0.0678
0.2081
0.2754
0.2042
0.0730
0.0345
0.1775

We can store this series of numbers on a computer’s hard disk, for example. We can then come back tomorrow, and convert the measurements to voltages. First we read the measurements, and create the appropriate voltage…

Fig. 5. The voltages that we stored as measurements

We then make a “staircase” waveform by “holding” those voltages until the next value comes in.

Fig 6. We make a “staircase” curve using the voltages.

All we need to do then is to use a low-pass filter to smooth out the hard edges of the staircase.

Fig 7. When we smooth out the staircase, we get back the original signal (in red).

So, in this example, we’ve gone from an analogue signal (the red curve in Figure 3) to a digital signal (the series of numbers), and back to an analogue signal (the red curve in Figure 7).

In some ways, this is a bit like the way a movie works. When you watch a movie, you see a series of still photographs, probably taken at a rate of 24 pictures (or frames) per second. If you play those photos back at the same rate (24 fps or frames per second), you think you see movement. However, this is because your eyes and brain aren’t fast enough to see 24 individual photos per second – so you are fooled into thinking that things on the screen are moving.

However, digital audio is slightly different from film in two ways:

  • The sound (equivalent to the movement in the film) is actually happening. It’s not a trick that relies on your ears and brain being too slow.
  • If, when you were filming the movie, something were to happen between frames (say, the flash of a gunshot, for example) then it would never be caught on film. This is because the photos are discrete moments in time – and what happens between them is lost. However, if something were to make a very, very short sound between two samples (two measurements) in the digital audio signal – it would not be lost. This is because of something that happens at the beginning of the chain that I haven’t described… yet…

However, there are some “artefacts” (a fancy term for “weird errors”) that are present both in film and in digital audio that we should talk about.

The first is an error that happens when you mess around with the rate at which you take the measurements (called the “sampling rate”) or the photos (called the “frame rate”) – and, more importantly, when you need to worry about this. Let’s say that you make a film at 24 fps. If you play this back at a higher frame rate, then things will move very quickly (like old-fashioned baseball movies…). If you play them back at a lower frame rate, then things move in slow motion. So, for things to look “normal” you have to play the movie at the same rate that it was filmed. However, as long as no one is looking, you can transfer the movie as fast as you like. For example, if you wanted to copy the film, you could set up a movie camera so it was pointing at a movie screen and film the film. As long as the movie on the screen is running in sync with the camera, you can do this at any frame rate you like. But you’ll have to watch the copy at the same frame rate as the original film… (Note that this issue is not something that will come up in this series of postings about high resolution audio)

The second is an easy artefact to recognise. If you see a car accelerating from 0 to something fast on film, you’ll see the wheels of the car start to get faster and faster, then, as the car gets faster, the wheels slow down, stop, and then start going backwards… This does not happen in real life (unless you’re in a place lit with flashing lights like fluorescent bulbs or LED’s). I’ll do a posting explaining why this happens – but the thing to remember here is that the speed of the wheel rotation that you see on the film (the one that’s actually captured by the filming…) is not the real rotational speed of the wheel. However, those two rotational speeds are related to each other (and to the frame rate of the film). If you change the real rotational rate or the frame rate, you’ll change the rotational rate in the film. So, we call this effect “aliasing” because it’s a false version (an alias) of the real thing – but it’s always the same alias (assuming you repeat the conditions…) Digital audio can also suffer from aliasing, but in this case, you put in one frequency (which is actually the same as a rotational speed) and you get out another one. This is not the same as harmonic distortion, since the frequency that you get out is due to a relationship between the original frequency and the sampling rate, so the result is almost never a multiple of the input frequency. (We’re going to dig into this a lot deeper through this series of postings about high resolution audio, so if it doesn’t immediately make sense, don’t worry…)

Some important details that I left out…

One of the things I said above was something like “we measure the voltage and store the results” and the example I gave was a nice series of numbers that only had 4 digits after the decimal point. This statement has some implications that we need to discuss.

Let’s say that I have a thing that I need to measure. For example, Figure 8 shows a piece of metal, and I want to measure its width.

Fig 8. A piece of metal with a width of “approximately 57 mm”.

Using my ruler, I can see that this piece of metal is about 57 mm wide. However, if I were geeky (and I am) I would say that this is not precise enough – and therefore it’s not accurate. The problem is that my ruler is only graduated in millimetres. So, if I try to measure anything that is not exactly an integer number of mm long, I’ll either have to guess (and be wrong) or round the measurement to the nearest millimetre (and be wrong).

So, if I wanted you to make a piece of metal the same width as my piece of metal, and I used the ruler in Figure 8, we would probably wind up with metal pieces of two different widths. In order to make this better, we need a better ruler – like the one in Figure 9.

Fig 9. The same piece of metal being measured with a vernier caliper. This gives us additional precision (down to 0.05 mm) so we can make a more accurate measurement.

Figure 9 shows a vernier caliper (a fancy type of ruler) being used to measure the same piece of metal. The caliper has a resolution of 0.05 mm instead of the 1 mm available on the ruler in Figure 8. So, we can make a much more accurate measurement of the metal because we have a measuring device with a higher precision.

The conversion of a digital audio signal is the same. As I said above, we measure the voltage of the electrical signal, and transmit (or store) the measurement. The question is: how accurate and precise is your measurement? As we saw above, this is (partly) determined by how many digits are in the number that you use when you “write down” the measurement.

Since the voltage measurements in digital audio are recorded in binary rather than decimal (we use 0 and 1 to write down the number instead of 0 up to 9) then we use Binary digITS – or “bits” instead of decimal digits (which are not called “dits”). The number of bits we have in the number that we write down (partly) determines the precision of the measurement of the voltage – and therefore (possibly), our accuracy…

Just like the example of the ruler in Figure 8, above, we have a limited resolution in our measurement. For example, if we had only 4 bits to work with then the waveform in 4 – the one we have to measure – would be measured with the “ruler” shown on the left side of Figure 10, below.

Fig 10: The waveform from Figure 4 as a voltage (notice the Y-axis on the right). We have to measure these values using the ruler with the resolution shown on the Y-axis on the left.

When we do this, we have to round off the value to the nearest “tick” on our ruler, as shown in Figure 11.

Fig 11: The values from figure 10 (shown as the circles) rounded off to the nearest value on our 4-bit ruler (the red staircase).

Using this “ruler” which gives a write-down-able “quantity” to the measurement, we get the following values for the red staircase:

0010
0100
0100
0011
0001
0000
0001
0010
0010
0010
0001
0000
0001

When we “play these back” we get the staircase again, shown in Figure 12.

Fig 12: The output of the measurements. Notice that all values sit exactly on one of the values for the “ruler” on the left Y-axis of the plot.

Of course, this means that, by rounding off the values, we have introduced an error in the system (just like the measurement in Figure 8 has a bigger error than the one in Figure 9). We can calculate this error if we just subtract the original signal from the output signal (in other words, Figure 12 minus Figure 10) to get Figure 13.

Fig 13: The error that we produced due to the rounding off of the signal when we did the measurements. Notice that the error is always less than 0.5 of a “tick” of the ruler on the left Y-axis.

In order to improve our accuracy of the measurement, we have to increase the precision of the values. We can do this by adding an extra digit (or bit) to the number that we use to record the value.

If we were using decimal numbers (0-9) then adding an extra digit to the number would give us 10 times as many possibilities. (For example, if we were using 4 digits after the decimal in the example at the start of this posting, we have a total of 10,000 possible values – 0.0000 to 0.9999. If we add one more digit, we increase the resolution to 100,000 possible values – 0.00000 to 0.99999 ).

In binary, adding one extra digit gives us twice as many “ticks” on the ruler. So, using 4 bits gives us 16 possible values. Increasing to 5 bits gives us 32 possible values.

If you’re listening to a CD, then the individual measurements of each voltage – the “sample values” – are stored with 16 bits, which means that we have 65,536 possible values to pick from.

Remember that this means that we have more “ticks” on our ruler – but we don’t necessarily increase its range. So, for example, we’re still measuring a voltage from -1 V to 1 V – we just have more and more resolution with which we can do that measurement.

On to Part 2…

Turn it down half-way…

#81 in a series of articles about the technology behind Bang & Olufsen loudspeakers

Bertrand Russell once said, “In all affairs it’s a healthy thing now and then to hang a question mark on the things you have long taken for granted.”

This article is a discussion, both philosophical and technical about what a volume control is, and what can be expected of it. This seems like a rather banal topic, but I find it surprising how often I’m required to explain it.

Why am I writing this?

I often get questions from colleagues and customers that sound something like the following:

  • Why does my Beovision television’s volume control only go to 90%? Why can’t I go to 100%?
  • I set the volume on my TV to 40, so why is it so quiet (or loud)?

The first question comes from people who think that the number on the screen is in percent – but it’s not. The speedometer in your car displays your speed in kilometres per hour (km/h), the tachometer is in revolutions of the engine per minute (RPM) the temperature on your thermostat is in degrees Celsius (ºC), and the display on your Beovision television is on a scale based on decibels (dB). None of these things are in percent (imagine if the speed limit on the highway was 80% of your car’s maximum speed… we’d have more accidents…)

The short reason we use decibels instead of percent is that it means that we can use subtraction instead of division – which is harder to do. The shortcut rule-of-thumb to remember is that, every time you drop by 6 dB on the volume control, you drop by 50% of the output. So, for example, going from Volume step 90 to Volume step 84 is going from 100% to 50%. If I keep going down, then the table of equivalents looks like this:

I’ve used two colours there to illustrate two things:

  • Every time you drop by 6 volume steps, you cut the percentage in half. For example, 60 is five drops of 6 steps, which is 1/2 of 1/2 of 1/2 of 1/2 of 1/2 of 100%, or 3.2% (notice the five halves there…)
  • Every time you drop by 20, you cut the percentage to 1/10. So, Volume Step 50 is 1% of Volume Step 90 because it’s two drops of 20 on the volume control.

If I graph this, showing the percentage equivalent of all 91 volume steps (from 0 to 90) then it looks like this:

Of course, the problem this plot is that everything from about Volume Step 40 and lower looks like 0% because the plot doesn’t have enough detail. But I can fix that by changing the way the vertical axis is displayed, as shown below.

That plot shows exactly the same information. The only difference is that the vertical scale is no longer linearly counting from 0% to 100% in equal steps.

Why do we (and every other audio company) do it this way? The simple reason is that we want to make a volume slider (or knob) where an equal distance (or rotation) corresponds to an equal change in output level. We humans don’t perceive things like change in level in percent – so it doesn’t make sense to use a percent scale.

For the longer explanation, read on…

Basic concepts

We need to start at the very beginning, so here goes:

Volume control and gain

  1. An audio signal is (at least in a digital audio world) just a long list of numbers for each audio channel.
  2. The level of the audio signal can be changed by multiplying it by a number (called the gain).
    1. If you multiply by a value larger than 1, the audio signal gets louder.
    2. If you multiply by a number between 0 and 1, the audio signal gets quieter.
    3. If you multiply by zero, you mute the audio signal.
  3. Therefore, at its simplest core, a volume control implemented in a digital audio system is a multiplication by a gain. You turn up the volume, the gain value increases, and the audio is multiplied by a bigger number producing a bigger result.

That’s the first thing. Now we move on to how we perceive things…

Perception of Level

Speaking very generally, our senses (that we use to perceive the world around us) scale logarithmically instead of linearly. What does this mean? Let’s take an example:

Let’s say that you have $100 in your bank account. If I then told you that you’d won $100, you’d probably be pretty excited about it.

However, if you have $1,000,000 in your bank account, and I told you that you’re won $100, you probably wouldn’t even bother to collect your prize.

This can be seen as strange; the second $100 prize is not less money than the first $100 prize. However, it’s perceived to be very different.

If, instead of being $100, the cash prize were “equal to whatever you have in your bank account” – so the person with $100 gets $100 and the person with $1,000,000 gets $1,000,000, then they would both be equally excited.

The way we perceive audio signals is similar. Let’s say that you are listening to a song by Metallica at some level, and I ask you to turn it down, and you do. Then I ask you to turn it down by the same amount again, and you do. Then I ask you to turn it down by the same amount again, and you do… If I were to measure what just happened to the gain value, what would I find?

Well, let’s say that, the first time, you dropped the gain to 70% of the original level, so (for example) you went from multiplying the audio signal by 1 to multiplying the audio signal by 0.7 (a reduction of 0.3, if we were subtracting, which we’re not). The second time, you would drop by the same amount – which is 70% of that – so from 0.7 to 0.49 (notice that you did not subtract 0.3 to get to 0.4). The third time, you would drop from 0.49 to 0.343. (not subtracting 0.3 from 0.4 to get to 0.1).

In other words, each time you change the volume level by the “same amount”, you’re doing a multiplication in your head (although you don’t know it) – in this example, by 0.7. The important thing to note here is that you are NOT subtracting 0.3 from the gain in each of the above steps – you’re multiplying by 0.7 each time.

What happens if I were to express the above as percentages? Then our volume steps (and some additional ones) would look like this:

100%
70%
49%
34%
24%
17%
12%
8%

Notice that there is a different “distance” between each of those steps if we’re looking at it linearly (if we’re just subtracting adjacent values to find the difference between them). However, each of those steps is a reduction to 70% of the previous value.

This is a case where the numbers (as I’ve expressed them there) don’t match our experience. We hear each reduction in level as the same as the other steps, but they don’t look like they’re the same step size when we write them all down the way I’ve done above. (In other words, the numerical “distance” between 100 and 70 is not the same as the numerical “distance” between 49 and 34, but these steps would sound like the same difference in audio level.)

SIDEBAR: This is very similar / identical to the way we hear and express frequency changes. For example, the figure below shows a musical staff. The red brackets on the left show 3 spacings of one octave each; the distance between each of the marked frequencies sound the same to us. However, as you can see by the frequency indications, each of those octaves has a very different “width” in terms of frequency. Seen another way, the distance in Hertz in the octave from 440 Hz to 880 Hz is equal to the distance from 440 Hz all the way down to 0 Hz (both have a width of 440 Hz). However, to us, these sound like very different intervals.

SIDEBAR to the SIDEBAR: This also means that the distance in Hertz covered by the top octave on a piano is larger than the the distance covered by all of the other keys.

SIDEBAR to the SIDEBAR to the SIDEBAR: This also means that changing your sampling rate from 48 kHz to 96 kHz doubles your bandwidth, but only gives you an extra octave. However, this is not an argument against high-resolution audio, since the frequency range of the output is a small part of the list of pro’s and con’s.)

This is why people who deal with audio don’t use percent – ever. Instead, we use an extra bit of math that uses an evil concept called a logarithm to help to make things make more sense.

What is a logarithm?

If I say the following, you should not raise your eyebrows:

2*3 = 6, therefore 6/2 = 3 and 6/3 = 2

In other words, division is just multiplication done backwards. This can be generalised to the following:

if a*b=c, then c/a=b and c/b=a

Logarithms are similar; they’re just exponents done backwards. For example:

102 = 100, therefore Log10(100) = 2

and generally:

AB=C, therefore LogA(C) = B

Why use a logarithm?

The nice thing about logarithms is that they are a convenient way for a mathematician to do addition instead of multiplication.

For example, if I have the following sequence of numbers:

2, 4, 8, 16, 32, 64, and so on…

It’s easy to see that I’m just multiplying by 2 to get the next number.

What if I were to express the number sequence above as a series of exponents? Then it would look like this:

21, 22, 23, 24, 25, 26

Not useful yet…

What if I asked you to multiply two numbers in that sequence? Say, for example, 1024 * 8192. This would take some work (or at least some scrambling, looking for the calculator app on your phone…). However, it helps to know that this is the same as asking you to multiply 210 * 213 – to which the answer is 223. Notice that 23 is merely 10+13. So, I’ve used exponents to convert the problem from multiplication (1024*8192) to addition (210 * 213 = 2(10+13)).

How did I find out that 8192 = 213? By using a logarithm : Log2(8192) = 13.

In the old days, you would have been given a book of logarithmic tables in school, which was a way of looking up the logarithm of 8192. (Actually, those books were in base 10 and not base 2, so you would have found out that Log10(8192) = 3.9013, which would have made this discussion more confusing…) Nowadays, you can use an antique device called a “calculator” – a simulacrum of which is probably on a device you call a “phone” but is rarely used as such.

I will leave it to the reader to figure out why this quality of logarithms (that they convert multiplication into addition) is why slide rules work.

So what?

Let’s go back to the problem: We want to make a volume slider (or knob) where an equal distance (or rotation) corresponds to an equal change in level. Let’s do a simple one that has 10 steps. Coming down from “maximum” (which we’ll say is a gain of 1 or 100%), it could look like these:

The gain values for four different versions of a 10-step volume control.

The plot above shows four different options for our volume controller. Starting at the maximum (volume step 10) and working downwards to the left, each one drops by the same perceived amount per step. The Black plot shows a drop of 90% per step, the red plot shows a drop of 70% per step (which matches the list of values I put above), Blue is 50% per step, and green is 30% per step.

As you can see, these lines are curved. As you can also see, as you get lower and lower, they get to the point where it gets harder to see the value (for example, the green curve looks like it has the same gain value for Volume steps 1 through 4).

However, we can view this a different way. If we change the scale of our graph’s Y-Axis to a logarithmic one instead of a linear one, the exact same information will look like this:

The same data plotted using a different scale for the Y-Axis.

Notice now that the Y-axis has an equal distance upwards every time the gain multiplies by 10 (the same way the music staff had the same distance every time we multiplied the frequency by 2). By doing this, we now see our gain curves as straight lines instead of curved ones. This makes it easy to read the values both when they’re really small and when they’re (comparatively) big (those first 4 steps on the green curve don’t look the same on that plot).

So, one way to view the values for our Volume controller is to calculate the gains, and then plot them on a logarithmic graph. The other way is to build the logarithm into the gain itself, which is what we do. Instead of reading out gain values in percent, we use Bels (named after Alexander Graham Bell). However, since a Bel is a big step, we we use tenths of a Bel or “decibels” instead. (… In the same way that I tell people that my house is 4,000 km, and not 4,000,000 m from my Mom’s house because a metre is too small a division for a big distance. I also buy 0.5 mm pencil leads – not 0.0005 m pencil leads. There are many times when the basic unit of measurement is not the right scale for the thing you’re talking about.)

In order to convert our gain value (say, of 0.7) to decibels, we do the following equation:

20 * Log10(gain) = Gain in dB

So, we would say

20 * Log10(0.7) = -3.01 dB

I won’t explain why we say 20 * the logarithm, since this is (only a little) complicated.

I will explain why it’s small-d and capital-B when you write “dB”. The small-d is the convention for “deci-“, so 1 decimetre is 1 dm. The capital-B is there because the Bel is named after Alexander Graham Bell. This is similar to the reason we capitalize Hz, V, A, and so on…

So, if you know the linear gain value, you can calculate the equivalent in decibels. If I do this for all of the values in the plots above, it will look like this:

Notice that, on first glance, this looks exactly like the plot in the previous figure (with the logarithmic Y-Axis), however, the Y-Axis on this plot is linear (counting from -100 to 0 in equal distances per step) because the logarithmic scaling is already “built into” the values that we’re plotting.

For example, if we re-do the list of gains above (with a little rounding), it becomes

100% = 0 dB
70% = -3 dB
49% = -6 dB
34% = -9 dB
24% = -12 dB
17% = -15 dB
12% = -18 dB
8% = -21 dB

Notice coming down that list that each time we multiplied the linear gain by 0.7, we just subtracted 3 from the decibel value, because, as we see in the equation above, these mean the same thing.

This means that we can make a volume control – whether it’s a slider or a rotating knob – where the amount that you move or turn it corresponds to the change in level. In other words, if you move the slider by 1 cm or rotate the knob by 10º – NO MATTER WHERE YOU ARE WITHIN THE RANGE – the change is level will be the same as if you made the same movement somewhere else.

This is why Bang & Olufsen devices made since about 1990 (give or take) have a volume control in decibels. In older models, there were 79 steps (0 to 78) or 73 steps (0 to 72), which was expanded to 91 steps (0 to 90) around the year 2000, and expanded again recently to 101 steps (0 to 100). Each step on the volume control corresponds to a 1 dB change in the gain. So, if you change the volume from step 30 to step 40, the change in level will appear to be the same as changing from step 50 to step 60.

Volume Step ≠ Output Level

Up to now, all I’ve said can be condensed into two bullet points:

  • Volume control is a change in the gain value that is multiplied by the incoming signal
  • We express that gain value in decibels to better match the way we hear changes in level

Notice that I didn’t say anything in those two statements about how loud things actually are… This is because the volume setting has almost nothing to do with the level of the output, which, admittedly, is a very strange thing to say…

For example, get a DVD or Blu-ray player, connect it to a television, set the volume of the TV to something and don’t touch it for the rest of this experiment. Now, put in a DVD copy of any movie that has ONLY dialogue, and listen to how loud it sounds. Then, take out the DVD and put in a CD of Metallica’s Death Magnetic, press play. This will be much, much louder. In fact, if you own a B&O TV, the difference in level between those two things is the same as turning up the volume by 31 steps, which corresponds to 31 dB. Why?

When re-recording engineers mix a movie, they aim to make the dialogue sit around a level of 31 dB below maximum (better known as -31 dB FS or “31 decibels below Full Scale”). This gives them enough “headroom” to get much louder for explosions and gunshots to be exciting.

When a mixing engineer and a mastering engineer work on a pop or rock album, it’s not uncommon for them to make it as loud as possible, aiming for maximum (better known as 0 dB FS).

This means that a movie’s dialogue is much quieter than Metallica or Billie Eilish or whomever is popular when you’re reading this, because Metallica is as loud as the explosions in the movie.

The volume setting is just a value that changes that input level… So, If I listen to music at volume step 42 on a Beovision television, and you watch a movie at volume step 73 on the same Beovision television, it’s possible that we’re both hearing the same sound pressure level in our living rooms, because the music is 31 dB louder than the movie, which is the same amount that I’ve turned down my TV relative to yours (73-42 = 31).

In other words, the Volume Setting is not a predictor of how loud it is. A Volume Setting is a little like the accelerator pedal in your car. You can use the pedal to go faster or slower, but there’s no way of knowing how fast you’re driving if you only know how hard you’re pushing on the pedal.

What about other brands and devices?

This is where things get really random:

  • take any device (or computer or audio software)
  • play a sine wave (because thats easy to measure)
  • measure the change in output level as you change the volume setting
  • graph the result
  • Repeat everything above for different devices

You’ll see something like this:

The gain vs. Volume step behaviours of 8 different devices / software players

there are two important things to note in the above plot.

  1. These are the measurements of 8 different devices (or software players or “apps”) and you get 8 different results (although some of them overlap, but this is because those are just different version numbers of the same apps).
    • Notice as well that there’s a big difference here. At a volume setting of “50%” there’s a 20 dB difference between the blue dashed line and the black one with the asterisk markings. 20 dB is a LOT.
  2. None of them look like the straight lines seen in the previous plot, despite the fact that the Y-axis is in decibels. In ALL of these cases, the biggest jumps in level happen at the beginning of the volume control (some are worse than others). This is not only because they’re coming up from a MUTE state – but because they’re designed that way to fool you. How?

Think about using any of these controllers: you turn it 25% of the way up, and it’s already THIS loud! Cool! This speaker has LOTS of power! I’m only at 25%! I’ll definitely buy it! But the truth is, when the slider / knob is at 25% of the way up, you’re already pushing close to the maximum it can deliver.

These are all the equivalent of a car that has high acceleration when starting from 0 km/h, but if you’re doing 100 km/h on the highway, and you push on the accelerator, nothing happens.

First impressions are important…

On the other hand (in support of thee engineers who designed these curves), all of these devices are “one-offs” (meaning that they’re all stand-alone devices) made by companies who make (or expect to be connected to) small loudspeakers. This is part of the reason why the curves look the way they do.

If B&O used those style of gain curves for a Beovision television connected to a pair of Beolab 90s, you’d either

  • be listening at very loud levels, even at low volume settings;
  • or you wouldn’t be able to turn it up enough for music with high dynamic range.

Some quick conclusions

Hopefully, if you’ve read this far and you’re still awake:

  • you will never again use “percent” to describe your volume level
  • you will never again expect that the output level can be predicted by the volume setting
  • you will never expect two devices of two different brands to output the same level when set to the same volume setting
  • you understand why B&O devices have so many volume steps over such a large range.

That’s a wrap…

I spent some time this week helping to track down the source of an error in a digital audio signal flow chain, and we wound up having a discussion that I thought might be worth repeating here.

Let’s start at the very beginning.

Let’s take an analogue audio signal and convert it to a Linear Pulse Code Modulation (LPCM) representation in the dumbest possible way.

Fig 1. A simple analogue signal that we’ll use for the purposes of this discussion.

In order to save this signal as a string of numerical values, we have to first accept the fact that we don’t have an infinite number of numbers to use. So, we have to round off the signal to the nearest usable value or “quantisation value”. This process of rounding the value is called “quantisation”.

Let’s say for now that our available quantisation values are the ones shown on the grid. If we then take our original sine wave and round it to those values, we get the result shown below.

Fig 2. The original signal is shown as the blue line. The quantized version of it is shown as the red line.

Of course, I’m leaving out a lot of important details here like anti-aliasing filtering and dither (I said that we were going to be dumb…) but those things don’t matter for this discussion.

So far so good. However, we have to be a bit more specific: an LPCM system encodes the values using binary representations of the values. So, a quantisation value of “0.25”, as shown above isn’t helpful. So, let’s make a “baby” LPCM system with only 3 bits (meaning that we have three Binary digITs available to represent our values).

To start, let’s count using a 3-bit system:

Binary Value4s place2s place1s placeDecimal Value
000=0 x 4 +0 x 2 + 0 x 1=0
001=0 x 4 +0 x 2 +1 x 1=1
010=0 x 4 +1 x 2 +0 x 1=2
011=0 x 4 +1 x 2 +1 x 1=3
100=1 x 4 +0 x 2 +0 x 1=4
101=1 x 4 +0 x 2 +1 x 1=5
110=1 x 4 +1 x 2 +0 x 1=6
111=1 x 4 +1 x 2 +1 x 1=7
Table 1: The 8 numbers that can be represented using a 3-bit binary representation

and that’s as far as we can go before needing 4 bits. However, for now, that’s enough.

Take a look at our signal. It ranges from -1 to 1 and 0 is in the middle. So, if we say that the “0” in our original signal is encoded as “000” in our 3-bit system, then we just count upwards from there as follows:

Fig 3. Starting at 000 for the “0” value, and counting upwards into the positive values.

Now what? Well, let’s look at this a little differently. If we were to divide a circle into the same number of quantisation values, make the “12:00” position = 000, and count clockwise, it would look like this:

Fig 4. Counting from 000 to 111 around a circle

The question now is “how do we number the negative values?” but the answer is already in the circle shown above… If I make it a little more obvious, then the answer is shown below.

Fig 5. Relating the values on the circle to the values we’ll need to represent the audio signal…

If we use the convention shown above, and represent that on the graph of our audio signal, then it looks like this:

Fig. 6

One nice thing about this way of doing things is that you just need to look at the first digit in the binary word to know whether the value is positive or negative. A 0 means it’s positive, and a 1 means it’s negative.

However, there are two issues here that we need to sort out… The first is that, since we have an even number of values, but an odd number of quantisation steps (4 above zero, 4 below zero, and zero = 9 steps) then we had to do something asymmetrical. As you can see in the plot above, there are no numbers assigned to the top quantisation value, which actually means that it doesn’t exist.

So, if we’re still being dumb, then the result of our quantisation will either look like this:

Fig 7. The dumb way to deal with the asymmetrical quantisation problem. Notice that the result is asymmetrically clipped on the positive side, but not the negative.

or this:

Fig 8. A smarter way to deal with the asymmetrical quantisation problem. Notice that we’ll never use the bottom value.

Wrapping up…

But what happens when you make two mistakes simultaneously? Let’s go back and look at an earlier plot.

Fig 9.

Let’s say that you’re writing some DSP code, and you forget about the asymmetry problem, so you scale things so they’ll TRY to look like the plot above.

However, as we already know, that top quantisation value doesn’t exist – but the code will try to put something there. If you’ve forgotten about this, then the system will THINK that you want this:

Fig 10. Notice that top quantisation value. I’ve labeled it as (100) because that’s the binary number after 011 – but 100 is ACTUALLY already used for the bottom-most negative value… Bad things are about to happen.

As you can see there, your code (because you’ve forgotten to write an IF-THEN statement) will think that the top-most positive quantisation value is just the number after 011, which is 100. However, that value means something totally different… So, the result coming out will ACTUALLY look like this:

Fig 11. The actual output resulting from the mistakes described above.

As you can see there, the signal is very different from what we think it should be.

This error is called a “wrapping” error, because the signal is “wrapped” too far around the circle shown in Figure 5, shown above. It sounds very bad – much worse than “normal” clipping (as shown in Figure 7) because of that huge nearly-instantaneous transition from maximum positive to maximum negative and back.

Of course, the wrapping can also happen in the opposite direction; a negatively-clipped signal can wrap around and show up at the top of the positive values. The reason is the same because the values are trying to go around the same circle.

As I said: this is actually the result of two problems that both have to occur in the same system:

  • The signal has to be trying to get to a level that is beyond the limits of the quantisation values
  • Someone forgot to write a line of code that makes sure that, when that happens, the signal is “just” clipped and not wrapped.

So, if the second of these issues is sitting there, unresolved, but the signal never exceeds the limits, then you’ll never have a problem. However, I will never need the airbags in my car, unless I have an accident. So, it’s best to remember to look after that second issue… just in case.

P.S.

This method of encoding the quantisation values is called the “Two’s Complement” method. If you want to know more about it, read this.

Translating Q to Q

As I’ve talked about in a previous posting, when a reciprocal peak/dip filter says “Q”, there’s no knowing what it might mean, because there are at least 7 different definitions of Q (3 for boosts and 4 for dips).

For many people, this doesn’t really matter. If you’re just playing with an EQ to make things sound better right now, then the values on the display really don’t matter: it’s the sound that counts.

If you’re like me, you need to be able to navigate between different pieces of software and hardware, and to get the same EQ response from them, then you’ll also need to know firstly that you can’t trust the display, and secondly, how to “translate” from device to device when necessary.

For example, take a look at Figure 1

Figure 1: The magnitude response of two peaking filters, both with Fc=1 kHz, Gain = +12 dB, Q = 2

This shows two magnitude responses, however, these are the measurements of two equalisers with identical settings:
Fc = 1 kHz, Gain = +12 dB, Q = 2.

The black curve shows the response of an equaliser that uses the -3 dB points to define the bandwidth of the filter, and therefore the Q is based on 1/(2 zeta). The red curve shows the response of an equaliser that uses the mid-point (in this case, +6 dB because the Gain is +12 dB) to define the bandwidth of the filter.

The difference between these two plots is shown below in Figure 2.

Figure 2: The difference between the two curves in Figure 1.

We’d have a similar problem if we were cutting instead of boosting, as shown in Figure 3.

Figure 3: The magnitude response of two peaking filters, both with Fc=1 kHz, Gain = -12 dB, Q = 2

You have to think upside down in this case, because the 1/(2 zeta) filter is actually using the 3 dB UP points to measure bandwidth; but we’ll ignore that and move on.

If you need to translate between the two systems shown above, there’s a pretty easy way to do it.

I’ll assume that you are implementing your filter using the mid-point definition of the bandwidth, so you need to convert into that system rather than out of it. (I’m making this assumption because it’s the one that Robert Bristow-Johnson used in his Audio Cookbook, which was freely copy-and-pasteable, which means that you find it everywhere these days.) Get the parameters from the filter you want to copy.

We’ll call these parameters Fc (for centre frequency, in Hz), G_{dB} (Gain in dB), and Q_{z}. I’m calling it Q_{z} because it’s a Q based on 1/(2 zeta) and we’ll need to keep it separate from our other Q, which I’ll call Q_{rbj} (for Robert Bristow-Johnson).

Convert the gain into linear.

    \[G_{lin} = 10^\frac{G_{dB}}{20}\]

Then do the following:

IF G_{dB} > 0

    \[Q_{rbj} = \frac {Q_{z}} {\sqrt{ G_{lin}}}\]

ELSEIF G_{dB} < 0

    \[Q_{rbj} = Q_{z} * \sqrt{ G_{lin}}\]

ELSE
your filter isn’t doing anything because G_{dB} = 0

END

Example 1

If you have a -3 dB-based filter with the following parameters:
Fc = 1.0 kHz
G_{dB} = +12 dB
Q_{z} = 2

and you want to implement that using the Bristow-Johnson equations, then you’ll have to use the following parameters:
Fc = 1.0 kHz
G_{dB} = +12 dB

    \[Q_{rbj} = \frac {2} {\sqrt{ 3.9811}} = 1.0024\]


Example 2

If you have a -3 dB-based filter with the following parameters:
Fc = 2.0 kHz
G_{dB} = -9 dB
Q = 2

and you want to implement that using the Bristow-Johnson equations, then you’ll have to use the following parameters:
Fc = 2.0 kHz
G_{dB} = -9 dB

    \[Q_{rbj} = 2 * \sqrt{ 0.3548} = 2.3826\]


Two Extra Things…

If the filter that you’re translating FROM is based on Andy Moorer’s design (which is based on the gain mid-point if the gain is within the ±6 dB range, but based on the 3 dB points if it’s outside that), then you’ll have to write your own IF/THEN statements.

If you’re implementing a filter that was specified for RBJ’s equations in a system that’s based on 1/(2 zeta), then you’re probably smart enough to figure out how to do the above in reverse.

One additional addendum

IF
you don’t like IF/THEN statements for some reason or another (code optimisation, for example)

THEN
you could do it this way instead:

    \[Q_{rbj} = \frac{Q_{z} }{ \sqrt{10^\frac{\lvert G_{dB} \rvert }{20}}}\]

What I’ve done there is to fold the decibel-to-linear conversion into the equation. I’ve also converted the gain in dB to an absolute value before converting to linear. That way, it’s always positive, so you always divide.

Q stands for…

These days, I’m spending a lot of time wrapping my head around the relationship between the frequency and the time responses of filters. In doing so, I’m digging into the concept of “Q”, of course. As a result, I’m reading my old books and some Internet sites, and I’m frequently presented with something like the following:

That, of course, is from the Wikipedia entry on “Q”.

However, in the Bell Telephone System Technical Publication – Monograph 2491, called “The Story of Q” by Estill I. Green ( published in the American Scientist, Vol 43, pp 584-594, in October 1955), it states:

“For a time, Johnson* designated the ratio of reactance to effective resistance of a coil by the symbol K. It was in 1920, while working the practical application of the wave filter which G. A. Campbell had invented some years before, that he for the first time employed the symbol Q for his parameter. His reason for choosing Q was quite simple. He says that it did not stand for ‘quality factor’ or anything else, but since the other letters of the alphabet had already been pre-empted for other purposes, Q was all he had left.”

So, if we’re going to be pedantic (which I love to be) there are two errors on that Wikipedia page. Firstly, Q does not stand for Quality. Secondly, it’s not the “Q factor”, it’s just the “Q”.

As an aside, that monograph is not only informative, it’s fun to read (depending, of course, on your definition of “fun”). For example, near the end of the paper, Green applies Q to rotating bodies (which is not a surprise, since an audio-wave oscillation is just a rotation represented in two dimensions). In that section, he points out that the rotation of the earth is slowing down due, in part, to tidal friction. Consequently, the length of a day is increasing at a rate of 0.00164 second per century, which would make the Q of the rotation of the earth equal to about 10,000,000,000,000 (10^13).

* K.S. Johnson worked in the Western Electric Company’s Engineering Department, which became Bell Telephone Laboratories in 1925.

DFT’s Part 6: Windowing artefacts

Links to:
DFT’s Part 1: Some introductory basics
DFT’s Part 2: It’s a little complex…
DFT’s Part 3: The Math
DFT’s Part 4: The Artefacts
DFT’s Part 5: Windowing

In Part 5, we talked about the idea of using a windowing function to “clean up” a DFT of a signal, and the cost of doing so. We talked about how the magnitude response that is given by the DFT is rarely “the Truth” – and that the amount that it’s not True is dependent on the interaction between the frequency content of the signal, the signal envelope, the windowing function, the size of the FFT, and the sampling rate. The only real solution to this problem is to know what-not-to-believe when you look at a DFT output.

However, we “only” looked at the artefacts on the magnitude response in the previous posting. In this last posting, we’ll dig a little deeper and NOT throw away the phase information. The problem is that, when you’re windowing, you’re not just looking at a screwed up version of the magnitude response, you’re also looking at a screwed up phase response as well.

We saw in Part 1 and Part 2 how the phase of a sinusoidal waveform can be converted to the sum of a real and an imaginary component. (In other words, if you add a cosine and a sine of the same frequency with very specific separate gains applied to them, the result will be a sinusoidal waveform with any amplitude and phase that you want.) For this posting, we’ll be looking at the artefacts of the same windowing functions that we’ve been working on – but keeping the real and imaginary components separate.

Rectangular windowing

We’ll start by looking at a plot from the previous post, which I’ve duplicated below.

Figure 1: The magnitude responses calculated by a DFT for 6 different frequencies. Note that the bin centre frequency is 1000.0 Hz.

The way I did the plot in Figure 1 was to create a sine wave with a given frequency, do a DFT of that, and plot the magnitude of the result. I did that for 6 different frequencies, ranging from 1000 Hz (exactly on a bin centre frequency) to 999.5 Hz (halfway to the adjacent bin centre frequency).

There’s a different way to plot this, which is to show the result of the DFT output, bin by bin, for a sinusoidal waveform with a frequency relative to the bin centre frequency. This is shown below in Figure 2.

Figure 2: Rectangular window:
The relationship between the frequency of the signal, the frequency centres of the DFT bin, and the resulting magnitude in dB. Note that the X-axis is frequency, measured in distance between bin frequencies or “bin widths”.

Now we have to talk about how to read that plot… This tells me the following (as examples):

  • If the bin centre frequency EXACTLY matches the frequency of the signal (therefore, the ∆ Freq. = 0) then the magnitude of that bin will be 0 dB (in other words, it will give me the correct answer).
  • If the bin centre frequency is EXACTLY an integer number of bin widths away from the frequency of the signal (therefore, the ∆ Freq. = … -10, -9, – 8… -3, -2, -1, 1, 2, 3, … 8, 9, 10, …) then the magnitude of that bin will be -∞ dB (in other words, it will have no output).
  • These two first points are why the light blue curve is so good in Figure 1.
  • If the frequency of the signal is half-way between two bins (therefore, the ∆ Freq. = -0.5 or +0.5), then you get an output of about -4 dB (which is what we also saw in the blue curve in Figure 25 in Part 5.
  • If the frequency of the signal is an integer number away from half-way between two bins (for example, ∆ Freq. = -2.5, -1.5, 1.5, or 2.5, etc… ) then the output of that bin will be the value shown at the tops of those bumps in the plots… (For example, if you mark a dot at each place where ∆ Freq. = ±x.5 on that curve above, and you join the dots, you’ll get the same curve as the curve for 999.5 Hz in Figure 1.)

So, Figure 2 shows us that, unless the signal frequency is exactly the same as the bin centre frequency, then the DFT’s magnitude will be too low, and there will be an output from all bins.

Figure 3: Rectangular window:
The relationship between the frequency of the signal, the frequency centres of the DFT bin, and the resulting phase in degrees.

Figure 3 shows us the same kind of analysis, but for the phase information instead. The important thing when reading this plot is to keep the magnitude response plot in mind as well. For example:

  • when the bin frequency matches the signal frequency (∆ Freq. = 0) then the phase error is 0º.
  • When the signal frequency is an integer number of bin widths away from the bin frequency, then it appears that the phase error is either 0º or ±180º, but neither of these is true, since the output is -∞ dB – there is no output (remember the magnitude response plot).
  • There is a gradually increasing error from 0º to ±180º (depending on whether you’re going up or down in frequency)( as the signal frequency moves from being adjacent to one bin or the next.
  • When you signal frequency crosses the bin frequency, you get a polarity flip (the vertical lines in the sawtooth shape in the plot).
Figure 4. Rectangular window:
The top two plots show the same relationship as in Figures 2 and 3, but divided into the various components, as explained below.

Figure 4, above, shows the same information, plotted differently.

  • The bottom right plot shows the magnitude response (exactly the same as shown in Figure 2) on a linear scale instead of in dB.
  • The top two plots show the Real and Imaginary components, which, combined, were used to generate the Magnitude and Phase plots. (Remember from Parts 1 and 2 that the Real component is like looking at the response from above, and the Imaginary component is like looking at the response from the side.)
  • The Nyquist plot is difficult, if not impossible to understand if you’ve never seen one before. But looking at the entire length of the animation in Figure 5, below, should help. I won’t bother explaining it more than to say that it (like the Real vs. Freq. and the Imaginary vs. Freq. plots) is just showing two dimensions of a three-dimensional plot – which is why it makes no sense on its own without some prior knowledge.
Figure 5. Rectangular window:
The Real, Imaginary, and Nyquist plots from Figure 4, viewed from different angles.

Hopefully, I’ve said enough about the plots above that you are now equipped to look at the same analyses of the other windowing functions and draw your own conclusions. I’ll just make the occasional comment here and there to highlight something…

Hann Window

Figure 6: Hann window: magnitude response error

Generally, the things to note with the Hann window are the wider centre lobe, but the lower side lobes (as compared to the rectangular windowing function).

Figure 7: Hann window: Phase response error
Figure 8: Hann window: Real, Imaginary, and Nyquist plots
Figure 9: Hann window: Real, Imaginary, and Nyquist plots in all three dimensions.

Hamming window

Figure 10: Hamming window: Magnitude response error.

The interesting thing about the Hamming window is that the lobes adjacent to the main lobe in the middle are lower. This might be useful if you’re trying to ignore some frequency content next to your signal’s frequency.

Figure 11: Hamming window: Phase response error.
Figure 12: Hamming window: Real, Imaginary, and Nyquist plots
Figure 13: Hamming window: Real, Imaginary, and Nyquist plots in all three dimensions

Blackman Window

Figure 14: Blackman window: Magnitude response error.

The Blackman window has a wider centre lobe, but the side lobes are lower in level.

Figure 15: Hamming window: Magnitude response error.
Figure 16: Blackman window: Real, Imaginary, and Nyquist plots
Figure 17: Blackman window: Real, Imaginary, and Nyquist plots in all three dimensions.

Blackman Harris window

Figure 18: Blackman-Harris window: Magnitude response error.

Although the Blackman-Harris window results in a wider centre lobe, as you can see in Figure 18, the side lobes are all at least 90 dB down from that…

Figure 19: Blackman-Harris window: Phase response error.
Figure 21: Blackman-Harris window: Real, Imaginary, and Nyquist plots
Figure 22: Blackman-Harris window: Real, Imaginary, and Nyquist plots in all three dimensions.

Wrapping up

I know that there’s lots left out of this series on DFT’s. There are other windowing functions that I didn’t talk about. I didn’t look at the math that is used to generate the functions… and I just glossed over lots of things. However, my intention here was not to do a complete analysis – it was a just an introductory discussion to help instil a lack of trust – or a healthy suspicion about the results of a DFT (or FFT – depending on how fast you do the math….).

Also, a reason I did this series was as a set-up, so when I write about some other topics in the future (like the actual resolution of 16-bit LPCM audio in a fixed point world, or the implications of making a volume control in the digital domain as just two examples…), I can refer back to this, pointing out what you can and cannot believe is the plots that I haven’t even made yet…

DFT’s Part 5: Windowing

Links to:
DFT’s Part 1: Some introductory basics
DFT’s Part 2: It’s a little complex…
DFT’s Part 3: The Math
DFT’s Part 4: The Artefacts

The previous posting in this series showed that, if we just take a slice of audio and run it through the DFT math, we get a distorted view of the truth. We’ll see the frequencies that are in the audio signal, but we’ll also see that there’s energy at frequencies that don’t really exist in the original signal. These are artefacts of slicing a “window” of time out of the original signal.

Let’s say that I were a musician, making samples (in this sentence, the word “sample” means what it means to musicians – a slice of a recording of a sound that I will play using a sampler) to put into my latest track in my new hip hop album. (Okay, I use the word “musician” loosely here… but never mind…) I would take a sample – say, of the bell that I recorded, which looks like this:

Figure 1: My original bell recording – or 2048 samples of it, at least…

We’ve already seen that the first sample (now I’m back to using the technical definition of the word “sample” – an instantaneous measurement of the amplitude of the signal) and the last sample aren’t on the 0 line. So, if we just play this recording, it will start and end with a “click”.

We get rid of the click by applying a “fade” on the start and the end of the recording, resulting in something like the following:

Figure 2: The same recording with a fade in and a fade out applied to it. Now it will sound more like a bell because it has a fast attack (a short fade in) and a longer decay (fade out).

So, the moral of the story so far is that, in order to get rid of an audible “click”, we need to fade in and fade out so that we start and end on the 0 line.

We can do the same thing to our slice of audio (from now on, I’m going to call it a “window”) to help the computer that’s doing the DFT math so that it doesn’t get all the extra frequency content caused by the clicks.

In Figure 2, above, I was being artistic with my fade in and fade out. I made the fade in fast (100 samples long) and I made the fade in slow (2000 samples long) so that the end result would look and sound more like a bell. A computer doesn’t care if we’re artistic or not – we’re just trying to get rid of those clicks. So, let’s do it.

Option 1 is to do nothing – which is what we’ve done so far. We take all of the individual samples in our window, and we multiply them by 1. (All other samples (the ones that we’re not using because they’re outside the window) are multiplied by 0.) If you think about what this looks like if I graph it in time, you’ll imagine a rectangle with a height of 1 and a length equal to the length of the window in samples.

If I do that to my original bell recording, I get the result shown in Figure 3.

Figure 3. The original bell sound, where each sample was multiplied by 1.

You may notice that Figure 3 is almost identical to Figure 1. The only difference is that I put the word “Rectangular” at the top. The reason for this will become clear later.

As we’ve already seen, this rectangular windowing of our recording is what gives us the problems in the first place. So, if I do a DFT of that window, I’ll get the following magnitude response.

Figure 4. The magnitude response of the bell sound with a rectangular window, as shown in Figure 3.

What we know already is that we want to fade in and fade out to get rid of those clicks. So, let’s do that.

Figure 5. The original bell sound, with a fade in and fade out applied to it.

Figure 5, above, shows the result. the ramp in and the ramp out are not straight lines – in fact, they look very much like the shape of an upside-down cosine wave (which is exactly what they are…). If we do a DFT of that window, we get the result shown in Figure 6.

Figure 6. The magnitude response of the bell sound with a Hann window, as shown in Figure 5.

There are now some things to talk about.

The first is to say that the shape of this “envelope” or “windowing function” – the gain over time that we apply to the audio signal – is named after the Austrian meteorologist Julius von Hann. He didn’t invent this particular curve – but he did come up with the idea of smoothing data (but in his case, he was smoothing meteorological data over geographical regions – not audio signals over time). Because he came up with the general idea, they named this curve after him.

Important sidebar: Some people will call this a “Hanning” window. This is not strictly correct – it’s a Hann function. However, you can use the excuse that, if you apply a Hann window to the signal, then you are hanning it… which is the kind of obfuscated back-pedalling and revisionist history used to cover up a mistake that is typically only within the purview of government officials.

The second thing is to notice is that the overall response at frequencies that are not in the original signal has dropped significantly. Where, in Figure 4, we see energy around -60 dB FS at all frequencies (give or take….) the plot in Figure 6 drops down below -100 dB FS – off the plot. This is good. It’s the result of getting rid of those non-zero values at the start and stop of the window… No clicks means no energy spread all over the frequency spectrum.

The third thing to discuss is the levels of the peaks in the plots. Take a look at the highest peak in Figure 4. It’s about -17 dB FS or so. The peak at the same frequency in Figure 6 is about -21 dB or so… 4 dB lower… This is because, if you look at the entire time window, there is indeed less energy in there. We made portions of the signal quieter, so, on average, the whole thing is quieter. We’ll look at how much quieter later.

This is where things get a little interesting, because some people think that the way that we faded in and faded out of the window (specifically, using a Hann function) can be improved in some way or another… So, let’s try a different way.

Figure 7. The original bell sound, with a different kind of fade in and fade out applied to it.

Figure 7, above, shows the same bell sound again, this time processed using a Hamming function instead. It looks very similar to the Hann function – but it’s not identical. For starters, you may notice that the start and stop values are not 0 – although they’re considerably quieter than they would have been if we had used a rectangular windowing function (or, in other words “done nothing”). The result of a DFT on this signal is shown in Figure 8.

Figure 8: The magnitude response of the bell sound with a Hamming window, as shown in Figure 7.

There are two things about Figure 8 that are different from Figure 6. The first is that the overall apparent level of the wide-band artefacts is higher (although not as high as that in Figure 4…). This is because we have a “click” caused by the fact that we don’t start and stop at 0. However, the advantage of this function is that the peaks are narrower – so we get a better idea of the actual signal – we just need to learn to ignore the bottom part of the plot.

Figure 9, shows yet another function, called a Blackman function.

Figure 9. The original bell sound, with a different kind of fade in and fade out applied to it.

You can see there that it takes longer for the signal to ramp in from 0 (and to ramp out again at the end), so we can expect that the peaks will be even lower than those for the Hann window. This can be seen in Figure 10.

Figure 10: The magnitude response of the bell sound with a Blackman window, as shown in Figure 7.

Indeed, the peaks are lower….

Another function is called the Blackman-Harris function, shown in Figures 11 and 12.

Figure 11.
Figure 12.

There are other windowing functions. And there are some where you can change some variables to play with width and things. Or you can make up your own. I won’t talk about them all here… This is just a brief introduction…

The purpose of this is to show some basic issues with windowing. You can play with the windowing function, but there will be subsequent effects in the DFT result like:

  • the apparent magnitude of the actual signal (the peaks in the plots above)
  • the apparent magnitude in frequency bands that aren’t in the signal
  • the apparent width of the frequency band of the actual signal

Also, you have to remember that a DFT shows you the complete frequency content of the slice of time that you fed it. So, if the frequency content changes over time (the sound of a sitar string being plucked, or the “pew pew” sound of Han Solo’s laser, for example) then this change over time will not be shown…

Some more details…

Let’s dig a little into the differences in the peaks in the DFT plots above. As we saw in Part 4, if the frequency of the signal you’re analysing is not exactly the same as the frequency of the DFT bin, then the energy will “bleed” into adjacent bins. The example I showed in that posting compared the levels shown by the DFT when the frequency of the signal is either exactly the same as a frequency bin, or half-way between two of them – a reminder of this is shown below in Figure 13.

Figure 13. The results of a DFT analysis of two signals. The blue plot shows the result when the signal frequency is 1000.0 Hz (exactly the same as the DFT bin frequency). The red plot shows the result when the signal frequency is 1000.5 Hz (half-way between two bins).

As you can see in that plot, the energy in the 1000.5 Hz bleeds into the two adjacent bins. In fact, it’s more accurate to say that there is energy in all of the DFT bins, due to the discontinuity of the signal when the beginning is wrapped around to meet its end.

So, let’s analyse this a little further. I’ll create a signal that is on a frequency bin (therefore it’s a sine wave with a carefully-chosen frequency), and do the DFT. Then, I’ll make the frequency a little lower, and do the DFT again. I’ll repeat this until I get to a signal frequency that is half-way between two bins. I’ll stop there, because once I pass the half-way point, I’ll just start seeing the same behaviour. The result of this is shown for a rectangular window in Figure 14.

Figure 14. The results of doing a DFT analysis on a sine wave with frequencies ranging from exactly on a DFT bin (1000 Hz) to half-way between two bins (999.5 Hz).

As you can see there, there is a LOT of energy bleeding into all frequency bins when the signal is not exactly on a bin. Remember that this does not mean that those frequencies are in the signal – but they are in the signal that the DFT is being asked to analyse.

Let’s do this again for the other windowing functions.

Figure 15. The same results, with a Hann window applied to the signal. Notice that the 1000 Hz result is now not as precise – but all of the other frequencies are “cleaner”.
Figure 16. The same results with a Hamming window. The 1000 Hz is narrower than with the Hann window – but all other frequencies are “noisier” due to the fact that the start and stop gains of the Hamming window are not 0.
Figure 17. The same analysis for the Blackman windowing function.
Figure 18. The same analysis for the Blackman Harris function.

What you may notice when you look at Figures 14 to 18 is that there is a relationship between the narrowness of the plot when the signal is on a bin frequency and the amount of energy that’s spread everywhere else when it’s not. Generally, you have to make a trade between accuracy and precision at the frequency where there’s energy to truth at all other frequencies.

However, if you look carefully at those plots around the 1000 Hz area, you can see that it’s a little more complicated. Let’s zoom into that area and have a look…

Figure 19. A zoom-in of the plot in Figure 14.
Figure 20. A zoom-in of the plot in Figure 15.
Figure 21. A zoom-in of the plot in Figure 16.
Figure 22. A zoom-in of the plot in Figure 17.
Figure 22. A zoom-in of the plot in Figure 18.

The thing to compare in plots 19 to 22 is the how similar the plots in each figure is, relative to each other. For example, in Figure 19, the six plots are very different from each other. In Figure 22, the six plots are almost identical from 995 Hz to 1005 Hz.

Depending on what kind of analysis you’re doing, you have to decide which of these behaviours is most useful to you. In other words, each type of windowing function screws up the result of the DFT. So you have to choose which one screws it up the least for the type of signal and the type of analysis you’re doing.

Alternatively, you can choose a favourite windowing function, and always use that one, and just get used to looking at the way your results are screwed up.

Some final details

So far, I have not actually defined the details of any of the windowing functions we’ve looked at here. I’ve just said that they fade in and fade out differently. I won’t give you the mathematical equations for creating the actual curves of the functions. You can get that somewhere else. Just look them up on the Internet. However, we can compare the shapes of the gain functions by looking at them on the same plot, which I’ve put in Figure 23.

Figure 23. The gain vs. time for 4 of the 5 windowing functions I’ve talked about.

You may notice that I left out the rectangular window. If I had plotted it, it would just be a straight line of 1’s, which is not a very interesting shape.

What may surprise you is how similar these curves look, especially since they have such different results on the DFT behaviour.

Another way to look at these curves (which is almost never shown) is to see them in decibels instead, which I’ve done in Figure 24.

Figure 24. The sample plots that were shown in Figure 23, on a decibel scale.

The reason I’ve plotted them in dB in Figure 24 is to show that, although they all look basically the same in Figure 23, you can see that they’re actually pretty different… For example, notice that, at about 10% of the way into the time of the window, there is a 40 dB difference between the Blackman Harris and the Hann functions… This is a lot.

One thing that I’ve only briefly mentioned is the fact that the windowing functions have an effect on the level that is shown in the DFT result, even when the signal frequency is exactly the same as the DFT bin frequency. As I said earlier, this is because there is, in fact, less energy in the time window overall, because we made the signal quieter at the beginning and end. The question is: “exactly how much quieter?” This is shown in Figure 25.

Figure 25. The relationship between the signal frequency, the maximum level shown in the DFT results, and the windowing function used.

So, as you can see there, a DFT of a rectangular windowed signal can show the actual level of the signal if the frequency of the signal is exactly the same as the DFT bin centre. All of the other windowing functions will show you a lower level.

HOWEVER, all of the other windowing functions have less variation in that error when the signal frequency moves away from the DFT bin. In other words, (for example) if you use a Blackman Harris window for your DFT, the level that’s displayed will be more wrong than if you used a rectangular window, but it will be more consistent. (Notice that the rectangular window ranges from almost -4 dB to 0 dB, whereas the Blackman Harris window only ranges from about -10 to -9 dB.)

We’ll dig into some more details in the next and final posting in this series… with some exciting animated 3D plots to keep things edu-taining.

DFT’s Part 4: The Artefacts

Links to:
DFT’s Part 1: Some introductory basics
DFT’s Part 2: It’s a little complex…
DFT’s Part 3: The Math

The previous post ended with the following:

And, you should be left with a question… Why does that plot in Figure 12 look like it’s got lots of energy at a bunch of frequencies – not just two clean spikes? We’ll get into that in the next posting.

Let’s begin by taking a nice, clean example…

If my sampling rate is 65,536 Hz (2^16) and I take one second of audio (therefore 65,536 samples) and I do a DFT, then I’ll get 65,536 values coming out, one for each frequency with an integer value (nothing after the decimal point. The frequencies range from 0 to 65,535 Hz, on integer values (so, 1 Hz, 2 Hz, 3 Hz, etc…) (And we’ll remember to throw away the top half of those values due to mirroring which we talked about in the last post.)

I then make a sine wave with an amplitude of 0 dB FS and a frequency of 1,000 Hz for 1 second, and I do an DFT of it, and then convert the output to show me just the magnitude (the level) of the signal (so I’m ignoring phase). The result would look like the plot below.

Figure 1. The magnitude response of a 1000 Hz sine tone, sampled at 65,536 Hz, calculated using a 65,536-point DFT.

The plot above looks very nice. I put in a 1,000 Hz sine wave at 0 dB FS, and the plot tells me that I have a signal at 1,000 Hz and 0 dB FS and nothing at any other frequency (at least with a dynamic range of 200 dB). However, what happens if my signal is 1000.5 Hz instead? Let’s try that:

Figure 2. The magnitude response of a 1000.5 Hz sine tone, sampled at 65,536 Hz, calculated using a 65,536-point DFT.

Now things don’t look so pretty. I can see that there’s signal around 1000 Hz, but it’s lower in level than the actual signal and there seems to be lots of stuff at other frequencies… Why is this?

In order to understand why the level in Figure 2 is lower than that in Figure 1, we have to zoom in at 1000 Hz and see the individual points on the plot.

Figure 1 (Zoom)

As you can see in Figure 1 (Zoom), above, there is one DFT frequency “bin” at 1000 Hz, exactly where the sine wave is centred.

Figure 2 (Zoom)

Figure 2 (Zoom) shows that, when the sine wave is at 1000.5 Hz, then the energy in that signal is distributed between two DFT frequency bins – at 1000 Hz and 1001 Hz. Since the energy is shared between two bins, then each of their level values is lower than the actual signal.

The reason for the “lots of stuff at other frequencies” problem is that the math in a DFT has a limited number of samples at its input, so it assumes that it is given a slice of time that repeats itself exactly.

For example…

Let’s look at a portion of a plot like the one below:

Figure 3. A portion of a plot. The gray rectangles hide things…

If I asked you to continue this plot to the left and right (in other words, guess what’s under the gray rectangles), would you draw a curve like the one below?

Figure 3. An obvious extrapolation of the curve in Figure 3.

This would be a good guess. However, the figure below is also a good guess.

Figure 4. An obvious extrapolation of the curve in Figure 3.

Of course, we could guess something else. Perhaps Figure 3 is mostly correct, but we should add a drawing of Calvin and Hobbes on a toboggan, sliding down the hill to certain death as well. You never know what was originally behind those grey rectangles…

This is exactly the problem the math behind a DFT has – you feed it a “slice” of a recording, some number of samples long, and the math (let’s call it “a computer”, since it’s probably doing the math) has to assume that this slice is a portion of time that is repeated forever – it started at the beginning of time, and it will continue repeating until the end of time. In essence, it has to make an “extrapolation” like the one shown in Figure 4 because it doesn’t have enough information to make assumptions that result in the plot in Figure 3.

For example: Part 2

Let’s go back to the bell recording that we’ve been looking at in the previous posts. We have a portion of a recording, 2048 samples long. If I plot that signal, it looks like the curve in Figure 5.

Figure 5. The bell recording we saw in previous postings, hiding the information that comes before and after.

When the computer does the DFT math, the assumption is that this is a slice that is repeated forever. So, the computer’s assumption is that the original signal looks like the one below, in Figure 6.

Figure 6. The signal, as assumed by the computer when it’s doing the DFT math.

I’ve put rectangles around the beginning (at sample 1) and end (at sample 2048) of the slice to highlight what the signal looks like, according to the computer… The signal in the left half of the left rectangle (ending at sample 0) is the end of the slice of the recording, right before it repeats. The signal starting at 2049 is the beginning again – a repeat of sample 1.

If we zoom in on the signal in the left rectangle, it looks like Figure 7.

Figure 7. the signal inside the left rectangle in Figure 6.

Notice that vertical line at sample 1 (actually going from sample 0 to sample 1, to be accurate). Of course, our original bell recording didn’t have that “instantaneous” drop in there – but the computer assumes it does because it doesn’t have enough information to assume anything else.

If we wanted to actually make that “instantaneous” vertical change in the signal (with a theoretical slope of infinity – although it’s not really that steep….), we would have to add other frequencies to our original signal. Generally, you can assume that, the higher the slope of an audio signal, either 1) the louder the signal or 2) the more high frequency content in the signal. Let’s look at the second one of those.

Let’s look at portions of sine waves at three different frequencies. These are shown below, in Figure 8. The top plot shows a sine wave with some frequency, showing how it looks as it passes phase = 0º (which we’ll call “time = 0” (on the X-axis)). At that moment, the sine wave has a value of 0 (on the Y-axis) and the slope is positive (it’s going upwards). The middle plot shows a sine wave with 3 times the frequency (notice that there are 6 negative-and-positive bumps in there instead of just 2). Everything I said about the top plot is still true. The level is 0 at time=0, and the slope is positive. The bottom plot is 5 times the frequency (10 bumps instead of 2). And, again, at time=0, everything is the same.

Figure 8. Three sinusoidal waves at related frequencies. We’re looking at the curves as they cross time=0 (on the X-axis).

Let’s look a little more carefully at the slope of the signal as it crosses time=0. I’ve added blue lines in Figure 9 to highlight those.

Figure 9.

Notice that, as the frequency increases, the slope of the signal when it crosses the 0 line also increases (assuming that the maximum amplitude stays the same – all three sine waves go from -1 to 1 on the Y-axis.

One take-away from that is the idea that I’ve already mentioned: the only way to get a steep slope in an audio signal is to add high frequency content. Or, to say it another way: if your audio signal has a steep slope at some time, it must contain energy at high frequencies.

Although I won’t explain here, the truth is just a little more complicated. This is because what we’re really looking for is a sharp change in the slope of the signal – the “corners” in the plot around Sample 0 in Figure 7. I’ve put little red circles around those corners to highlight them, shown below in Figure 10. When audio geeks see a sharp corner like that in an audio signal, they say that the waveform is discontinuous – meaning that the level jumps suddenly to something unexpected – which means that its slope does as well.

Basically, if you see a discontinuity in an audio signal that is otherwise smooth, you’re probably going to hear a “click”. The audibility of the click depends on how big a jump there is in the signal relative to the remaining signal. (For example, if you put a discontinuity in a nice, smooth, sine wave, you’ll hear it. If you put a discontinuity in a white noise signal – which is made up of nothing but discontinuities (because it’s random) then you won’t hear it…)

Figure 10. The red circles show the discontinuities in the slope of the signal when it is assumed that it repeats.

Circling back…

Think back to the examples I started with at the beginning of this post. When I do a 65,536-point DFT of a 1000 Hz sine wave sampled at 65,536 Hz, the result is a nice clean-looking magnitude response (Figure 1). However, when I do a 65,536-point DFT of a 1000.5 Hz sine wave sampled at 65,536 Hz, the result is not nearly as nice. Why?

Think about how the end of the two sine waves join up with their beginnings. When you do a 65,536-point DFT on a signal that has a sampling rate of 65,536 Hz, then the slice of time that you’re analysing is exactly 1 second long. A 1000 Hz sine wave, repeats itself exactly after 1 second, so the 65,537th sample is identical to the first. If you join the last 30 samples of the slice to the first 30 samples, it will look like the red curve on the top plot in Figure 10, below.

However, if the sinusoid has a frequency of 1000.5 Hz, then it is only half-way through the waveform when you get to the end of the second. This will look like the lower black curve in Figure 10.

Figure 11. The top plot shows the a 1000 Hz sine wave at the end of exactly 1 second, joined to the beginning of the same sine wave. The bottom plot shows the same for a 1000.5 Hz sine wave

Notice that the lower plot has a discontinuity in the slope of the waveform. This means that there is energy in frequencies other than 1000.5 Hz in it. And, in fact, if you measured how much energy there is in that weird waveform that sounds like a sine wave most of the time, but has a little click every second, you’ll find out that the result is already plotted in Figure 2.

The conclusion

The important thing to remember from this posting is that a DFT tells you what the relative frequency content of the signal is – but only for the signal that you give it. And, in most cases, the signal that you give it (a slice of time that is looped for infinity) is not the same as the total signal that you took the slice from.

So, most of the time, a DFT (or FFT – you choose what you call it) is NOT showing you what is in your signal in real life. It’s just giving you a reasonably good idea of what’s in there – and you have to understand how to interpret the plot that you’re looking at.

In other words, Figure 2 does not show me how a 1000.5 Hz sine tone sounds – but Figure 1 shows me how a 1000 Hz sine tone sounds. However, Figures 1 and 2 show me exactly how the computer “hears” those signals – or at least the portion of audio that I gave it to listen to.

There is a general term applied to the problem that we’re talking about. It’s called “windowing effects” because the DFT is looking at a “window” of time (up to now, I’ve been calling it a “slice” of the audio signal. I’m going to change to using the word “time window” or just “window” from now on.

In the next posting, DFT’s Part 5: Windowing, we’ll look at some sneaky ways to minimise these windowing effects so that they’re less distracting when you’re looking at magnitude response plots.

DFT’s Part 3: The Math

Links to:
DFT’s Part 1: Some introductory basics
DFT’s Part 2: It’s a little complex…

If you have an audio signal or the impulse response measurement of an audio device (which is just the audio output of a device when the input signal is a very short “click” – how the device responds to an impulse), one way to find out its spectral content is to use a Fourier Transform. Normally, we live in a digital audio world, with discrete divisions of time, so we use a DFT or a Discrete Fourier Transform (although most people call it an FFT – a Fast Fourier Transform).

If you do a DFT of a signal (say, a sinusoidal waveform), then you take a slice of time, usually with a length (measured in samples) that is a nice power of 2 – for example 2, or 4 (2^2), or 2^12 (4096 samples) or 2^13 (8192 samples). When you convert this signal in time through the DFT math, you get out the same number of number (so, 2048 samples in, 2048 numbers out). Each of those numbers can be used to find out the magnitude (the level) and the phase for a frequency.

Those frequencies (say, 2048 of them) are linearly spaced from 0 Hz up to just below the sampling rate (the sampling rate would be the 2049th frequency in this case… we’ll see why, below…)

So, generally speaking: if I have an audio signal (a measurement of level over time) and I do a DFT (which is just a series of mathematical equations) and then I can see the relative amount of energy by frequency for that “slice” of time.

So, how does the math work? In essence, it’s just a matter of doing a lot of multiplication, and then adding the results that you get (and then maybe doing a little division, if you’re in the mood…). We’ve already seen in Parts 1 and 2 of this series that

  • a sinusoidal waveform is just 2 dimensions (dimension #1 is movement in space, the other dimension is time) of a three-dimensional rotation (dimensions #1 and #2 are space and #3 is time)
  • if we want to know the frequency, the amplitude, and the direction of rotation of the “wheel”, we will need to see the real component (the cosine) and the imaginary component (the negative sine)
  • the imaginary component is a negative sine wave instead of a positive sine wave because the wheel is rotation clockwise

A real-world example

I took a bell and I hit it, so it rang the way bells ring. While I was doing that, I recorded it with a microphone connected to my computer. The sampling rate was 48 kHz and I recorded with enough bits to not worry about that. The result of that recording is shown in Figure 1.

Figure 1. A 7-second long recording of a bell

Seven seconds is a lot of samples at 48,000 samples per second. (In fact, it’s 7 * 48000 samples – which is a lot…) So, let’s take a slice somewhere out of the middle of that recording. This portion (a “zoomed-in” view of Figure 1) is shown below in Figure 2.

Figure 2. A portion of the signal shown in Figure 1. The gray part is 2048 samples long.

So, for the remainder of this posting, we’ll only be looking at that little slice of time, 2048 samples long. Since our sampling rate is 48 kHz, this means that the total length of that slice is 2048 * 1/48000 = 0.0427 seconds, or approximately 42.7 ms.

Let’s start by calculating the amount of energy there is at 0 Hz or “DC” in this section. We do this by taking the value of each individual sample in the section, and adding all those values together. Some of the values are positive (they’re above the 0 line in Figure 2) and some are negative (they’re below 0). So, if we add them all up we should be somewhere close to 0… Let’s try….

Figure 3.

Figure 3 has three separate plots. The top plot in blue is the section of the recording that we’re using, 2048 samples long. You’ll see that I put a red circle around two samples, sample number 47 and sample number 1000. These were chosen at random, just so we have something near the beginning and something near the middle of the recording to use as examples…

So, to find the total energy at 0 Hz, we have to add the individual values of each of the 2048 samples. So, for example, sample #47 has a value of 0.2054 and sample #1000 has a value of -0.2235. We add those two values and the other 2046 sample values together and we get a total value of 2.9057. Let’s just leave that number sitting there for now. We’ll come back to it later.

For now, we’ll ignore the middle and bottom plots in Figure 3. This is because they’ll be easier to understand after Figure 4 is explained…

Now we want to move up to frequencies above 0 Hz. The way we do this is similar to what we did, with an extra step in the process.

Figure 4.

The top blue plot in Figure 4 shows the same thing that it showed in Figure 3 – it’s the 2048 samples in the recording, with sample numbers 47 and 1000 highlighted with red circles.

Take a look at the middle plot. The red curve in that plot is a cosine wave with a period (the amount of time it takes to complete 1 cycle) of 2048 samples. On that plot, I’ve put two * signs (“asterisks”, if you prefer…) – one on sample number 47 and the other at sample 1000.

One small, but important note here: although it’s impossible to see in that plot, the last value of the cosine wave is not the same as the first – it’s just a little lower in level. This is because the cosine wave would start to repeat itself on the next sample. So, the 2049th sample is equal to the 1st. This makes the period of the cosine wave 2048 samples.

The black curve in this plot is the result when you multiply the original recording (in blue) by the cosine curve (in red). So, for example, sample #47 on the blue curve (a value of 0.2054) multiplied by sample #47 on the red cosine curve (0.9901) equals 0.2033, which is indicated by a red circle on the black curve in the middle plot.

If you look at sample 1000, the value on the blue curve is positive, but when it’s multiplied by the negative value on the cosine curve, the result is a negative value on the black curve.

You’ll also notice that, when the cosine wave is 0, the result of the multiplication in the black curve is also 0.

So, we take each of the 2048 samples in the original recording of the bell, and multiply each of those values, one by one, by their corresponding samples in the cosine curve. This gives us 2048 sample values shown in the black curve, which we add all together, and that gives us a total of 1.5891.

We then do exactly the same thing again, but instead of using a cosine wave, we use a negative sine wave, shown as the red curve in the bottom plot. The blue curve multiplied by the negative sine wave, sample-by-sample results in the black curve in the bottom plot. We add all those sample values together and we get -2.5203.

Now, we do it all again at the next frequency.

Figure 5.

Now, the period of the cosine and the negative sine waves is 1024 samples, so they’re at two times the frequency of those shown in Figure 4. However, apart from that change, the procedure is identical. We multiply the signal by the cosine wave (sample-by-sample), add up all the results, and we get 1.3547. We multiply the signal by the negative sine wave and we get -1.025.

This procedure is repeated, increasing the frequency of the cosine (the real) and the negative sine (the imaginary) waves each time. So far we have seen 0 periods (Figure 3), 1 period (Figure 2), and 2 periods (Figure 3) – we just keep going with 3 periods, 4 periods, and so on.

Eventually we get to 1024 periods. If I were to plot that, it would not look like a cosine wave, since the values would be 1, -1, 1, -1…. for 2048 samples. (But, due to the nature of digital audio and smoothing filters that we’re not going to talk about, it would, in fact, be a cosine wave at a frequency of one half of the sampling rate…)

At that frequency, the values for the negative sine wave would be a string of 2048 zeros – exactly as it is in Figure 3.

If we keep going up, we get to 2048 periods – one period of the cosine wave for each sample. This means that, at each sample, the cosine starts, so the result is a string of 2048 ones. Similarly, the negative sine wave will be a string of 2048 zeros. Note that both of these are identical to what we saw in Figure 1 when we were looking at 0 Hz…

Since we’ve already seen in the previous posting that, at a given frequency, the cosine component (the total sum of the results of multiplying the original signal by a cosine wave) is the real component and the negative sine is the imaginary component, then we can write all of the results as follows:

frequency “x”: Real + Imaginary contributions
f1: 2.9057 + 0.0000 j
f2: 1.5891 – 2.5203 j
f3: 1.3547 – 1.0251 j

f2047: 1.3547 – 1.0251 j
f2048: 1.5891 – 2.5203 j
f2049: 2.9057 + 0.0000 j

… and, as we saw in Figure 1 in the last post, for any one frequency, the real and imaginary contributions can be converted into a magnitude (a level) by using a little Pythagoras:

magnitude = sqrt(real^2 + imag^2)

So, we get the following magnitudes

frequency “x”: magnitude
f1: 2.9057
f2: 2.9794
f3: 1.6988

f2047: 1.6988
f2048: 2.9794
f2049: 2.9057

Let’s plot the first 10 values – f1 up to f10. (Remember that these are not in Hertz – they’re frequency numbers. We’ll find out what the actual frequencies are later…)

Figure 6.

So, Figure 6 shows the beginning of the results of our calculations – the first 10 values of the 2048 values that we’re going to get. Not much interesting here yet, so let’s plot all 2048 values.

Figure 7.

Figure 7 shows two interesting things. The first is that at least one of those numbers gets very big – almost up to 160 – whatever that means. The other is that, you may notice that we have some symmetry going on here. In fact, you might have already noticed this… If you go back and look at the lists of numbers I gave earlier, you’ll see that the values for f1 and f2049 are identical (this is true in the complex world, where we see the real and imaginary components separately, and also therefore in their magnitudes). Similarly, f2 and f2048 are identical, as are f3 and f2047. If I had put in all of the values, you would have seen that the symmetry started at f1024 which is identical to f1026. (See this posting for a discussion about aliasing, which may help to understand why this happens….)

So, since the values are repeated, we only need to look at the first 1025 values that we calculated – we know that f1026 to f2048 are the same in reverse order… So, let’s plot the bottom half of Figure 7.

Figure 8.

Figure 8 shows us the same information as Figure 7 – just without the symmetrical repetition. However, it’s still a little hard to read. This is because our frequency divisions are linear. Remember that we multiplied our original signal by 1 period, 2 period, 3 periods, etc… This means that we were going up in linear frequency steps – adding equal frequencies on each step. The problem is that humans hear frequency steps logarithmically – semitones (1.06 times the frequency) and octaves (2 times the frequency) are examples – we multiply (not add) in equal steps. So, let’s plot Figure 8 again, but change the X-axis to a logarithmic scale.

Figure 9.

Figure 9 and Figure 8 show exactly the same information – I’ve just changed the way the x-axis is scaled so that it looks more like the way we hear distribution of frequency.

But what frequency is it?

There are two remaining problems with Figure 9 – the scaling of the two axes. Let’s tackle the X-axis first.

We know that, to get the value for f1, we found the average of all of the values in the recording. This told us the magnitude of the 0Hz. component of the signal.

Then things got a little complicated. To find the magnitude at f2, we multiplied the signal by a cosine (and a negative sine) with a period of 2048 sample. What is the frequency of that cosine wave in real life? Well, we know that the original recording was done with a sampling rate of 48 kHz or 48,000 samples per second, and our 2048-sample long slice of time equalled 42.66666666… milliseconds. If we divide the sampling rate by the period of the cosine wave, we’ll find its frequency, since we’ll find out how many times per second (per 48,000 samples) the wave will occur.

f2 = 48,000 / 2048 = 23.4375 Hz

The next frequency value will be the sampling rate divided the period of the next cosine wave – half the length of the first, or:

f3 = 48,000 / (2048 / 2) = 46.875 Hz

You might notice that f3 = 2 * f2… this helps the math.

f4 = 48,000 / (2048 / 3) = 70.3125 Hz

or f4 = 3 * f2

So, I can now keep going up to find all of my frequencies, and then change the labels on my X-axis so that they make sense to humans.

Figure 10.

That’s one problem solved. We now know that the bell’s loudest frequency is just under 600 Hz (the peak with a magnitude of about 160) and there’s another frequency at about 1500 Hz as well – with a magnitude of about 30 or so.

But how loud is it?

So, let’s tackle the second problem – what does a magnitude of 160 mean in real life?

Not only do humans hear changes in frequency logarithmically, we also hear changes in level logarithmically as well. We say something like “a trumpet is twice as loud as a dog barking” instead of “the loudness of a trumpet is the loudness of a dog barking plus 2”. In fact, that second one just sounds silly when you say it…

As a result, we use logarithms to convert linear levels (like the ones shown on the Y-axis of Figure 10) to something that makes more sense. Instead of having values like 1, 10, 100, and 1000 (I multiplied by 10 each time), we take the log of those values, and tell people that…

Log10 (1) = 0
Log10 (10) = 1
log10 (100) = 2
log10 (1000) = 3

Now we can use the numbers on the right of those equations, which are small-ish instead of the other ones, which are big-ish…

We use this logarithmic conversion in the calculation of a decibel – which we will not get into here – but it would make the topic of another posting in the future. For now, you’ll just have to hang on…

What we’ll do is to take the magnitude values plotted in Figure 10 and find their logarithms, multiply those by 20, and we get their values in decibels. Cool.

The only problem is that if I were to do that, the numbers would look unusually big. This is because I left out one step way up at the top. Back when we were multiplying and adding all those samples and cosine (and negative sine) waves, we should have done one more thing. We should have found the average value instead of the total sum. This means that we should have divided by the total number of samples. However, since we’re only looking at half of the data (the lower 1025 frequency bins – and not all 2048) we divide by half of the number of samples in our slice of time.

So, we take each sample in the recording, multiply each of those by a value in the cosine (or negative sine) wave – and divide the results by half of the number of samples. When you get that average, you then find its logarithm (base 10) and multiply by 20.

If you do that for each value, you get the result shown below in Figure 11.

Figure 11.

If we connect the dots, then we get Figure 12.

Figure 12.

And there are the peaks we saw earlier. One just under 600 Hz at about -16 dB FS, and the other at about 1500 Hz with a level of about -31 dB FS.

The important stuff to remember for now…

There are two important things to remember from this posting.

  1. The frequencies that are calculated using a DFT (or FFT) are linearly spaced. That means that (on a human, logarithmic scale) we have a poor resolution in the low frequencies and a very fine resolution in the high frequencies. (for example, in this case, the first three frequencies are 0 Hz, 23.4 Hz, and 46.9 Hz. The last three frequencies are 23953.1 Hz, 23976.6 Hz, and 24,000 Hz.)
  2. If you want better resolution in the low frequencies, you’ll need to calculate with more samples – a longer slice of time, which means more might have happened in that time (although there are some tricks we can play, as we’ll see later).

And, you should be left with a question… Why does that plot in Figure 12 look like it’s got lots of energy at a bunch of frequencies – not just two clean spikes? We’ll get into that in the next posting: DFT’s Part 4: The Artefacts.