If you’ve read the three introductory parts of this series, linked above; and if you’re still awake, then we are ready to start putting things together and jumping to incorrect conclusions…
Let’s say that you’ve been hired to specify a digital audio system for some reason (we’ll assume that it’s an LPCM system – nothing exotic). Using the information I’ve told you so far, you can make two decisions in your specification:
You select a bit depth to be enough to give you the dynamic range you desire. In this case, “dynamic range” means the “distance” in level between the loudest sound you can record / store / transmit (I isn’t say what the “digital audio system” was going to be used for) and the inherent noise floor of the system. If you’re recording the background noise on an airplane while it’s in flight, you don’t need a big dynamic range, because it’s always loud, and never changes. However, if you’re recording a Japanese Taiko Drummer group, you’ll need a huge usable dynamic range because the loud parts of the performance are a LOT louder than the quietest parts.
As we saw in Part 3, an LPCM digital audio system cannot record any audio that has a frequency higher than 1/2 the sampling rate. So, you select a sampling rate that is at least 2x the highest frequency you’re interested in. For example, if you believe the books that say you can hear from 20 Hz to 20,000 Hz, then you might decide that your sampling rate has to be at least 40,000 Hz. On the other hand, if you’re making a subwoofer that you know will never be fed a signal above 120 Hz, then you don’t need a sampling rate higher than 240 Hz.
Don’t get angry yet. I’m just keeping these numbers simple to make the math easy. Later on, I’ll explain why what I just said might not be correct.
I just jumped to at least three conclusions (probably more) that are going to haunt me.
The first was that my “digital audio system” was something like the following:
As you can see there, I took an analogue audio signal, converted it to digital, and then converted it back to analogue. Maybe I transmitted it or stored it in the part that says “digital audio”.
However, the important, and very probably incorrect assumption here is that I did nothing to the signal. No volume control, no bass and treble adjustments… nothing.
We assumed above that we can define the system’s dynamic range based on the dynamic range of the audio signal itself. However, this makes the assumption that the noise floor of the digital system and the noise floor of your audio signal are identical, which is probably not true. As we saw in Part 2, the noise generated by TPDF dither is white – it has the same probability of having a given amount of energy per Hertz. Since we hear sound logarithmically (meaning that, to us, octaves are equal widths. Equal spacings in Hz are not.) This means that the noise sound “bright” to us – because there’s just as much energy in the top octave (say, 10 kHz to 20 kHz, if you believe the books) as there is in all other frequencies combined from 0 Hz up to 10 kHz.
If, however, the noise floor in your concert hall where the taiko drummers are playing is caused by the air conditioning system, then this noise will be a lot louder in the low frequencies than the the highs – which is not the same.
Therefore it’s too simplistic to say “the noise floor of the digital system” and the “noise floor of the signal” – since these two noise floors are different. (As Steven Wright said: “It doesn’t matter what temperature the room is, it’s always room temperature.”)
As we’ll see later, if you’re going to do anything to the signal while it’s in the “digital domain”, then you need to take that into consideration when you’re deciding on your sampling rate. It’s not enough to say “useful audio bandwidth times 2” because there are some side effects that need to be remembered…
However, counter-intuitively, it could be that, in order to improve your system, you’ll want to make the sampling rate LOWER instead of HIGHER – so this is not a simple case of “more is better”.
We’ll get to that topic later. For now, I’ll leave you in suspense.
One thing we saw in Part 3 was that, if we have an audio signal with energy at a frequency higher than 1/2 the sampling rate, and if that signal gets into the analogue-to-digital converter (ADC), then the output of the ADC will contain an error. We’ll get out energy at frequencies that were not in the original, due to the effect called “aliasing“.
Once that’s in the digital audio signal, there’s no removing it, so we need to make sure that the too-high-frequency signals don’t get into the ADC’s input in the first place. This is done using a low-pass filter that (in theory) removes all energy in the signal above the Nyquist frequency (which is equal to 1/2 the sampling rate). Since that low-pass filter prevents aliasing, we call it an anti-aliasing filter. Normally, these days, that antialiasing filter is built into the ADC itself.
As we also saw in Part 3, the digital-to-analogue converter (DAC) has to smooth out the digital signal to convert it from a “staircase” wave to a smoother one. That’s also done with a low-pass filter that eliminates all the harmonics that would be required to make the staircase have sharp corners. Since this is done to re-construct the analogue signal, it’s called a “reconstruction filter“.
This means that, if we pull apart some of the components in the signal chain I showed in Figure 1, it really looks more like this:
Reminder: This is still just the lead-up to the real topic of this series. However, we have to get some basics out of the way first…
Just like the last posting, this is a copy-and-paste from an article that I wrote for another series. However, this one is important, and rather than just link you to a different page, I’ve reproduced it (with some minor editing to make it fit) here.
In the first posting in this series, I talked about digital audio (more accurately, Linear Pulse Code Modulation or LPCM digital audio) is basically just a string of stored measurements of the electrical voltage that is analogous to the audio signal, which is a change in pressure over time… In the second posting in the series, we looked at a “trick” for dealing with the issue of quantisation (the fact that we have a limited resolution for measuring the amplitude of the audio signal). This trick is to add dither (a fancy word for “noise”) to the signal before we quantise it in order to randomise the error and turn it into noise instead of distortion.
In this posting, we’ll look at some of the problems incurred by the way we carve up time into discrete moments when we grab those samples.
Let’s make a wheel that has one spoke. We’ll rotate it at some speed, and make a film of it turning. We can define the rotational speed in RPM – rotations per minute, but this is not very useful. In this case, what’s more useful is to measure the wheel rotation speed in degrees per frame of the film.
Take a look at the left-most column in Figure 1. This shows the wheel rotating 45º each frame. If we play back these frames, the wheel will look like it’s rotating 45º per frame. So, the playback of the wheel rotating looks the same as it does in real life.
This is more or less the same for the next two columns, showing rotational speeds of 90º and 135º per frame.
However, things change dramatically when we look at the next column – the wheel rotating at 180º per frame. Think about what this would look like if we played this movie (assuming that the frame rate is pretty fast – fast enough that we don’t see things blinking…) Instead of seeing a rotating wheel with only one spoke, we would see a wheel that’s not rotating – and with two spokes.
This is important, so let’s think about this some more. This means that, because we are cutting time into discrete moments (each frame is a “slice” of time) and at a regular rate (I’m assuming here that the frame rate of the film does not vary), then the movement of the wheel is recorded (since our 1 spoke turns into 2) but the direction of movement does not. (We don’t know whether the wheel is rotating clockwise or counter-clockwise. Both directions of rotation would result in the same film…)
Now, let’s move over one more column – where the wheel is rotating at 225º per frame. In this case, if we look at the film, it appears that the wheel is back to having only one spoke again – but it will appear to be rotating backwards at a rate of 135º per frame. So, although the wheel is rotating clockwise, the film shows it rotating counter-clockwise at a different (slower) speed. This is an effect that you’ve probably seen many times in films and on TV. What may come as a surprise is that this never happens in “real life” unless you’re in a place where the lights are flickering at a constant rate (as in the case of fluorescent or some LED lights, for example).
Again, we have to consider the fact that if the wheel actually were rotating counter-clockwise at 135º per frame, we would get exactly the same thing on the frames of the film as when the wheel if rotating clockwise at 225º per frame. These two events in real life will result in identical photos in the film. This is important – so if it didn’t make sense, read it again.
This means that, if all you know is what’s on the film, you cannot determine whether the wheel was going clockwise at 225º per frame, or counter-clockwise at 135º per frame. Both of these conclusions are valid interpretations of the “data” (the film). (Of course, there are more – the wheel could have rotated clockwise by 360º+225º = 585º or counter-clockwise by 360º+135º = 495º, for example…)
Since these two interpretations of reality are equally valid, we call the one we know is wrong an alias of the correct answer. If I say “The Big Apple”, most people will know that this is the same as saying “New York City” – it’s an alias that can be interpreted to mean the same thing.
Wheels and Slinkies
We people in audio commit many sins. One of them is that, every time we draw a plot of anything called “audio” we start out by drawing a sine wave. (A similar sin is committed by musicians who, at the first opportunity to play a grand piano, will play a middle-C, as if there were no other notes in the world.) The question is: what, exactly, is a sine wave?
Get a Slinky – or if you don’t want to spend money on a brand name, get a spring. Look at it from one end, and you’ll see that it’s a circle, as can be (sort of) seen in Figure 2.
Since this is a circle, we can put marks on the Slinky at various amounts of rotation, as in Figure 3.
Of course, I could have put the 0º mark anywhere. I could have also rotated counter-clockwise instead of clockwise. But since both of these are arbitrary choices, I’m not going to debate either one.
Now, let’s rotate the Slinky so that we’re looking at from the side. We’ll stretch it out a little too…
Let’s do that some more…
When you do this, and you look at the Slinky directly from one side, you are able to see the vertical change of the spring from the centre as a result of the change in rotation. For example, we can see in Figure 6 that, if you mark the 45º rotation point in this view, the distance from the centre of the spring is 71% of the maximum height of the spring (at 90º).
So what? Well, basically, the “punch line” here is that a sine wave is actually a “side view” of a rotation. So, Figure 7, shows a measurement – a capture – of the amplitude of the signal every 45º.
Since we can now think of a sine wave as a rotation of a circle viewed from the side, it should be just a small leap to see that Figure 7 and the left-most column of Figure 1 are basically identical.
Let’s make audio equivalents of the different columns in Figure 1.
Figure 10 is an important one. Notice that we have a case here where there are exactly 2 samples per period of the cosine wave. This means that our sampling frequency (the number of samples we make per second) is exactly one-half of the frequency of the signal. If the signal gets any higher in frequency than this, then we will be making fewer than 2 samples per period. And, as we saw in Figure 1, this is where things start to go haywire.
Figure 11 shows the equivalent audio case to the “225º per frame” column in Figure 1. When we were talking about rotating wheels, we saw that this resulted in a film that looked like the wheel was rotating backwards at the wrong speed. The audio equivalent of this “wrong speed” is “a different frequency” – the alias of the actual frequency. However, we have to remember that both the correct frequency and the alias are valid answers – so, in fact, both frequencies (or, more accurately, all of the frequencies) exist in the signal.
So, we could take Fig 11, look at the samples (the black lollipops) and figure out what other frequency fits these. That’s shown in Figure 12.
Moving up in frequency one more step, we get to the right-hand column in Figure 1, whose equivalent, including the aliased signal, are shown in Figure 13.
Do I need to worry yet?
Hopefully, now, you can see that an LPCM system has a limit with respect to the maximum frequency that it can deal with appropriately. Specifically, the signal that you are trying to capture CANNOT exceed one-half of the sampling rate. So, if you are recording a CD, which has a sampling rate of 44,100 samples per second (or 44.1 kHz) then you CANNOT have any audio signals in that system that are higher than 22,050 Hz.
That limit is commonly known as the “Nyquist frequency“, named after Harry Nyquist – one of the persons who figured out that this limit exists.
In theory, this is always true. So, when someone did the recording destined for the CD, they made sure that the signal went through a low-pass filter that eliminated all signals above the Nyquist frequency.
In practice, however, there are many cases where aliasing occurs in digital audio systems because someone wasn’t paying enough attention to what was happening “under the hood” in the signal processing of an audio device. This will come up later.
Two more details to remember…
There’s an easy way to predict the output of a system that’s suffering from aliasing if your input is sinusoidal (and therefore contains only one frequency). The frequency of the output signal will be the same distance from the Nyquist frequency as the frequency if the input signal. In other words, the Nyquist frequency is like a “mirror” that “reflects” the frequency of the input signal to another frequency below Nyquist.
This can be easily seen in the upper plot of Figure 14. The distance from the Input signal and the Nyquist is the same as the distance between the output signal and the Nyquist.
Also, since that Nyquist frequency acts as a mirror, then the Input and output signal’s frequencies will move in opposite directions (this point will help later).
Usually, frequency-domain plots are done on a logarithmic scale, because this is more intuitive for we humans who hear logarithmically. (For example, we hear two consecutive octaves on a piano as having the same “interval” or “width”. We don’t hear the width of the upper octave as being twice as wide, like a measurement system does. that’s why music notation does not get wider on the top, with a really tall treble clef.) This means that it’s not as obvious that the Nyquist frequency is in the centre of the frequencies of the input signal and its alias below Nyquist.
Reminder: This is still just the lead-up to the real topic of this series. However, we have to get some basics out of the way first…
Just like the last posting, this is a copy-and-paste from an article that I wrote for another series. However, this one is important, and rather than just link you to a different page, I’ve reproduced it (with some minor editing to make it fit) here.
In the last posting, I talked about digital audio (more accurately, Linear Pulse Code Modulation or LPCM digital audio) is basically just a string of stored measurements of the electrical voltage that is analogous to the audio signal, which is a change in pressure over time…
For now, we’ll say that each measurement is rounded off to the nearest possible “tick” on the ruler that we’re using to measure the voltage. That rounding results in an error. However, (assuming that everything is working correctly) that error can never be bigger than 1/2 of a “step”. Therefore, in order to reduce the amount of error, we need to increase the number of ticks on the ruler.
Now we have to introduce a new word. If we really had a ruler, we could talk about whether the ticks are 1 mm apart – or 1/16″ – or whatever. We talk about the resolution of the ruler in terms of distance between ticks. However, if we are going to be more general, we can talk about the distance between two ticks being one “quantum” – a fancy word for the smallest step size on the ruler.
So, when you’re “rounding off to the nearest value” you are “quantising” the measurement (or “quantizing” it, if you live in Noah Webster’s country and therefore you harbor the belief that wordz should be spelled like they sound – and therefore the world needz more zees). This also means that the amount of error that you get as a result of that “rounding off” is called “quantisation error“.
In some explanations of this problem, you may read that this error is called “quantisation noise”. However, this isn’t always correct. This is because if something is “noise” then is is random, and therefore impossible to predict. However, that’s not strictly the case for quantisation error. If you know the signal, and you know the quantisation values, then you’ll be able to predict exactly what the error will be. So, although that error might sound like noise, technically speaking, it’s not. This can easily be seen in Figures 1 through 3 which demonstrate that the quantisation error causes a periodic, predictable error (and therefore harmonic distortion), not a random error (and therefore noise).
Sidebar: The reason people call it quantisation noise is that, if the signal is complicated (unlike a sine wave) and high in level relative to the quantisation levels – say a recording of Britney Spears, for example – then the distortion that is generated sounds “random-ish”, which causes people to jump to the conclusion that it’s noise.
Now, let’s talk about perception for a while… We humans are really good at detecting patterns – signals – in an otherwise noisy world. This is just as true with hearing as it is with vision. So, if you have a sound that exists in a truly random background noise, then you can focus on listening to the sound and ignore the noise. For example, if you (like me) are old enough to have used cassette tapes, then you can remember listening to songs with a high background noise (the “tape hiss”) – but it wasn’t too annoying because the hiss was independent of the music, and constant. However, if you, like me, have listened to Bob Marley’s live version of “No Woman No Cry” from the “Legend” album, then you, like me, would miss the the feedback in the PA system at that point in the song when the FoH engineer wasn’t paying enough attention… That noise (the howl of the feedback) is not noise – it’s a signal… Which makes it just as important as the song itself. (I could get into a long boring talk about John Cage at this point, but I’ll try to not get too distracted…)
The problem with the signal in Figure 2 is that the error (shown in Figure 3) is periodic – it’s a signal that demands attention. If the signal that I was sending into the quantisation system (in Figure 1) was a little more complicated than a sine wave – say a sine wave with an amplitude modulation – then the error would be easily “trackable” by anyone who was listening.
So, what we want to do is to quantise the signal (because we’re assuming that we can’t make a better “ruler”) but to make the error random – so it is changed from distortion to noise. We do this by adding noise to the signal before we quantise it. The result of this is that the error will be randomised, and will become independent of the original signal… So, instead of a modulating signal with modulated distortion, we get a modulated signal with constant noise – which is easier for us to ignore. (It has the added benefit of spreading the frequency content of the error over a wide frequency band, rather than being stuck on the harmonics of the original signal… but let’s not talk about that…)
Let’s take a look at an example of this from an equivalent world – digital photography.
The photo in Figure 4 is a black and white photo – which actually means that it’s comprised of shades of gray ranging from black all the way to white. The photo has 272,640 individual pixels (because it’s 640 pixels wide and 426 pixels high). Each of those pixels is some shade of gray, but that shading does not have an infinite resolution. There are “only” 256 possible shades of gray available for each pixel.
So, each pixel has a number that can range from 0 (black) up to 255 (white).
If we were to zoom in to the top left corner of the photo and look at the values of the 64 pixels there (an 8×8 pixel square), you’d see that they are:
What if we were to reduce the available resolution so that there were fewer shades of gray between white and black? We can take the photo in Figure 1 and round the value in each pixel to the new value. For example, Figure 5 shows an example of the same photo reduced to only 6 levels of gray.
Now, if we look at those same pixels in the upper left corner, we’d see that their values are
They’ve all been quantised to the nearest available level, which is 102. (Our possible values are restricted to 0, 51, 102, 154, 205, and 255).
So, we can see that, by quantising the gray levels from 256 possible values down to only 6, we lose details in the photo. This should not be a surprise… That loss of detail means that, for example, the gentle transition from lighter to darker gray in the sky in the original is “flattened” to a light spot in a darker background, with a jagged edge at the transition between the two. Also, the details of the wall pillars between the windows are lost.
If we take our original photo and add noise to it – so were adding a random value to the value of each pixel in the original photo (I won’t talk about the range of those random values…) it will look like Figure 6. This photo has all 256 possible values of gray – the same as in Figure 1.
If we then quantise Figure 6 using our 6 possible values of gray, we get Figure 7. Notice that, although we do not have more grays than in Figure 5, we can see things like the gradual shading in the sky and some details in the walls between the tall windows.
That noise that we add to the original signal is called dither – because it is forcing the quantiser to be indecisive about which level to quantise to choose.
I should be clear here and say that dither does not eliminate quantisation error. The purpose of dither is to randomise the error, turning the quantisation error into noise instead of distortion. This makes it (among other things) independent of the signal that you’re listening to, so it’s easier for your brain to separate it from the music, and ignore it.
Addendum: Binary basics and SNR
We normally write down our numbers using a “base 10” notation. So, when I write down 9374 – I mean 9 x 1000 + 3 x 100 + 7 x 10 + 4 x 1 or 9 x 103 + 3 x 102 + 7 x 101 + 4 x 100
We use base 10 notation – a system based on 10 digits (0 through 9) because we have 10 fingers.
If we only had 2 fingers, we would do things differently… We would only have 2 digits (0 and 1) and we would write down numbers like this: 11101
which would be the same as saying 1 x 16 + 1 x 8 + 1 x 4 + 0 x 2 + 1 x 1 or 1 x 24 + 1 x 23 + 1 x 22 + 0 x 21 + 1 x 20
The details of this are not important – but one small point is. If we’re using a base-10 system and we increase the number by one more digit – say, going from a 3-digit number to a 4-digit number, then we increase the possible number of values we can represent by a factor of 10. (in other words, there are 10 times as many possible values in the number XXXX than in XXX.)
If we’re using a base-2 system and we increase by one extra digit, we increase the number of possible values by a factor of 2. So XXXX has 2 times as many possible values as XXX.
Now, remember that the error that we generate when we quantise is no bigger than 1/2 of a quantisation step, regardless of the number of steps. So, if we double the number of steps (by adding an extra binary digit or bit to the value that we’re storing), then the signal can be twice as “far away” from the quantisation error.
This means that, by adding an extra bit to the stored value, we increase the potential signal-to-error ratio of our LPCM system by a factor of 2 – or 6.02 dB.
So, if we have a 16-bit LPCM signal, then a sine wave at the maximum level that it can be without clipping is about 6 dB/bit * 16 bits – 3 dB = 93 dB louder than the error. The reason we subtract the 3 dB from the value is that the error is +/- 0.5 of a quantisation step (normally called an “LSB” or “Least Significant Bit”).
Note as well that this calculation is just a rule of thumb. It is neither precise nor accurate, since the details of exactly what kind of error we have will have a minor effect on the actual number. However, it will be close enough.
I’ve been debating writing a series of postings about “high resolution” audio for a long time – years. Lately, (probably because of some hype generated by some recent press releases) I’ve been getting lots of question (no, that’s not a typo) about it, so it appears the time has come…
To start: the question that I get (a lot) is “If I can’t hear above 20 kHz, then what’s the use of high-res?” As I’ll explain as we go through, this is only one, rather small aspect to consider in this topic. In fact, it might be the least important issue to consider.
However, before I write too much, I’ll say that I’m not going to argue for or against higher resolutions in digital audio systems. I’m only going to go through a bunch of issues that can be used to argue either for or against them. So, there’s not going to be a big reveal at the end of this series telling you that high-res is either better, worse, or no different than whatever you’re using now. It’s merely going to be a discussion of a number of issues that need to be weighed. The problem is that this entire topic is complicated – and there’s no single “right” answer, as I’ll argue as we go along.
To start, let’s get down to basics and look (once again, from the perspectives of this website) at what sound is, and how it’s converted from an analogue electrical signal into a digital representation. The good thing is that I’ve written this introduction before in a different series of postings. So, I’m going to be extremely lazy and just copy-and-paste that information here. I’m not just referring you to another page because I’m intentionally leaving some things out because we’re headed into having a different discussion this time.
A quick introduction to sound
At the simplest level, sound can be described as a small change in air pressure (or barometric pressure) over short periods of time. If you’d like to have a better and more edu-tain-y version of this statement with animations and pretty colours, you could take 10 minutes to watch this video, for example.
That change in pressure can be “captured” by using a microphone, that is (at the simplest level) a device that has a change in air pressure at its input and a change in electrical voltage at its output. Ignoring a lot of details, we could say that if you were to plot a measurement of the air pressure (at the input of the microphone) over time, and you were to compare it to a plot of the measurement of the voltage (at the output of the microphone) over time, you would see the same curve on the two graphs. This means that the change in voltage is analogous to the change in air pressure.
At this point in the conversation, I’ll make a point to say that, in theory, we could “zoom in” on either of those two curves shown in Figure 1 and see more and more details. This is like looking at a map of Canada – it has lots of crinkly, jagged lines. If you zoom in and look at the map of Newfoundland and Labrador, you’ll see that it has finer, crinkly, jagged lines. If you zoom in further, and stand where the water meets the shore in Trepassey and take a photo of your feet, you could copy it to draw a map of the line of where the water comes in around the rocks – and your toes – and you would wind up with even finer, crinkly, jagged lines… You could take this even further and get down to a microscopic or molecular level – but you get the idea… The point is that, in theory, both of the plots in Figure 1 have infinite resolution, both in time and in air pressure or voltage.
Now, let’s say that you wanted to take that microphone’s output and transmit it through a bunch of devices and wires that, in theory, all do nothing to the signal. Let’s say, for example, that you take the mic’s output, send it through a wire to a box that makes the signal twice as loud. Then take the output of that box and send it through a wire to another box that makes it half as loud. You take the output of that box and send it through a wire to a measuring device. What will you see? Unfortunately, none of the wires or boxes in the chain can be perfect, so you’ll probably see the signal plus something else which we’ll call the “error” in the system’s output. We can call it the error because, if we measure the input voltage and the output voltage at any one instant, we’ll probably see that they’re not identical. Since they should be identical, then the system must be making a mistake in transmitting the signal – so it makes errors…
Pedantic Sidebar: Some people will call that error that the system adds to the signal “noise” – but I’m not going to call it that. This is because “noise” is a specific thing – noise is random – so if it’s not random, it’s not noise. Also, although the signal has been distorted (in that the output of the system is not identical to the input) I won’t call it “distortion” either, since distortion is a name that’s given to something that happens to the signal because the signal is there. (We would probably get at least some of the error out of our system even if we didn’t send any audio into it.) So, we could be slightly geeky and adequately vague and call the extra stuff “Distortion plus noise” but not “THD+N” – which stands for “Total Harmonic Distortion Plus Noise” – because not all kinds of distortion will produce a harmonic of the signal… but I’m getting ahead of myself…
So, we want to transmit (or store) the audio signal – but we want to reduce the noise caused by the transmission (or storage) system. One way to do this is to spend more money on your system. Use wires with better shielding, amplifiers with lower noise floors, bigger power supplies so that you don’t come close to their limits, run your magnetic tape twice as fast, and so on and so on. Or, you could convert the analogue signal (remember that it’s analogous to the change in air pressure over time) to one that is represented (and therefore transmitted or stored) digitally instead.
What does this mean?
Conversion from analogue to digital and back (but skipping important details)
IMPORTANT: If you read this section, then please read the following postings as well. This is because, in order to keep things simple to start, I’m about to leave out some important details that I’ll add afterwards. However, if you don’t add the details, you could (understandably) jump to some incorrect conclusions (that many others before you have concluded…) So, if you don’t have time to read both sections, please don’t read either of them.
In the example above, we made a varying voltage that was analogous to the varying air pressure. If we wanted to store this, we could do it by varying the amount of magnetism on a wire or a coating on a tape, for example. Or we could cut a wiggly groove in a bit of vinyl that has a similar shape to the curve in the plots in Figure 1. Or, we could do something else: we could get a metronome (or a clock) and make a measurement of the voltage every time the metronome clicks, and write down the measurements.
For example, let’s zoom in on the first little bit of the signal in the plots in Figure 1
We’ll then put on a metronome and make a measurement of the voltage every time we hear the metronome click…
We can then keep the measurements (remembering how often we made them…) and write them down like this:
We can store this series of numbers on a computer’s hard disk, for example. We can then come back tomorrow, and convert the measurements to voltages. First we read the measurements, and create the appropriate voltage…
We then make a “staircase” waveform by “holding” those voltages until the next value comes in.
All we need to do then is to use a low-pass filter to smooth out the hard edges of the staircase.
So, in this example, we’ve gone from an analogue signal (the red curve in Figure 3) to a digital signal (the series of numbers), and back to an analogue signal (the red curve in Figure 7).
In some ways, this is a bit like the way a movie works. When you watch a movie, you see a series of still photographs, probably taken at a rate of 24 pictures (or frames) per second. If you play those photos back at the same rate (24 fps or frames per second), you think you see movement. However, this is because your eyes and brain aren’t fast enough to see 24 individual photos per second – so you are fooled into thinking that things on the screen are moving.
However, digital audio is slightly different from film in two ways:
The sound (equivalent to the movement in the film) is actually happening. It’s not a trick that relies on your ears and brain being too slow.
If, when you were filming the movie, something were to happen between frames (say, the flash of a gunshot, for example) then it would never be caught on film. This is because the photos are discrete moments in time – and what happens between them is lost. However, if something were to make a very, very short sound between two samples (two measurements) in the digital audio signal – it would not be lost. This is because of something that happens at the beginning of the chain that I haven’t described… yet…
However, there are some “artefacts” (a fancy term for “weird errors”) that are present both in film and in digital audio that we should talk about.
The first is an error that happens when you mess around with the rate at which you take the measurements (called the “sampling rate”) or the photos (called the “frame rate”) – and, more importantly, when you need to worry about this. Let’s say that you make a film at 24 fps. If you play this back at a higher frame rate, then things will move very quickly (like old-fashioned baseball movies…). If you play them back at a lower frame rate, then things move in slow motion. So, for things to look “normal” you have to play the movie at the same rate that it was filmed. However, as long as no one is looking, you can transfer the movie as fast as you like. For example, if you wanted to copy the film, you could set up a movie camera so it was pointing at a movie screen and film the film. As long as the movie on the screen is running in sync with the camera, you can do this at any frame rate you like. But you’ll have to watch the copy at the same frame rate as the original film… (Note that this issue is not something that will come up in this series of postings about high resolution audio)
The second is an easy artefact to recognise. If you see a car accelerating from 0 to something fast on film, you’ll see the wheels of the car start to get faster and faster, then, as the car gets faster, the wheels slow down, stop, and then start going backwards… This does not happen in real life (unless you’re in a place lit with flashing lights like fluorescent bulbs or LED’s). I’ll do a posting explaining why this happens – but the thing to remember here is that the speed of the wheel rotation that you see on the film (the one that’s actually captured by the filming…) is not the real rotational speed of the wheel. However, those two rotational speeds are related to each other (and to the frame rate of the film). If you change the real rotational rate or the frame rate, you’ll change the rotational rate in the film. So, we call this effect “aliasing” because it’s a false version (an alias) of the real thing – but it’s always the same alias (assuming you repeat the conditions…) Digital audio can also suffer from aliasing, but in this case, you put in one frequency (which is actually the same as a rotational speed) and you get out another one. This is not the same as harmonic distortion, since the frequency that you get out is due to a relationship between the original frequency and the sampling rate, so the result is almost never a multiple of the input frequency. (We’re going to dig into this a lot deeper through this series of postings about high resolution audio, so if it doesn’t immediately make sense, don’t worry…)
Some important details that I left out…
One of the things I said above was something like “we measure the voltage and store the results” and the example I gave was a nice series of numbers that only had 4 digits after the decimal point. This statement has some implications that we need to discuss.
Let’s say that I have a thing that I need to measure. For example, Figure 8 shows a piece of metal, and I want to measure its width.
Using my ruler, I can see that this piece of metal is about 57 mm wide. However, if I were geeky (and I am) I would say that this is not precise enough – and therefore it’s not accurate. The problem is that my ruler is only graduated in millimetres. So, if I try to measure anything that is not exactly an integer number of mm long, I’ll either have to guess (and be wrong) or round the measurement to the nearest millimetre (and be wrong).
So, if I wanted you to make a piece of metal the same width as my piece of metal, and I used the ruler in Figure 8, we would probably wind up with metal pieces of two different widths. In order to make this better, we need a better ruler – like the one in Figure 9.
Figure 9 shows a vernier caliper (a fancy type of ruler) being used to measure the same piece of metal. The caliper has a resolution of 0.05 mm instead of the 1 mm available on the ruler in Figure 8. So, we can make a much more accurate measurement of the metal because we have a measuring device with a higher precision.
The conversion of a digital audio signal is the same. As I said above, we measure the voltage of the electrical signal, and transmit (or store) the measurement. The question is: how accurate and precise is your measurement? As we saw above, this is (partly) determined by how many digits are in the number that you use when you “write down” the measurement.
Since the voltage measurements in digital audio are recorded in binary rather than decimal (we use 0 and 1 to write down the number instead of 0 up to 9) then we use Binary digITS – or “bits” instead of decimal digits (which are not called “dits”). The number of bits we have in the number that we write down (partly) determines the precision of the measurement of the voltage – and therefore (possibly), our accuracy…
Just like the example of the ruler in Figure 8, above, we have a limited resolution in our measurement. For example, if we had only 4 bits to work with then the waveform in 4 – the one we have to measure – would be measured with the “ruler” shown on the left side of Figure 10, below.
When we do this, we have to round off the value to the nearest “tick” on our ruler, as shown in Figure 11.
Using this “ruler” which gives a write-down-able “quantity” to the measurement, we get the following values for the red staircase:
When we “play these back” we get the staircase again, shown in Figure 12.
Of course, this means that, by rounding off the values, we have introduced an error in the system (just like the measurement in Figure 8 has a bigger error than the one in Figure 9). We can calculate this error if we just subtract the original signal from the output signal (in other words, Figure 12 minus Figure 10) to get Figure 13.
In order to improve our accuracy of the measurement, we have to increase the precision of the values. We can do this by adding an extra digit (or bit) to the number that we use to record the value.
If we were using decimal numbers (0-9) then adding an extra digit to the number would give us 10 times as many possibilities. (For example, if we were using 4 digits after the decimal in the example at the start of this posting, we have a total of 10,000 possible values – 0.0000 to 0.9999. If we add one more digit, we increase the resolution to 100,000 possible values – 0.00000 to 0.99999 ).
In binary, adding one extra digit gives us twice as many “ticks” on the ruler. So, using 4 bits gives us 16 possible values. Increasing to 5 bits gives us 32 possible values.
If you’re listening to a CD, then the individual measurements of each voltage – the “sample values” – are stored with 16 bits, which means that we have 65,536 possible values to pick from.
Remember that this means that we have more “ticks” on our ruler – but we don’t necessarily increase its range. So, for example, we’re still measuring a voltage from -1 V to 1 V – we just have more and more resolution with which we can do that measurement.
Occasionally, a question that comes into the customer communications department to Bang & Olufsen from a dealer or a customer eventually finds its way into my inbox.
This week, the question was about nomenclature. Why is it that, on some loudspeakers, for example, we say there is a tweeter, mid-range, and woofer, whereas on other loudspeakers we say that we’re using a “full range” driver instead? What’s the difference? (Folded into the same question was another about amplifier power, but I’ll take that one in another posting.)
So, what IS the difference? There are three different ways to answer this question.
Answer #1: It’s how you use it.
My Honda Civic, the motorcycle that passed me on the highway this morning, and an F1 car all have a gear in the gearbox that’s labelled “3”. However, the gear ratio of those three examples of “third gear” are all different. In other words, if you showed a mechanic the gear ratio of one of those gearbox settings without knowing anything else, they wouldn’t be able to tell you “ah! that’s third gear…”
So, in this example, “third gear” is called “third” only because it’s the one between “second” and “fourth”. There is nothing physical about it that makes it “third”. If that were the case then my car wouldn’t have a first gear, because some farm tractor out there in the world would have a gear with a lower ratio – and an F1 car would start at gear 100 or so… And that wouldn’t make sense.
Similarly, we use the words “tweeter”, “midrange”, “woofer”, “subwoofer”, and “full range” to indicate the frequency range that that particular driver is looking after in this particular device. My laptop has a 1″ “woofer” – which only means that it’s the driver that’s taking care of the low frequencies that come out of my laptop.
So, using this explanation, the Beolab 90 webpage says that it has midranges and tweeters and no “full range” drivers because the midrange drivers look after the midrange frequencies, and the tweeters look after the high frequencies. However, the Beolab 28’s webpage says that it has a tweeter and full range drivers, but no midranges. This is because the drivers that play the midrange frequencies in the Beolab 28 also play some of the high-frequency content as part of the Beam Width control. Since they’re doing “double duty”, they get a different name.
Answer #2: Excruciating minutiae
The description I gave above isn’t really an adequate answer. For example, I said that my laptop has a 1″ “woofer”. Beolab 90 has a 1″ “tweeter” – but these two drivers are not designed the same way. Beolab 90’s tweeter is specifically designed to be used to produce high frequencies. One consequence of this is that the total mass of the moving parts (the diaphragm and voice coil, amongst other things) is as low as possible, so that it’s easy to move. This means that it can produce high frequency signals without having to use a lot of electrical power to push it back and forth.
However, the 1″ “woofer” in my laptop is designed differently. It probably has a much higher total mass for the moving parts. This means that its resonant frequency (the frequency that it would “ring” at if you hit it like a drum head) is much lower. Therefore it “wants” to move easily at a lower frequency than a tweeter would.
For example, if you put a child on a swing and you give them a push, they’ll swing back and forth at some frequency. If the child wanted to swing SLOWER (at a lower frequency), you could
move to a swing with longer ropes so this happens naturally, or
you can hold on to the ropes and use your muscles to control the rate of swinging instead.
The smarter thing to do is the first choice, that way you can keep sipping your coffee instead of getting a workout.
So, a 1″ woofer and a 1″ tweeter are not really the same thing.
Answer #3: Compromise
We live in a world that has been convinced by advertisers that “compromise” is a bad thing – but it’s not. Someone who does never accepts to compromise is destined to live a very lonely life. When designing a loudspeaker, one of the things to consider is what, exactly, each component will be asked to do, and choose the appropriate components accordingly.
If we’re going to be really pedantic – there’s really no such thing as a tweeter, woofer, or anything else with those kinds of names. Any loudspeaker driver can produce sound at any frequency. The only difference between them is the relative ease with which the driver plays a signal at a given frequency. You can get 20 Hz to come out of a “tweeter” – it will just be naturally a LOT quieter than the signals at around 5 kHz. Similarly, a woofer can play signals at 20 kHz, but it will be a lot quieter and/or take a lot more power than signals at 50 Hz.
What this means is that, when you make an active loudspeaker, the response (the relative levels of signals at different frequencies) is really a result of the filters in the digital signal processing and the control from the amplifier (ignoring the realities of heat and time…). If we want more or less level at 2 kHz from a loudspeaker driver, we “just” change the filter in the signal processing and use the amplifier to do the work (the same as the example above where you were using your muscle power to control the frequency of the child on the swing).
However, there are examples where we know that a driver will be primarily used for one frequency band, but actually be extending into another. The side-facing drivers on Beolab 28 are a good example of this. They’re primarily being used to control the beam width in the midrange, but they’re also helping to control the beam width in the high frequencies. Since, they’re doing double-duty in two frequency ranges, they can’t really be called “midranges” or “tweeters” – they’d be more accurately called “midranges that also play as quiet tweeters”. (They don’t have to play high frequencies loudly, since this is “only” to control the beam width of the front tweeter.) However, “midranges that also play as quiet tweeters” is just too much information for a simple datasheet – so “full range” will do as a compromise.
I’ve got some extra things to add here…
Firstly, it has become common over the past couple of years to call “woofers” “subwoofers” instead. I don’t know why this happened – but I suspect that it’s merely the result of people who write advertising copy using a word they’ve heard before without really knowing what it means. Personally, I think that it’s funny to see a laptop specified to have a “1” subwoofer”. Maybe we should make the word “subtweeter” popular instead.
Secondly, personally, I believe that a “subwoofer” is a thing that looks after the frequency range below a “woofer”. I remember a conversation I had at an AES convention once (I think it was with Günther Theile and Tomlinson Holman) where we all agreed that a “subwoofer” should look after the frequency range up to 40 Hz, which is where a decent woofer should take over.
Lastly, if you find an audio magazine from the 1970s, you’ll see that a three-way loudspeaker had a “tweeter”, “squawker”, and “woofer”. Sometime between then and now, “squawker” was replaced with “midrange” – but I wonder why the other two didn’t change to “highrange” and “lowrange” (although neither of these would be correct, since all three drivers in a three-way system have limited frequency ranges).
I spent some time this week helping to track down the source of an error in a digital audio signal flow chain, and we wound up having a discussion that I thought might be worth repeating here.
Let’s start at the very beginning.
Let’s take an analogue audio signal and convert it to a Linear Pulse Code Modulation (LPCM) representation in the dumbest possible way.
In order to save this signal as a string of numerical values, we have to first accept the fact that we don’t have an infinite number of numbers to use. So, we have to round off the signal to the nearest usable value or “quantisation value”. This process of rounding the value is called “quantisation”.
Let’s say for now that our available quantisation values are the ones shown on the grid. If we then take our original sine wave and round it to those values, we get the result shown below.
Of course, I’m leaving out a lot of important details here like anti-aliasing filtering and dither (I said that we were going to be dumb…) but those things don’t matter for this discussion.
So far so good. However, we have to be a bit more specific: an LPCM system encodes the values using binary representations of the values. So, a quantisation value of “0.25”, as shown above isn’t helpful. So, let’s make a “baby” LPCM system with only 3 bits (meaning that we have three Binary digITs available to represent our values).
To start, let’s count using a 3-bit system:
0 x 4 +
0 x 2 +
0 x 1
0 x 4 +
0 x 2 +
1 x 1
0 x 4 +
1 x 2 +
0 x 1
0 x 4 +
1 x 2 +
1 x 1
1 x 4 +
0 x 2 +
0 x 1
1 x 4 +
0 x 2 +
1 x 1
1 x 4 +
1 x 2 +
0 x 1
1 x 4 +
1 x 2 +
1 x 1
Table 1: The 8 numbers that can be represented using a 3-bit binary representation
and that’s as far as we can go before needing 4 bits. However, for now, that’s enough.
Take a look at our signal. It ranges from -1 to 1 and 0 is in the middle. So, if we say that the “0” in our original signal is encoded as “000” in our 3-bit system, then we just count upwards from there as follows:
Now what? Well, let’s look at this a little differently. If we were to divide a circle into the same number of quantisation values, make the “12:00” position = 000, and count clockwise, it would look like this:
The question now is “how do we number the negative values?” but the answer is already in the circle shown above… If I make it a little more obvious, then the answer is shown below.
If we use the convention shown above, and represent that on the graph of our audio signal, then it looks like this:
One nice thing about this way of doing things is that you just need to look at the first digit in the binary word to know whether the value is positive or negative. A 0 means it’s positive, and a 1 means it’s negative.
However, there are two issues here that we need to sort out… The first is that, since we have an even number of values, but an odd number of quantisation steps (4 above zero, 4 below zero, and zero = 9 steps) then we had to do something asymmetrical. As you can see in the plot above, there are no numbers assigned to the top quantisation value, which actually means that it doesn’t exist.
So, if we’re still being dumb, then the result of our quantisation will either look like this:
But what happens when you make two mistakes simultaneously? Let’s go back and look at an earlier plot.
Let’s say that you’re writing some DSP code, and you forget about the asymmetry problem, so you scale things so they’ll TRY to look like the plot above.
However, as we already know, that top quantisation value doesn’t exist – but the code will try to put something there. If you’ve forgotten about this, then the system will THINK that you want this:
As you can see there, your code (because you’ve forgotten to write an IF-THEN statement) will think that the top-most positive quantisation value is just the number after 011, which is 100. However, that value means something totally different… So, the result coming out will ACTUALLY look like this:
As you can see there, the signal is very different from what we think it should be.
This error is called a “wrapping” error, because the signal is “wrapped” too far around the circle shown in Figure 5, shown above. It sounds very bad – much worse than “normal” clipping (as shown in Figure 7) because of that huge nearly-instantaneous transition from maximum positive to maximum negative and back.
Of course, the wrapping can also happen in the opposite direction; a negatively-clipped signal can wrap around and show up at the top of the positive values. The reason is the same because the values are trying to go around the same circle.
As I said: this is actually the result of two problems that both have to occur in the same system:
The signal has to be trying to get to a level that is beyond the limits of the quantisation values
Someone forgot to write a line of code that makes sure that, when that happens, the signal is “just” clipped and not wrapped.
So, if the second of these issues is sitting there, unresolved, but the signal never exceeds the limits, then you’ll never have a problem. However, I will never need the airbags in my car, unless I have an accident. So, it’s best to remember to look after that second issue… just in case.
This method of encoding the quantisation values is called the “Two’s Complement” method. If you want to know more about it, read this.
As I’ve talked about in a previous posting, when a reciprocal peak/dip filter says “Q”, there’s no knowing what it might mean, because there are at least 7 different definitions of Q (3 for boosts and 4 for dips).
For many people, this doesn’t really matter. If you’re just playing with an EQ to make things sound better right now, then the values on the display really don’t matter: it’s the sound that counts.
If you’re like me, you need to be able to navigate between different pieces of software and hardware, and to get the same EQ response from them, then you’ll also need to know firstly that you can’t trust the display, and secondly, how to “translate” from device to device when necessary.
For example, take a look at Figure 1
This shows two magnitude responses, however, these are the measurements of two equalisers with identical settings: Fc = 1 kHz, Gain = +12 dB, Q = 2.
The black curve shows the response of an equaliser that uses the -3 dB points to define the bandwidth of the filter, and therefore the Q is based on 1/(2 zeta). The red curve shows the response of an equaliser that uses the mid-point (in this case, +6 dB because the Gain is +12 dB) to define the bandwidth of the filter.
The difference between these two plots is shown below in Figure 2.
We’d have a similar problem if we were cutting instead of boosting, as shown in Figure 3.
You have to think upside down in this case, because the 1/(2 zeta) filter is actually using the 3 dB UP points to measure bandwidth; but we’ll ignore that and move on.
If you need to translate between the two systems shown above, there’s a pretty easy way to do it.
I’ll assume that you are implementing your filter using the mid-point definition of the bandwidth, so you need to convert into that system rather than out of it. (I’m making this assumption because it’s the one that Robert Bristow-Johnson used in his Audio Cookbook, which was freely copy-and-pasteable, which means that you find it everywhere these days.) Get the parameters from the filter you want to copy.
We’ll call these parameters Fc (for centre frequency, in Hz), (Gain in dB), and . I’m calling it because it’s a Q based on 1/(2 zeta) and we’ll need to keep it separate from our other Q, which I’ll call (for Robert Bristow-Johnson).
Convert the gain into linear.
Then do the following:
ELSE your filter isn’t doing anything because
If you have a -3 dB-based filter with the following parameters:
and you want to implement that using the Bristow-Johnson equations, then you’ll have to use the following parameters:
If you have a -3 dB-based filter with the following parameters:
and you want to implement that using the Bristow-Johnson equations, then you’ll have to use the following parameters:
Two Extra Things…
If the filter that you’re translating FROM is based on Andy Moorer’s design (which is based on the gain mid-point if the gain is within the ±6 dB range, but based on the 3 dB points if it’s outside that), then you’ll have to write your own IF/THEN statements.
If you’re implementing a filter that was specified for RBJ’s equations in a system that’s based on 1/(2 zeta), then you’re probably smart enough to figure out how to do the above in reverse.
One additional addendum
IF you don’t like IF/THEN statements for some reason or another (code optimisation, for example)
THEN you could do it this way instead:
What I’ve done there is to fold the decibel-to-linear conversion into the equation. I’ve also converted the gain in dB to an absolute value before converting to linear. That way, it’s always positive, so you always divide.
When it comes to audio, the “signal” is an easy thing to define. It’s what you want to listen to – a song, the dialogue in the movie – whatever it was that you wanted to hear that made you turn on the loudspeaker in the first place.
Let’s say that, normally, we listen to music – so that’s the signal. And, although “music” means different things to different people, most of the time, “music” will contain energy at more than one frequency, and its level will change over time. For example, compare the two plots in Figure 1.
Looking at Figure 1, it seems obvious that the level of “Bird on a Wire” changes over time, but the level of a sine wave doesn’t. However, that’s not as obvious when we zoom into that plot, as is shown in Figure 2, below.
From Figure 2, we can easily establish the obvious fact that “Bird on a Wire” and a sine wave are different. However, now it’s not as obvious that the sine wave as a constant level – it repeats itself periodically – which is why we call it “periodic” – but what is its level?
The simplest way to determine the level of a signal is similar to the way yesterday’s share prices are shown in the financial section of the newspaper. In that case, you are told the highest price and the lowest price for the day. In audio, we sometimes talk to the “peak-to-peak” amplitude of a signal. This is the difference between the highest and and the lowest peak (more accurately called a “trough”) of the signal in whatever amount of time you’ve been measuring. For example, take a look at Figure 3.
In Figure 3, I’ve drawn two signals. The top one is a 100 Hz sine wave with a peak-to-peak amplitude of 2 (because the difference between the highest peak (+1) and the lowest peak (-1) is 2). The bottom signal is a 100 Hz sine wave with a peak-to-peak amplitude of 0.1 – but with two clicks – one hitting +1 and the other hitting -1. So, if I just look at the peaks of that second signal, it also has a peak-to-peak amplitude of 2.
So, although it was easy to find the peak-to-peak amplitudes of those two signals, it should be obvious that this does not give a fair indication of how loud they appear to be.
However, if you’re building a piece of audio equipment (like an amplifier or an EQ, for example), this measurement does give you an idea of the “worst case” limits of the signal that might come through the system. So it’s not a useless measurement.
An additional problem with a peak-to-peak measurement of a signal is that it doesn’t tell you anything about asymmetry across the 0 line. (In an analogue world, we’d call that a “DC offset” because there would be a DC voltage that is added to the AC waveform.) For example, both of the signals in Figure 4 have a peak-to-peak amplitude of 1, but they are different…
If you’re lazy, you can do half of a peak-to-peak measurement. This is where you just check the maximum value of either the peak or the trough. We call this a “peak” amplitude measurement.
This has its problems, though. For example, take a look at Figure 5.
Here, we see two signals. The top one is a sine wave. The bottom one was a sine wave until I squished its negative-going half with a cheap compressor. As you can probably see, the top waveform is symmetrical – the negative half of the signal is the same as the positive half of the signal, just upside-down. It is also easily obvious that the second signal on the lower plot is not symmetrical. Its positive peak is higher than its negative peak.
However, both of these signals have a maximum positive peak of 1 – therefore their peak amplitudes are both 1 (but their peak-to-peak amplitudes and their apparent loudnesses are different).
You might think that an easy way around this problem is to look at the absolute value of the signals and find the peaks that way. However, as you can see in Figure 6, in the case of asymmetrical signals, this does not change anything.
Another way to look at the signal is to take an average of the level over time. However, if the signal is symmetrical (like a sine wave, for example) this would not work, since the average will probably be 0. This is because, if the signal is symmetrical, then the average of all of the negative values in the signal (over time) average out to be the negative equal of the average of all of the positive values. So we can’t just use the average of the signal directly… However, with a little extra math, we can do something useful.
I’m going to skip quickly over some old-fashioned math here in order to jump to the punchline which is: “the power in an AC signal (like a sine wave) is proportional to the square of the signal.”
The reason for this can be explained by combining Ohm’s Law and Watt’s Law as follows:
V = IR
where V is electromotive force (or voltage) in volts, I is current in amperes, and R is the resistance in ohms.
P = VI
where P is the power in watts, and V and I are the same as above.
If we fiddle with Ohm’s Law like this:
V = IR
I = V/R
Then we can replace the “I” in Watt’s Law like this
P = VI
P = V * V/R
P = V2 / R
So, with that last equation, we can see that the Power (in watts) is proportional to the square of the Voltage (in volts). So, if you double the voltage, you get 4 times the power (because 22 = 4).
We could do the same thing for current, as follows:
P = VI
P = IR * I
P = I2 R
So what? Well, one thing this tells us is that, if you want double the power (for example, from a loudspeaker’s output or the heat from a hair dryer) then you’ll need 4 times the amplitude of the signal feeding it (for example, 4 times the voltage at the same current level or 4 times the current with the same voltage).
Now, let’s come back to the problem at hand… What’s the level of the signal? Well, we start by taking our signal and find its equivalent power (by squaring its instantaneous amplitude value over time – so, for example, if it’s a digital signal, we take the value of each sample and multiply it by itself). Part of the effect of this squaring of the signal is that it removes the negative portion of the signal (because a negative number multiplied by a negative number is a positive number).
We then take a slice of time, and average all of the values that we just created by squaring the original values. Now we have the average (or “mean”) power in the signal.
However, we’re not interested in the power of the signal, we’re interested in its “average” amplitude (say, its voltage). So, to get back from power, we take the square root of the average that we just calculated.
By doing all of this, we are finding the Root of the Mean of the Square of the voltage – the RMS level.
If we apply this math to a sine wave, the result will be something like what’s shown in Figure 6.
In Figure 6, the black curve is the original sine wave with a frequency of 100 Hz and a peak amplitude of 1.0 (and no DC offset). The red curve shows the result of squaring all the values in the sine wave (which is why it’s called a “sine squared” wave or sin2 wave). If we find the average of all of the values in the red curve, the result would be 0.5. The square root of 0.5 is approximately 0.707 – which is shown as the blue line in the plot.
So, the RMS value of a sine wave with a peak value of 1 is 0.707. What does this mean? The easiest way to think of this is that if you had an old-fashioned incandescent light bulb and you powered it with a 1Vp (1 Volt Peak) AC voltage sine wave, it would be exactly the same brightness as if you connected it to a 0.707 V DC battery instead. If you wanted to use a battery to power your toaster, and you wanted it to make toast just as quickly as it normally does, then the battery will have to have a voltage that is 0.707 * the peak value of the AC voltage that normally feeds it. (Note that, if you live in North America, then the electrical signal feeding your toaster is 110 V RMS – so you’ll need a 110 V battery. If you live in Europe, then your toaster is fed with 220 V RMS – so you’ll need a 220 V battery. If you live somewhere else, you might need something else… Note that the electrical company has already done the RMS calculation for you…)
So, an RMS measurement of an AC signal tells us what DC value would result in the same power consumption.
There is just one problem: part of the RMS calculation is the “M” part – we are finding the mean of the values over some period of time. The length of time that we’re going to use is easy to choose if it’s a sine wave – we just make sure that the length of time (we call it a “time constant”) is at least as long as one period of the sine wave itself. If it’s smaller, then the RMS value will bob up and down as the sine wave goes up and down.
However, if we’re going to try to use the RMS method to find the level of a music signal, we’re going to have to make some tough choices… For example, let’s find the RMS value of the “Bird on a Wire” sample, using different time constants, shown below.
If we convert the plot in Figure 7 to a decibel representation by taking 20*log10 of each sample value, we get the plots in Figure 8. (Note that this is not the same as dB FS, since we are not comparing the result to the RMS value of a full-scale sine wave… but that’s a topic for another posting.)
There are some things that are evident in Figure 8. The most obvious one is that there is a link between the RMS time constant and the variability of the RMS level when the signal that you’re analysing is not periodic. Looking at this short 200 ms-long example from Bird on a Wire, with the four time constants that I used, the range of results are as shown below in Figure 9.
RMS Time constant
Of course, it’s important to remember that if I had picked a different signal or different RMS time constants, I would have gotten different results.
The question to ask here is:
“If I want to know the level of that 200 ms slice of Bird on a Wire, which RMS time constant should I use?”
“which of those four plots tells me the signal’s level?”
The answer is that none of these is correct – or all of them are, even though they show different things. The problem is that music has such a wide frequency range – from 20 Hz to 20,000 Hz. Therefore, if you choose a time constant that is long enough to give you a stable measurement at 20 Hz (which will be at least 50 ms – 1/20th of a second or one period of a 20 Hz wave), then it will be the length of 1000 periods of the 20 kHz portion of the signal.
Of course, you could argue that you care more about the 20 Hz part than the 20,000 Hz part – but that’s dependent on what you’re doing. If you’re measuring the signal that’s being sent to a tweeter, then you’re probably not interested in what’s going on at 20 Hz at all…
We’re heading (in a future posting) towards talking about measuring a system’s (or a devices’s) ability to deliver a wide range of signal levels. We’re going to talk about its “signal to noise ratio” which is a measurement of how much louder the signal (the music) is than the noise that the system itself generates. The idea in the design of all audio systems is that you want to make that ratio as big as possible so that you cannot hear the noise because it’s so much quieter than the signal.
The problem is that we’re going to have to measure how loud the signal can be – and compare that to how loud the signal actually is at any given moment. In order to understand the concepts in that discussion, then it’s necessary to understand the concepts that I introduced above, namely the following:
the relationship between RMS Time constant and the RMS level
This is the start of what will be a series of posts that are an attempt to answer a question about the pro’s and con’s of implementing a volume control in the digital domain. When I first thought about how to answer this question, I thought I could do it in a couple of sentences – but the more I thought about it, the more I realised that the answer is complicated…
There’s no doubt in my mind that I’m making this answer more complicated than necessary, but, as Carl Sagan once said, “If you wish to make apple pie from scratch, you must first create the universe.”
So, to begin, we have to define what “noise” is from the point of view of audio engineering.
On the one hand, we can define it simply. “Noise” is a random signal. We can be more accurate and say that this means that the amplitude of a noise signal cannot be predicted using a knowledge of what has come before in time.
If I flip a coin, it will be either heads or tails. I can’t predict this. It will be random. If I flip it 100 times, and, by some strange coincidence, I get 100 “tails”, there is still a 50% chance of getting a tails on the 101st flip. What has happened before can, in no way, be used to predict what is about to happen.
Of course, what is about to happen on the 101st flip has a limited number of possible outcomes. I cannot flip the coin and get “dog” as a result… (this sounds silly, but it will come in handy later…) Just like I cannot roll two dice and get a 13…
In LPCM digital audio, a noise signal is one where each individual sample in the signal has a random value that is in no way related to any of the previous samples. Its range (the set of possible values from which we can pick our random number) may be limited (depending on the specific characteristics of the noise signal and what may have come before), but it will be random.
Typically, when you are talking to someone in audio about noise, they describe it using a colour as the first descriptor. So, you’ll hear of “white noise” and “pink noise”, as the two most popular examples. For the purposes of this series of postings, we’ll only be talking about white noise. So, what is this?
One definition that you’ll see thrown around a lot says something like “white noise is a random signal that has equal energy per linear bandwidth” or “… equal energy per hertz” or “…equal intensity at different frequencies” or something like this. These descriptions are sort of true if you don’t want to get into temporal details, which, unfortunately, is exactly where we’re headed…
The good thing about those definitions is that they describe a general characteristic of white noise. If you take a white noise signal, and you measure the intensity of (or the energy in) the signal for a given bandwidth (say, a bandwidth of 100 Hz ranging from 200 Hz to 300 Hz) then it will be the same in another frequency range with the same bandwidth (say, a bandwidth of 100 Hz ranging from 1,000 Hz to 1,100 Hz). Note that these two bandwidths are the same in hertz – not in a multiplier like octaves or semitones or decades. So, if you have white noise that has a total bandwidth of 0 Hz to 20,000 Hz, then you will have the same amount of energy in the 0 – to – 10,000 Hz band as you will in the 10,000 – to – 20,000 Hz band. In other words (to us humans), there is as much energy in the top octave of the signal as in the rest of the bandwidth combined.
This is why white noise sounds like “bright” and “hissy” (similar to the “ss” sound in “hissy”) and not “darker” like the “sh” sound in “ash” (as they incorrectly claim here…). Since white is a “bright” colour, then we use the word “white” to describe the frequency-dependent energy distribution of “white” noise.
However, this is not really true. The truth is that a white noise signal has an equal probability per bandwidth of having the same energy level. This little detail is usually left out, partly because it’s complicated, and partly because it doesn’t matter in most cases in the real world. However, in our case, it does.
Let’s look at an example. I made a white noise signal in Matlab using the statement rand(SignalLength, 1) – rand(SignalLength, 1) where SignalLength is the length of the noise signal in samples, and the 1 means that I’m doing this for 1 audio channel…. mono is so retro…
You may be wondering why I did a rand() – rand() instead of just a rand(). the simple answer for now was that I wanted to make the signal “balanced” on either side of the zero line and the rand() function in Matlab has a range of 0 to 1.
I know… I could have done this by saying 2 * (rand(SignalLength, 1) – 0.5) but there is another reason that we’ll get into later…)
I then used a DFT to find the magnitude response of this signal. The result – both the signal and its magnitude response – are shown below in Figure 1.
Some additional information that is really not important: The sampling rate of this signal is 2^16 (65,536 Hz), and I did a 2^16 point DFT, so I have one frequency bin per hertz. (If this last bit of information is confusing, but interesting, please start reading this…)
You may notice that the magnitude is “flat” – meaning that it generally doesn’t slope upwards or downwards. However, you will also notice that it is certainly not “flat” – meaning that it is not a perfectly straight line. In fact, if we zoom in on both plots, we can see Figure 2.
Notice that we do NOT have an equal amount of energy per hertz… if we did, then the bottom plot would be a flat line.
If I do all of that again – make a new noise sample the same way (with a new set of random numbers) and plot the result, and a zoomed in version, I get Figures 3 and 4.
Compare Figures 1 and 3 or Figure 2 and 4. You’ll notice that they have similar characteristics overall – but not only are they NOT identical, they are completely different (on a sample-by-sample or a DFT bin-by-bin comparison).
Let’s say that I run this code and generate a white noise signal 1 second long, and I then calculate the magnitude response of that noise signal and store it. Then, I’ll repeat this, and average the new magnitude response to the first one. Then, I’ll do it again, each time, including the magnitude response to the average of all of the magnitude responses that I’ve done….
For each 1-second slice of time, the noise signal does not have equal energy per bandwidth – however, it is certainly white noise.
This is because, each time I do this, the average magnitude response will get flatter and flatter… and eventually, after doing this an infinite number of times, it will be a flat line.
This means that, white noise will have an equal amount of energy per bandwidth only if I wait long enough. The question is “how long is long enough?” The answer to that question depends on what you’re doing with it.
Another way to look at this…
In the each of the examples above, I made 1 second-long white noise signals and used the entire signal – all 65,536 samples – to calculate the magnitude response.
What happens if I have a one-second long signal, but only a portion of it is a burst of white noise, and the rest is silence? For example, look at Figure 5.
Figure 5’s magnitude response looks similar to the ones we’ve seen before (apart from being a little lower overall than the plots in Figures 1 and 3 – because there’s less energy overall in 0.5 sec of noise than there is in 1 second of noise). I’ll keep going to show what happens if we take this to an extreme.
The magnitude response shown in Figure 7 looks very different from the ones we’ve seen before. It’s much smoother… We’ll keep going…
Figure 8 is very different again… The total magnitude response, even when not “zoomed in” is smooth. It’s important here to note that the actual response that we see there will be different every time I run the random generator again. For example, look at Figure 9, which is also a 16-sample long white noise signal.
If we keep getting shorter and shorter, eventually we’ll get down to a single sample with a random value. However, since it’s a single sample (that is very probably non-zero) in a long string of zeros, then its magnitude response will be completely flat. It will not be noise – it will be an impulse with a random level. And it won’t sound like noise – it will sound like a click.
There are two basic important things to know at this point.
White noise has the frequency content you expect only if you average over time.
The shorter the time the noise is present, the less energy you will have, overall.
Thanks to David for emailing and pointing out that it’s “Hz” and “hertz” but not “Hertz”. I’ve corrected the text above… Being reminded of this reminds me of a Steven Wright joke – “I’m having amnesia and déjà vu at the same time. I think I’ve forgotten this before…”