# “High-Res” Audio: Part 6: Noise, noise, noise

In Part 2 of this series, I talked about the relationship between the noise floor of an LPCM signal and the number of bits used to encode it.

Assuming that the signal is correctly dithered using TPDF dither with a peak-to-peak amplitude of ±1 LSB, then this means that you can easily calculate the dynamic range of your system with a very simple equation:

Dynamic Range in dB = 6.02 * NumberOfBits – 3

(Note that the sampling rate is not part of this equation… That will be useful information later.)

Normally, we’re lazy and we say 6 times the number of bits -3 for the dither – but if you’re really lazy, you leave out the -3 as well.

So, this means that, in a 16-bit system, the noise floor is 93 dB below a sine wave at full scale (6 * 16 -3 = 93) and for a 24-bit system, the noise floor is 141 dB below a sine wave at full scale (you do the math as practice).

Also, we can generalise and say that “adding 1 bit halves the level of the noise floor” (because -6 dB is the same as multiplying by 0.5). However, this is only part of the story.

The noise that’s generated by dither has a “white” characteristic. This means that there is an equal probability of getting some energy per bandwidth (or some say “per Hertz”) over a period of time. This sounds a little complicated, so I’ll explain.

Noise is random. This means that you may or may not get energy at, say 1 kHz, in a given short measurement. However, if you measure white noise for long enough, you’ll eventually see that you got something in every frequency band. Also, you’ll see that, if you look back over the entire length of your measurement of white noise, you got the same amount of energy in the band from 100 Hz to 200 Hz as you did in the band from 1000 Hz to 1200 Hz and the band from 10,000 Hz to 10,200 Hz. (Each of those bandwidths is 200 Hz wide).

There are now two things to discuss:

1. This distribution of energy is not like the way we hear things. We don’t hear the distance between 100 Hz and 200 Hz as the same distance as going from 1,000 Hz to 1,200 Hz. We hear logarithmically, which means that we hear in multiples of frequency, not additions of bandwidth. So, to use 100 Hz – 200 Hz sounds like the same “distance” as 1,000 Hz to 2,000 Hz. This is why white noise sounds like it is “bright” – or it has emphasis on the high frequencies. If you have a system that has a flat response from 0 to 20,000 Hz, and you play white noise through it, you have the same amount of energy in the top octave (10 kHz to 20 kHz) as you do in all of the octaves below – which is why we hear this as “top-heavy”.
2. If you had two bands of white noise with equal levels, and let’s say that one ranges from 100 Hz to 200 Hz, and the other is 1000 Hz to 1200 Hz, then the output level of the two of them together will be 3 dB louder than the output level of either of them alone. This is because their powers add together instead of their amplitudes (because the two signals are unrelated to each other).

Let’s put all this (and one or two other things) together:

1. We know from a previous part in this series that an LPCM digital audio system cannot have signals higher than the Nyquist frequency – 1/2 the sampling rate.
2. TPDF dither is white noise at a total level that is (6.02 * NumberOfBits – 3) dB below full-scale.
3. If you add white noise signals with equal levels but different bandwidths, you get a 3 dB increase over the level of just one of them

This means that,

• if I have a 16-bit, TPDF dithered LPCM audio signal with a sampling rate of 48 kHz, it has a noise floor that is 93 dB below full scale, and that noise has a white characteristic with a bandwidth of 24 kHz (the Nyquist frequency). There will be no noise above that frequency coming out of the system.
• if I have a 16-bit, TPDF dithered LPCM audio signalwith a sampling rate of 192 kHz, it has a noise floor that is 93 dB below full scale, and that noise has a white characteristic with a bandwidth of 96 kHz (the Nyquist frequency). There will be no noise above that frequency coming out of the system.

So, the two systems have the same noise floor level overall, but with very different bandwidths… What does this mean?

Well, let’s start by looking at the level of the noise floor in the 48 kHz system (so the noise “only” extends to 24 kHz).

If I double the sampling rate (to 96 kHz), I double the bandwidth of the noise without changing its level, so this means that the portion of the noise that “lives” in the 0 Hz – 24 kHz region drops by 3 dB (because I’m ignoring the top half of the signal ranging from 24 kHz to 48 kHz in the 96 kHz system.

If I had multiplied the original sampling rate by 4 (to 192 kHz) I multiply the bandwidth of the noise by 4 as well (to 96 kHz). This means that the overall level of the noise from 0 to 24 kHz is now 6 dB down from the original version.

In other words: if I multiply the sampling rate by two, but I don’t increase the bandwidth of the noise floor that I’m interested in (say I only care about 20 Hz – 20 kHz), then its level drops by 3 dB.

## So what?

Well, you could jump to the conclusion that this proves that higher sampling rates are better. However, that would be a bit (ha hah) premature. Consider that, if you want to drop the (band-limited) noise floor by 6 dB, you have to quadruple the sampling rate – and therefore quadrupling the data rate (and therefore the disc storage, the bandwidth of the transmission system, the error rate, and so on…) A 400% increase in the data is not insignificant.

OR, you could just add one more bit – going from 16 bits to 17 bits will give you the same result with a data increase of only 6.25% – a much smarter decision, no?

## The Real World

This little analysis above makes a basic (and possibly incorrect) assumption. The assumption is that, by quadrupling the sampling rate, all other components in the system will remain predictably identical. This may not be true. For example, many DACs (especially older ones) exhibit an increase in their own noise floor when you use them at a higher sampling rate. So, it could be that the benefit you get theoretically is negated by the detriment that you actually get. This is just one example of a flaw in the theory – but it’s a very typical one – especially if you’re building a product instead of just using one.

## P.S.

You may have looked at Figures 1 or 2 and are wondering why, if the noise floor is at -93 dB FS in a 16-bit system, I plotted it around -120 dB FS (give or take). The reason is related to the explanation I just gave above. I said in the captions that it’s from a 96 kHz system. This means that the noise extends to the Nyquist frequency at 48 kHz, and that total level is at -93 dB FS. We also know that, if I keep the noise the same, but half the bandwidth that I’m looking at, the level drops by 3 dB. Therefore I can either do math or I can make the following table:

If you look carefully at the figures, you’ll see that there’s a point every 100 Hz. (It’s most easily visible in the low-frequency range of Figure 2.) So, the level of the noise that I see on a magnitude response plot like this is not only dependent on the noise level itself, but the bandwidths of the divisions that I’ve used to slice it up. In my case, the bandwidth per “slice” is about 100 Hz, so the noise level of each of those little contributors is at about -120 dB FS. If I had used slices only 50 Hz wide, it would show up at -123 instead…

# “High-Res” Audio: Part 5 – Mirrors are bad

Let’s go back to something I said in the last post:

## Mistake #1

I just jumped to at least three conclusions (probably more) that are going to haunt me.

The first was that my “digital audio system” was something like the following:

As you can see there, I took an analogue audio signal, converted it to digital, and then converted it back to analogue. Maybe I transmitted it or stored it in the part that says “digital audio”.

However, the important, and very probably incorrect assumption here is that I did nothing to the signal. No volume control, no bass and treble adjustments… nothing.

If you consider that signal flow from the position of an end-consumer playing a digital recording, this was pretty easy to accomplish in the “old days” when we were all playing CDs. That’s because (in a theoretical, oversimplified world…)

• the output of the mixing/mastering console was analogue
• that analogue signal was converted to digital in the mastering studio
• the resulting bits were put on a disc
• you put that disc in your player which contained a DAC that converted the signal directly to analogue
• you then sent the signal to your “processing” (a.k.a. “volume control”, and maybe some bass and treble adjustment.).

So, that flowchart in Figure 1 was quite often true in 1985.

These days, things are probably VERY different… These days, the signal path probably looks something more like this (note that I’ve highlighted “alterations” or changes in the bits in the audio signal in red):

• The signal was converted from analogue to digital in the studio
(yes, I know… studios often work with digital mixers these days, but at least some of the signals within the mix were analogue to start – unless you are listening to music made exclusively with digital synthesizers)
• The resulting bits were saved on a file
• Depending on the record label, the audio signal was modified to include a “watermark” that can identify it later – in court, when you’ve been accused of theft.
• The file was transferred to a storage device (let’s say “hard drive”) in a large server farm renting out space to your streaming service
• The streaming service encodes the file
• If the streaming service does not offer an lossless option, then the file is converted to a lossy format like MP3, Ogg Vorbis, AAC, or something else.
• If the streaming service offers a lossless option, then the file is compressed using a format like FLAC or ALAC (This is not an alteration, since, with a lossless compression system, you don’t lose anything)
(it might look like an audio player – but that means it’s just a computer that you can’t use to check your social media profile)
• You press play, and the signal is decoded (either from the lossy CODEC or the compression format) back to LPCM. (Still not an alteration. If it’s a lossy CODEC, then the alteration has already happened.)
• That LPCM signal might be sample-rate converted
• The streaming service’s player might do some processing like dynamic range compression or gain changes if you’ve asked it to make all the songs have the same level.
• All of the user-controlled “processing” like volume controls, bass, and treble, are done to the digital signal.
• The signal is sent to the loudspeaker or headphones
• If you’re sending the signal wirelessly to a loudspeaker or headphones, then the signal is probably re-encoded as a lossy CODEC like AAC, aptX, or SBC.
(Yes, there are exceptions with wireless loudspeakers, but they are exceptions.)
• If you’re sending the signal as a digital signal over a wire (like S/PDIF or USB), the you get a bit-for-bit copy at the input of the loudspeaker or headphones.
• The loudspeakers or headphones might sample-rate convert the signal
• The sound is (finally) converted to analogue – either one stream per channel (e.g. “left”) or one stream per loudspeaker driver (e.g. “tweeter”) depending on the product.

So, as you can see in that rather long and complicated list (it looks complicated, but I’ve actually simplified it a little, believe it or not), there’s not much relation to the system you had in 1985.

Let’s take just one of those blocks and see what happens if things go horribly wrong. I’ll take the “volume control” block and add some distortion to see the result with two LPCM systems that have two different sampling rates, one running at 48 kHz and the other at 194 kHz – four times the rate. Both systems are running at 24 bits, with TPDF dither (I won’t explain what that means here). I’ll start by making a 10 kHz tone, and sending it through the system without any intentional distortion. If we look at those two signals in the time domain, they’ll look like this:

The sine tone in the 48 kHz system may look less like a sine tone than the one in the 192 kHz system, however, in this case, appearances are deceiving. The reconstruction filter in the DAC will filter out all the high frequencies that are necessary to reproduce those corners that you see here, so the resulting output will be a sine wave. Trust me.

If we look at the magnitude responses of these two signals, they look like Figure 2, below.

You may be wondering about the “skirts” on either side of the 10 kHz spikes. These are not really in the signal, they’re a side-effect (ha ha) of the windowing process used in the DFT (aka FFT). I will not explain this here – but I did a long series of articles on windowing effects with DFTs, so you can search for it if you’re interested in learning more about this.

If you’re attentive, you’ll notice that both plots extend up to 96 kHz. That’s because the 192 kHz system on the bottom has a Nyquist frequency of 96 kHz, and I want both plots to be on the same scale for reasons that will be obvious soon.

Now I want to make some distortion. In order to make things obvious, I’m going to make a LOT of distortion. I’ve made the sine wave try to have an amplitude that is 10 times higher than my two systems will allow. In other words, my amplitude should be +/- 10, but the signal clips at +/- 1, resulting in something looking very much like a square wave, as shown in Figure 3.

You may already know that if you want to make a square wave by building it up using its constituent harmonics, you need to have the fundamental (which we’ll call Fc. In our case, Fc = 10 kHz) with an amplitude that we’ll say is “A”, you then add the

• 3rd harmonic (3 times Fc, so 30 kHz in our case) with an amplitude of A/3.
• 5th harmonic (5 Fc = 50 kHz) with an amplitude of A/5
• 7 Fc at A/7
• and so on up to infinity

Let’s look at the magnitude responses of the two signals above to see if that’s true.

If we look at the bottom plot first (running at 192 kHz and with a Nyquist limit of 96 kHz) the 10 kHz tone is still there. We can also see the harmonics at 30 kHz, 50 kHz, 70 kHz, and 90 kHz in amongst the mess of other spikes we’ll get to those soon…)

Looking at the top plot (running at 48 kHz and with a Nyquist limit of 24 kHz), we see the 10 kHz tone, but the 30 kHz harmonic is not there – because it can’t be. Signals can’t exist in our system above the Nyquist limit. So, what happens? Think back to the images of the rotating wheel in Part 3. When the wheel was turning more than 1/2 a turn per frame of the movie, it appears to be going backwards at a different speed that can be calculated by subtracting the actual rotation from 180º (half-a-turn).

The same is true when, inside a digital audio signal flow, we try to make a signal that’s higher than Nyquist. The energy exists in there – it just “folds” to another frequency – its “alias”.

We can look at this generally using Figure 6.

Looking at Figure 6: If we make a sine tone that sweeps upward from 0 Hz to the Nyquist frequency at Fs/2 (half the sampling rate or sampling frequency) then the output is the same as the input. However, when the intended frequency goes above Fs/2, the actual frequency that comes out is Fs/2 minus the intended frequency. This creates a “mirror” effect.

If the intended frequency keeps going up above Fs, then the mirroring happens again, and again, and again… This is illustrated in Figure 7.

This plot is shown with linear scales for both the X- and Y-axes to make it easy to understand. If the axes in Figure 7 were scaled to a logarithmic scaling instead (which is how “Frequency Response” are normally shown, since this corresponds to how we hear frequency differences), then it would look like Figure 8.

Coming back to our missing 30 kHz harmonic in the 48 kHz LPCM system: Since 30 kHz is above the Nyquist limit of 24 kHz in that system, it mirrors down to 24 kHz – (30 kHz – 24 kHz) = 18 kHz. The 50 kHz harmonic shows up as an alias at 2 kHz. (follow the red line in Figure 7: A harmonic on the black line at 48 kHz would actually be at 0 Hz on the red line. Then, going 2000 Hz up to 50 kHz would bring the red line up to 2 kHz.)

Similarly, the 110 kHz harmonic in the 192 kHz system will produce an alias at 96 kHz – (110 kHz – 96 kHz) = 82 kHz.

If I then label the first set of aliases in the two systems, we get Figure 9.

Now we have to stop for a while and think about what’s happened.

We had a digital signal that was originally “valid” – meaning that it did not contain any information above the Nyquist frequency, so nothing was aliasing. We then did something to the signal that distorted it inside the digital audio path. This produced harmonics in both cases, however, some of the harmonics that were produced are harmonically related to the original signal (just as they ought to be) and others are not (because they’re aliases of frequency content that cannot be reproduced by the system.

What we have to remember is that, once this happens, that frequency content is all there, in the signal, below the Nyquist frequency. This means that, when we finally send the signal out of the DAC, the low-pass filtering performed by the reconstruction filter will not take care of this. It’s all part of the signal.

So, the question is: which of these two systems will “sound better” (whatever that means)? (I know, I know, I’m asking “which of these two distortions would you prefer?” which is a bit of a weird question…)

This can be answered in two ways that are inter-related.

The first is to ask “how much of the artefact that we’ve generated is harmonically related to the signal (the original sine tone)?” As we can see in Figure 5, the higher the sampling rate, the more artefacts (harmonics) will be preserved at their original intended frequencies. There’s no question that harmonics that are harmonically related to the fundamental will sound “better” than tones that appear to have no frequency relationship to the fundamental. (If I were using a siren instead of a constant sine tone, then aliased harmonics are equally likely to be going down or up when the fundamental frequency goes up… This sounds weird.)

The second is to look at the levels of the enharmonic artefacts (the ones that are not harmonically related to the fundamental). For example, both the 48 kHz and the 192 kHz system have an aliased artefact at 2 kHz, however, its level in the 48 kHz system is 15 dB below the fundamental whereas, in the 192 kHz system, it’s more than 26 dB below. This is because the 6 kHz artefact in the 48 kHz system is an alias of the 30 kHz harmonic, whereas, in the 192 kHz system, it’s an alias of the 190 kHz harmonic, which is much lower in level.

As I said, these two points are inter-related (you might even consider them to be the same point) however, they can be generalised as follows:

The higher the sampling rate, the more the artefacts caused by distortion generated within the system are harmonically related to the signal.

In other words, it gives a manufacturer more “space” to screw things up before they sound bad. The title of this posting is “Mirrors are bad” but maybe it should be “Mirrors are better when they’re further away” instead.

Of course, the distortion that’s actually generated by processing inside a digital audio system (hopefully) won’t be anything like the clipping that I did to the signal. On the other hand, I’ve measured some systems that exhibit exactly this kind of behaviour. I talked about this in another series about Typical Problems in Digital Audio: Aliasing where I showed this measurement of a real device:

However, I’m not here to talk about what you can or can’t hear – that is dependent on too many variables to make it worth even starting to talk about. The point of this series is not to prove that something is better or worse than something else. It’s only to show the advantages and disadvantages of the options so that you can make an informed choice that best suits your requirements.

On to Part 6

# “High-Res” Audio: Part 4 – Know your limits

If you’ve read the three introductory parts of this series, linked above; and if you’re still awake, then we are ready to start putting things together and jumping to incorrect conclusions…

Let’s say that you’ve been hired to specify a digital audio system for some reason (we’ll assume that it’s an LPCM system – nothing exotic). Using the information I’ve told you so far, you can make two decisions in your specification:

You select a bit depth to be enough to give you the dynamic range you desire. In this case, “dynamic range” means the “distance” in level between the loudest sound you can record / store / transmit (I isn’t say what the “digital audio system” was going to be used for) and the inherent noise floor of the system. If you’re recording the background noise on an airplane while it’s in flight, you don’t need a big dynamic range, because it’s always loud, and never changes. However, if you’re recording a Japanese Taiko Drummer group, you’ll need a huge usable dynamic range because the loud parts of the performance are a LOT louder than the quietest parts.

As we saw in Part 3, an LPCM digital audio system cannot record any audio that has a frequency higher than 1/2 the sampling rate. So, you select a sampling rate that is at least 2x the highest frequency you’re interested in. For example, if you believe the books that say you can hear from 20 Hz to 20,000 Hz, then you might decide that your sampling rate has to be at least 40,000 Hz. On the other hand, if you’re making a subwoofer that you know will never be fed a signal above 120 Hz, then you don’t need a sampling rate higher than 240 Hz.

Don’t get angry yet. I’m just keeping these numbers simple to make the math easy. Later on, I’ll explain why what I just said might not be correct.

## Mistake #1

I just jumped to at least three conclusions (probably more) that are going to haunt me.

The first was that my “digital audio system” was something like the following:

As you can see there, I took an analogue audio signal, converted it to digital, and then converted it back to analogue. Maybe I transmitted it or stored it in the part that says “digital audio”.

However, the important, and very probably incorrect assumption here is that I did nothing to the signal. No volume control, no bass and treble adjustments… nothing.

## Mistake #2

We assumed above that we can define the system’s dynamic range based on the dynamic range of the audio signal itself. However, this makes the assumption that the noise floor of the digital system and the noise floor of your audio signal are identical, which is probably not true. As we saw in Part 2, the noise generated by TPDF dither is white – it has the same probability of having a given amount of energy per Hertz. Since we hear sound logarithmically (meaning that, to us, octaves are equal widths. Equal spacings in Hz are not.) This means that the noise sound “bright” to us – because there’s just as much energy in the top octave (say, 10 kHz to 20 kHz, if you believe the books) as there is in all other frequencies combined from 0 Hz up to 10 kHz.

If, however, the noise floor in your concert hall where the taiko drummers are playing is caused by the air conditioning system, then this noise will be a lot louder in the low frequencies than the the highs – which is not the same.

Therefore it’s too simplistic to say “the noise floor of the digital system” and the “noise floor of the signal” – since these two noise floors are different. (As Steven Wright said: “It doesn’t matter what temperature the room is, it’s always room temperature.”)

## Mistake #3

As we’ll see later, if you’re going to do anything to the signal while it’s in the “digital domain”, then you need to take that into consideration when you’re deciding on your sampling rate. It’s not enough to say “useful audio bandwidth times 2” because there are some side effects that need to be remembered…

However, counter-intuitively, it could be that, in order to improve your system, you’ll want to make the sampling rate LOWER instead of HIGHER – so this is not a simple case of “more is better”.

We’ll get to that topic later. For now, I’ll leave you in suspense.

## Some details

One thing we saw in Part 3 was that, if we have an audio signal with energy at a frequency higher than 1/2 the sampling rate, and if that signal gets into the analogue-to-digital converter (ADC), then the output of the ADC will contain an error. We’ll get out energy at frequencies that were not in the original, due to the effect called “aliasing“.

Once that’s in the digital audio signal, there’s no removing it, so we need to make sure that the too-high-frequency signals don’t get into the ADC’s input in the first place. This is done using a low-pass filter that (in theory) removes all energy in the signal above the Nyquist frequency (which is equal to 1/2 the sampling rate). Since that low-pass filter prevents aliasing, we call it an anti-aliasing filter. Normally, these days, that antialiasing filter is built into the ADC itself.

As we also saw in Part 3, the digital-to-analogue converter (DAC) has to smooth out the digital signal to convert it from a “staircase” wave to a smoother one. That’s also done with a low-pass filter that eliminates all the harmonics that would be required to make the staircase have sharp corners. Since this is done to re-construct the analogue signal, it’s called a “reconstruction filter“.

This means that, if we pull apart some of the components in the signal chain I showed in Figure 1, it really looks more like this:

On to Part 5.

# “High-Res” Audio: Part 3 – Frequency Limits

Reminder: This is still just the lead-up to the real topic of this series. However, we have to get some basics out of the way first…

Just like the last posting, this is a copy-and-paste from an article that I wrote for another series. However, this one is important, and rather than just link you to a different page, I’ve reproduced it (with some minor editing to make it fit) here.

In the first posting in this series, I talked about digital audio (more accurately, Linear Pulse Code Modulation or LPCM digital audio) is basically just a string of stored measurements of the electrical voltage that is analogous to the audio signal, which is a change in pressure over time… In the second posting in the series, we looked at a “trick” for dealing with the issue of quantisation (the fact that we have a limited resolution for measuring the amplitude of the audio signal). This trick is to add dither (a fancy word for “noise”) to the signal before we quantise it in order to randomise the error and turn it into noise instead of distortion.

In this posting, we’ll look at some of the problems incurred by the way we carve up time into discrete moments when we grab those samples.

Let’s make a wheel that has one spoke. We’ll rotate it at some speed, and make a film of it turning. We can define the rotational speed in RPM – rotations per minute, but this is not very useful. In this case, what’s more useful is to measure the wheel rotation speed in degrees per frame of the film.

Take a look at the left-most column in Figure 1. This shows the wheel rotating 45º each frame. If we play back these frames, the wheel will look like it’s rotating 45º per frame. So, the playback of the wheel rotating looks the same as it does in real life.

This is more or less the same for the next two columns, showing rotational speeds of 90º and 135º per frame.

However, things change dramatically when we look at the next column – the wheel rotating at 180º per frame. Think about what this would look like if we played this movie (assuming that the frame rate is pretty fast – fast enough that we don’t see things blinking…) Instead of seeing a rotating wheel with only one spoke, we would see a wheel that’s not rotating – and with two spokes.

This is important, so let’s think about this some more. This means that, because we are cutting time into discrete moments (each frame is a “slice” of time) and at a regular rate (I’m assuming here that the frame rate of the film does not vary), then the movement of the wheel is recorded (since our 1 spoke turns into 2) but the direction of movement does not. (We don’t know whether the wheel is rotating clockwise or counter-clockwise. Both directions of rotation would result in the same film…)

Now, let’s move over one more column – where the wheel is rotating at 225º per frame. In this case, if we look at the film, it appears that the wheel is back to having only one spoke again – but it will appear to be rotating backwards at a rate of 135º per frame. So, although the wheel is rotating clockwise, the film shows it rotating counter-clockwise at a different (slower) speed. This is an effect that you’ve probably seen many times in films and on TV. What may come as a surprise is that this never happens in “real life” unless you’re in a place where the lights are flickering at a constant rate (as in the case of fluorescent or some LED lights, for example).

Again, we have to consider the fact that if the wheel actually were rotating counter-clockwise at 135º per frame, we would get exactly the same thing on the frames of the film as when the wheel if rotating clockwise at 225º per frame. These two events in real life will result in identical photos in the film. This is important – so if it didn’t make sense, read it again.

This means that, if all you know is what’s on the film, you cannot determine whether the wheel was going clockwise at 225º per frame, or counter-clockwise at 135º per frame. Both of these conclusions are valid interpretations of the “data” (the film). (Of course, there are more – the wheel could have rotated clockwise by 360º+225º = 585º or counter-clockwise by 360º+135º = 495º, for example…)

Since these two interpretations of reality are equally valid, we call the one we know is wrong an alias of the correct answer. If I say “The Big Apple”, most people will know that this is the same as saying “New York City” – it’s an alias that can be interpreted to mean the same thing.

We people in audio commit many sins. One of them is that, every time we draw a plot of anything called “audio” we start out by drawing a sine wave. (A similar sin is committed by musicians who, at the first opportunity to play a grand piano, will play a middle-C, as if there were no other notes in the world.) The question is: what, exactly, is a sine wave?

Get a Slinky – or if you don’t want to spend money on a brand name, get a spring. Look at it from one end, and you’ll see that it’s a circle, as can be (sort of) seen in Figure 2.

Since this is a circle, we can put marks on the Slinky at various amounts of rotation, as in Figure 3.

Of course, I could have put the 0º mark anywhere. I could have also rotated counter-clockwise instead of clockwise. But since both of these are arbitrary choices, I’m not going to debate either one.

Now, let’s rotate the Slinky so that we’re looking at from the side. We’ll stretch it out a little too…

Let’s do that some more…

When you do this, and you look at the Slinky directly from one side, you are able to see the vertical change of the spring from the centre as a result of the change in rotation. For example, we can see in Figure 6 that, if you mark the 45º rotation point in this view, the distance from the centre of the spring is 71% of the maximum height of the spring (at 90º).

So what? Well, basically, the “punch line” here is that a sine wave is actually a “side view” of a rotation. So, Figure 7, shows a measurement – a capture – of the amplitude of the signal every 45º.

Since we can now think of a sine wave as a rotation of a circle viewed from the side, it should be just a small leap to see that Figure 7 and the left-most column of Figure 1 are basically identical.

Let’s make audio equivalents of the different columns in Figure 1.

Figure 10 is an important one. Notice that we have a case here where there are exactly 2 samples per period of the cosine wave. This means that our sampling frequency (the number of samples we make per second) is exactly one-half of the frequency of the signal. If the signal gets any higher in frequency than this, then we will be making fewer than 2 samples per period. And, as we saw in Figure 1, this is where things start to go haywire.

Figure 11 shows the equivalent audio case to the “225º per frame” column in Figure 1. When we were talking about rotating wheels, we saw that this resulted in a film that looked like the wheel was rotating backwards at the wrong speed. The audio equivalent of this “wrong speed” is “a different frequency” – the alias of the actual frequency. However, we have to remember that both the correct frequency and the alias are valid answers – so, in fact, both frequencies (or, more accurately, all of the frequencies) exist in the signal.

So, we could take Fig 11, look at the samples (the black lollipops) and figure out what other frequency fits these. That’s shown in Figure 12.

Moving up in frequency one more step, we get to the right-hand column in Figure 1, whose equivalent, including the aliased signal, are shown in Figure 13.

## Do I need to worry yet?

Hopefully, now, you can see that an LPCM system has a limit with respect to the maximum frequency that it can deal with appropriately. Specifically, the signal that you are trying to capture CANNOT exceed one-half of the sampling rate. So, if you are recording a CD, which has a sampling rate of 44,100 samples per second (or 44.1 kHz) then you CANNOT have any audio signals in that system that are higher than 22,050 Hz.

That limit is commonly known as the “Nyquist frequency“, named after Harry Nyquist – one of the persons who figured out that this limit exists.

In theory, this is always true. So, when someone did the recording destined for the CD, they made sure that the signal went through a low-pass filter that eliminated all signals above the Nyquist frequency.

In practice, however, there are many cases where aliasing occurs in digital audio systems because someone wasn’t paying enough attention to what was happening “under the hood” in the signal processing of an audio device. This will come up later.

## Two more details to remember…

There’s an easy way to predict the output of a system that’s suffering from aliasing if your input is sinusoidal (and therefore contains only one frequency). The frequency of the output signal will be the same distance from the Nyquist frequency as the frequency if the input signal. In other words, the Nyquist frequency is like a “mirror” that “reflects” the frequency of the input signal to another frequency below Nyquist.

This can be easily seen in the upper plot of Figure 14. The distance from the Input signal and the Nyquist is the same as the distance between the output signal and the Nyquist.

Also, since that Nyquist frequency acts as a mirror, then the Input and output signal’s frequencies will move in opposite directions (this point will help later).

Usually, frequency-domain plots are done on a logarithmic scale, because this is more intuitive for we humans who hear logarithmically. (For example, we hear two consecutive octaves on a piano as having the same “interval” or “width”. We don’t hear the width of the upper octave as being twice as wide, like a measurement system does. that’s why music notation does not get wider on the top, with a really tall treble clef.) This means that it’s not as obvious that the Nyquist frequency is in the centre of the frequencies of the input signal and its alias below Nyquist.

On to Part 4

# “High-Res” Audio: Part 2 – Resolution

Reminder: This is still just the lead-up to the real topic of this series. However, we have to get some basics out of the way first…

Just like the last posting, this is a copy-and-paste from an article that I wrote for another series. However, this one is important, and rather than just link you to a different page, I’ve reproduced it (with some minor editing to make it fit) here.

Back to Part 1

In the last posting, I talked about digital audio (more accurately, Linear Pulse Code Modulation or LPCM digital audio) is basically just a string of stored measurements of the electrical voltage that is analogous to the audio signal, which is a change in pressure over time…

For now, we’ll say that each measurement is rounded off to the nearest possible “tick” on the ruler that we’re using to measure the voltage. That rounding results in an error. However, (assuming that everything is working correctly) that error can never be bigger than 1/2 of a “step”. Therefore, in order to reduce the amount of error, we need to increase the number of ticks on the ruler.

Now we have to introduce a new word. If we really had a ruler, we could talk about whether the ticks are 1 mm apart – or 1/16″ – or whatever. We talk about the resolution of the ruler in terms of distance between ticks. However, if we are going to be more general, we can talk about the distance between two ticks being one “quantum” – a fancy word for the smallest step size on the ruler.

So, when you’re “rounding off to the nearest value” you are “quantising” the measurement (or “quantizing” it, if you live in Noah Webster’s country and therefore you harbor the belief that wordz should be spelled like they sound – and therefore the world needz more zees). This also means that the amount of error that you get as a result of that “rounding off” is called “quantisation error“.

In some explanations of this problem, you may read that this error is called “quantisation noise”. However, this isn’t always correct. This is because if something is “noise” then is is random, and therefore impossible to predict. However, that’s not strictly the case for quantisation error. If you know the signal, and you know the quantisation values, then you’ll be able to predict exactly what the error will be. So, although that error might sound like noise, technically speaking, it’s not. This can easily be seen in Figures 1 through 3 which demonstrate that the quantisation error causes a periodic, predictable error (and therefore harmonic distortion), not a random error (and therefore noise).

Sidebar: The reason people call it quantisation noise is that, if the signal is complicated (unlike a sine wave) and high in level relative to the quantisation levels – say a recording of Britney Spears, for example – then the distortion that is generated sounds “random-ish”, which causes people to jump to the conclusion that it’s noise.

Now, let’s talk about perception for a while… We humans are really good at detecting patterns – signals – in an otherwise noisy world. This is just as true with hearing as it is with vision. So, if you have a sound that exists in a truly random background noise, then you can focus on listening to the sound and ignore the noise. For example, if you (like me) are old enough to have used cassette tapes, then you can remember listening to songs with a high background noise (the “tape hiss”) – but it wasn’t too annoying because the hiss was independent of the music, and constant. However, if you, like me, have listened to Bob Marley’s live version of “No Woman No Cry” from the “Legend” album, then you, like me, would miss the the feedback in the PA system at that point in the song when the FoH engineer wasn’t paying enough attention… That noise (the howl of the feedback) is not noise – it’s a signal… Which makes it just as important as the song itself. (I could get into a long boring talk about John Cage at this point, but I’ll try to not get too distracted…)

The problem with the signal in Figure 2 is that the error (shown in Figure 3) is periodic – it’s a signal that demands attention. If the signal that I was sending into the quantisation system (in Figure 1) was a little more complicated than a sine wave – say a sine wave with an amplitude modulation – then the error would be easily “trackable” by anyone who was listening.

So, what we want to do is to quantise the signal (because we’re assuming that we can’t make a better “ruler”) but to make the error random – so it is changed from distortion to noise. We do this by adding noise to the signal before we quantise it. The result of this is that the error will be randomised, and will become independent of the original signal… So, instead of a modulating signal with modulated distortion, we get a modulated signal with constant noise – which is easier for us to ignore. (It has the added benefit of spreading the frequency content of the error over a wide frequency band, rather than being stuck on the harmonics of the original signal… but let’s not talk about that…)

## For example…

Let’s take a look at an example of this from an equivalent world – digital photography.

The photo in Figure 4 is a black and white photo – which actually means that it’s comprised of shades of gray ranging from black all the way to white. The photo has 272,640 individual pixels (because it’s 640 pixels wide and 426 pixels high). Each of those pixels is some shade of gray, but that shading does not have an infinite resolution. There are “only” 256 possible shades of gray available for each pixel.

So, each pixel has a number that can range from 0 (black) up to 255 (white).

If we were to zoom in to the top left corner of the photo and look at the values of the 64 pixels there (an 8×8 pixel square), you’d see that they are:

86 86 90 88 87 87 90 91
86 88 90 90 89 87 90 91
88 89 91 90 89 89 90 94
88 90 91 93 90 90 93 94
89 93 94 94 91 93 94 96
90 93 94 95 94 91 95 96
93 94 97 95 94 95 96 97
93 94 97 97 96 94 97 97

What if we were to reduce the available resolution so that there were fewer shades of gray between white and black? We can take the photo in Figure 1 and round the value in each pixel to the new value. For example, Figure 5 shows an example of the same photo reduced to only 6 levels of gray.

Now, if we look at those same pixels in the upper left corner, we’d see that their values are

102 102 102 102 102 102 102 102
102 102 102 102 102 102 102 102
102 102 102 102 102 102 102 102
102 102 102 102 102 102 102 102
102 102 102 102 102 102 102 102
102 102 102 102 102 102 102 102
102 102 102 102 102 102 102 102
102 102 102 102 102 102 102 102

They’ve all been quantised to the nearest available level, which is 102. (Our possible values are restricted to 0, 51, 102, 154, 205, and 255).

So, we can see that, by quantising the gray levels from 256 possible values down to only 6, we lose details in the photo. This should not be a surprise… That loss of detail means that, for example, the gentle transition from lighter to darker gray in the sky in the original is “flattened” to a light spot in a darker background, with a jagged edge at the transition between the two. Also, the details of the wall pillars between the windows are lost.

If we take our original photo and add noise to it – so were adding a random value to the value of each pixel in the original photo (I won’t talk about the range of those random values…) it will look like Figure 6. This photo has all 256 possible values of gray – the same as in Figure 1.

If we then quantise Figure 6 using our 6 possible values of gray, we get Figure 7. Notice that, although we do not have more grays than in Figure 5, we can see things like the gradual shading in the sky and some details in the walls between the tall windows.

That noise that we add to the original signal is called dither – because it is forcing the quantiser to be indecisive about which level to quantise to choose.

I should be clear here and say that dither does not eliminate quantisation error. The purpose of dither is to randomise the error, turning the quantisation error into noise instead of distortion. This makes it (among other things) independent of the signal that you’re listening to, so it’s easier for your brain to separate it from the music, and ignore it.

## Addendum: Binary basics and SNR

We normally write down our numbers using a “base 10” notation. So, when I write down 9374 – I mean
9 x 1000 + 3 x 100 + 7 x 10 + 4 x 1
or
9 x 103 + 3 x 102 + 7 x 101 + 4 x 100

We use base 10 notation – a system based on 10 digits (0 through 9) because we have 10 fingers.

If we only had 2 fingers, we would do things differently… We would only have 2 digits (0 and 1) and we would write down numbers like this:
11101

which would be the same as saying
1 x 16 + 1 x 8 + 1 x 4 + 0 x 2 + 1 x 1
or
1 x 24 + 1 x 23 + 1 x 22 + 0 x 21 + 1 x 20

The details of this are not important – but one small point is. If we’re using a base-10 system and we increase the number by one more digit – say, going from a 3-digit number to a 4-digit number, then we increase the possible number of values we can represent by a factor of 10. (in other words, there are 10 times as many possible values in the number XXXX than in XXX.)

If we’re using a base-2 system and we increase by one extra digit, we increase the number of possible values by a factor of 2. So XXXX has 2 times as many possible values as XXX.

Now, remember that the error that we generate when we quantise is no bigger than 1/2 of a quantisation step, regardless of the number of steps. So, if we double the number of steps (by adding an extra binary digit or bit to the value that we’re storing), then the signal can be twice as “far away” from the quantisation error.

This means that, by adding an extra bit to the stored value, we increase the potential signal-to-error ratio of our LPCM system by a factor of 2 – or 6.02 dB.

So, if we have a 16-bit LPCM signal, then a sine wave at the maximum level that it can be without clipping is about 6 dB/bit * 16 bits – 3 dB = 93 dB louder than the error. The reason we subtract the 3 dB from the value is that the error is +/- 0.5 of a quantisation step (normally called an “LSB” or “Least Significant Bit”).

Note as well that this calculation is just a rule of thumb. It is neither precise nor accurate, since the details of exactly what kind of error we have will have a minor effect on the actual number. However, it will be close enough.

On to Part 3.

# “High-Res” Audio: Part 1

I’ve been debating writing a series of postings about “high resolution” audio for a long time – years. Lately, (probably because of some hype generated by some recent press releases) I’ve been getting lots of question (no, that’s not a typo) about it, so it appears the time has come…

To start: the question that I get (a lot) is “If I can’t hear above 20 kHz, then what’s the use of high-res?” As I’ll explain as we go through, this is only one, rather small aspect to consider in this topic. In fact, it might be the least important issue to consider.

However, before I write too much, I’ll say that I’m not going to argue for or against higher resolutions in digital audio systems. I’m only going to go through a bunch of issues that can be used to argue either for or against them. So, there’s not going to be a big reveal at the end of this series telling you that high-res is either better, worse, or no different than whatever you’re using now. It’s merely going to be a discussion of a number of issues that need to be weighed. The problem is that this entire topic is complicated – and there’s no single “right” answer, as I’ll argue as we go along.

To start, let’s get down to basics and look (once again, from the perspectives of this website) at what sound is, and how it’s converted from an analogue electrical signal into a digital representation. The good thing is that I’ve written this introduction before in a different series of postings. So, I’m going to be extremely lazy and just copy-and-paste that information here. I’m not just referring you to another page because I’m intentionally leaving some things out because we’re headed into having a different discussion this time.

## A quick introduction to sound

At the simplest level, sound can be described as a small change in air pressure (or barometric pressure) over short periods of time. If you’d like to have a better and more edu-tain-y version of this statement with animations and pretty colours, you could take 10 minutes to watch this video, for example.

That change in pressure can be “captured” by using a microphone, that is (at the simplest level) a device that has a change in air pressure at its input and a change in electrical voltage at its output. Ignoring a lot of details, we could say that if you were to plot a measurement of the air pressure (at the input of the microphone) over time, and you were to compare it to a plot of the measurement of the voltage (at the output of the microphone) over time, you would see the same curve on the two graphs. This means that the change in voltage is analogous to the change in air pressure.

At this point in the conversation, I’ll make a point to say that, in theory, we could “zoom in” on either of those two curves shown in Figure 1 and see more and more details. This is like looking at a map of Canada – it has lots of crinkly, jagged lines. If you zoom in and look at  the map of Newfoundland and Labrador, you’ll see that it has finer, crinkly, jagged lines. If you zoom in further, and stand where the water meets the shore in Trepassey and take a photo of your feet, you could copy it to draw a map of the line of where the water comes in around the rocks – and your toes – and you would wind up with even finer, crinkly, jagged lines… You could take this even further and get down to a microscopic or molecular level – but you get the idea… The point is that, in theory, both of the plots in Figure 1 have infinite resolution, both in time and in air pressure or voltage.

Now, let’s say that you wanted to take that microphone’s output and transmit it through a bunch of devices and wires that, in theory, all do nothing to the signal. Let’s say, for example, that you take the mic’s output, send it through a wire to a box that makes the signal twice as loud. Then take the output of that box and send it through a wire to another box that makes it half as loud. You take the output of that box and send it through a wire to a measuring device. What will you see? Unfortunately, none of the wires or boxes in the chain can be perfect, so you’ll probably see the signal plus something else which we’ll call the “error” in the system’s output. We can call it the error because, if we measure the input voltage and the output voltage at any one instant, we’ll probably see that they’re not identical. Since they should be identical, then the system must be making a mistake in transmitting the signal – so it makes errors…

Pedantic Sidebar: Some people will call that error that the system adds to the signal “noise” – but I’m not going to call it that. This is because “noise” is a specific thing – noise is random – so if it’s not random, it’s not noise. Also, although the signal has been distorted (in that the output of the system is not identical to the input) I won’t call it “distortion” either, since distortion is a name that’s given to something that happens to the signal because the signal is there. (We would probably get at least some of the error out of our system even if we didn’t send any audio into it.) So, we could be slightly geeky and adequately vague and call the extra stuff “Distortion plus noise” but not “THD+N” – which stands for “Total Harmonic Distortion Plus Noise” – because not all kinds of distortion will produce a harmonic of the signal… but I’m getting ahead of myself…

So, we want to transmit (or store) the audio signal – but we want to reduce the noise caused by the transmission (or storage) system. One way to do this is to spend more money on your system. Use wires with better shielding, amplifiers with lower noise floors, bigger power supplies so that you don’t come close to their limits, run your magnetic tape twice as fast, and so on and so on. Or, you could convert the analogue signal (remember that it’s analogous to the change in air pressure over time) to one that is represented (and therefore transmitted or stored) digitally instead.

What does this mean?

## Conversion from analogue to digital and back(but skipping important details)

IMPORTANT: If you read this section, then please read the following postings as well. This is because, in order to keep things simple to start, I’m about to leave out some important details that I’ll add afterwards. However, if you don’t add the details, you could (understandably) jump to some incorrect conclusions (that many others before you have concluded…) So, if you don’t have time to read both sections, please don’t read either of them.

In the example above, we made a varying voltage that was analogous to the varying air pressure. If we wanted to store this, we could do it by varying the amount of magnetism on a wire or a coating on a tape, for example. Or we could cut a wiggly groove in a bit of vinyl that has a similar shape to the curve in the plots in Figure 1. Or, we could do something else: we could get a metronome (or a clock) and make a measurement of the voltage every time the metronome clicks, and write down the measurements.

For example, let’s zoom in on the first little bit of the signal in the plots in Figure 1

We’ll then put on a metronome and make a measurement of the voltage every time we hear the metronome click…

We can then keep the measurements (remembering how often we made them…) and write them down like this:

0.3000
0.4950
0.5089
0.3351
0.1116
0.0043
0.0678
0.2081
0.2754
0.2042
0.0730
0.0345
0.1775

We can store this series of numbers on a computer’s hard disk, for example. We can then come back tomorrow, and convert the measurements to voltages. First we read the measurements, and create the appropriate voltage…

We then make a “staircase” waveform by “holding” those voltages until the next value comes in.

All we need to do then is to use a low-pass filter to smooth out the hard edges of the staircase.

So, in this example, we’ve gone from an analogue signal (the red curve in Figure 3) to a digital signal (the series of numbers), and back to an analogue signal (the red curve in Figure 7).

In some ways, this is a bit like the way a movie works. When you watch a movie, you see a series of still photographs, probably taken at a rate of 24 pictures (or frames) per second. If you play those photos back at the same rate (24 fps or frames per second), you think you see movement. However, this is because your eyes and brain aren’t fast enough to see 24 individual photos per second – so you are fooled into thinking that things on the screen are moving.

However, digital audio is slightly different from film in two ways:

• The sound (equivalent to the movement in the film) is actually happening. It’s not a trick that relies on your ears and brain being too slow.
• If, when you were filming the movie, something were to happen between frames (say, the flash of a gunshot, for example) then it would never be caught on film. This is because the photos are discrete moments in time – and what happens between them is lost. However, if something were to make a very, very short sound between two samples (two measurements) in the digital audio signal – it would not be lost. This is because of something that happens at the beginning of the chain that I haven’t described… yet…

However, there are some “artefacts” (a fancy term for “weird errors”) that are present both in film and in digital audio that we should talk about.

The first is an error that happens when you mess around with the rate at which you take the measurements (called the “sampling rate”) or the photos (called the “frame rate”) – and, more importantly, when you need to worry about this. Let’s say that you make a film at 24 fps. If you play this back at a higher frame rate, then things will move very quickly (like old-fashioned baseball movies…). If you play them back at a lower frame rate, then things move in slow motion. So, for things to look “normal” you have to play the movie at the same rate that it was filmed. However, as long as no one is looking, you can transfer the movie as fast as you like. For example, if you wanted to copy the film, you could set up a movie camera so it was pointing at a movie screen and film the film. As long as the movie on the screen is running in sync with the camera, you can do this at any frame rate you like. But you’ll have to watch the copy at the same frame rate as the original film… (Note that this issue is not something that will come up in this series of postings about high resolution audio)

The second is an easy artefact to recognise. If you see a car accelerating from 0 to something fast on film, you’ll see the wheels of the car start to get faster and faster, then, as the car gets faster, the wheels slow down, stop, and then start going backwards… This does not happen in real life (unless you’re in a place lit with flashing lights like fluorescent bulbs or LED’s). I’ll do a posting explaining why this happens – but the thing to remember here is that the speed of the wheel rotation that you see on the film (the one that’s actually captured by the filming…) is not the real rotational speed of the wheel. However, those two rotational speeds are related to each other (and to the frame rate of the film). If you change the real rotational rate or the frame rate, you’ll change the rotational rate in the film. So, we call this effect “aliasing” because it’s a false version (an alias) of the real thing – but it’s always the same alias (assuming you repeat the conditions…) Digital audio can also suffer from aliasing, but in this case, you put in one frequency (which is actually the same as a rotational speed) and you get out another one. This is not the same as harmonic distortion, since the frequency that you get out is due to a relationship between the original frequency and the sampling rate, so the result is almost never a multiple of the input frequency. (We’re going to dig into this a lot deeper through this series of postings about high resolution audio, so if it doesn’t immediately make sense, don’t worry…)

## Some important details that I left out…

One of the things I said above was something like “we measure the voltage and store the results” and the example I gave was a nice series of numbers that only had 4 digits after the decimal point. This statement has some implications that we need to discuss.

Let’s say that I have a thing that I need to measure. For example, Figure 8 shows a piece of metal, and I want to measure its width.

Using my ruler, I can see that this piece of metal is about 57 mm wide. However, if I were geeky (and I am) I would say that this is not precise enough – and therefore it’s not accurate. The problem is that my ruler is only graduated in millimetres. So, if I try to measure anything that is not exactly an integer number of mm long, I’ll either have to guess (and be wrong) or round the measurement to the nearest millimetre (and be wrong).

So, if I wanted you to make a piece of metal the same width as my piece of metal, and I used the ruler in Figure 8, we would probably wind up with metal pieces of two different widths. In order to make this better, we need a better ruler – like the one in Figure 9.

Figure 9 shows a vernier caliper (a fancy type of ruler) being used to measure the same piece of metal. The caliper has a resolution of 0.05 mm instead of the 1 mm available on the ruler in Figure 8. So, we can make a much more accurate measurement of the metal because we have a measuring device with a higher precision.

The conversion of a digital audio signal is the same. As I said above, we measure the voltage of the electrical signal, and transmit (or store) the measurement. The question is: how accurate and precise is your measurement? As we saw above, this is (partly) determined by how many digits are in the number that you use when you “write down” the measurement.

Since the voltage measurements in digital audio are recorded in binary rather than decimal (we use 0 and 1 to write down the number instead of 0 up to 9) then we use Binary digITS – or “bits” instead of decimal digits (which are not called “dits”). The number of bits we have in the number that we write down (partly) determines the precision of the measurement of the voltage – and therefore (possibly), our accuracy…

Just like the example of the ruler in Figure 8, above, we have a limited resolution in our measurement. For example, if we had only 4 bits to work with then the waveform in 4 – the one we have to measure – would be measured with the “ruler” shown on the left side of Figure 10, below.

When we do this, we have to round off the value to the nearest “tick” on our ruler, as shown in Figure 11.

Using this “ruler” which gives a write-down-able “quantity” to the measurement, we get the following values for the red staircase:

0010
0100
0100
0011
0001
0000
0001
0010
0010
0010
0001
0000
0001

When we “play these back” we get the staircase again, shown in Figure 12.

Of course, this means that, by rounding off the values, we have introduced an error in the system (just like the measurement in Figure 8 has a bigger error than the one in Figure 9). We can calculate this error if we just subtract the original signal from the output signal (in other words, Figure 12 minus Figure 10) to get Figure 13.

In order to improve our accuracy of the measurement, we have to increase the precision of the values. We can do this by adding an extra digit (or bit) to the number that we use to record the value.

If we were using decimal numbers (0-9) then adding an extra digit to the number would give us 10 times as many possibilities. (For example, if we were using 4 digits after the decimal in the example at the start of this posting, we have a total of 10,000 possible values – 0.0000 to 0.9999. If we add one more digit, we increase the resolution to 100,000 possible values – 0.00000 to 0.99999 ).

In binary, adding one extra digit gives us twice as many “ticks” on the ruler. So, using 4 bits gives us 16 possible values. Increasing to 5 bits gives us 32 possible values.

If you’re listening to a CD, then the individual measurements of each voltage – the “sample values” – are stored with 16 bits, which means that we have 65,536 possible values to pick from.

Remember that this means that we have more “ticks” on our ruler – but we don’t necessarily increase its range. So, for example, we’re still measuring a voltage from -1 V to 1 V – we just have more and more resolution with which we can do that measurement.

On to Part 2…

# What’s a woofer?

#82 in a series of articles about the technology behind Bang & Olufsen loudspeakers

Occasionally, a question that comes into the customer communications department to Bang & Olufsen from a dealer or a customer eventually finds its way into my inbox.

This week, the question was about nomenclature. Why is it that, on some loudspeakers, for example, we say there is a tweeter, mid-range, and woofer, whereas on other loudspeakers we say that we’re using a “full range” driver instead? What’s the difference? (Folded into the same question was another about amplifier power, but I’ll take that one in another posting.)

So, what IS the difference? There are three different ways to answer this question.

## Answer #1: It’s how you use it.

My Honda Civic, the motorcycle that passed me on the highway this morning, and an F1 car all have a gear in the gearbox that’s labelled “3”. However, the gear ratio of those three examples of “third gear” are all different. In other words, if you showed a mechanic the gear ratio of one of those gearbox settings without knowing anything else, they wouldn’t be able to tell you “ah! that’s third gear…”

So, in this example, “third gear” is called “third” only because it’s the one between “second” and “fourth”. There is nothing physical about it that makes it “third”. If that were the case then my car wouldn’t have a first gear, because some farm tractor out there in the world would have a gear with a lower ratio – and an F1 car would start at gear 100 or so… And that wouldn’t make sense.

Similarly, we use the words “tweeter”, “midrange”, “woofer”, “subwoofer”, and “full range” to indicate the frequency range that that particular driver is looking after in this particular device. My laptop has a 1″ “woofer” – which only means that it’s the driver that’s taking care of the low frequencies that come out of my laptop.

So, using this explanation, the Beolab 90 webpage says that it has midranges and tweeters and no “full range” drivers because the midrange drivers look after the midrange frequencies, and the tweeters look after the high frequencies. However, the Beolab 28’s webpage says that it has a tweeter and full range drivers, but no midranges. This is because the drivers that play the midrange frequencies in the Beolab 28 also play some of the high-frequency content as part of the Beam Width control. Since they’re doing “double duty”, they get a different name.

The description I gave above isn’t really an adequate answer. For example, I said that my laptop has a 1″ “woofer”. Beolab 90 has a 1″ “tweeter” – but these two drivers are not designed the same way. Beolab 90’s tweeter is specifically designed to be used to produce high frequencies. One consequence of this is that the total mass of the moving parts (the diaphragm and voice coil, amongst other things) is as low as possible, so that it’s easy to move. This means that it can produce high frequency signals without having to use a lot of electrical power to push it back and forth.

However, the 1″ “woofer” in my laptop is designed differently. It probably has a much higher total mass for the moving parts. This means that its resonant frequency (the frequency that it would “ring” at if you hit it like a drum head) is much lower. Therefore it “wants” to move easily at a lower frequency than a tweeter would.

For example, if you put a child on a swing and you give them a push, they’ll swing back and forth at some frequency. If the child wanted to swing SLOWER (at a lower frequency), you could

• move to a swing with longer ropes so this happens naturally, or
• you can hold on to the ropes and use your muscles to control the rate of swinging instead.

The smarter thing to do is the first choice, that way you can keep sipping your coffee instead of getting a workout.

So, a 1″ woofer and a 1″ tweeter are not really the same thing.

We live in a world that has been convinced by advertisers that “compromise” is a bad thing – but it’s not. Someone who does never accepts to compromise is destined to live a very lonely life. When designing a loudspeaker, one of the things to consider is what, exactly, each component will be asked to do, and choose the appropriate components accordingly.

If we’re going to be really pedantic – there’s really no such thing as a tweeter, woofer, or anything else with those kinds of names. Any loudspeaker driver can produce sound at any frequency. The only difference between them is the relative ease with which the driver plays a signal at a given frequency. You can get 20 Hz to come out of a “tweeter” – it will just be naturally a LOT quieter than the signals at around 5 kHz. Similarly, a woofer can play signals at 20 kHz, but it will be a lot quieter and/or take a lot more power than signals at 50 Hz.

What this means is that, when you make an active loudspeaker, the response (the relative levels of signals at different frequencies) is really a result of the filters in the digital signal processing and the control from the amplifier (ignoring the realities of heat and time…). If we want more or less level at 2 kHz from a loudspeaker driver, we “just” change the filter in the signal processing and use the amplifier to do the work (the same as the example above where you were using your muscle power to control the frequency of the child on the swing).

However, there are examples where we know that a driver will be primarily used for one frequency band, but actually be extending into another. The side-facing drivers on Beolab 28 are a good example of this. They’re primarily being used to control the beam width in the midrange, but they’re also helping to control the beam width in the high frequencies. Since, they’re doing double-duty in two frequency ranges, they can’t really be called “midranges” or “tweeters” – they’d be more accurately called “midranges that also play as quiet tweeters”. (They don’t have to play high frequencies loudly, since this is “only” to control the beam width of the front tweeter.) However, “midranges that also play as quiet tweeters” is just too much information for a simple datasheet – so “full range” will do as a compromise.

## P.S.

I’ve got some extra things to add here…

Firstly, it has become common over the past couple of years to call “woofers” “subwoofers” instead. I don’t know why this happened – but I suspect that it’s merely the result of people who write advertising copy using a word they’ve heard before without really knowing what it means. Personally, I think that it’s funny to see a laptop specified to have a “1” subwoofer”. Maybe we should make the word “subtweeter” popular instead.

Secondly, personally, I believe that a “subwoofer” is a thing that looks after the frequency range below a “woofer”. I remember a conversation I had at an AES convention once (I think it was with Günther Theile and Tomlinson Holman) where we all agreed that a “subwoofer” should look after the frequency range up to 40 Hz, which is where a decent woofer should take over.

Lastly, if you find an audio magazine from the 1970s, you’ll see that a three-way loudspeaker had a “tweeter”, “squawker”, and “woofer”. Sometime between then and now, “squawker” was replaced with “midrange” – but I wonder why the other two didn’t change to “highrange” and “lowrange” (although neither of these would be correct, since all three drivers in a three-way system have limited frequency ranges).

# That’s a wrap…

I spent some time this week helping to track down the source of an error in a digital audio signal flow chain, and we wound up having a discussion that I thought might be worth repeating here.

Let’s start at the very beginning.

Let’s take an analogue audio signal and convert it to a Linear Pulse Code Modulation (LPCM) representation in the dumbest possible way.

In order to save this signal as a string of numerical values, we have to first accept the fact that we don’t have an infinite number of numbers to use. So, we have to round off the signal to the nearest usable value or “quantisation value”. This process of rounding the value is called “quantisation”.

Let’s say for now that our available quantisation values are the ones shown on the grid. If we then take our original sine wave and round it to those values, we get the result shown below.

Of course, I’m leaving out a lot of important details here like anti-aliasing filtering and dither (I said that we were going to be dumb…) but those things don’t matter for this discussion.

So far so good. However, we have to be a bit more specific: an LPCM system encodes the values using binary representations of the values. So, a quantisation value of “0.25”, as shown above isn’t helpful. So, let’s make a “baby” LPCM system with only 3 bits (meaning that we have three Binary digITs available to represent our values).

To start, let’s count using a 3-bit system:

and that’s as far as we can go before needing 4 bits. However, for now, that’s enough.

Take a look at our signal. It ranges from -1 to 1 and 0 is in the middle. So, if we say that the “0” in our original signal is encoded as “000” in our 3-bit system, then we just count upwards from there as follows:

Now what? Well, let’s look at this a little differently. If we were to divide a circle into the same number of quantisation values, make the “12:00” position = 000, and count clockwise, it would look like this:

The question now is “how do we number the negative values?” but the answer is already in the circle shown above… If I make it a little more obvious, then the answer is shown below.

If we use the convention shown above, and represent that on the graph of our audio signal, then it looks like this:

One nice thing about this way of doing things is that you just need to look at the first digit in the binary word to know whether the value is positive or negative. A 0 means it’s positive, and a 1 means it’s negative.

However, there are two issues here that we need to sort out… The first is that, since we have an even number of values, but an odd number of quantisation steps (4 above zero, 4 below zero, and zero = 9 steps) then we had to do something asymmetrical. As you can see in the plot above, there are no numbers assigned to the top quantisation value, which actually means that it doesn’t exist.

So, if we’re still being dumb, then the result of our quantisation will either look like this:

or this:

## Wrapping up…

But what happens when you make two mistakes simultaneously? Let’s go back and look at an earlier plot.

Let’s say that you’re writing some DSP code, and you forget about the asymmetry problem, so you scale things so they’ll TRY to look like the plot above.

However, as we already know, that top quantisation value doesn’t exist – but the code will try to put something there. If you’ve forgotten about this, then the system will THINK that you want this:

As you can see there, your code (because you’ve forgotten to write an IF-THEN statement) will think that the top-most positive quantisation value is just the number after 011, which is 100. However, that value means something totally different… So, the result coming out will ACTUALLY look like this:

As you can see there, the signal is very different from what we think it should be.

This error is called a “wrapping” error, because the signal is “wrapped” too far around the circle shown in Figure 5, shown above. It sounds very bad – much worse than “normal” clipping (as shown in Figure 7) because of that huge nearly-instantaneous transition from maximum positive to maximum negative and back.

Of course, the wrapping can also happen in the opposite direction; a negatively-clipped signal can wrap around and show up at the top of the positive values. The reason is the same because the values are trying to go around the same circle.

As I said: this is actually the result of two problems that both have to occur in the same system:

• The signal has to be trying to get to a level that is beyond the limits of the quantisation values
• Someone forgot to write a line of code that makes sure that, when that happens, the signal is “just” clipped and not wrapped.

So, if the second of these issues is sitting there, unresolved, but the signal never exceeds the limits, then you’ll never have a problem. However, I will never need the airbags in my car, unless I have an accident. So, it’s best to remember to look after that second issue… just in case.

## P.S.

This method of encoding the quantisation values is called the “Two’s Complement” method. If you want to know more about it, read this.

# Translating Q to Q

As I’ve talked about in a previous posting, when a reciprocal peak/dip filter says “Q”, there’s no knowing what it might mean, because there are at least 7 different definitions of Q (3 for boosts and 4 for dips).

For many people, this doesn’t really matter. If you’re just playing with an EQ to make things sound better right now, then the values on the display really don’t matter: it’s the sound that counts.

If you’re like me, you need to be able to navigate between different pieces of software and hardware, and to get the same EQ response from them, then you’ll also need to know firstly that you can’t trust the display, and secondly, how to “translate” from device to device when necessary.

For example, take a look at Figure 1

This shows two magnitude responses, however, these are the measurements of two equalisers with identical settings:
Fc = 1 kHz, Gain = +12 dB, Q = 2.

The black curve shows the response of an equaliser that uses the -3 dB points to define the bandwidth of the filter, and therefore the Q is based on 1/(2 zeta). The red curve shows the response of an equaliser that uses the mid-point (in this case, +6 dB because the Gain is +12 dB) to define the bandwidth of the filter.

The difference between these two plots is shown below in Figure 2.

We’d have a similar problem if we were cutting instead of boosting, as shown in Figure 3.

You have to think upside down in this case, because the 1/(2 zeta) filter is actually using the 3 dB UP points to measure bandwidth; but we’ll ignore that and move on.

If you need to translate between the two systems shown above, there’s a pretty easy way to do it.

I’ll assume that you are implementing your filter using the mid-point definition of the bandwidth, so you need to convert into that system rather than out of it. (I’m making this assumption because it’s the one that Robert Bristow-Johnson used in his Audio Cookbook, which was freely copy-and-pasteable, which means that you find it everywhere these days.) Get the parameters from the filter you want to copy.

We’ll call these parameters Fc (for centre frequency, in Hz), (Gain in dB), and . I’m calling it because it’s a Q based on 1/(2 zeta) and we’ll need to keep it separate from our other Q, which I’ll call (for Robert Bristow-Johnson).

Convert the gain into linear.

Then do the following:

IF

ELSEIF

ELSE
your filter isn’t doing anything because

END

## Example 1

If you have a -3 dB-based filter with the following parameters:

and you want to implement that using the Bristow-Johnson equations, then you’ll have to use the following parameters:

## Example 2

If you have a -3 dB-based filter with the following parameters:

and you want to implement that using the Bristow-Johnson equations, then you’ll have to use the following parameters:

## Two Extra Things…

If the filter that you’re translating FROM is based on Andy Moorer’s design (which is based on the gain mid-point if the gain is within the ±6 dB range, but based on the 3 dB points if it’s outside that), then you’ll have to write your own IF/THEN statements.

If you’re implementing a filter that was specified for RBJ’s equations in a system that’s based on 1/(2 zeta), then you’re probably smart enough to figure out how to do the above in reverse.