I was working on the sound design of a loudspeaker last week with some new people and software – so we had to get some definitions straight before we messed things up by thinking that we were using the same words to mean the same thing. I’ve made a similar mistake to this before, as I’ve written about here – and I don’t being reminded of my own stupidity repeatedly… (Or, as Stephen Wright once said “I’m having amnesia and deja vu at the same time – I think I’ve forgotten this before…”)
So, in this case on that day, we were talking about the lowly 2nd-order Low Pass Filter, based on a single biquad.
If you read about how to find the cutoff frequency of a low-pass filter, you’ll probably find out that you find the frequency where the gain is one half of the power of that in the bandpass portion of the filter’s response. Since 10*log10(0.5) = -3.01 dB, then this is also called the “3 dB down point” of the filter.
In my case, when I’m implementing a filter, I use the math provided by Robert Bristow-Johnson to calculate my biquad coefficients. You input a cutoff frequency (Fc), and a Q value, and (for a given sampling rate) you get your biquad coefficients.
The question then, is: is the desired cutoff frequency the actual measurable cutoff frequency of the system? (Let’s assume for the purposes of this discussion that there are no other components in the system that affect the magnitude response – just to keep it simple.)
The simple answer is: No.
For example, if I make a 2nd-order low pass filter with a desired cutoff frequency of 1 kHz (using a high enough sampling rate to not introduce any errors due to the bilinear transform) and I vary the Q from something very small (in this example, 0.1) to something pretty big (in this example, 20) I get magnitude response curves that look like the figure below.
It is probably already evident that these 25 filter responses plotted above that they do not all cross each other at the 1 kHz line. In addition, you may notice that there is only one of those curves that is -3.01 dB at 1 kHz – when the Q = 1/sqrt(2) or 0.707.
This begs the question: what is the gain of each of those filters at the desired value of Fc (in this case, 1 kHz)? This is plotted as the red line in the figure below.
This plot also shows the maximum gain of the filters for different values of Q. Notice that, in the low end, the maximum value is 0 dB, since the low pass filters only roll off. However, for Q values higher than 1/sqrt(2), there is an overshoot in the response, resulting in a boost at some frequency. As the Q increases, the frequency at which the gain of the filter is highest approaches the desired cutoff frequency. (As can be seen in the plot above, by the time you get to a Q of 20, the gain at Fc and the maximum gain of the filter are the same.)
It may be intuitively interesting (or interestingly intuitive) to note that, when Q goes to infinity, the gain at Fc also goes to infinity, and (relatively speaking) all other frequencies are infinitely attenuated – so you have a sine wave generator.
So, we know that the gain value at the stated Fc is not -3 dB for all but one value of Q. So, what is the -3 dB point, if we state a desired Fc of 1 kHz and we vary the Q? This is shown in the figure below.
So, varying the Q from 0.1 to 20 varies the actual Fc (or, at least, the -3 dB point) from about 104 Hz to about 1554 Hz.
Or, if we plot the same information as a function (or just a multiple) of the desired Fc, you get the plot below.
So, if you’re sitting in a meeting, and the person in front of you is looking at a measurement of a loudspeaker magnitude response, and they say “could you please put in a low pass filter with a cutoff frequency of 1 kHz and a Q of 0.5” you should start asking questions by what, exactly, they mean by “cutoff frequency”… If not, you might just wind up with nice-looking numbers but strangely-sounding loudspeakers.
Although I am guessing, I don’t think that it is crazy to say that the majority of digital audio systems today employ some kind of sampling rate conversion somewhere in the signal flow.
A sampling rate converter is a physical device or a processing block in some software that takes an audio signal that has been sampled at one rate (say, 44.1 kHz) and converts it to an audio signal at another rate (say, 48 kHz).
There are many reasons why you might want to do this. For example, if you have a device that has equalisation (filtering), then if you change the sampling rate, you will have to new coefficients into the filters. If you have a LOT of filters, then it might take so much time to load them into the system that you’ll miss the first second or two of a song if it’s a different sampling rate than the previous song. So, instead of doing this, you keep your processing at one constant (or ‘fixed’) sampling rate, and convert the input to that rate. This might even be true in the case where the incoming sampling rate is the same as the internal sampling rate. For example, you might be “sample rate converting” from 48 kHz to 48 kHz – just to keep the design of the system clocking constant.
Looking very broadly, there are two options for sampling rate conversion.
Synchronous Sampling Rate Conversion
Let’s say that you have to convert from 48 kHz to 96 kHz – a multiplication of 2. In this simple case, you could take the incoming samples, and insert an new, extra one mid-way between each of them. The value of the new sample depends on how you are doing the math to calculate it. We will not discuss this here. The important thing about this concept is that the timing of the output is “locked” to the input. In this example, every second sample of the output happens at exactly the same time as every sample at the input. This can also be true if the ratio of the sampling rates are not “nicely” related like a 2:1 ratio. For example, if you have an input at 44.1 kHz and and output at 48 kHz, you could take the incoming 44.1 kHz signal, insert 47999 “virtual” samples between each of the original samples (making the new sampling rate 2116800000 Hz) and then pull an output sample from that stream every 444100 samples.
In other words:
(44100 * 48000) / 44100 = 48000
Of course, this is not a smart way to do this (because it will be a huge waste of processing power and memory – and imagine how big the numbers would be if you’re converting 176.4 kHz to 192 kHz… bigger!), but it would work, as long as the “virtual” samples you create at the very high “virtual” sampling rate have the correct values.
This type of sampling rate conversion, where the output is numerically “locked” to the input in time (meaning that, at some regular interval of time, the input and the output samples will happen simultaneously – or at least with a constant delay) is called synchronous sampling rate conversion. It’s called that because the input and the output are synchronised with each other… A bit like gears meshing together.
Asynchronous Sampling Rate Conversion
There is another way to do this, where we do not lock the output clock to the input clock. Let’s say that you want to build a device that has a constant sampling rate at its output, but you don’t really know what the sampling rate of the input is. In this case you will use an asynchronous sampling rate converter – so-called because there is no fixed lock between the input and output clocks.
In this case, the incoming signal is analysed and its sampling rate is measured. The way this is done is a little similar to the method shown above. You take the clock running at the rate of the output’s signal and multiply that by some value (say 512, for example) to create an internal “virtual” clock running at a higher sampling rate. You then “grab” the value of an incoming sample and apply its value to the “virtual” sample that is closest in time. This allows the incoming samples to drift in time relative to the output samples.
In both cases, there is the open question of how you generate the signal at the higher internal sampling rate. This can be done using a kind of low pass filter that is effectively similar to the reconstruction filter in a DAC. I will not talk about this any more than that – other than to say that the response characteristics of that filter are VERY important… So, if you’re planning on building your own sampling rate converter, read a lot more stuff on the subject than what I’ve written here – because what I’ve written here is most certainly not enough information.
There’s one strange effect that pops up here. Since, in an ASRC (Asynchronous Sampling Rate Converter) the incoming signal is sampled at discrete times that are numerically related to the output sampling rate, then any potential jitter in the system is also quantised in time. So, for example, if your output sampling rate is 48000 samples per second, and you’re creating the internal sampling rate by multiplying that by 512, then any jitter in the ASRC cannot have a value less than 1/(48000*512) second = 4.069*10^-8 or 40.69 nanoseconds. In other words, in such a system, the error caused by jitter will be 0, ±40.69 nanoseconds, ±81.38 nanoseconds, and so on. It can’t be something in between… (assuming that the output clock is perfect. If it’s drifting due to jitter, then those values will also drift…)
The good news is that, if the clock that is used for ASRC’s output sampling rate is very accurate and stable, and if the filtering that is applied to the incoming signal is well-done, then an ASRC can behave very, very well – and there are lots of examples of this. (Sadly, there are many more examples where an ASRC is implemented poorly. This is why many people think that sampling rate converters are bad – because most sampling rate converters are bad.) in fact, a correctly-made sampling rate converter can be used to reduce jitter in a system (so you would even want to use it in cases where the incoming sampling rate and outgoing sampling rates are the same). This is why some DAC’s include an ASRC at the input – to reduce jitter originating at the signal source.
Wrapping up Part 8: The take-home messages for these three parts in Section 8 are:
Sampling Jitter results in some kind of distortion of the signal that can be related to the signal itself
Sampling Jitter can occur in the ADC, the DAC, or an ASRC
If implemented correctly, an ASRC can be used to attenuate jitter in a system
Once introduced to the signal, jitter cannot be attenuated. So, if you have a recording that was made using an ADC with a lot of jitter, the artefacts caused by that jitter is in the recorded signal forever. If you have a DAC that has absolutely no jitter whatsoever (this is not possible) then this will not eliminate the jitter that is already in the signal. Of course, it won’t make the situation worse… but it won’t make it better.
Addendum. If you want to dig further into the world of Sampling Jitter and the advantages of using ASRC’s to attenuate jitter, I highly recommend the following as a good starting point:
Julian Dunn’s paper called “Jitter Theory” – Technical Note TN-23 from Audio Precision. This is a chapter in his book called “Measurement Techniques for Digital Audio”, published by Audio Precision. See this link for more info.
Clock Jitter, D/A Converters, and Sample-Rate Conversion By Robert W. Adams, Published in The Audio Critic, Issue No. 21
The Effects of Sampling Clock Jitter on Nyquist Sampling Analog-to-Digital Converters and on Oversampling Delta Sigma ADCs, Steven Harris. AES Preprint #2844 (87th International Convention of the AES, October 1989)
Jitter Analysis of Asynchronous Sample-rate Conversion, Robert Adams. AES Preprint #3712 (95th International Convention of the AES, October 1993)
In the previous post we looked at the effect of an incoming analogue signal that is sampled at the wrong times. In that description, I implied that the playback of the samples would happen at exactly the correct times. So, the jitter was entirely at the ADC (analogue-to-digital converter) and nowhere else.
In this posting, we’ll look at a very similar issue – jitter in the DAC (digital-to-analogue converter).
Jitter in the Digital to Analogue conversion
Let’s assume that we have a signal (in our case, a sinusoidal waveform, since that’s easy to plot) that was sampled by an ADC with no jitter. So, our original signal looks like Figure 1.
That signal is sampled by the ADC at exactly the correct times, since it has no jitter. The result of this is shown below in Figure 2.
When the time comes to play this signal, we send those samples to the DAC in the correct order and hope that it converts each of them to an analogue voltage at exactly the correct times. If the sampling rate of the system is 96 kHz, then we hope that the DAC converts a sample ever 1/96000th of a second, at exactly the right time each time.
That time that the DAC spits out the sample is dictated by a clock somewhere in the system. It might by an internal clock, or it might come from an external device, depending on your system and how it’s being used. However, if that clock is inaccurate for some reason, or if there is some kind of noise infecting the connection between the clock and the DAC, then the DAC can be triggered to convert a sample at the incorrect time. This is sampling jitter in the digital to analogue conversion process. I’ve tried to illustrate this in Figure 3.
It may not be immediately obvious, but the sample values in Figure 3 are identical to those in Figure 2. What I’ve done is to move them in time, so that you’re getting exactly the right level output at the wrong time each time. Of course, I have heavily exaggerated this plot to make it obvious that the times between consecutive samples are not equal. Some are much shorter than the sampling period (e.g. between samples 3 and 4) and some are much longer (e.g between samples 9 and 10).
Just like the case of ADC jitter, we can analyse this simply as an amplitude error. In other words, as a result of the timing errors, the red circles are not sitting directly on the original gray signal. And, just like we saw in the case of the ADC jitter, the amount of amplitude error is proportional to the slope of the signal.
Addendum: It’s important to remember that the descriptions and the plots that I’m showing here are to help show what jitter is – and those plots are high. I’m not showing what the final result will be. The actual jitter in a system is much, much lower than anything I’ve shown here. Also, I’ve completely omitted the effects of the anti-aliasing filter and the reconstruction filter – just to keep things simple.
Ignoring a most of the details, converting an analogue audio signal into a digital one is much like filming a movie. The signal (a continuous change in voltage) is measured (or sampled) at a regular rate (the sampling rate), and those measurements are stored for future use. This is called Analogue-to-Digital Conversion.
In the future, you take those samples, and you convert them back to voltages at the same sampling rate (in the same way that you play a film at the same frame rate that you used to record it). This is called Digital-to-Analogue Conversion.
However, we’re not here to talk about conversion – we’re here to talk about jitter in the conversion process.
As we’ve already seen, jitter (and wander) is an error in the timing of a clock event. So, let’s look at this effect as part of the sampling process. To start: jitter in the analogue to digital conversion.
Jitter in the Analogue to Digital conversion
Let’s say that we want to convert an analogue sinusoidal wave into a PCM digital version.
Note that I’m going to skip a bunch of steps in the following explanation – concentrating only on the parts that are important for our discussion of jitter.
We start with a wave that has theoretically infinite resolution in amplitude and time, and we divide time into discrete moments, represented by the numbered vertical lines in the plot below.
Every time the clock “ticks” (in other words, on each of those vertical lines), we measure the voltage of the signal. These discrete measurements are represented in Figure 2 as the circles, sitting on the original waveform (in gray).
Part of this system relies on the accuracy of the clock that’s used to tell the sampling system when to do the measurements. In a perfect world, a system with a sampling rate of 44.1 kHz would make a measurement of the incoming analogue wave exactly every 1/44100th of a second. The time between samples would never vary.
This, of course, is impossible. The clock that ticks at the sampling rate will have some error in time – albeit a very, very small error.
Let’s heavily exaggerate this error so that we can see the resulting effect. Figure 3 shows the same original analogue sinusoidal waveform, sampled (measured) at incorrect times. In other words, sometimes the measurement (represented by the red circles) is made slightly too early (to the left of the gray vertical line – as is the case for Sample #9), sometimes, it’s made too late (to the right of the line – as in Sample #2).
For example, look at the sample that should occur at clock tick #2. I’ve zoomed in to the plot so that this can be seen more clearly in Figure 4.
Notice that, because the measurement was made at the wrong time (in the case of sample #2, somewhat late), the result is an error in the measurement of the waveform’s amplitude. So, an error in time produces an error in level.
Let’s assume that the measurements we made in Figure 3 are stored and then replayed at exactly the correct times – what will the result be? This is shown in Figure 5. As you can see there, by comparing the measurements we made in Figure 3 to the original waveform, we have resulted in a distortion of the waveform.
The time-based errors in the measurements in Figure 3 result (in this example) in a system that contains amplitude-based errors at the output. This results in some kind of distortion of the signal, as can be seen here.
As you can see in Figure 5, the result is a signal that is not a sine wave. Even after this digital signal has been low-pass filtered by the reconstruction filter in the Digital-to-Analogue Converter (the DAC), it will not be a clean sine wave. But let’s think about exactly what can go wrong here, more carefully.
For starters, an error that is ONLY caused by timing errors in the sampling process cannot produce levels that are outside the amplitude range of the original signal. In other words, if our original signal was 1 V Peak and symmetrical, then the sampled waveform will not exceed this. This is because the samples are all real measurements of the signal – merely performed at the incorrect times.
Secondly, if the amount of jitter is kept constant, then the amount of amplitude error will modulate (or vary) with the slope of the signal. This is illustrated in Figure 6, below.
Another way to consider this is that, given a constant amount of jitter, the amplitude error (and therefore the distortion that is generated) modulates with the signal, is proportional to the slope of the signal. Since the maximum slope of the signal increases with amplitude and with frequency, then jitter artefacts will also increase as a result of an increase in the signal level or its frequency.
Thirdly, (and this one may be obvious): in an LPCM system, there are no jitter artefacts if there is no signal. If the input signal is constantly 0, then it doesn’t matter when you measure it… (Note that I said “in an LPCM system” in that sentence – if it’s a Delta-Sigma (1-bit) converter, then this is not true.)
There is more thing to consider – although, given the level of jitter in real-life systems these days, this one is more of a thought experiment than anything else. Take a look back at Figure 3 – specifically, the samples that should have been taken at times 11 and 12. In a 44.1 kHz system, those two samples would have been samples 1/44100th of a second apart. However, as you can see there, the time between those two samples is less than 1/44100th of a second. If the sampling period is reduced, then the sampling rate must be higher than 44.1 kHz. This means that, ignoring everything else, the Nyquist frequency of the system is momentarily raised, allowing content above the intended Nyquist into the captured signal… However, as I said, this is merely an interesting thing to think about. Find something else to feed your free-floating anxiety that keeps you up at night – this issue is not worth a wink’s worth of lost sleep…
One extra thing to note here: If you look at Figure 3, you see a signal that has artefacts caused by jitter. Simply stated, this means that there are errors in the recorded signal. The way I’ve plotted this in Figure 3, those can be considered to be amplitude errors when played through a system without jitter. In other words, if you have a signal with jitter artefacts, you cannot remove them by using a system that has no jitter. the best you can do is to not add more jitter…
Addendum: This description of jitter artefacts as an amplitude distortion is only one way to look at the problem – using what is called the “Time-Domain Model”. Instead, you could use the “Frequency-Domain Model”, which I will not discuss here. If you’d like to dive into this further, Julian Dunn’s paper called “Jitter Theory” – Technical Note TN-23 from Audio Precision is the best place to start. This is a chapter in his book called “Measurement Techniques for Digital Audio”, published by Audio Precision. See this link for more info.
This “series” of postings was intended to describe some of the errors that I commonly see when I measure and evaluate digital audio systems. All of the examples I’ve shown are taken from measurements of commercially-available hardware and software – they’re not “beta” versions that are in development.
There are some reasons why I wrote this series that I’d like to make reasonably explicit.
Many of the errors that I’ve described here are significant – but will, in some cases, not be detected by “typical” audio measurements such as frequency response or SNR measurements.
For example, the small clicks caused by skip/insert artefacts will not show up in a SNR or a THD+N measurement due to the fact that the artefacts are so small with respect to the signal. This does not mean that they are not audible. Play a midrange sine tone (say, in the 2 -3 kHz region… nothing too annoying) and listen for clicks.
As another example, the drifting time clock problems described here are not evident as jitter or sampling rate errors at the digital output of the device. These are caused by a clocking problems inside the signal path. So, a simple measurement of the digital output carrier will not, in any way, reveal the significance of the problem inside the system.
Aliasing artefacts (described here) may not show up in a THD measurement (since aliasing artefacts are not Harmonic). They will show up as part of the Noise in a THD+N measurement, but they certainly do not sound like noise, since they are weirdly correlated with the signal. Therefore you cannot sweep them under the rug as “noise”…
Some of the problems with some systems only exist with some combinations of file format / sampling rate / bit depth, as I showed here. So, for example, if you read a test of a streaming system that says “I checked the device/system using a 44.1 kHz, 16-bit WAV file, and found that its output is bit-perfect” Then this is probably true. However, there is no guarantee whatsoever that this “bit-perfect-ness” will hold for all other sampling rates, bit depths, and file formats.
Sometimes, if you test a system, it will behave for a while, and then not behave. As we saw in Figure 10 of this posting, the first skip-insert error happened exactly 10 seconds after the file started playing. So, if you do a quick sweep that only lasts for 9.5 seconds you’ll think that this system is “bit-perfect” – which is true most of the time – but not all of the time…
Unfortunately, the only thing that I have concluded after having done lots of measurements of lots of systems is that, unless you do a full set of measurements on a given system, you don’t really know how it behaves. And, it might not behave the same tomorrow because something in the chain might have had a software update overnight.
However, there are two more thing that I’d like to point out (which I’ve already mentioned in one of the postings).
Firstly, just because a system has a digital input (or source, say, a file) and a digital output does not guarantee that it’s perfect. These days the weakest links in a digital audio signal path are typically in the signal processing software or the clocking of the devices in the audio chain.
Secondly, if you do have a digital audio system or device, and something sounds weird, there’s probably no need to look for the most complicated solution to the problem. Typically, the problem is in a poor implementation of an algorithm somewhere in the system. In other words, there’s no point in arguing over whether your DAC has a 120 dB or a 123 dB SNR if you have a sampling rate converter upstream that is generating aliasing at -60 dB… Don’t spend money “upgrading” your mains cables if your real problem is that audio samples are being left out every half second because your source and your receiver can’t agree on how fast their clocks should run.
So, the bad news is that trying to keep track of all of this is complicated at best. More likely impossible.
On the other hand, if you do have a system that you’re happy with, it’s best to not read anything I wrote and just keep listening to your music…
As a setup for this posting, I have to start with some background information…
Back when I was doing my bachelor’s of music degree, I used to make some pocket money playing background music for things like wedding receptions. One of the good things about playing such a gig was that, for the most part, no one is listening to you… You’re just filling in as part of the background noise. So, as the evening went on, and I grew more and more tired, I would change to simpler and simpler arrangements of the tunes. Leaving some notes out meant I didn’t have to think as quickly, and, since no one was really listening, I could get away with it.
If you watch the short video above, you’ll hear the same composition played 3 times (the 4th is just a copy of the first, for comparison). The first arrangement contains a total of 71 notes, as shown below.
The second arrangement uses only 38 notes, as you can see in Figure 2, below.
The third arrangement uses even fewer notes – a total of only 27 notes, shown in Figure 3, below.
The point of this story is that, in all three arrangements, the piece of music is easily recognisable. And, if it’s late in the night and you’ve had too much to drink at the wedding reception, I’d probably get away with not playing the full arrangement without you even noticing the difference…
A psychoacoustic CODEC (Compression DECompression) algorithm works in a very similar way. I’ll explain…
If you do an “audiometry test”, you’ll be put in a very, very quiet room and given a pair of headphones and a button. in an adjacent room is a person who sees a light when you press the button and controls a tone generator. You’ll be told that you’ll hear a tone in one ear from the headphones, and when you do, you should push the button. When you do this, the tone will get quieter, and you’ll push the button again. This will happen over and over until you can’t hear the tone. This is repeated in your two ears at different frequencies (and, of course, the whole thing is randomised so that you can’t predict a response…)
If you do this test, and if you have textbook-quality hearing, then you’ll find out that your threshold of hearing is different at different frequencies. In fact, a plot of the quietest tones you can hear at different frequencies it will look something like that shown in Figure 4.
This red curve shows a typical curve for a threshold of hearing. Any frequency that was played at a level that would be below this red curve would not be audible. Note that the threshold is very different at different frequencies.
Interestingly, if you do play this tone shown in Figure 5, then your threshold of hearing will change, as is shown in Figure 6.
IF you were not playing that loud 1 kHz tone, and, instead, you played a quieter tone just below 2 kHz, it would also be audible, since it’s also above the threshold of hearing (shown in Figure 7.
However, if you play those two tones simultaneously, what happens?
This effect is called “psychoacoustic masking” – the quieter tone is masked by the louder tone if the two area reasonably close together in frequency. This is essentially the same reason that you can’t hear someone whispering to you at an AC/DC concert… Normal people call it being “drowned out” by the guitar solo. Scientists will call it “psychoacoustic masking”.
Let’s pull these two stories together… The reason I started leaving notes out when I was playing background music was that my processing power was getting limited (because I was getting tired) and the people listening weren’t able to tell the difference. This is how I got away with it. Of course, if you were listening, you would have noticed – but that’s just a chance I had to take.
If you want to record, store, or transmit an audio signal and you don’t have enough processing power, storage area, or bandwidth, you need to leave stuff out. There are lots of strategies for doing this – but one of them is to get a computer to analyse the frequency content of the signal and try to predict what components of the signal will be psychoacoustically masked and leave those components out. So, essentially, just like I was trying to predict which notes you wouldn’t miss, a computer is trying to predict what you won’t be able to hear…
This process is a general description of what is done in all the psychoacoustic CODECs like MP3, Ogg Vorbis, AC-3, AAC, SBC, and so on and so on. These are all called “lossy” CODECs because some components of the audio signal are lost in the encoding process. Of course, these CODECs have different perceived qualities because they all have different prediction algorithms, and some are better at predicting what you can’t hear than others. Also, depending on what bitrate is available, the algorithms may be more or less aggressive in making their decisions about your abilities.
There’s just one small problem… If you remove some components of the audio signal, then you create an error, and the creation of that error generates noise. However, the algorithm has an trick up its sleeve. It knows the error it has created, it knows the frequency content of the signal that it’s keeping (and therefore it knows the resulting elevated masking threshold). So it uses that “knowledge” to shape the frequency spectrum of the error to sit under the resulting threshold of hearing, as shown by the gray area in Figure 9.
Let’s assume that this system works. (In fact, some of the algorithms work very well, if you consider how much data is being eliminated… There’s no need to be snobbish…)
Now to the real part of the story…
Okay – everything above was just the “setup” for this posting.
For this test, I put two .wav files on a NAS drive. Both files had a sampling rate of 48 kHz, one file was a 16-bit file and the other was a 24-bit file.
On the NAS drive, I have two different applications that act as audio servers. These two applications come from two different companies, and each one has an associated “player” app that I’ve put on my phone. However, the app on the phone is really just acting as a remote control in this case.
The two audio server applications on the NAS drive are able to stream via my 2.4 GHz WiFi to an audio device acting as a receiver. I captured the output from that receiver playing the two files using the two server applications. (therefore there were 4 tests run)
The content of the signal in the two .wav files was a swept sine tone, going from 20 Hz to 90% of Nyquist, at 0 dB FS. I captured the output of the audio device in Figure 10 and ran a spectrogram of the result, analysing the signal down to 100 dB below the signal’s level. The results are shown below.
So, Figures 11 and 13 show the same file (the 16-bit version) played to the same output device over the same network, using two different audio server applications on my NAS drive.
Figures 12 and 14 also show the same file (the 24-bit version). As is immediately obvious, the “Audio Server SW 2” is not nearly as happy about playing the 24-bit file. There is harmonic distortion (the diagonal lines parallel with the signal), probably caused by clipping. This also generates aliasing, as we saw in a previous posting.
However, there is also a lot of visible noise around the signal – the “fuzzy blobs” that surround the signal. This has the same appearance as what you would see from the output of a psychoacoustic CODEC – it’s the noise that the encoder tries to “fit under” the signal, as shown in Figure 9… One give-away that this is probably the case is that the vertical width (the frequency spread) of that noise appears to be much wider when the signal is a low-frequency. This is because this plot has a logarithmic frequency scale, but a CODEC encoder “thinks” on a linear frequency scale. So, frequency bands of equal widths on a linear scale will appear to be wider in the low end on a log scale. (Another way to think of this is that there are as many “Hertz’s” from 0 Hz to 10 kHz as there are from 10 kHz to 20 kHz. The width of both of these bands is 10000 Hz. However, those of us who are still young enough to hear up there will only hear the second of these as the top octave – and there are lots of octaves in the first one. (I know, if we go all the way to 0 Hz, then there are an infinite number of octaves, but I don’t want to discuss Zeno today…))
So, it appears that “Audio Server SW 2” on my NAS drive doesn’t like sending 24 bits directly to my audio device. Instead, it probably decodes the wav file, and transcodes the lossless LPCM format into a lossy CODEC (clipping the signal in the process) and sends that instead. So, by playing a “high resolution” audio file using that application, I get poorer quality at the output.
As always, I’m not going to discuss whether this effect is audible or not. That’s irrelevant, since it’s dependent on too many other factors.
And, as always, I’m not going to put brand or model names on any of the software or hardware tested here. If, for no other reason, this is because this problem may have already been corrected in a firmware update that has come out since I tested it.
The take-home messages here are:
an entire audio signal path can be brought down by one piece of software in the audio chain
you can’t test a system with one audio file and assume that it will work for all other sampling rates, bit depths and formats
Normally, when I run this test, I do it for all combinations of 6 sampling rates, 2 bit depths, and 2 formats (WAV and FLAC), at at least 2 different signal levels – meaning 48 tests per DUT
What I often see is that a system that is “bit perfect” in one format / sampling rate / bit depth is not necessarily behaving in another, as I showed above..
So, if you read a test involving a particular NAS drive, or a particular Audio Server application, or a particular audio device using a file format with a sampling rate and a bit depth and the reviewer says “This system worked perfectly.” You cannot assume that your system will also work perfectly unless all aspects of your system are identical to the tested system. Changing one link in the chain (even upgrading the software version) can wreck everything…
This makes life confusing, unfortunately. However, it does mean that, if someone sounds wrong to you with your own system, there’s no need to chase down excruciating minutiae like how many nanoseconds of jitter you have at your DAC’s input, or whether the cat sleeping on your amplifier is absorbing enough cosmic rays. It could be because your high-res file is getting clipped, aliased, and converted to MP3 before sending to your speakers…
Just in case you’re wondering, I tested these two systems above with all 6 standard sampling rates (44.1, 48, 88.2, 96, 176.4, and 192 kHz), 2 bit depths (16 & 24). I also did two formats (WAV and FLAC) and three signal levels (0, -1, and -60 dB FS) – although that doesn’t matter for this last comment.
“Audio Server SW 2” had the same behaviour in the case of all sampling rates – 16 bit files played without artefacts within 100 dB of the 0 dB FS signal, whereas 24-bit files in all sampling rates exhibited the same errors as are shown in Figure 14.
We’ve seen in a previous posting that timing errors can occur in wireless audio systems. As we saw there, the wrong way to deal with this is to simply drop or repeat samples when the receiver realises it’s out of synchronisation with the transmitter. A better way to do it is to smoothly drift the sampling rate to either catch up or slow down – although this causes the modern-day equivalent of “wow and flutter”, since variations in the sampling rate will cause pitch shifts at the output. The trick here is to make changes slowly so as to get away with it…
However, what I didn’t address in that posting was how bad the problem can be – I only talked about how not to correct the problem when you know you have one.
So, let’s do a different (but related) test. I made a signal that consists of “digital black” – a long string of zeros – and therefore silence. Then, I made a single-sample spike every second (for example, every 44100 samples in a 44.1 kHz sampling rate system). In order to not make anything unhappy, I gave the clicks a value of 0.5 – so nothing is close to overloading…
Then, I transmitted that signal to an audio device wirelessly and recorded its output.
Figure 1, below, shows the original signal on top, and the recorded output of the device under test (the “DUT”) on the bottom.
You may notice that there is a little noise in the bottom plot. This is because this particular DUT has an acoustical output, and the noise you see there (partly) is acoustical noise in the room and measurement system.
Note that this plot shows only the first 5 seconds of a test that actually ran for 10 minutes.
Then, I wrote a little Matlab script that finds the spikes in each signal, and counts the number of samples between spikes. So, in a system running at 44.1 kHz I would expect that there is 1 spike every 44100 samples – both at the input to the system (the original signal) and its output. In other words, I’m finding out how far apart those spikes are with a resolution of 1 sample.
So, I find the duration between clicks at the output of the DUT, convert from samples to milliseconds, and plot the error over the full 600 seconds (10 minutes) of the test. In theory, there is no error – and each duration is exactly 1 second ±0 ms. In practice, however, this is not true.
For this posting, I tested two commercially-available devices, transmitting from the same device.
Figure 2 shows the results for that first device. As you can see there, one second at the device’s input does not correspond to 1 second at its output. It drifts from a little under 999.7 ms to a little over 1000.2 ms. Note that, for this test, I don’t know from the measurement how that change takes place – whether it’s shifting slowly or using a skip/insert strategy. I just know one version of how bad the problems is over time on a second-by-second basis.
Figure 3, below, shows the same analysis for another device. Notice that there are three colours in this plot, corresponding to three separate tests of the same device…
As you can see there, this device seems to be behaving most of the time, but occasionally gets a little lost and jumps by to about ±70 ms in a worst case. This means that, for this test, we can see that “1 second” can last anything between about 930 ms and 1070 ms. Note that this analysis doesn’t show what happens at the moment (or during the time) that jump occurs – we only know that it has happened sometime between clicks at the output.
You may be wondering why the plot in Figure 2 is more “jagged” than the one in Figure 3. This is mostly because the scale of the two plots is so different. If we were to zoom in to the plot in Figure 3, we would see that it is roughly as busy, as is shown below in Figure 4.
One significant difference between these two devices is that the first has an acoustical output and the second has an electrical output. This may cause you to wonder whether the acoustical noise in the first measurement contributes to the error. This may be possible. However, a 0.2 ms (or 200 µs) error is roughly equivalent to 9 samples at 44.1 kHz (or a 6.9 cm shift in distance between the DUT and the microphone). This is well outside the range of the error generated by acoustical noise – so that cannot be held responsible as being the only contributor to the error measurement.
I should say that the wireless audio protocol that was used for these two tests were the same… So, this is not a comparison of two different transmission systems. Also, as I mentioned above, the transmitter was the same for both DUT’s. So, the difference in results here are attributable to the skill and attention to the execution of the manufacturers of the two receiving devices.
As always, don’t bother asking which devices these DUT’s are. I’m not telling – primarily because it doesn’t matter. I’m just using these two devices as examples of errors I often see when I measure audio equipment…
One additional thing that might be of interest to geeks like me. That second DUT has a digital audio output, which is what I used to capture its signal. Interestingly, when I measure the sampling rate of that output with a digital audio signal analyser, the sampling rate is typically within 2 ppm of the correct frequency. So, ignoring the big spikes in Figure 3 (which are probably the result of buffer over- or under-runs) if the timing errors we see in Figure 4 were solely caused by a clock error that was visible on the digital audio output, then we should not see deviations of no more than approximately 2 microseconds per second. Instead, we see changes on the order of 1 to 2 milliseconds per second, which indicates a sample rate drift of 1000 to 2000 ppm… So, this means that, although the sampling rate of my transmitter and the output sampling rate of my receiver (the DUT) are nominally the same, AND there is very low jitter / error on the DUT’s output sampling rate, something else in the audio signal path is causing this error. In other words, a simple measurement of the digital output’s sampling rate is not adequate to verify that the DUT’s clock is behaving.
In a previous posting, I tried to explain the concept of aliasing. The easiest way to illustrate this is to try to sample an audio signal that has a frequency that is higher than the Nyquist frequency – one half of the sampling rate. If you do this, then the signal that will come out of your digital audio system will have a different frequency than the original signal. In fact, it will be the Nyquist frequency minus the difference between the original signal and the Nyquist frequency.
For example, if we have an LPCM audio system that has a sampling rate of 48 kHz, then its Nyquist frequency is 24 kHz. If you allow any audio signal to be sampled by that system, and you record a sine wave with a frequency of 30 kHz, then the signal that will be played back by the system will be
The example I gave above is only part of the story. It’s the part of the story that’s told because it’s easy to tell, and relatively easy to grasp. However, let’s look into this a little more…
If I ask you “what is the square root of 4?” you’ll probably say that the answer is “2”. However, this is also only part of the story. The square root of 4 is also -2, since -2 * -2 = 4. So, there are two correct answers to the question – in other words, both answers exist and are equally valid.
Aliasing is somewhat similar. If we manage to get a 30 kHz sine wave into an LPCM recording system with a sampling rate of 48 kHz, we will appear to have recorded an 18 kHz sine wave. However, the samples that we have captured are also equally valid for the original 30 kHz sine wave. In fact, both the 18 kHz and the 30 kHz tones can be thought of as being equally valid answers to the set of samples we recorded.
This means that, if I record an 18 kHz sine tone in the 48 kHz system, we can consider the 30 kHz sine tone to also exist simultaneously, inside the digital domain.
Oddly, this is also true at other frequencies. So, you do not only get a mirror effect around the Nyquist, but you also get it at the 1.5 times the sampling rate (or the sampling rate + Nyquist).
I won’t go into this any deeper for now – but if you want to continue, the section on “Folding” at the Wikipedia page on Aliasing is a good place to start.
Normally, we try to prevent audio signals higher with frequency content higher than the Nyquist frequency from getting into an LPCM system. This is done by low-pass filtering the audio signal to eliminate any content that might cause aliasing. That’s why the low-pass filter at the input of an analogue-to-digital converter is called an anti-aliasing filter. (At least, that’s the theory. In reality, the anti-aliasing filter of many ADC’s allow a little signal to get through above Nyquist…)
However, what happens if you create signals with a frequency above the Nyquist within the digital domain? Is this possible? Can it happen accidentally?
The short answer to this question is “yes”.
For example, let’s take a sine wave with a frequency of 2212 Hz (this is an arbitrary number… it could have been something else…), record it with an LPCM system with a sampling rate of 48 kHz. Then, after the signal is in the digital domain, I clip it at 85% of the peak value, so it looks like the waveform shown in Figure 1.
By clipping the sine wave symmetrically (meaning that we have made the same change in the wave’s shape on the top and the bottom), we create odd-order harmonics. This means that, when we look at the spectrum of the signal’s frequency content, we will see energy at the fundamental frequency (the original sine wave’s frequency) and also peaks at 3x, 5x, 7x, 9x, that frequency – and so on.
(If I had clipped only on the top or the bottom, and therefore made asymmetrical distortion, we would see energy in the even-order harmonics at 2x, 4x, 6x, 8x, the fundamental frequency – and so on.)
So, let’s look at the frequency content of the clipped signal shown in Figure 1. This is shown in Figure 2, below.
As you can see in Figure 2, we are expecting to see harmonics that extend (at least in this plot) up to 37604 Hz (or 17 x 2212 Hz). Of course, there are harmonics that go higher than this – but they aren’t visible in this plot because I’m only plotting signals with a level down to -60 dB FS.
You may notice that the width of the plot at 2212 Hz increases at the bottom. This is just an artefact of the math being done to find the frequency components in the signal. That spread in the frequency domain isn’t actually in the signal itself, so it can be ignored.
As I said above, the signal was clipped in the digital domain, in an LPCM system running at 48 kHz. So, just for reference, I’ve put in blue lines in Figure 2 that show the sampling rate and the Nyquist frequency – one half the sampling rate.
So: we can see that some of the artefacts created by clipping the signal are sitting at frequencies above the Nyquist frequency in this system. This means that this content will be “mirrored” or “folded down” or – more correctly – aliased to other frequencies below the Nyquist frequency. For example, the harmonic at 24332 Hz will be mirrored to 23668 Hz, according to the following math:
So, looking at the top 60 dB of the signal content (shown in Figure 3): the resulting actual output of the LPCM signal will contain:
the original fundamental frequency at 2212 Hz
four harmonics of that frequency (shown as the other red numbers in Figure 3), and
four more frequencies that are not harmonically related to the fundamental (the blue numbers)
As you may already know, an LPCM system has a low-pass filter at its output stage – part of the system that is used to convert the signal back to an analogue output. However, that low pass filter typically has a cutoff frequency around the Nyquist frequency of the system. However, the artefacts that we have created here have aliased down to frequencies below the Nyquist within the digital domain – so, by the time the signal reaches the low pass filter at the output (known as a “reconstruction filter”) they’re already in the audio band, and therefore they’re not filtered out.
So, as we can see in this rather simple example: it is easily possible that a digital audio system that has some processing (specifically “non-linear” processing) can create harmonics that are higher than the Nyquist frequency and will have “aliases” below the Nyquist frequency, and therefore will not be removed by an anti-aliasing filter.
Since the aliased artefacts are not harmonically related to the fundamental frequency, they are more easily audible than “normal” distortion artefacts that generate harmonically-related artefacts. There are a couple of reasons for this, but the most obvious one can be demonstrated by sweeping the frequency of the fundamental. If the artefacts are harmonically related, then as the fundamental frequency of the signal goes up, so do the artefacts. However, if the artefacts are the result of aliasing, then as the fundamental frequency of the signal goes up, some of the artefacts go down in frequency, which sounds quite strange…
The example I gave above (of clipping) is just one way to create distortion that generates harmonically-related artefacts that alias in the system. Lots of different processes can create those artefacts. One of the usual suspects is a poorly-made sampling rate converter.
Many systems use sampling rate converters for different reasons. For example, if you have a loudspeaker or processor that has a lot of filtering in its processing chain, the best architecture is to run the digital signal processing (the DSP) at a constant (or “fixed”) sampling rate, regardless of the sampling rate of the incoming signal. This is because, if you were to change sampling rates in the DSP to match the incoming signal, you would have to load an entirely new set of coefficients (a fancy word that basically means “multiplications values inside the digital filters”) into the processor. This takes some time, and you don’t want to miss the first part of the song every time the sampling rate changes while you’re waiting to load a bunch of new coefficients into your filters… So, instead, the smart thing to do is to keep the DSP running at a constant rate, and sample rate convert all incoming signals to the internal sampling rate. This way, there’s no dropout at the start of the song.
However, you have to be careful if you do this, since a poorly-made sampling rate converter will certainly create aliasing artefacts.
In part 5 of this series of postings, I described one kind of test that can be made on an audio system. This test consists of sending a sine wave with a swept frequency into the system and recording its output. You then do a spectrogram of the output, looking for signals at frequencies other than the one you sent in.
To get an idea of what aliasing will look like in this plot, I made a DSP algorithm that creates the same kinds of artefacts. The resulting plot is shown in Figure 4, below. (Remember that this is a measurement of a system that I made to intentionally generate similar artefacts to aliasing – this isn’t actually the output of a system that is aliasing).
Now that you know what to look for in the plot, let’s look at the measurements of some commercially-available systems. Figure 5, below is a measurement of a system that has two problems. One can be seen as the vertical lines – these are “skip/insert” artefacts that I described in an earlier posting. The aliasing artefacts can also be seen in this plot. Note that, in this case, the input and output of the system are both digital connections to my measurement equipment.
If I send a signal at a different sampling rate into the same system, I get a different behaviour. This is not unusual in systems with sampling rate converters. In this plot, you can see the skip/insert artefacts (the vertical stripes) the aliasing artefacts, and the obvious band-limiting of the system. Notice that nothing above about 24 kHz comes out of the system, which would mean that, internally, it is probably running at a sampling rate of 48 kHz. (The input signal in this measurement was at 192 kHz and my analysis system was running at 96 kHz.)
Let’s look at another system. In this case, I put a 48 kHz, 16-bit .flac file on a hard drive, and played it through another digital audio system, again capturing its digital output. The result of this is shown in Figure 7.
As you can see in Figure 6, this system is behaving very well in this particular test. I see the nice, clean signal with only one frequency at only one time. No artefacts down to 100 dB below the signal level. This is good.
Now let’s test exactly the same system, at exactly the same sampling rate, again with a .flac file – but this time with a 24-bit word length in the file. The result of this is shown in Figure 7.
So, by going from a 16-bit file to a 24-bit file, this system obviously behaves very, very differently. It now has harmonic distortion (the straight diagonal lines running parallel to the fundamental frequency), aliasing of those harmonics when they go beyond 24 kHz, and strange noises as well (the large area of blue blobs in the lower left corner, and surrounding the fundamental frequency all the way up.
Those “strange noises” – the blobs – are probably artefacts caused by a lossy codec similar to MP3. Typically, systems like this are built to reduce the data rate of the audio signal by trying to predict what you can’t hear in the signal – and leaving that out. In doing so, they create errors that produce noise, so the encoder tries to shape that noise so that it “hides” under the signal that it keeps. The end result looks something like the blobs shown in Figure 7… For a more thorough discussion of this, see this posting.
So, based only on the information from this test, we can guess that the system might be decoding the 24-bit file, “transcoding” it to a lossy format, and transmitting that through the system. However, this is just a guess based on one test… So it could easily be wrong.
One thing we can conclude, however, is that the 48 kHz / 16-bit file behaves MUCH better than a 48 kHz / 24-bit file in this system… So, in this particular case, a higher resolution is not necessarily better…
I should also point out that the digital output of that system was capable of outputting 24 bits. The reason I’m pointing this out is that many persons think that if a system or device has a digital output, then it is good. This is too simple a conclusion to make, because, as I’m trying to illustrate with this series of postings, the “weak link” in the chain is very likely NOT the physical output of the system. It’s more likely some part of the processing in the DSP chain (for example, a poorly-made sampling rate converter that aliases) or a poorly-implemented clocking system (for example, a skip/insert strategy).
For more aliasing fun…
If you’re intrigued by this, and you’d like to compare the aliasing caused by other sampling rate converters, I’d recommend checking out the page at http://src.infinitewave.ca. They plot the signals with a linear frequency scale instead of a logarithmic one. Consequently, the sweep of the fundamental looks like a curve (instead of the straight lines in my plots) but the harmonic distortion and aliasing artefacts are easier to see as being related to the fundamental.
Addendum – 2018/05/14
Last week, I ran a quick test on another commercially-available device – this time, a stand-alone audio file player with a digital output. I was running the test using a 44.1 kHz, 16-bit FLAC file, but the device had a 48 kHz output. The interesting thing about this one was that the artefacts that showed up were almost exclusively aliasing errors. So, I thought it would be interesting to show the plots here.