Jitter – Part 8.3 – Sampling Rate Conversion

#8 in a series of articles about wander and jitter

Although I am guessing, I don’t think that it is crazy to say that the majority of digital audio systems today employ some kind of sampling rate conversion somewhere in the signal flow.

A sampling rate converter is a physical device or a processing block in some software that takes an audio signal that has been sampled at one rate (say, 44.1 kHz) and converts it to an audio signal at another rate (say, 48 kHz).

There are many reasons why you might want to do this. For example, if you have a device that has equalisation (filtering), then if you change the sampling rate, you will have to new coefficients into the filters. If you have a LOT of filters, then it might take so much time to load them into the system that you’ll miss the first second or two of a song if it’s a different sampling rate than the previous song. So, instead of doing this, you keep your processing at one constant (or ‘fixed’) sampling rate, and convert the input to that rate. This might even be true in the case where the incoming sampling rate is the same as the internal sampling rate. For example, you might be “sample rate converting” from 48 kHz to 48 kHz – just to keep the design of the system clocking constant.

Looking very broadly, there are two options for sampling rate conversion.

Synchronous Sampling Rate Conversion

Let’s say that you have to convert from 48 kHz to 96 kHz – a multiplication of 2. In this simple case, you could take the incoming samples, and insert an new, extra one mid-way between each of them. The value of the new sample depends on how you are doing the math to calculate it. We will not discuss this here. The important thing about this concept is that the timing of the output is “locked” to the input. In this example, every second sample of the output happens at exactly the same time as every sample at the input. This can also be true if the ratio of the sampling rates are not “nicely” related like a 2:1 ratio. For example, if you have an input at 44.1 kHz and and output at 48 kHz, you could take the incoming 44.1 kHz signal, insert 47999 “virtual” samples between each of the original samples (making the new sampling rate 2116800000 Hz) and then pull an output sample from that stream every 444100 samples. 

In other words:

(44100 * 48000) / 44100 = 48000

Of course, this is not a smart way to do this (because it will be a huge waste of processing power and memory – and imagine how big the numbers would be if you’re converting 176.4 kHz to 192 kHz… bigger!), but it would work, as long as the “virtual” samples you create at the very high “virtual” sampling rate have the correct values.

This type of sampling rate conversion, where the output is numerically “locked” to the input in time (meaning that, at some regular interval of time, the input and the output samples will happen simultaneously – or at least with a constant delay) is called synchronous sampling rate conversion. It’s called that because the input and the output are synchronised with each other… A bit like gears meshing together.

Asynchronous Sampling Rate Conversion

There is another way to do this, where we do not lock the output clock to the input clock. Let’s say that you want to build a device that has a constant sampling rate at its output, but you don’t really know what the sampling rate of the input is. In this case you will use an asynchronous sampling rate converter – so-called because there is no fixed lock between the input and output clocks.

In this case, the incoming signal is analysed and its sampling rate is measured. The way this is done is a little similar to the method shown above. You take the clock running at the rate of the output’s signal and multiply that by some value (say 512, for example) to create an internal “virtual” clock running at a higher sampling rate. You then “grab” the value of an incoming sample and apply its value to the “virtual” sample that is closest in time. This allows the incoming samples to drift in time relative to the output samples.

In both cases, there is the open question of how you generate the signal at the higher internal sampling rate. This can be done using a kind of low pass filter that is effectively similar to the reconstruction filter in a DAC. I will not talk about this any more than that – other than to say that the response characteristics of that filter are VERY important… So, if you’re planning on building your own sampling rate converter, read a lot more stuff on the subject than what I’ve written here – because what I’ve written here is most certainly not enough information.

There’s one strange effect that pops up here. Since, in an ASRC (Asynchronous Sampling Rate Converter) the incoming signal is sampled at discrete times that are numerically related to the output sampling rate, then any potential jitter in the system is also quantised in time. So, for example, if your output sampling rate is 48000 samples per second, and you’re creating the internal sampling rate by multiplying that by 512, then any jitter in the ASRC cannot have a value less than 1/(48000*512) second = 4.069*10^-8 or 40.69 nanoseconds. In other words, in such a system, the error caused by jitter will be 0, ±40.69 nanoseconds, ±81.38 nanoseconds, and so on. It can’t be something in between… (assuming that the output clock is perfect. If it’s drifting due to jitter, then those values will also drift…)

The good news is that, if the clock that is used for ASRC’s output sampling rate is very accurate and stable, and if the filtering that is applied to the incoming signal is well-done, then an ASRC can behave very, very well – and there are lots of examples of this. (Sadly, there are many more examples where an ASRC is implemented poorly. This is why many people think that sampling rate converters are bad – because most sampling rate converters are bad.) in fact, a correctly-made sampling rate converter can be used to reduce jitter in a system (so you would even want to use it in cases where the incoming sampling rate and outgoing sampling rates are the same). This is why some DAC’s include an ASRC at the input – to reduce jitter originating at the signal source.

Wrapping up Part 8: The take-home messages for these three parts in Section 8 are:

  • Sampling Jitter results in some kind of distortion of the signal that can be related to the signal itself
  • Sampling Jitter can occur in the ADC, the DAC, or an ASRC
  • If implemented correctly, an ASRC can be used to attenuate jitter in a system
  • Once introduced to the signal, jitter cannot be attenuated. So, if you have a recording that was made using an ADC with a lot of jitter, the artefacts caused by that jitter is in the recorded signal forever. If you have a DAC that has absolutely no jitter whatsoever (this is not possible)  then this will not eliminate the jitter that is already in the signal. Of course, it won’t make the situation worse… but it won’t make it better. 

Addendum. If you want to dig further into the world of Sampling Jitter and the advantages of using ASRC’s to attenuate jitter, I highly recommend the following as a good starting point:

  • Julian Dunn’s paper called “Jitter Theory” – Technical Note TN-23 from Audio Precision. This is a chapter in his book called “Measurement Techniques for Digital Audio”, published by Audio Precision. See this link for more info.
  • Clock Jitter, D/A Converters, and Sample-Rate Conversion
    By Robert W. Adams, Published in The Audio Critic, Issue No. 21
  • The Effects of Sampling Clock Jitter on Nyquist Sampling Analog-to-Digital Converters and on Oversampling Delta Sigma ADCs, Steven Harris. AES Preprint #2844 (87th International Convention of the AES, October 1989)
  • Jitter Analysis of Asynchronous Sample-rate Conversion, Robert Adams. AES Preprint #3712 (95th International Convention of the AES, October 1993)

Jitter – Part 8.2 – Sampling Jitter

#8 in a series of articles about wander and jitter

In the previous post we looked at the effect of an incoming analogue signal that is sampled at the wrong times. In that description, I implied that the playback of the samples would happen at exactly the correct times. So, the jitter was entirely at the ADC (analogue-to-digital converter) and nowhere else.

In this posting, we’ll look at a very similar issue – jitter in the DAC (digital-to-analogue converter).

Jitter in the Digital to Analogue conversion

Let’s assume that we have a signal (in our case, a sinusoidal waveform, since that’s easy to plot) that was sampled by an ADC with no jitter. So, our original signal looks like Figure 1.

That signal is sampled by the ADC at exactly the correct times, since it has no jitter. The result of this is shown below in Figure 2.

When the time comes to play this signal, we send those samples to the DAC in the correct order and hope that it converts each of them to an analogue voltage at exactly the correct times. If the sampling rate of the system is 96 kHz, then we hope that the DAC converts a sample ever 1/96000th of a second, at exactly the right time each time.

That time that the DAC spits out the sample is dictated by a clock somewhere in the system. It might by an internal clock, or it might come from an external device, depending on your system and how it’s being used. However, if that clock is inaccurate for some reason, or if there is some kind of noise infecting the connection between the clock and the DAC, then the DAC can be triggered to convert a sample at the incorrect time. This is sampling jitter in the digital to analogue conversion process. I’ve tried to illustrate this in Figure 3.

It may not be immediately obvious, but the sample values in Figure 3 are identical to those in Figure 2. What I’ve done is to move them in time, so that you’re getting exactly the right level output at the wrong time each time. Of course, I have heavily exaggerated this plot to make it obvious that the times between consecutive samples are not equal. Some are much shorter than the sampling period (e.g. between samples 3 and 4) and some are much longer (e.g between samples 9 and 10).

Just like the case of ADC jitter, we can analyse this simply as an amplitude error. In other words, as a result of the timing errors, the red circles are not sitting directly on the original gray signal. And, just like we saw in the case of the ADC jitter, the amount of amplitude error is proportional to the slope of the signal.

Addendum: It’s important to remember that the descriptions and the plots that I’m showing here are to help show what jitter is – and those plots are high. I’m not showing what the final result will be. The actual jitter in a system is much, much lower than anything I’ve shown here. Also, I’ve completely omitted the effects of the anti-aliasing filter and the reconstruction filter – just to keep things simple.

Jitter: Part 8.1 – Sampling Jitter

#8 in a series of articles about wander and jitter

Ignoring a most of the details, converting an analogue audio signal into a digital one is much like filming a movie. The signal (a continuous change in voltage) is measured (or sampled) at a regular rate (the sampling rate), and those measurements are stored for future use. This is called Analogue-to-Digital Conversion.

In the future, you take those samples, and you convert them back to voltages at the same sampling rate (in the same way that you play a film at the same frame rate that you used to record it). This is called Digital-to-Analogue Conversion.

However, we’re not here to talk about conversion – we’re here to talk about jitter in the conversion process.

As we’ve already seen, jitter (and wander) is an error in the timing of a clock event. So, let’s look at this effect as part of the sampling process. To start: jitter in the analogue to digital conversion.

Jitter in the Analogue to Digital conversion

Let’s say that we want to convert an analogue sinusoidal wave into a PCM digital version.

Note that I’m going to skip a bunch of steps in the following explanation – concentrating only on the parts that are important for our discussion of jitter.

We start with a wave that has theoretically infinite resolution in amplitude and time, and we divide time into discrete moments, represented by the numbered vertical lines in the plot below.

Fig 1. An analogue sinusoidal wave, about to be sampled in the first step to conversion into an LPCM audio signal.

Every time the clock “ticks” (in other words, on each of those vertical lines), we measure the voltage of the signal. These discrete measurements are represented in Figure 2 as the circles, sitting on the original waveform (in gray).

Fig 2. The intantaneous amplitude of the original waveform (in gray) is measured at each discrete moment in time.

Part of this system relies on the accuracy of the clock that’s used to tell the sampling system when to do the measurements. In a perfect world, a system with a sampling rate of 44.1 kHz would make a measurement of the incoming analogue wave exactly every 1/44100th of a second. The time between samples would never vary.

This, of course, is impossible. The clock that ticks at the sampling rate will have some error in time – albeit a very, very small error.

Let’s heavily exaggerate this error so that we can see the resulting effect. Figure 3 shows the same original analogue sinusoidal waveform, sampled (measured) at incorrect times. In other words, sometimes the measurement (represented by the red circles) is made slightly too early (to the left of the gray vertical line – as is the case for Sample #9), sometimes, it’s made too late (to the right of the line – as in Sample #2).

Fig 3. The same analogue sinusoidal waveform, sampled at the wrong times. This error in timing is different each time.

For example, look at the sample that should occur at clock tick #2. I’ve zoomed in to the plot so that this can be seen more clearly in Figure 4.

Fig 4. A portion of the plot in Figure 3, zoomed in for clarity’s sake.

Notice that, because the measurement was made at the wrong time (in the case of sample #2, somewhat late), the result is an error in the measurement of the waveform’s amplitude. So, an error in time produces an error in level.

Let’s assume that the measurements we made in Figure 3 are stored and then replayed at exactly the correct times – what will the result be? This is shown in Figure 5. As you can see there, by comparing the measurements we made in Figure 3 to the original waveform, we have resulted in a distortion of the waveform.

Fig 5. The samples, measured at the incorrect times (as shown in Figure 4) re-aligned as though they were played back at the correct times.

The time-based errors in the measurements in Figure 3 result (in this example) in a system that contains amplitude-based errors at the output. This results in some kind of distortion of the signal, as can be seen here.

As you can see in Figure 5, the result is a signal that is not a sine wave. Even after this digital signal has been low-pass filtered by the reconstruction filter in the Digital-to-Analogue Converter (the DAC), it will not be a clean sine wave. But let’s think about exactly what can go wrong here, more carefully.

For starters, an error that is ONLY caused by timing errors in the sampling process cannot produce levels that are outside the amplitude range of the original signal. In other words, if our original signal was 1 V Peak and symmetrical, then the sampled waveform will not exceed this. This is because the samples are all real measurements of the signal – merely performed at the incorrect times.

Secondly, if the amount of jitter is kept constant, then the amount of amplitude error will modulate (or vary) with the slope of the signal. This is illustrated in Figure 6, below.

Fig 6a. A sinusoidal waveform that has been sampled.
Fig 6b. The range of the amplitude error if the range of jitter is ±0.5 sample is small when the slope of the signal is low.
Fig 6c. The range of the amplitude error if the range of jitter is ±0.5 sample is much higher when the slope of the signal is high.

Another way to consider this is that, given a constant amount of jitter, the amplitude error (and therefore the distortion that is generated) modulates with the signal, is proportional to the slope of the signal. Since the maximum slope of the signal increases with amplitude and with frequency, then jitter artefacts will also increase as a result of an increase in the signal level or its frequency.

Fig 6. The blue curve is a sine wave to which I have applied excessive amounts of jitter with a Gaussian distribution. The red curve is the sample-by-sample error (the original signal subtracted from the jittered signal) plotted on a magnified scale. As can be seen, the level of the instantaneous error is proportional to the slope of the signal. So, the end result is that the noise generated by the jitter is modulated by the signal. (If you look carefully at the blue curve, you can see the result of the jitter – it’s vertically narrower when the slope is low – at the tops and bottoms of the curve.)

Thirdly, (and this one may be obvious): in an LPCM system, there are no jitter artefacts if there is no signal. If the input signal is constantly 0, then it doesn’t matter when you measure it… (Note that I said “in an LPCM system” in that sentence – if it’s a Delta-Sigma (1-bit) converter, then this is not true.)

There is more thing to consider  – although, given the level of jitter in real-life systems these days, this one is more of a thought experiment than anything else. Take a look back at Figure 3 – specifically, the samples that should have been taken at times 11 and 12. In a 44.1 kHz system, those two samples would have been samples 1/44100th of a second apart. However, as you can see there, the time between those two samples is less than 1/44100th of a second. If the sampling period is reduced, then the sampling rate must be higher than 44.1 kHz. This means that, ignoring everything else, the Nyquist frequency of the system is momentarily raised, allowing content above the intended Nyquist into the captured signal… However, as I said, this is merely an interesting thing to think about. Find something else to feed your free-floating anxiety that keeps you up at night – this issue is not worth a wink’s worth of lost sleep…

One extra thing to note here: If you look at Figure 3, you see a signal that has artefacts caused by jitter. Simply stated, this means that there are errors in the recorded signal. The way I’ve plotted this in Figure 3, those can be considered to be amplitude errors when played through a system without jitter. In other words, if you have a signal with jitter artefacts, you cannot remove them by using a system that has no jitter. the best you can do is to not add more jitter…

Addendum: This description of jitter artefacts as an amplitude distortion is only one way to look at the problem – using what is called the “Time-Domain Model”. Instead, you could use the “Frequency-Domain Model”, which I will not discuss here. If you’d like to dive into this further, Julian Dunn’s paper called “Jitter Theory” – Technical Note TN-23 from Audio Precision is the best place to start. This is a chapter in his book called “Measurement Techniques for Digital Audio”, published by Audio Precision. See this link for more info.

Typical Errors in Digital Audio: Wrapping up

This “series” of postings was intended to describe some of the errors that I commonly see when I measure and evaluate digital audio systems. All of the examples I’ve shown are taken from measurements of commercially-available hardware and software – they’re not “beta” versions that are in development.

There are some reasons why I wrote this series that I’d like to make reasonably explicit.

  1. Many of the errors that I’ve described here are significant – but will, in some cases, not be detected by “typical” audio measurements such as frequency response or SNR measurements.
    1. For example, the small clicks caused by skip/insert artefacts will not show up in a SNR or a THD+N measurement due to the fact that the artefacts are so small with respect to the signal. This does not mean that they are not audible. Play a midrange sine tone (say, in the 2 -3 kHz region… nothing too annoying) and listen for clicks.
    2. As another example, the drifting time clock problems described here are not evident as jitter or sampling rate errors at the digital output of the device. These are caused by a clocking problems inside the signal path. So, a simple measurement of the digital output carrier will not, in any way, reveal the significance of the problem inside the system.
    3. Aliasing artefacts (described here) may not show up in a THD measurement (since aliasing artefacts are not Harmonic). They will show up as part of the Noise in a THD+N measurement, but they certainly do not sound like noise, since they are weirdly correlated with the signal. Therefore you cannot sweep them under the rug as “noise”…
  2. Some of the problems with some systems only exist with some combinations of file format / sampling rate / bit depth, as I showed here. So, for example, if you read a test of a streaming system that says “I checked the device/system using a 44.1 kHz, 16-bit WAV file, and found that its output is bit-perfect” Then this is probably true. However, there is no guarantee whatsoever that this “bit-perfect-ness” will hold for all other sampling rates, bit depths, and file formats.
  3. Sometimes, if you test a system, it will behave for a while, and then not behave. As we saw in Figure 10 of this posting, the first skip-insert error happened exactly 10 seconds after the file started playing. So, if you do a quick sweep that only lasts for 9.5 seconds you’ll think that this system is “bit-perfect” – which is true most of the time – but not all of the time…
  4. Sometimes, you just don’t get what you’ve paid for – although that’s not necessarily the fault of the company you’re paying…

Unfortunately, the only thing that I have concluded after having done lots of measurements of lots of systems is that, unless you do a full set of measurements on a given system, you don’t really know how it behaves. And, it might not behave the same tomorrow because something in the chain might have had a software update overnight.

However, there are two more thing that I’d like to point out (which I’ve already mentioned in one of the postings).

Firstly, just because a system has a digital input (or source, say, a file) and a digital output does not guarantee that it’s perfect. These days the weakest links in a digital audio signal path are typically in the signal processing software or the clocking of the devices in the audio chain.

Secondly, if you do have a digital audio system or device, and something sounds weird, there’s probably no need to look for the most complicated solution to the problem. Typically, the problem is in a poor implementation of an algorithm somewhere in the system. In other words, there’s no point in arguing over whether your DAC has a 120 dB or a 123 dB SNR if you have a sampling rate converter upstream that is generating aliasing at -60 dB… Don’t spend money “upgrading” your mains cables if your real problem is that audio samples are being left out every half second because your source and your receiver can’t agree on how fast their clocks should run.

 

So, the bad news is that trying to keep track of all of this is complicated at best. More likely impossible.

 

On the other hand, if you do have a system that you’re happy with, it’s best to not read anything I wrote and just keep listening to your music…

Typical Errors in Digital Audio: Part 8 – The Weakest Link

As a setup for this posting, I have to start with some background information…

Back when I was doing my bachelor’s of music degree, I used to make some pocket money playing background music for things like wedding receptions. One of the good things about playing such a gig was that, for the most part, no one is listening to you… You’re just filling in as part of the background noise. So, as the evening went on, and I grew more and more tired, I would change to simpler and simpler arrangements of the tunes. Leaving some notes out meant I didn’t have to think as quickly, and, since no one was really listening, I could get away with it.

If you watch the short video above, you’ll hear the same composition played 3 times (the 4th is just a copy of the first, for comparison). The first arrangement contains a total of 71 notes, as shown below.

Fig 1. Arrangement #1 – a total of 71 notes.

The second arrangement uses only 38 notes, as you can see in Figure 2, below.

Fig 2. Arrangement #2 – a total of 38 notes. A reduction in “data” of 46%.

The third arrangement uses even fewer notes – a total of only 27 notes, shown in Figure 3, below.

Fig 3. Arrangement #3 – a total of 27 notes. A reduction in “data” of 62% compared to the original.

 

The point of this story is that, in all three arrangements, the piece of music is easily recognisable. And, if it’s late in the night and you’ve had too much to drink at the wedding reception, I’d probably get away with not playing the full arrangement without you even noticing the difference…

A psychoacoustic CODEC (Compression DECompression) algorithm works in a very similar way. I’ll explain…

If you do an “audiometry test”, you’ll be put in a very, very quiet room and given a pair of headphones and a button. in an adjacent room is a person who sees a light when you press the button and controls a tone generator. You’ll be told that you’ll hear a tone in one ear from the headphones, and when you do, you should push the button. When you do this, the tone will get quieter, and you’ll push the button again. This will happen over and over until you can’t hear the tone. This is repeated in your two ears at different frequencies (and, of course, the whole thing is randomised so that you can’t predict a response…)

If you do this test, and if you have textbook-quality hearing, then you’ll find out that your threshold of hearing is different at different frequencies. In fact, a plot of the quietest tones you can hear at different frequencies it will look something like that shown in Figure 4.

 

Fig 4. A hand-drawn representation of a typical threshold of hearing curve.

 

This red curve shows a typical curve for a threshold of hearing. Any frequency that was played at a level that would be below this red curve would not be audible. Note that the threshold is very different at different frequencies.

Fig 5. A 1 kHz tone played at 70 dB SPL will obviously be audible, since it’s above the red line.

Interestingly, if you do play this tone shown in Figure 5, then your threshold of hearing will change, as is shown in Figure 6.

Fig 6. The threshold of hearing changes when an audible tone is played.

IF you were not playing that loud 1 kHz tone, and, instead, you played a quieter tone just below 2 kHz, it would also be audible, since it’s also above the threshold of hearing (shown in Figure 7.

Fig 7. A quieter tone at a higher frequency is also audible.

However, if you play those two tones simultaneously, what happens?

Fig 8. The higher frequency quieter tone is not audible in the presence of the louder, lower-frequency tone.

 

This effect is called “psychoacoustic masking” – the quieter tone is masked by the louder tone if the two area reasonably close together in frequency. This is essentially the same reason that you can’t hear someone whispering to you at an AC/DC concert… Normal people call it being “drowned out” by the guitar solo. Scientists will call it “psychoacoustic masking”.

 

Let’s pull these two stories together… The reason I started leaving notes out when I was playing background music was that my processing power was getting limited (because I was getting tired) and the people listening weren’t able to tell the difference. This is how I got away with it. Of course, if you were listening, you would have noticed – but that’s just a chance I had to take.

If you want to record, store, or transmit an audio signal and you don’t have enough processing power, storage area, or bandwidth, you need to leave stuff out. There are lots of strategies for doing this – but one of them is to get a computer to analyse the frequency content of the signal and try to predict what components of the signal will be psychoacoustically masked and leave those components out. So, essentially, just like I was trying to predict which notes you wouldn’t miss, a computer is trying to predict what you won’t be able to hear…

This process is a general description of what is done in all the psychoacoustic CODECs like MP3, Ogg Vorbis, AC-3, AAC, SBC, and so on and so on. These are all called “lossy” CODECs because some components of the audio signal are lost in the encoding process. Of course, these CODECs have different perceived qualities because they all have different prediction algorithms, and some are better at predicting what you can’t hear than others. Also, depending on what bitrate is available, the algorithms may be more or less aggressive in making their decisions about your abilities.

 

There’s just one small problem… If you remove some components of the audio signal, then you create an error, and the creation of that error generates noise. However, the algorithm has an trick up its sleeve. It knows the error it has created, it knows the frequency content of the signal that it’s keeping (and therefore it knows the resulting elevated masking threshold). So it uses that “knowledge” to  shape the frequency spectrum of the error to sit under the resulting threshold of hearing, as shown by the gray area in Figure 9.

Fig 9. The black vertical line is the content that is kept by the encoder. The red line is the resulting elevated threshold of hearing. The gray area is the noise-shaped error caused by the omission of some frequency components in the original signal.

 

Let’s assume that this system works. (In fact, some of the algorithms work very well, if you consider how much data is being eliminated… There’s no need to be snobbish…)

 

Now to the real part of the story…

Okay – everything above was just the “setup” for this posting.

For this test, I put two .wav files on a NAS drive. Both files had a sampling rate of 48 kHz, one file was a 16-bit file and the other was a 24-bit file.

On the NAS drive, I have two different applications that act as audio servers. These two applications come from two different companies, and each one has an associated “player” app that I’ve put on my phone. However, the app on the phone is really just acting as a remote control in this case.

The two audio server applications on the NAS drive are able to stream via my 2.4 GHz WiFi to an audio device acting as a receiver. I captured the output from that receiver playing the two files using the two server applications. (therefore there were 4 tests run)

Fig 10. A block diagram of the system under test.

The content of the signal in the two .wav files was a swept sine tone, going from 20 Hz to 90% of Nyquist, at 0 dB FS. I captured the output of the audio device in Figure 10 and ran a spectrogram of the result, analysing the signal down to 100 dB below the signal’s level. The results are shown below.

Fig 11. A spectrogram of the output signal from the audio device using “Audio Server SW 1” playing the 48 kHz, 16-bit WAV file. This is good, since it shows only the signal, and no extraneous artefacts within 100 dB.

 

Fig 12. A spectrogram of the output signal from the audio device using “Audio Server SW 1” playing the 48 kHz, 24-bit WAV file. This is also good, since it shows only the signal, and no extraneous artefacts within 100 dB.

 

Fig 13. A spectrogram of the output signal from the audio device using “Audio Server SW 2” playing the 48 kHz, 16-bit WAV file. This is also good, since it shows only the signal, and no extraneous artefacts within 100 dB.
Fig 14. A spectrogram of the output signal from the audio device using “Audio Server SW 2” playing the 48 kHz, 24-bit WAV file. This is obviously not good…

So, Figures 11 and 13 show the same file (the 16-bit version) played to the same output device over the same network, using two different audio server applications on my NAS drive.

Figures 12 and 14 also show the same file (the 24-bit version). As is immediately obvious, the “Audio Server SW 2” is not nearly as happy about playing the 24-bit file. There is harmonic distortion (the diagonal lines parallel with the signal), probably caused by clipping. This also generates aliasing, as we saw in a previous posting.

However, there is also a lot of visible noise around the signal – the “fuzzy blobs” that surround the signal. This has the same appearance as what you would see from the output of a psychoacoustic CODEC – it’s the noise that the encoder tries to “fit under” the signal, as shown in Figure 9… One give-away that this is probably the case is that the vertical width (the frequency spread) of that noise appears to be much wider when the signal is a low-frequency. This is because this plot has a logarithmic frequency scale, but a CODEC encoder “thinks” on a linear frequency scale. So, frequency bands of equal widths on a linear scale will appear to be wider in the low end on a log scale. (Another way to think of this is that there are as many “Hertz’s” from 0 Hz to 10 kHz as there are from 10 kHz to 20 kHz. The width of both of these bands is 10000 Hz. However, those of us who are still young enough to hear up there will only hear the second of these as the top octave – and there are lots of octaves in the first one. (I know, if we go all the way to 0 Hz, then there are an infinite number of octaves, but I don’t want to discuss Zeno today…))

 

Conclusion

So, it appears that “Audio Server SW 2” on my NAS drive doesn’t like sending 24 bits directly to my audio device. Instead, it probably decodes the wav file, and transcodes the lossless LPCM format into a lossy CODEC (clipping the signal in the process) and sends that instead. So, by playing a “high resolution” audio file using that application, I get poorer quality at the output.

As always, I’m not going to discuss whether this effect is audible or not. That’s irrelevant, since it’s dependent on too many other factors.

And, as always, I’m not going to put brand or model names on any of the software or hardware tested here. If, for no other reason, this is because this problem may have already been corrected in a firmware update that has come out since I tested it.

The take-home messages here are:

  • an entire audio signal path can be brought down by one piece of software in the audio chain
  • you can’t test a system with one audio file and assume that it will work for all other sampling rates, bit depths and formats
    • Normally, when I run this test, I do it for all combinations of 6 sampling rates, 2 bit depths, and 2 formats (WAV and FLAC), at at least 2 different signal levels – meaning 48 tests per DUT
    • What I often see is that a system that is “bit perfect” in one format / sampling rate / bit depth is not necessarily behaving in another, as I showed above..

So, if you read a test involving a particular NAS drive, or a particular Audio Server application, or a particular audio device using a file format with a sampling rate and a bit depth and the reviewer says “This system worked perfectly.” You cannot assume that your system will also work perfectly unless all aspects of your system are identical to the tested system. Changing one link in the chain (even upgrading the software version) can wreck everything…

This makes life confusing, unfortunately. However, it does mean that, if someone sounds wrong to you with your own system, there’s no need to chase down excruciating minutiae like how many nanoseconds of jitter you have at your DAC’s input, or whether the cat sleeping on your amplifier is absorbing enough cosmic rays. It could be because your high-res file is getting clipped, aliased, and converted to MP3 before sending to your speakers…

Addendum

Just in case you’re wondering, I tested these two systems above with all 6 standard sampling rates (44.1, 48, 88.2, 96, 176.4, and 192 kHz), 2 bit depths (16 & 24). I also did two formats (WAV and FLAC) and three signal levels (0, -1, and -60 dB FS) – although that doesn’t matter for this last comment.

“Audio Server SW 2” had the same behaviour in the case of all sampling rates – 16 bit files played without artefacts within 100 dB of the 0 dB FS signal, whereas 24-bit files in all sampling rates exhibited the same errors as are shown in Figure 14.

 

Typical Errors in Digital Audio: Part 7 – Just a sec…

We’ve seen in a previous posting that timing errors can occur in wireless audio systems. As we saw there, the wrong way to deal with this is to simply drop or repeat samples when the receiver realises it’s out of synchronisation with the transmitter. A better way to do it is to smoothly drift the sampling rate to either catch up or slow down – although this causes the modern-day equivalent of “wow and flutter”, since variations in the sampling rate will cause pitch shifts at the output. The trick here is to make changes slowly so as to get away with it…

However, what I didn’t address in that posting was how bad the problem can be – I only talked about how not to correct the problem when you know you have one.

So, let’s do a different (but related) test. I made a signal that consists of “digital black” – a long string of zeros – and therefore silence. Then, I made a single-sample spike every second (for example, every 44100 samples in a 44.1 kHz sampling rate system). In order to not make anything unhappy, I gave the clicks a value of 0.5 – so nothing is close to overloading…

Then, I transmitted that signal to an audio device wirelessly and recorded its output.

Figure 1, below, shows the original signal on top, and the recorded output of the device under test (the “DUT”) on the bottom.

 

Fig 1. The top plot shows the original signal set to the DUT using a wireless audio connection. The bottom plot shows the output of the DUT.

 

You may notice that there is a little noise in the bottom plot. This is because this particular DUT has an acoustical output, and the noise you see there (partly) is acoustical noise in the room and measurement system.

Note that this plot shows only the first 5 seconds of a test that actually ran for 10 minutes.

Then, I wrote a little Matlab script that finds the spikes in each signal, and counts the number of samples between spikes. So, in a system running at 44.1 kHz I would expect that there is 1 spike every 44100 samples – both at the input to the system (the original signal) and its output. In other words, I’m finding out how far apart those spikes are with a resolution of 1 sample.

So, I find the duration between clicks at the output of the DUT, convert from samples to milliseconds, and plot the error over the full 600 seconds (10 minutes) of the test. In theory, there is no error – and each duration is exactly 1 second ±0 ms. In practice, however, this is not true.

For this posting, I tested two commercially-available devices, transmitting from the same device.

Figure 2 shows the results for that first device. As you can see there, one second at the device’s input does not correspond to 1 second at its output. It drifts from a little under 999.7 ms to a little over 1000.2 ms. Note that, for this test, I don’t know from the measurement how that change takes place – whether it’s shifting slowly or using a skip/insert strategy. I just know one version of how bad the problems is over time on a second-by-second basis.

Fig 2. The deviation (in milliseconds) from the expected 1-second interval between spikes in the audio signal at the output of the DUT.

 

Figure 3, below, shows the same analysis for another device. Notice that there are three colours in this plot, corresponding to three separate tests of the same device…

Fig 3. Three tests of a second device, showing the deviation from a 1-second interval between clicks at the output.

As you can see there, this device seems to be behaving most of the time, but occasionally gets a little lost and jumps by to about ±70 ms in a worst case. This means that, for this test, we can see that “1 second” can last anything between about 930 ms and 1070 ms. Note that this analysis doesn’t show what happens at the moment (or during the time) that jump occurs – we only know that it has happened sometime between clicks at the output.

You may be wondering why the plot in Figure 2 is more “jagged” than the one in Figure 3. This is mostly because the scale of the two plots is so different. If we were to zoom in to the plot in Figure 3, we would see that it is roughly as busy, as is shown below in Figure 4.

Fig 4. The same information shown in Figure 3, zoomed in on the vertical scale.

 

One significant difference between these two devices is that the first has an acoustical output and the second has an electrical output. This may cause you to wonder whether the acoustical noise in the first measurement contributes to the error. This may be possible. However, a 0.2 ms (or 200 µs) error is roughly equivalent to 9 samples at 44.1 kHz (or a 6.9 cm shift in distance between the DUT and the microphone). This is well outside the range of the error generated by acoustical noise – so that cannot be held responsible as being the only contributor to the error measurement.

I should say that the wireless audio protocol that was used for these two tests were the same… So, this is not a comparison of two different transmission systems. Also, as I mentioned above, the transmitter was the same for both DUT’s. So, the difference in results here are attributable to the skill and attention to the execution of the manufacturers of the two receiving devices.

As always, don’t bother asking which devices these DUT’s are. I’m not telling – primarily because it doesn’t matter. I’m just using these two devices as examples of errors I often see when I measure audio equipment…

 

One additional thing that might be of interest to geeks like me. That second DUT has a digital audio output, which is what I used to capture its signal. Interestingly, when I measure the sampling rate of that output with a digital audio signal analyser, the sampling rate is typically within 2 ppm of the correct frequency. So, ignoring the big spikes in Figure 3 (which are probably the result of buffer over- or under-runs) if the timing errors we see in Figure 4 were solely caused by a clock error that was visible on the digital audio output, then we should not see deviations of no more than approximately 2 microseconds per second. Instead, we see changes on the order of 1 to 2 milliseconds per second, which indicates a sample rate drift of 1000 to 2000 ppm… So, this means that, although the sampling rate of my transmitter and the output sampling rate of my receiver (the DUT) are nominally the same, AND there is very low jitter / error on the DUT’s output sampling rate, something else in the audio signal path is causing this error. In other words, a simple measurement of the digital output’s sampling rate is not adequate to verify that the DUT’s clock is behaving.

 

Typical Errors in Digital Audio: Part 6 – Aliasing

In a previous posting, I tried to explain the concept of aliasing. The easiest way to illustrate this is to try to sample an audio signal that has a frequency that is higher than the Nyquist frequency – one half of the sampling rate. If you do this, then the signal that will come out of your digital audio system will have a different frequency than the original signal. In fact, it will be the Nyquist frequency minus the difference between the original signal and the Nyquist frequency.

For example, if we have an LPCM audio system that has a sampling rate of 48 kHz, then its Nyquist frequency is 24 kHz. If you allow any audio signal to be sampled by that system, and you record a sine wave with a frequency of 30 kHz, then the signal that will be played back by the system will be

Nyquist – (signal freq – Nyquist)
24 kHz – (30 kHz – 24 kHz)
24 kHz – 6 kHz
18 kHz

Digging a little deeper

The example I gave above is only part of the story. It’s the part of the story that’s told because it’s easy to tell, and relatively easy to grasp. However, let’s look into this a little more…

If I ask you “what is the square root of 4?” you’ll probably say that the answer is “2”. However, this is also only part of the story. The square root of 4 is also -2, since -2 * -2 = 4. So, there are two correct answers to the question – in other words, both answers exist and are equally valid.

Aliasing is somewhat similar. If we manage to get a 30 kHz sine wave into an LPCM recording system with a sampling rate of 48 kHz, we will appear to have recorded an 18 kHz sine wave. However, the samples that we have captured are also equally valid for the original 30 kHz sine wave. In fact, both the 18 kHz and the 30 kHz tones can be thought of as being equally valid answers to the set of samples we recorded.

This means that, if I record an 18 kHz sine tone in the 48 kHz system, we can consider the 30 kHz sine tone to also exist simultaneously, inside the digital domain.

Oddly, this is also true at other frequencies. So, you do not only get a mirror effect around the Nyquist, but you also get it at the 1.5 times the sampling rate (or the sampling rate + Nyquist).

I won’t go into this any deeper for now – but if you want to continue, the section on “Folding” at the Wikipedia page on Aliasing is a good place to start.

 

Normally, we try to prevent audio signals higher with frequency content higher than the Nyquist frequency from getting into an LPCM system. This is done by low-pass filtering the audio signal to eliminate any content that might cause aliasing. That’s why the low-pass filter at the input of an analogue-to-digital converter is called an anti-aliasing filter. (At least, that’s the theory. In reality, the anti-aliasing filter of many ADC’s allow a little signal to get through above Nyquist…)

However, what happens if you create signals with a frequency above the Nyquist within the digital domain? Is this possible? Can it happen accidentally?

The short answer to this question is “yes”.

For example, let’s take a sine wave with a frequency of 2212 Hz (this is an arbitrary number… it could have been something else…), record it with an LPCM system with a sampling rate of 48 kHz. Then, after the signal is in the digital domain, I clip it at 85% of the peak value, so it looks like the waveform shown in Figure 1.

Fig 1. A sine wave that has been symmetrically clipped at 85% of its peak value.

By clipping the sine wave symmetrically (meaning that we have made the same change in the wave’s shape on the top and the bottom), we create odd-order harmonics. This means that, when we look at the spectrum of the signal’s frequency content, we will see energy at the fundamental frequency (the original sine wave’s frequency) and also peaks at 3x, 5x, 7x, 9x, that frequency – and so on.

(If I had clipped only on the top or the bottom, and therefore made asymmetrical distortion, we would see energy in the even-order harmonics at 2x, 4x,  6x, 8x, the fundamental frequency – and so on.)

So, let’s look at the frequency content of the clipped signal shown in Figure 1. This is shown in Figure 2, below.

Fig 2. The frequency content of the signal shown in Figure 1. Notice that the harmonics are all at frequencies that are the fundamental frequency (2212 Hz) multiplied by an odd number, as is explained above.

As you can see in Figure 2, we are expecting to see harmonics that extend (at least in this plot) up to 37604 Hz (or 17 x 2212 Hz). Of course, there are harmonics that go higher than this – but they aren’t visible in this plot because I’m only plotting signals with a level down to -60 dB FS.

You may notice that the width of the plot at 2212 Hz increases at the bottom. This is just an artefact of the math being done to find the frequency components in the signal. That spread in the frequency domain isn’t actually in the signal itself, so it can be ignored.

As I said above, the signal was clipped in the digital domain, in an LPCM system running at 48 kHz. So, just for reference, I’ve put in blue lines in Figure 2 that show the sampling rate and the Nyquist frequency – one half the sampling rate.

So: we can see that some of the artefacts created by clipping the signal are sitting at frequencies above the Nyquist frequency in this system. This means that this content will be “mirrored” or “folded down” or – more correctly – aliased to other frequencies below the Nyquist frequency. For example, the harmonic at 24332 Hz will be mirrored to 23668 Hz, according to the following math:

Nyquist – (signal freq – Nyquist)
24000 – (24332 – 24000)
24000 – 332
23668 Hz

So, looking at the top 60 dB of the signal content (shown in Figure 3): the resulting actual output of the LPCM signal will contain:

  1. the original fundamental frequency at 2212 Hz
  2. four harmonics of that frequency (shown as the other red numbers in Figure 3), and
  3. four more frequencies that are not harmonically related to the fundamental (the blue numbers)

 

Fig 3. The frequency content of the actual output of the signal from an LPCM system with a 48 kHz sampling rate. The frequencies indicated in blue are the aliased artefacts.

As you may already know, an LPCM system has a low-pass filter at its output stage – part of the system that is used to convert the signal back to an analogue output. However, that low pass filter typically has a cutoff frequency around the Nyquist frequency of the system. However, the artefacts that we have created here have aliased down to frequencies below the Nyquist within the digital domain – so, by the time the signal reaches the low pass filter at the output (known as a “reconstruction filter”) they’re already in the audio band, and therefore they’re not filtered out.

So, as we can see in this rather simple example: it is easily possible that a digital audio system that has some processing (specifically “non-linear” processing) can create harmonics that are higher than the Nyquist frequency and will have “aliases” below the Nyquist frequency, and therefore will not be removed by an anti-aliasing filter.

Since the aliased artefacts are not harmonically related to the fundamental frequency, they are more easily audible than “normal” distortion artefacts that generate harmonically-related artefacts. There are a couple of reasons for this, but the most obvious one can be demonstrated by sweeping the frequency of the fundamental. If the artefacts are harmonically related, then as the fundamental frequency of the signal goes up, so do the artefacts. However, if the artefacts are the result of aliasing, then as the fundamental frequency of the signal goes up, some of the artefacts go down in frequency, which sounds quite strange…

The example I gave above (of clipping) is just one way to create distortion that generates harmonically-related artefacts that alias in the system. Lots of different processes can create those artefacts. One of the usual suspects is a poorly-made sampling rate converter.

Many systems use sampling rate converters for different reasons. For example, if you have a loudspeaker or processor that has a lot of filtering in its processing chain, the best architecture is to run the digital signal processing (the  DSP) at a constant (or “fixed”) sampling rate, regardless of the sampling rate of the incoming signal. This is because, if you were to change sampling rates in the DSP to match the incoming signal, you would have to load an entirely new set of coefficients (a fancy word that basically means “multiplications values inside the digital filters”) into the processor. This takes some time, and you don’t want to miss the first part of the song every time the sampling rate changes while you’re waiting to load a bunch of new coefficients into your filters… So, instead, the smart thing to do is to keep the DSP running at a constant rate, and sample rate convert all incoming signals to the internal sampling rate. This way, there’s no dropout at the start of the song.

However, you have to be careful if you do this, since a poorly-made sampling rate converter will certainly create aliasing artefacts.

In part 5 of this series of postings, I described one kind of test that can be made on an audio system. This test consists of sending a sine wave with a swept frequency into the system and recording its output. You then do a spectrogram of the output, looking for signals at frequencies other than the one you sent in.

To get an idea of what aliasing will look like in this plot, I made a DSP algorithm that creates the same kinds of artefacts. The resulting plot is shown in Figure 4, below. (Remember that this is a measurement of a system that I made to intentionally generate similar artefacts to aliasing – this isn’t actually the output of a system that is aliasing).

 

Fig 4. An example of an analysis of a system that has the same kind of artefacts as a system that is aliasing. Notice that, as the original signal increases in frequency (the straight diagonal line), some of the aliasing artefacts decrease in frequency.

Now that you know what to look for in the plot, let’s look at the measurements of some commercially-available systems. Figure 5, below is a measurement of a system that has two problems. One can be seen as the vertical lines – these are “skip/insert” artefacts that I described in an earlier posting. The aliasing artefacts can also be seen in this plot. Note that, in this case, the input and output of the system are both digital connections to my measurement equipment.

 

Fig 5. An example of aliasing (and skip insert artefacts) in a commercially-available digital audio system. The original signal was a 48 kHz, 16-bit .wav file.

 

If I send a signal at a different sampling rate into the same system, I get a different behaviour. This is not unusual in systems with sampling rate converters. In this plot, you can see the skip/insert artefacts (the vertical stripes) the aliasing artefacts, and the obvious band-limiting of the system. Notice that nothing above about 24 kHz comes out of the system, which would mean that, internally, it is probably running at a sampling rate of 48 kHz. (The input signal in this measurement was at 192 kHz and my analysis system was running at 96 kHz.)

Fig 6. An example of aliasing (and skip insert artefacts) in the same commercially-available digital audio system as shown in Figure 5. The original signal was a 192 kHz, 16-bit .wav file.

Let’s look at another system. In this case, I put a 48 kHz, 16-bit .flac file on a hard drive, and played it through another digital audio system, again capturing its digital output. The result of this is shown in Figure 7.

Fig 6. An example of a lack of aliasing in another commercially-available digital audio system. The original signal was a 48 Hz, 16-bit .flac file.

As you can see in Figure 6, this system is behaving very well in this particular test. I see the nice, clean signal with only one frequency at only one time. No artefacts down to 100 dB below the signal level. This is good.

Now let’s test exactly the same system, at exactly the same sampling rate, again with a .flac file – but this time with a 24-bit word length in the file. The result of this is shown in Figure 7.

Fig 7. Artefacts in the same commercially-available digital audio system as shown in Figure 6. The original signal was a 48 kHz, 24-bit .wav file.

So, by going from a 16-bit file to a 24-bit file, this system obviously behaves very, very differently. It now has harmonic distortion (the straight diagonal lines running parallel to the fundamental frequency), aliasing of those harmonics when they go beyond 24 kHz, and strange noises as well (the large area of blue blobs in the lower left corner, and surrounding the fundamental frequency all the way up.

Those “strange noises” –  the blobs – are probably artefacts caused by a lossy codec similar to MP3. Typically, systems like this are built to reduce the data rate of the audio signal by trying to predict what you can’t hear in the signal – and leaving that out. In doing so, they create errors that produce noise, so the encoder tries to shape that noise so that it “hides” under the signal that it keeps. The end result looks something like the blobs shown in Figure 7… For a more thorough discussion of this, see this posting.

So, based only on the information from this test, we can guess that the system might be decoding the 24-bit file, “transcoding” it to a lossy format, and transmitting that through the system. However, this is just a guess based on one test… So it could easily be wrong.

One thing we can conclude, however, is that the 48 kHz / 16-bit file behaves MUCH better than a 48 kHz / 24-bit file in this system… So, in this particular case, a higher resolution is not necessarily better…

I should also point out that the digital output of that system was capable of outputting 24 bits. The reason I’m pointing this out is that many persons think that if a system or device has a digital output, then it is good. This is too simple a conclusion to make, because, as I’m trying to illustrate with this series of postings, the “weak link” in the chain is very likely NOT the physical output of the system. It’s more likely some part of the processing in the DSP chain (for example, a poorly-made sampling rate converter that aliases) or a poorly-implemented clocking system (for example, a skip/insert strategy).

For more aliasing fun…

If you’re intrigued by this, and you’d like to compare the aliasing caused by other sampling rate converters, I’d recommend checking out the page at http://src.infinitewave.ca. They plot the signals with a linear frequency scale instead of a logarithmic one. Consequently, the sweep of the fundamental looks like a curve (instead of the straight lines in my plots) but the harmonic distortion and aliasing artefacts are easier to see as being related to the fundamental.

 

Addendum – 2018/05/14

Last week, I ran a quick test on another commercially-available device – this time, a stand-alone audio file player with a digital output. I was running the test using a 44.1 kHz, 16-bit FLAC file, but the device had a 48 kHz output. The interesting thing about this one was that the artefacts that showed up were almost exclusively aliasing errors. So, I thought it would be interesting to show the plots here.

Figure 8. The results of a test on a different commercially-available audio file player with a digital output. The original file was a 44.1 kHz, 16 bit FLAC file with a signal level of 0 dB FS. The output of the device was running at 48 kHz. The aliasing artefacts are easily visible, even within 50 dB of the signal level!

 

Figure 9. The same output as was shown in Figure 8, but in this plot, with an analysis depth of 100 dB.

 

Typical Errors in Digital Audio: Part 5 – What time is it there?

I’m originally from Newfoundland – one of the few places in the world with a 1/2-hour time zone. So, when it’s 10:00 a.m. in Montreal, it’s 11:30 a.m. in St. John’s – my home town. This meant that, when I was a kid 40 years ago, and we would call our relatives in Toronto or Germany to wish them a Merry Christmas, there were two questions that you could always rely on being asked: (1) what’s the weather like there? and (2) what time is it there?

These days, I have a similar problem that is well-described by “Segal’s Law“. My iPhone and my wristwatch (an old analogue one with hands that go around pointing at the floor and the fridge…) are never synchronised… This is because of two things: (1) I probably did a bad job of setting my watch and (more importantly) (2) my watch runs just a little bit slowly…

So, let’s say, for example, that I set my watch to be EXACTLY in sync with my phone on a Monday morning at 9:00 a.m. As the week goes by, my iPhone and my watch drift apart, and, just for the sake of argument let’s say that, one week later, when my iPhone turns over to 9:00 a.m. on Monday morning, my wristwatch turns over to 8:59 a.m. So, I lose 1 minute per week on my watch.

(It’s pretty safe to assume that my iPhone is also not perfect – but it’s different because, every once in a while, it compares its internal clock with another, more accurate clock somewhere else via a connection across the Internet (which, we will assume, for the purposes of this discussion, works).)

Let’s consider this from a strange point of view. Let’s assume that

  • I’m checking the time on my watch every minute, on the minute
  • someone else is “fixing” my watch every week so that it’s correct at 9:00 a.m. on Mondays. They do this by adjusting the watch to the correct time 30 seconds before the iPhone says it’s 9:00 a.m.
  • I don’t know that they’re doing this for me…

If we think about this from my perspective, I’ll live in a strange world where 8:59 on Mondays never exists. This is because at 8:58 and 30 seconds (on my watch), my friend re-sets the time to 8:59 and 30 seconds (while I’m not looking) to synchronise with the iPhone…

 

IF my watch was running fast – say, gaining one minute each week, then I would live in a different strange universe where 9:00 happens twice every Monday morning…

 

The basic problem here is that we have two clocks that do not run at the same rate – but they are expected to do so. So, we synchronise them regularly (in the above example, on Monday mornings at 9:00) – but between those synchronisation events, they drift apart in time.

 

So what?

The example above is very, very similar to the way a digital audio streaming system works – especially if you’re using a wireless connection between the transmitting device and a receiver.

Lets say that you’re playing a sound file that was recorded at 44.1 kHz and streaming it wirelessly to a receiver. I’m trying to be as generic as possible here, but I could be talking about a Bluetooth connection to a pair of headphones or a WiFi connection via DLNA to a device connected to a pair of loudspeakers, for example…

It is not unusual with such a connection for the transmitter to collect up a block of audio samples – say, 64 of them – and send them to the receiver’s input buffer. The receiver then pulls those samples out, one by one, and (eventually) sends them to a digital-to-analogue converter that produces a signal that (eventually) comes out as an audio signal. Then, 64/44100’ths of a second later (64 samples later) the transmitter sends another block, and so on and so on until the song ends.

This system works well if the clock inside the transmitter and the clock inside the receiver are perfectly synchronised. We can even be a little generous and say that they can drift apart a little – but not so much that we either run out of samples to play (because the receiver is playing them out faster than they’re coming in from the transmitter) or that we have samples left over to play when the next block comes in (because the receiver is playing them out slower than they’re coming in from the transmitter).

 

Dealing with this problem the right way

The right way to deal with this issue is for the receiver to always be checking what time it thinks it is when the block arrives from the transmitter. If the block arrives a little early, then the receiver should think “hmmmm, my clock is going too slowly – I’ll speed it up a bit”. If the block arrives a little late, then the receiver should adjust its clock to go a little slower.

So, in this case, the receiver has a basic, nominal speed for its internal clock – but it’s constantly adjusting it to be faster and slower to try and match the clock of the transmitter – but it can only do this adjustment at the block rate – the frequency at which the blocks of samples arrive, which is dependent on the block length (how many samples are in each block) and the sampling rate (how many samples per second). (Of course, this can result in “jitter and wander” problems if you’re not careful (I won’t talk about this here…) – so you have to pay a little attention to how quickly you’re adjusting your clock rate… but that’s “just” a matter of correct implementation.)

 

Dealing with this problem the wrong way

There is another way to deal with this problem, which, unfortunately, has measurable and possibly audible consequences. This implementation is basically the same as my original example, where I had a friend “fixing” my wristwatch once a week. You have a transmitter that sends blocks of samples to the receiver – and although these two devices should have exactly the same clock rate, they don’t.

Let’s say, for example, that the receiver is playing the samples faster than they’re being sent by the transmitter. This means that the two will slowly drift farther and farther apart until, eventually, the receiver will have to play a sample, but nothing has come in from the transmitter yet, so there’s no sample there to play. In this case, the receiver says “no problem, I’ll just play the last sample again, and the next block will come in while I’m doing that” – so it inserts an extra sample that is just a duplicate of the previous one.

If the receiver’s clock is going slower than the transmitter’s, then, as the two drift farther apart, we will get to a moment where the receiver will receive a new block of samples but it’s not done playing all of the samples in the previous block yet. In this event, it says “no problem, I’ll just leave that last sample out and move on to the next block to catch up” – so it skips a sample.

This is called a “Skip / Insert”  strategy for dealing with clock synchronisation. It’s done by software and hardware engineers because it’s simple to implement, and, in many cases, a manufacturer can get away with this, since it is rarely audible for a couple of reasons.

Can this be measured?

The simple answer to this is “yes” – and it can be measured in a number of different ways. I’ll show one way below…

Can I hear it?

The honest answer to this question is “sometimes” – but it’s not as easy to detect as one might think. Of course, a skip/insert event (a duplicated sample or a dropped one) creates an artefact. However, the magnitude of this artefact relative to the “correct” signal is dependent on when it happens.

Let’s take a look at a couple of simple cases. We’ll “transmit” one period of a sine wave that should come out on the other side of the system looking like Figure 1.

Fig 1: The original signal that we want to transmit

But what happens if we don’t get a block in time to keep outputting a signal? We insert a duplicate sample and hope that the block comes in before I have to send out another one. Examples of this are shown in Figures 2 and 3, below.

Fig 2: Insert example 1
Fig 3: Insert example 2

You’ll probably notice that it’s much easier to see which sample I duplicated in Figure 3 than in Figure 2. In Figure 3 it was sample number 26 that was duplicated. In Figure 2 it’s sample number 13.

The reason it’s easier to see the error in Figure 3 is that duplicating the sample causes an obvious change in the slope of the signal, whereas in Figure 2 it does not – the slope of the signal is 0, and by duplicating a sample, I am also making it 0 – but for a slightly longer time.

This does not mean that we did not generate an error. It just means that we’ll probably “get away with it” in the case of Figure 2, and we probably won’t in the case of Figure 3.

However, since the drifting of the two clocks (in the receiver and transmitter) are not dependent on the signal, there’s no way to know when this is going to happen.

And, of course, if this happens in the middle of a snare drum hit or a ssssinger sssstarting a word in a ssssong with the letter “s” – then we also won’t hear it because there’s so much going on (frequency-wise) that the artefact will be buried in the mess.

Also, since this clock drifting is usually not completely regular, the errors do not usually come in at a regular rate (although I’ve seen exceptions…). So, it’s not like you can listen for “a click every second” or “one per minute”. They happen when they happen – hopefully when you’re not listening and/or when the tune is busy enough to hide it.

 

A skip event is similar to an insert, as you can see in the two examples in Figures 4 and 5.

Fig 4: Skip example 1
Fig 5: Skip example 2

Again, I’ve intentionally put in these two skips in places where they are least obvious (Figure 4) and most obvious (Figure 5).

 

The real world

One of the tests that can be done on an audio system is to send a sinusoidal signal with a swept frequency through a system, capture the output, and then do a spectrogram of the result. In theory, if you see anything other than a single frequency at any one time at the output, then you know that something has happened to the signal. You would probably then need to go back and look at the output signal itself to start evaluating exactly what happened… This is a test that is used to evaluate one aspect of the performance of different sampling rate converters, for example, at this site.

Let’s take a sine sweep and run it through a system. The sweep goes up logarithmically in frequency from 20 Hz to about 90% of Nyquist (which would correspond to 20,000 Hz in a system running at 44.1 kHz) over 60 seconds and has a level of -1 dB FS. We’ll then capture the output in a system that is behaving perfectly and do a spectrogram of this, looking for artefacts down to some level below the signal level. (If you’re really geeky, you’ll know that this signal-to-error ratio is dependent on the window length of the FFT I’m using to create the spectrogram – but this is beyond our discussion today…).

An example of the output of a system that is behaving well is shown in Figure 6.

 

Fig 6. A spectrogram of a sinusoidal signal, swept in frequency over 60 seconds. Notice that there are no additional signals within 50 dB (the scale on the right) of the signal.

You may notice that the plot looks a little “wide” in the beginning. This is because the window length of the FFT I’m using to analyse the signal isn’t long enough to get a precise analysis of a low-frequency signal. So, this is an artefact of the analysis – not an error in the playback system.

What happens if we have random skip/insert events in the system? This is shown in Figure 7.

Fig 7. Intentionally-created skip/insert events seen as artefacts in the frequency domain.

The signal in Figure 7 was one that I created – I intentionally made skip/insert events at random times and applied them to my test signal.

There are two things to notice here. The first is that each event is visible as a vertical “spike” in the plot. This is because a skip/insert event will cause a short, wide-band “burst” that sounds like a click. However, the bandwidth of the click is dependent on when it happens relative to the signal. For example, the skip/insert events in Figure 2 and 4 would not create as much high-frequency energy as the ones in Figure 3 and 5. So, the bigger the effect on the slope of the signal, the more high frequency energy we’ll get in our “click” sound. Since the slope of a signal increases with frequency, then this also means that low-frequency signals will likely produce lower-bandwidth artefacts.

Now let’s look at the results from some real-world devices and systems that are commercially available.

Fig 8. The same test run on a commercially-available system/device. If you’re curious about some of the information listed in the plot, you can decode it as follows: The title “44k1_16_-1dBFS_chan1_100dB_snr” means that the original file I was playing was a 44.1k kHz / 16 bit file. The level of the sinusoidal sweep was -1 dB FS, and TPFD dithered. The analysis we’re looking at here is for channel 1 (the left channel), and we’re looking for artefacts down to 100 dB below the signal level. The “96000” you see on the top left of the plot indicates that the output of the system was captured at a sampling rate of 96 kHz (the internal sampling rate of the sound card that I used to do this measurement).

 

As you can see in Figure 8, there was one skip/insert event that happened during the 60 seconds I was running this test. Remember that the time that that event happened had nothing to do with the frequency it was playing. It just happens when it happens due to the relationship between the transmitter’s and the receiver’s clock speeds.

 

Fig 9. Another commercially-available system/device.

 

Figure 9 shows the results from a different system/device that obviously uses a skip/insert strategy to deal with clock synchronisation problems. It also obviously has some serious clock issues, since it has to correct on the order of approximately once a second…

 

Fig 10. Another commercially-available system/device.

Figure 10 shows the results from a different system/device that uses a skip/insert strategy – but appears to do so at scheduled intervals. In this case, there is a high probability of getting a skip/insert event every 10 seconds with the counter starting at the instant I starting hearing the music.

 

Addendum 1

Inquisitive readers may be asking why it is that, although I’m doing an analysis down to -101 dB FS (100 dB below the signal level of -1 dB FS), you can’t see the effects of the dither noise floor in my original 16-bit file (which is normally assumed to be at -93 dB FS). This is because the -93 dB FS estimate of a dither signal assumes that you are looking at the total energy from the entire frequency band. The spectrograms above are based on FFT’s that split up the total frequency band into “slices” (called frequency bins) – and the total energy in each of these bins is less than the total energy in all of them (one person clapping is not as loud as 1000 people clapping at the same time…). If we wanted to see the dither noise, I would have had to set my analysis to go down approximately 30 dB lower – but the actual value for this is dependent on the relationship between the sampling rate, the window length of the FFT’s, and the windowing function that I’m using.

 

Addendum 2

Do not bother contacting me to ask which “commercially-available system/device” I measured and in which I found these errors. I’m not doing this to get anyone in trouble. I’m just doing this to try to illustrate common errors that I see often when I evaluate and test audio devices.

An besides, it would not be fair for me to rat on specific companies, systems, or devices, since, in some cases, these errors may have already been fixed with a firmware update, meaning that “naming names” would be irrelevant and unnecessarily detrimental.

But, I will say that I see this problem often. A rough estimate is that I would see errors like this on roughly half of the commercially-available devices and systems I test. It can also be sneaky, as we saw in Figures 8 and 10. Sometimes you get one of these clicks only once in a minute. So, if you do a 10-second measurement to test if your wireless audio receiver is “bit accurate” – the answer can be “yes” – but if you keep measuring for 1 or 2 minutes, you find out the answer is “no”…

 

Addendum 3

If it helps, I could have used the example of a leap year instead of two clocks at the beginning. The reason we have a February 29 every 4 years is that our calendar “runs” a little faster than the time it takes us to get around the sun (because a “year” is actually 365.25 days long…). So, every 4 years we have to “insert” a day to put the two clocks back in sync.

Also, since a “year” is not exactly 365.25 days long, we also have the occasional “leap second” as well. But most people don’t notice this, since it’s rarely useful as an excuse when you’ve missed a meeting…

Typical Errors in Digital Audio: Part 3 – Aliasing

Reminder: This is still just the lead-up to the real topic of this series. However, we have to get some basics out of the way first…

In the first posting in this series, I talked about digital audio (more accurately, Linear Pulse Code Modulation or LPCM digital audio) is basically just a string of stored measurements of the electrical voltage that is analogous to the audio signal, which is a change in pressure over time… In the second posting in the series, we looked at a “trick” for dealing with the issue of quantisation (the fact that we have a limited resolution for measuring the amplitude of the audio signal). This trick is to add dither (a fancy word for “noise”) to the signal before we quantise it in order to randomise the error and turn it into noise instead of distortion.

In this posting, we’ll look at some of the problems incurred by the way we carve up time into discrete moments when we grab those samples.

Let’s make a wheel that has one spoke. We’ll rotate it at some speed, and make a film of it turning. We can define the rotational speed in RPM – rotations per minute, but this is not very useful. In this case, what’s more useful is to measure the wheel rotation speed in degrees per frame of the film.

 

Fig 1. The position of a clockwise-rotating wheel (with only one spoke) for 9 frames of a film. Each column shows a different rotational speed of the wheel. The far left column is the slowest rate of rotation. The far right column is the fastest rate of rotation. Red wheels show the frame in which the sequence starts repeating.

 

Take a look at the left-most column in Figure 1. This shows the wheel rotating 45º each frame. If we play back these frames, the wheel will look like it’s rotating 45º per frame. So, the playback of the wheel rotating looks the same as it does in real life.

This is more or less the same for the next two columns, showing rotational speeds of 90º and 135º per frame.

However, things change dramatically when we look at the next column – the wheel rotating at 180º per frame. Think about what this would look like if we played this movie (assuming that the frame rate is pretty fast – fast enough that we don’t see things blinking…) Instead of seeing a rotating wheel with only one spoke, we would see a wheel that’s not rotating – and with two spokes.

This is important, so let’s think about this some more. This means that, because we are cutting time into discrete moments (each frame is a “slice” of time) and at a regular rate (I’m assuming here that the frame rate of the film does not vary), then the movement of the wheel is recorded (since our 1 spoke turns into 2) but the direction of movement does not. (We don’t know whether the wheel is rotating clockwise or counter-clockwise. Both directions of rotation would result in the same film…)

Now, let’s move over one more column – where the wheel is rotating at 225º per frame. In this case, if we look at the film, it appears that the wheel is back to having only one spoke again – but it will appear to be rotating backwards at a rate of 135º per frame. So, although the wheel is rotating clockwise, the film shows it rotating counter-clockwise at a different (slower) speed. This is an effect that you’ve probably seen many times in films and on TV. What may come as a surprise is that this never happens in “real life” unless you’re in a place where the lights are flickering at a constant rate (as in the case of fluorescent or some LED lights, for example).

Again, we have to consider the fact that if the wheel actually were rotating counter-clockwise at 135º per frame, we would get exactly the same thing on the frames of the film as when the wheel if rotating clockwise at 225º per frame. These two events in real life will result in identical photos in the film. This is important – so if it didn’t make sense, read it again.

This means that, if all you know is what’s on the film, you cannot determine whether the wheel was going clockwise at 225º per frame, or counter-clockwise at 135º per frame. Both of these conclusions are valid interpretations of the “data” (the film). (Of course, there are more – the wheel could have rotated clockwise by 360º+225º = 585º or counter-clockwise by 360º+135º = 495º, for example…)

Since these two interpretations of reality are equally valid, we call the one we know is wrong an alias of the correct answer. If I say “The Big Apple”, most people will know that this is the same as saying “New York City” – it’s an alias that can be interpreted to mean the same thing.

Wheels and Slinkies

We people in audio commit many sins. One of them is that, every time we draw a plot of anything called “audio” we start out by drawing a sine wave. (A similar sin is committed by musicians who, at the first opportunity to play a grand piano, will play a middle-C, as if there were other notes in the world.) The question is: what, exactly, is a sine wave?

Get a Slinky – or if you don’t want to spend money on a brand name, get a spring. Look at it from one end, and you’ll see that it’s a circle, as can be (sort of) seen in Figure 2.

Fig 2. A Slinky, seen from one end. If I had really lined things up, this would just look like a shiny circle.

Since this is a circle, we can put marks on the Slinky at various amounts of rotation, as in Figure 3.

Fig 3. The same Slinky, marked in increasing angles of 45º.

Of course, I could have put the 0º marl anywhere. I could have also rotated counter-clockwise instead of clockwise. But since both of these are arbitrary choices, I’m not going to debate either one.

Now, let’s rotate the Slinky so that we’re looking at from the side. We’ll stretch it out a little too…

Fig 4. The same Slinky, stretched a little, and viewed from the side.

Let’s do that some more…

Fig 5. The same Slinky, stretched more, and viewed from the “side” (in a direction perpendicular to the axis of the rotation).

When you do this, and you look at the Slinky directly from one side, you are able to see the vertical change of the spring from the centre as a result of the change in rotation. For example, we can see in Figure 6 that, if you mark the 45º rotation point in this view, the distance from the centre of the spring is 71% of the maximum height of the spring (at 90º).

Fig 6. The same markings shown in Figure 3, when looking at the Slinky from the side. Note that, if we didn’t have the advantage of a little perspective (and a spring made of flat metal), we would not know whether the 0º point was closer or further away from us than the 180º point. In other words, we wouldn’t know if the Slinky was rotating clockwise or counter-clockwise.

So what? Well, basically, the “punch line” here is that a sine wave is actually a “side view” of a rotation. So, Figure 7, shows a measurement – a capture – of the amplitude of the signal every 45º.

Fig 7. Each measurement (a black “lollipop”) is a measurement of the vertical change of the signal as a result of rotating 45º.

Since we can now think of a sine wave as a rotation of a circle viewed from the side, it should be just a small leap to see that Figure 7 and the left-most column of Figure 1 are basically identical.

Let’s make audio equivalents of the different columns in Figure 1.

Fig 8. A sampled cosine wave where the frequency of the signal is equivalent to 90º per sample period. This is identical to the “90º per frame” column in Figure 1.
Fig 9. A sampled cosine wave where the frequency of the signal is equivalent to 135º per sample period. This is identical to the “135º per frame” column in Figure 1.
Fig 10. A sampled cosine wave where the frequency of the signal is equivalent to 180º per sample period. This is identical to the “180º per frame” column in Figure 1.

Figure 10 is an important one. Notice that we have a case here where there are exactly 2 samples per period of the cosine wave. This means that our sampling frequency (the number of samples we make per second) is exactly one-half of the frequency of the signal. If the signal gets any higher in frequency than this, then we will be making fewer than 2 samples per period. And, as we saw in Figure 1, this is where things start to go haywire.

Fig 11. A sampled cosine wave where the frequency of the signal is equivalent to 225º per sample period. This is identical to the “225º per frame” column in Figure 1.

Figure 11 shows the equivalent audio case to the “225º per frame” column in Figure 1. When we were talking about rotating wheels, we saw that this resulted in a film that looked like the wheel was rotating backwards at the wrong speed. The audio equivalent of this “wrong speed” is “a different frequency” – the alias of the actual frequency. However, we have to remember that both the correct frequency and the alias are valid answers – so, in fact, both frequencies (or, more accurately, all of the frequencies) exist in the signal.

So, we could take Fig 11, look at the samples (the black lollipops) and figure out what other frequency fits these. That’s shown in Figure 12.

Fig 12. The red signal and the black samples of it are the same as was shown in Figure 11. However, another frequency (the blue signal) also fits those samples. So, both the red signal and the blue signal exist in our system.

Moving up in frequency one more step, we get to the right-hand column in Figure 1, whose equivalent, including the aliased signal, are shown in Figure 13.

Fig 13. A signal (the red curve) that has a frequency equivalent to 280º of rotation per sample, its samples (the black lollipops) and the aliased additional signal that results (the blue curve).

 

 

Do I need to worry yet?

Hopefully, now, you can see that an LPCM system has a limit with respect to the maximum frequency that it can deal with appropriately. Specifically, the signal that you are trying to capture CANNOT exceed one-half of the sampling rate. So, if you are recording a CD, which has a sampling rate of 44,100 samples per second (or 44.1 kHz) then you CANNOT have any audio signals in that system that are higher than 22,050 Hz.

That limit is commonly known as the “Nyquist frequency“, named after Harry Nyquist – one of the persons who figured out that this limit exists.

In theory, this is always true. So, when someone did the recording destined for the CD, they made sure that the signal went through a low-pass filter that eliminated all signals above the Nyquist frequency.

In practice, however, there are many cases where aliasing occurs in digital audio systems because someone wasn’t paying enough attention to what was happening “under the hood” in the signal processing of an audio device. This will come up later.

 

Two more details to remember…

There’s an easy way to predict the output of a system that’s suffering from aliasing if your input is sinusoidal (and therefore contains only one frequency). The frequency of the output signal will be the same distance from the Nyquist frequency as the frequency if the input signal. In other words, the Nyquist frequency is like a “mirror” that “reflects” the frequency of the input signal to another frequency below Nyquist.

This can be easily seen in the upper plot of Figure 14. The distance from the Input signal and the Nyquist is the same as the distance between the output signal and the Nyquist.

Also, since that Nyquist frequency acts as a mirror, then the Input and output signal’s frequencies will move in opposite directions (this point will help later).

 

Fig 14. Two plots showing the same information about an Input Signal above the Nyquist frequency and the output alias signal. Notice that, in the linear plot on top, it’s easier to see that the Nyquist frequency is the mirror point at the centre of the frequencies of the Input and Output signals.

 

Usually, frequency-domain plots are done on a logarithmic scale, because this is more intuitive for we humans who hear logarithmically. (For example, we hear two consecutive octaves on a piano as having the same “interval” or “width”. We don’t hear the width of the upper octave as being twice as wide, like a measurement system does. that’s why music notation does not get wider on the top, with a really tall treble clef.) This means that it’s not as obvious that the Nyquist frequency is in the centre of the frequencies of the input signal and its alias below Nyquist.