In a previous posting, I showed some plots that displayed the probability density functions (or PDF) of a number of commercial audio recordings. (If you are new to the concept of a probability density function, then you might want to at least have a look at that posting before reading further…)
I’ve been doing a little more work on this subject, with some possible implications on how to interpret those plots. Or, perhaps more specifically, with some possible implications on possible conclusions to be drawn from those plots.
Full-band examples
To start, let’s create some noise with a desired PDF, without imposing any frequency limitations on the signal.
To do this, I’ve ported equations from “Computer Music: Synthesis, Composition, and Performance” by Charles Dodge and Thomas A. Jerse, Schirmer Books, New York (1985) to Matlab. That code is shown below in italics, in case you might want to use it. (No promises are made regarding the code quality… However, I will say that I’ve written the code to be easily understandable, rather than efficient – so don’t make fun of me.) I’ve made the length of the noise samples 2^16 because I like that number. (Actually, it’s for other reasons involving plotting the results of an FFT, and my own laziness regarding frequency scaling – but that’s my business.)
Uniform (aka Rectangular) Distribution
uniform = rand(2^16, 1);
Fig 1: The PDF and the spectrum (1-octave smoothed) of a noise signal with a rectangular distribution. Note that there is a DC component, since there are no negative values in the signal.
Of course, as you can see in the plots in Figure 1, the signal is not “perfectly” rectangular, nor is it “perfectly” flat. This is because it’s noise. If I ran exactly the same code again, the result would be different, but also neither perfectly rectangular nor flat. Of course, if I ran the code repeatedly, and averaged the results, the average would become “better” and “better”.
Fig 2: The PDF and the spectrum (1-octave smoothed) of a noise signal with a linear distribution. Note that there is a DC component, since there are no negative values in the signal.
Triangular Distribution
triangular = rand(2^16, 1) – rand(2^16, 1);
Fig 3: The PDF and the spectrum (1-octave smoothed) of a rise signal with a triangular distribution. Note that there is no DC component, since the PDF is symmetrical across the 0 line.
Exponential Distribution
lambda = 1; % lambda must be greater than 0
exponential_temp = rand(2^16, 1) / lambda;
if any(exponential_temp == 0) % ensure that no values of exponential_temp are 0
error(‘Please try again…’)
end
exponential = -log(exponential_temp);
Fig 4: The PDF and the spectrum (1-octave smoothed) of a rise signal with an exponential distribution. Note that there is a DC component, since there are no negative values in the signal. Note as well that the values can be significantly higher than 1, so you might incur clipping if you use this without thinking…
Bilateral Exponential Distribution (aka Laplacian)
Fig 5: The PDF and the spectrum (1-octave smoothed) of a rise signal with a bilateral exponential distribution. Note that there is no DC component, since the PDF is symmetrical across the 0 line. Note as well that the values can be significantly higher than 1 (or less than -1), so you might incur clipping if you use this without thinking…
Gaussian
sigma = 1;
xmu = 0; % offset
n = 100; % number of random number vectors used to create final vector (more is better)
xnover = n/2;
sc = 1/sqrt(n/12);
total = sum(rand(2^16, n), 2);
gaussian = sigma * sc * (total – xnover) + xmu;
Fig 6: The PDF and the spectrum (1-octave smoothed) of a rise signal with a Gaussian distribution. Note that there is no DC component, since the PDF is symmetrical across the 0 line. Note as well that the values can be significantly higher than 1 (or less than -1), so you might incur clipping if you use this without thinking…
Of course, if you are using Matlab, there is an easier way to get a noise signal with a Gaussian PDF, and that is to use the randn() function.
The effects of band-passing the signals
What happens to the probability distribution of the signals if we band-limit them? For example, let’s take the signals that were plotted above, and put them through two sets of two second-order Butterworth filters in series, one set producing a high-pass filter at 200 Hz and the other resulting in a low-pass filter at 2 kHz .(This is the same as if we were making a mid-range signal in a 4th-order Linkwitz-Riley crossover, assuming that our midrange drivers had flat magnitude responses far beyond our crossover frequencies, and therefore required no correction in the crossover…)
What happens to our PDF’s as a result of the band limiting? Let’s see…
Fig 7: The PDF of noise with a rectangular distribution that has been band-limited from 200 Hz to 2 kHz.
Fig 8: The PDF of noise with a linear distribution that has been band-limited from 200 Hz to 2 kHz.
Fig 9: The PDF of noise with a triangular distribution that has been band-limited from 200 Hz to 2 kHz.
Fig 10: The PDF of noise with an exponential distribution that has been band-limited from 200 Hz to 2 kHz.
Fig 11: The PDF of noise with a Laplacian distribution that has been band-limited from 200 Hz to 2 kHz.
Fig 12: The PDF of noise with a Gaussian distribution that has been band-limited from 200 Hz to 2 kHz.
So, what we can see in Figures 7 through 12 (inclusive) is that, regardless of the original PDF of the signal, if you band-limit it, the result has a Gaussian distribution.
And yes, I tried other bandwidths and filter slopes. The result, generally speaking, is the same.
One part of this effect is a little obvious. The high-pass filter (in this case, at 200 Hz) removes the DC component, which makes all of the PDF’s symmetrical around the 0 line.
However, the “punch line” is that, regardless of the distribution of the signal coming into your system (and that can be quite different from song to song as I showed in this posting) the PDF of the signal after band-limiting (say, being sent to your loudspeaker drivers) will be Gaussian-ish.
And, before you ask, “what if you had only put in a high-pass or a low-pass filter?” – that answer is coming in a later posting…
Post-script
This posting has a Part 1 that you’ll find here, and a Part 3 that you’ll find here.
Often, people want simple answers to what are usually complicated questions. “What is the best restaurant in the world?” “Is this new movie worth seeing?” “Is it better for ‘the environment’ to buy cloth or disposable diapers?” “Is this Metallica album any good?”
My rule of thumb to any question like this is that, if the answer is “it depends….” then I immediately trust the person doing the answering. Once upon a time, the answer to the first question was “Noma in Copenhagen” if you believed one rating, but if you don’t like eating yogurt with ants, perhaps you might disagree… Some people screamed with delight when they saw the latest Star Wars trailer – others aren’t into science fiction so much, so they might not be the first in line to buy tickets. It turns out that the cloth vs. disposable is, in part, dependent on geography and natural resources.
However, it’s easy to go to many websites and learn that Death Magnetic is an affront to humanity due to a heavy-handed use of dynamic range compression, and the engineers involved should be on a “most wanted” poster with bounties on their heads…
Personally, I disagree. True, Death Magnetic isn’t the best album to use to show what a wide-dynamic-range recording sounds like – but that’s just one aspect of everything about that recording. Similarly, it would be misguided to say that “Casablanca” is a bad movie because of it’s obvious lack of colours – there are other aspects of that movie that might make it worth watching…
Now, before we go any further, don’t send me emails and comments about this. If you stick with me, you’ll see that my point is not to argue the merits of Death Magnetic. If you don’t like it, I respect your opinion, and I won’t make you listen to it. (And, just for the record, while I type this, I’m listening to an old recording of Schoenberg String Quartets on my headphones – hopefully proof that I’m open to listening to other albums as well…)
My problem with the people I’m trying to make fun of is that they use a single measure – a one-dimensional analysis – to judge the quality of something that has more dimensions. For example, no one ever complains about the overall frequency range of Death Magnetic. Maybe that’s really good (whatever that means).
We can even go further in our mission to pick nits: How, exactly, does one measure or rate the “dynamic range” of a recording? Spoiler alert: this question will not be answered by the end of this blog posting. I just hope to show that answering this question is difficult – and that you shouldn’t trust someone else’s rating…
To start, let’s get back-to-basics. The dynamic range of something is a measure or rating of its loudest level to its quietest. So, we can talk about the dynamic range of an audio device – comparing the level at which it can go no louder to the level of its noise floor (both of which are dependent on the frequency to which you want to pay attention). Or, we could talk about the dynamic range of a conversation over dinner – from the loudest moment in the evening when the drunk, belligerent person was shouting down the table at someone he disagreed with, to that quiet moment early in the evening at 8:20 p.m. when no one at the table could think of something to say.
Let’s take a different recording as an example, since it will make the show-and-tell a little easier.
If I take Jennifer Warnes’s 1986 recording of “Bird on a Wire” from the album “Famous Blue Raincoat” and plot the sample values of the left channel on a linear scale ranging from -1 to 1, it will look like Figure 1.
Fig 1: The individual values of all of the samples in “Bird on a Wire”, plotted on a linear scale.
This shows us some stuff. For example, we can see that there are no samples louder than 0.5 of maximum. We can also see that it’s a bit “spiky” – probably drum beats, if we don’t know the tune. There’s also a part of the tune centred around the 150 second mark that’s probably a bit louder, and a section afterwards that is a bit quieter. However, what is the “dynamic range” of this song?
Attempt #1
One way to answer the question is to compare the loudest peak in the song to the average level (also known as the “crest factor” of the signal). Let’s try that…
Using Matlab, I found out that this is a single sample with a value of 0.4973 that occurs 259.4884 seconds into the tune (sample #11,443,439, if you’re wondering). Then I calculated the overall RMS value of the entire track (the square Root of the Mean of the sample values individually Squared) and found out that it’s 0.0530.
Therefore, our “dynamic range” of this recording is 20*log10(0.4973 / 0.0530) = 19.45 dB
That’s an answer that was easy to calculate. However, to borrow a quote from Pauli: “That is not only not right; it is not even wrong.”
This is because the value of a single sample in a song cannot be considered to be the perceived peak in the music – even if it is the biggest one. In addition, the total average the power of all of the samples in the song means nothing, since you can’t listen to the entire song at once. It’s like saying “what is the elevation of the Alps?” You can do the average, and come up with a number, but it won’t matter much for planning your hike over Mont Blanc…
Attempt #2
One of the problems with Figure 1 is that it shows the sample values on a linear scale, which has little to do with the way we hear. For example, we hear a drop from a linear value of 1 down to a linear value of 0.5 the same as going from 0.5 down to 0.25 – and 0.25 down to 0.125… (I was halving the signal in each of those steps). So, let’s re-plot the values on a logarithmic scale, which is a better visual representation of how we hear the signal. This is shown in Figure 2.
Fig 2: The same data that was shown in Figure 1, plotted on a decibel scale. It says “dB FS” – but it’s not really. It’s actually just 20*log10(abs(sample value)).
Things look a little less different now – although, to be fair, I have a range of 100 dB on the Y-axis – which is big. Note that the samples with a value of 0 are not plotted on this graph, since they will have a value of -∞ dB, which would require a bigger computer screen. So, if you could zoom in on this plot, you’d see holes where those sample values occur.
Let’s zoom in a little on the vertical axis so that things are a little more friendly.
Fig 3: The same information as is shown in Figure 2 with a reduced range in the Y-axis.
Okay, so we still see that there’s a quiet section around the 200-second mark, but the loud section just before it doesn’t seem as louder as it did in the linear plot.
There is a quick lesson to be learned here: linear plotting of audio tends to exaggerate the louder stuff. This is because, in a linear world, the area of 0.5 up to 1.0 is half of the plot (if we only talk about the positive world), but in the log world (which our ears live in) it’s only the top 6 dB. (Similarly, if you’re plotting in linear frequency, the top octave on a piano is more important than all of the others combined… which is not true.)
Another way to see this exaggeration is to look at the dip in level at about the 220 second mark. This is much more evident as a “trough” in Figure 3 than in Figure 1, because the low-level samples get equal attention in the log-scale plot.
Why have I mentioned this? It’s because if you go to Youtube and watch videos of people ranting about the “loudness wars” they often do demonstrations where they use compression to reduce the dynamic range of old recordings and play “before and after” results. As they do this, the video shows what looks like a real-time plot of a linear representation of the results. Apart from the fact that it’s likely that you’re actually looking at a scaling of a plot in a graphics editor, and not in an audio editor, the linear representation is not a fair one to show what a compressor does to an audio signal. Note that I’m not making any comment about the content of those videos – I’m just warning you about jumping to conclusions based on the visual representation of the data.
Okay, back to our story… How can we find the dynamic range of this recording?
Well, our second possibility is to look for a loud section in Figure 3 and compare it to a quiet section in Figure 3. The problem is, how do we find those sections? We’ve already established that looking at the loudest single sample in the entire song is a silly idea. So is finding the RMS of the entire track. Let’s try something else. Let’s do a running RMS of the song. This is shown in Figure 4.
Fig 4: The same data shown in Figure 3 (in blue) and a running RMS value (in red) with a 10-second time constant.
Now, one of the problems with an RMS calculation is that part of the calculation is to take the mean (or average) of a bunch of values. However, you have to decide how many values to average – in other words, “how long a slice of time are we taking?” In the plot above, the RMS value (shown in red) is calculated using a 10-second window (which is why it starts at the 5-second mark). For each point on that red line, I took all of the audio sample 5 seconds before to 5 seconds afterwards, squared them, found the mean value, and then took the square root of that. This can give me some indication of the level of the music in those 10 seconds – but we don’t hear all 10 seconds at once… so now, it’s like asking the average height of Mont Blanc (instead of the average height of the Alps). So, this information is still not really useful.
One thing is interesting, though… Look at where the red line says the quietest moment in the song is. It’s not centred around the deep trough at 220 seconds. It’s a little later than that. This is because a narrow, deep notch in the blue signal gets averaged out of existence in the RMS calculation – making it less significant.
Another interesting thing is to note how low the RMS value is relative to the instantaneous values. (In other words, the red line is a lot lower than the top of the blue curve.) This is because the peaks in the signal are short – but that’s not visible because I’m plotting the whole tune. If I zoom in to only one second of music (in Figure 5) this becomes more intuitive…
Fig 5: A zoomed-in portion of Figure 4, showing only 1 second of music. Remember that the red line still shows the RMS value of the surrounding 10 seconds of music, 9 seconds of which are not shown here.
So, let’s reduce our RMS time constant to see what happens…
Fig 6: The same data shown in Figure 3 (in blue) and a running RMS value (in red) with a 1-second time constant.
Now we’re starting to see a little more variation in the RMS signal, which might be a little closer to the way we hear things. But it’s still not a good representation because the RMS time window is 1 second – and we don’t hear 1 second of music simultaneously either…
So, how much time is “now” when we’re listening to music? Some people say 30 milliseconds is a good number, since anything outside that window can be heard as an echo… This is a mis-interpretation of a mis-interpretation, but at least it gives us a number to put into the math, so let’s try that.
Fig 7: The same information as shown in Figure 3 (in blue) and a running RMS (in red) with a 30 ms time constant.
Now, if you squint just right, the red curve basically looks like a lower copy of the blue curve, so it’s probably not useful at this magnification. Let’s zoom in to see if it’s makes more sense
Fig 7: The same information as shown in Figure 5, but with a 30 ms time constant for the RMS calculation.
Enough… Hopefully you can see where I’m headed with this: no where… The simple problem is that, in order to find the dynamic range of a piece of music, we need to compare the loudest moment to the quietest moment. Unfortunately, none of what I’ve shown you so far is a measurement of either of these two values.
We can do other things. We can band-limit the signal (since neither 20 kHz nor 20 Hz have much contribution to how loud something sounds). We can play with time constants (maybe using a short one to invent a number for the peaks, and a longer one for the quiet moments). We can do other, more complicated stuff, including applying filters (simple ones like the ones used in the ITU-1770 standard, or complicated ones like the ones suggested by companies who make gear).
We also have to consider whether our perception of how loud something is is related to the average level or the peak level (JJ points out in his lecture at about 20:38 that different persons will disagree on this…).
Ultimately, we get to a point where we have to say that a single number to represent the dynamic range of a song is a nice number to have, but probably not related to how you perceive it. And if it is, it might not be related to how I perceive it…
And, in the end, the REAL message is that it doesn’t matter. Deciding that a recording is “good” because it has been rated to have a wide dynamic range by someone’s calculation is like saying that “Casablanca” isn’t a good movie because it doesn’t have any colours in it – or choosing a restaurant based on a single rating from one critic. No matter what the rating is, if you don’t want to eat live ants crawling on yogurt, you should probably go somewhere else…
Addendum: Before you comment on this posting, please go and watch the two lectures by JJ here and here. If the content of your comment indicates that you have not watched these two lectures, I will probably delete it from my website. You’ve been warned… I will also delete your comment if you mention Death Magnetic – if you want to talk about that, make your own website. You, too, have been warned.
Interesting that they include people who “buy” digital downloads in the “own” category, since normal consumers typically don’t know that they’re only buying access to the music from most sites, and not actually purchasing the content, like we did in the old days…
Stephen Colbert once said that George W. Bush was a man of conviction. He believed the same thing on Wednesday that he believed on Monday, no matter what happened on Tuesday….
Often people ask me questions about sound quality. In “the old days”, it was something like “which is better, analogue or digital?” Later, it became “which is better, MP3 or Ogg Vorbis (or something else)?”. These days, it’s something like “which streaming service has the best quality?” or “Is high-resolution audio really worth it?” Or, it’s a more general question like “what loudspeaker (or headphones) would you recommend?”
My answer to these questions is always the same. It’s a combination of “it depends….” and “whatever I tell you today, it might be different tomorrow…”Something that is true on Monday may not still be true on Wednesday…
Recently, during a discussion about something else, I told someone that many, if not most, mobile devices will clip (and therefore) distort a signal if you try to boost it (e.g. with a “bass boost” or a “pop” setting instead of playing the signals “flat”), but they won’t if you cut. Therefore, on a mobile device, it’s smarter to cut than to boost a signal.
I made that statement based on some past measurements that were done that showed that, if you put a 0 dB FS signal at a low frequency (say, 80 Hz) on different mobile devices, and turned the “bass boost” (or equivalent, such as a “pop” or “rock” setting) on, the signal would be often be clipped, and therefore it would increase the level of distortion – sometimes quite dramatically. The measurements also showed that this was independent of volume setting. So, turning down the level didn’t help – it just made things quieter, but maintained the same THD value. This is likely because the processing in such devices & software was done (a) before the volume control and (b) in a fixed-point system that does not have a carefully-managed headroom.
Fig 1. The minijack headphone output of a mobile device that was measured a long time (about 2 years) ago. This was done at a maximum volume setting (notice the reading of 2.38 V peak-to-peak in the centre bottom of the display) playing a .wav file with an 80 Hz sine wave (displayed on the bottom left). The EQ in the device was set to “Flat”.
Fig 2. The output of the same device at the same maximum volume setting, The EQ was set to boost the bass signals. Notice that the signal is now clipped.
Fig 3. The output of the same device at a lower volume setting (notice that the Peak-to-Peak measurement shows 291 mV on the centre bottom of the display), The EQ was set to boost the bass signals. Notice that the signal is still clipped as much as it was at a high volume setting – only a little extra noise is visible, because we’re closer to the noise floor of the device.
So, just to be sure that this was still true on newer devices and software, I did a couple of quick measurements. I put a logartihmically-swept sine wave with a level of 0 dB FS and a frequency range of 20 Hz to 2 kHz on a current mobile audio device with a minijack headphone output. With the output volume at maximum and the EQ set to “OFF” or “FLAT”, I recorded the output and did a little analysis.
Fig 4. A plot of the absolute value of the signal as it is swept in frequency from 20 Hz to 2000 Hz. Notice that the y-axis is zoomed in to a total range of only 2 dB.
Although Figure 4 is a plot of the time-domain output of the system (in other words, I’m just plotting 20*log10(abs(signal)), it can be read as a frequency response plot (which is why I’ve labelled it that way on the x-axis). Actually, though, I have to be explicit and say that we’re actually looking at the absolute value of the signal itself, and the y-axis is labelled as the frequency at the time of the signal coming in.
If we zoom in on the signal, we get the plot shown in Figure 5.
Fig 5. Zooming in to the plot in figure 1, when the signal is sweeping between 40 Hz and 41 Hz. I’m looking for clipping, but I don’t see anything worth worrying about.
Then I turned the bass boost setting to “ON” and repeated the recording, without changing anything else. The result is shown in Figure 6.
Fig 6. the equivalent of Figure 1, but with bass boost set to ON on the mobile device. Again, this can be read as a frequency response plot – but notice that the vertical scale is now 20 dB.
Zooming in on the same 40 Hz region, we see the following:
Fig 7. An equivalent to Figure 2 – but with the bass boost set to ON on the mobile device. Again, it’s nice to see that there is no significant clipping…
So, as can be seen in those plots, the clipping problem that used to be quite obvious in some mobile devices has been corrected on this newer device due to a difference in the way the signal processing is implemented in the software.
However, as can be seen in Figure 6, this solution to the problem was to drop the midrange and treble by about 6 dB. (It’s also interesting that we lose almost 6 dB at 20 Hz when we think that we are boosting the “bass” – but that’s another discussion…) This might be considered to be a smart solution, since the listener can just turn up the level to compensate for the loss, if they wish. However, it does mean that, on this particular device, with this particular software version, if you do turn on the “bass boost” function, you’ll lose about 6 dB from your maximum level (which is equivalent to 4 times the sound power) in the midrange and high frequency regions (say, from about 600 Hz and up, give or take…). So, distortion has been traded for a lower maximum output level in the comparison between these two devices and software versions, two years apart.
So, generally speaking, it seems that I have fallen victim to exactly the problem that I often warn people to avoid. I believed the same thing on Wednesday that I believed on Monday… Then again, I know for certain that there are many people who are still walking around with the same distorting software / device that I complained in the “old days” (those first set of measurements are only 2 years old…). And, if you’re concerned about maximum output levels, the “new” solution might also not be optimal for your preferences.
So, it seems that “it depends” is still the safest answer…