achieving distance and depth in stereo recordings – one man’s opinion

I had an interesting email from an old recording-engineer friend of mine this week regarding a debate he had with a student concerning the issue of “depth” in recordings (in his specific case, 2-channel stereo recordings done with an ORTF mic configuration). This got me thinking about to a bunch of thoughts I had once-upon-a-time about distance perception, and a newer bunch of thoughts about loudspeaker directivity. Now, those two bunches of thoughts are congealing into a single idea regarding how to achieve (and experience) a reasonable perceived sensation of distance and depth in 2-channel stereo.

To start, some definitions:

When I say “stereo” I mean “2-channel sound recording”
“Distance” to a source in a stereo recording is the perceived distance between the listener and the (probably phantom) image.
“Depth” in a stereo recording is the difference in the perceived distances from the listener to the closest and farthest (probably phantom) images (i.e. the distance to the concert master vs. the distance to the xylophone in a symphony orchestra)

Step 1: Distance perception in real life

Go to an anechoic chamber with a loudspeaker and a friend. Sit there and close your eyes and get your friend to place the loudspeaker some distance from you. Keep your eyes closed, play some sounds out of the loudspeaker and try to estimate how far away it is. You will be wrong (unless you’re VERY lucky). Why? It’s because, in real life with real sources in real spaces, distance information (in other words, the information that tells you how far away a sound source is) comes mainly from the relationship between the direct sound and the early reflections. If you get the direct sound only, then you get no distance information. Add the early reflections and you can very easily tell how far away it is. This has been proven in lots of “official” listening tests. (For example, go check out this report as a basic starting point).

Anecdote #1: Back in the old days when I was working on my Ph.D. we had an 8-loudspeaker system in the lab – one speaker every 45° in a circle around the listening position. We were trying to build a multichannel room simulator where we were building a sound field, piece by piece – the direct sound and (up to 3rd-order) early reflections had the “correct” panning, delay and gain, and we added a diffuse field to tail in behind it. One of the interesting things that I found with that system was that the simulated distance to the source was easily to achieve with just the 1st-order reflections, but that the precision of that perceived distance was increased as we added 2nd- and 3rd-order reflections. (We didn’t have enough computing power to simulate higher-order reflections at the time. It would be interesting to go back and try again to see what would happen with higher-order stuff now that my Mac has gotten a little faster…) Another interesting thing (although, in retrospect, it shouldn’t surprise anyone) was that the location and the distance to the simulated sound source were also easy to determine without the direct sound being part of the sound field at all. Just the 1st- to 3rd-order reflections by themselves were enough to tell you where things were.

Step 2: Distance perception in a recording

It’s been well-known for many years that the apparent distance to a sound source in a stereo recording is controllable by the so-called “dry-wet” ratio – in other words, the relative levels of the direct sound and the reverb. I first learned this in the booklet that came with my first piece of recording gear – an Alesis Microverb. To be honest – this is a bit of an over-simplification, but done in good faith for people who are at the knowledge level one would typically have if one were an Alesis Microverb customer. The people at another reverb unit manufacturer know that the truth requires a little more details. For example, their flagship reverb unit uses correctly-positioned and correctly-delayed early reflections (calculated using ray tracing, apparently) to deliver a believable room size and sound source location in that room.

If you’re thinking in terms of a stereo microphone pair, then consider it this way: you want your microphone configuration to be reasonably good at acting like a decent panning algorithm. At the very least, you should ensure that you don’t have conflicting information between the interchannel time and the interchannel amplitude differences for your direct sound and the early reflections. For example, if you have a pair of near-coincident cardioids, but they’re “toed-in” instead of “toed-out”, you have a problem (i.e. the left mic is pointing to the right and the right mic is pointing to the left. This means that the the earlier channel will not be the louder channel for sound sources and reflections that are not on-axis to the pair) This would make for conflicting and therefore confusing information for your brain.

Anecdote #2: I did a recording for Atma once-upon-a-time in a large church in Montreal with a very long reverb time. During the sessions, I sat in the church (no control room), about 20 m from the mic pair. So, when I and the organist discussed what take to do next, we were talking live in the same room – no talkback speakers. During the editing for this disc, I happened to be shuttling around, looking for the beginning of a take – so I’d drop the cursor somewhere on the screen and hit “play” quickly to see where I was. One of the takes ended with the organist asking “did we get it?” and I responded “yup” quickly and loudly. It just so happened that, when I was shuttling around, looking for the right take, I hit “play” at the beginning of the “yup” and then quickly hit “stop”. The interesting thing is that it sounded, for that split second, like I was right next to the microphones – not 20 m away like I knew I was. So, I hit “play” again, and this time didn’t hit stop. This time, I sounded far away. What’s going on? Well, because the church was so big, it was possible to hit the stop button before any of the first reflections came in (save maybe the one off the floor), so it was possible (with a fast enough thumb on the transport buttons of the editing machine) to make the recording of my voice anechoic. The result was that I sounded 0 m away instead of 20 m.

The moral of the stories thus far? In order to deliver a perception of precise distance and depth (even if it’s not accurate…) you need early reflections in the recording, and they have to be panned and delayed appropriately.

Step 3: The delivery

Think back to Step 1. We agreed (or at least I said…) that early reflections tell your brain how far away the sound source is. Now think to a loudspeaker in a listening room.

Case #1: If you have an anechoic room, there are no early reflections, and, regardless of how far away the loudspeakers are, a sound source in the recording without early reflections (i.e. a close-mic’ed vocal) will sound much closer to you than the loudspeakers.

Case #2: If you have a listening room with early reflections, but the loudspeakers are directional such that there is no energy being delivered to the side walls (for example, a dipole with the angles carefully chosen to point the null of the loudspeaker at the point of specular reflection from the side wall), then the result is the same as in Case 1. This time there are no early reflections because of loudspeaker directivity instead of wall absorption, but the effect at the listening position is the same.

Case #3: If you have a listening room with early reflections, and the loudspeakers are omni-directional, then the early reflections from the side walls tell you how far away the loudspeakers are. Therefore, the close-mic’ed vocal track from Case #1 cannot sound any closer than the loudspeakers – your brain is too smart to be told otherwise.

The punchline

So, if you want to achieve precision in the distance and depth of your stereo recordings (whether you’re on the recording end or the playback end) you’re going to need to make sure that you have a reasonable mix of the following:

Early reflections in the recording itself have to be there, and coming in at the right times with the right gains with the right panning
Not much energy in the early reflections in your listening room – either by putting some absorption on the walls in the right places, or by having reasonably directional loudspeakers (or both).

earfluff and eyecandy

mostly audio, but with some other stuff occasionally

achieving distance and depth in stereo recordings – one man’s opinion