Asked to comment on John Newton’s recent Intelligencer article on recording and archiving at the BSO and to the comments therein here, BMInt writer David Griesinger responded in extraordinary detail (for a comment). We publish his response as an article.
A reply to the comments after John Newton’s article would have to be book length to cover the points raised, and in the end I think no one would be convinced. But I might be able to add some water to the fire. To begin with the subject of SACD vs PCM, I would like to direct readers to the excellent Wikipedia article here . I quote the following paragraphs:
Audible differences compared to PCM/CD
In the audiophile community, the sound from the SACD format is thought to be significantly better compared to older format Red Book CD recordings. However, In September 2007, the Audio Engineering Society published the results of a year-long trial in which a range of subjects including professional recording engineers were asked to discern the difference between SACD and compact disc audio (44.1 kHz/16 bit) under double blind test conditions. Out of 554 trials, there were 276 correct answers, a 49.8% success rate corresponding almost exactly to the 50% that would have been expected by chance guessing alone. The authors suggested that different mixes for the two formats might be causing perceived differences, and commented:
Now, it is very difficult to use negative results to prove the inaudibility of any given phenomenon or process. There is always the remote possibility that a different system or more finely attuned pair of ears would reveal a difference. But we have gathered enough data, using sufficiently varied and capable systems and listeners, to state that the burden of proof has now shifted. Further claims that careful 16/44.1 encoding audibly degrades high resolution signals must be supported by properly controlled double-blind tests.
This conclusion is contentious among a large segment of audio engineers who work with high resolution material and many within the audiophile community. Some have questioned the basic methodology and the equipment used in the AES study.
Double-blind listening tests in 2004 between DSD and 24-bit, 176.4 kHz PCM recordings reported that among test subjects no significant differences could be heard. DSD advocates and equipment manufacturers continue to assert an improvement in sound quality above PCM 24-bit 176.4 kHz. Despite both formats’ extended frequency responses, it has been shown people cannot distinguish audio with information above 21 kHz from audio without such high-frequency content.
Readers with an open mind and a sense of humor might enjoy reading a paper I wrote on this subject in 2003. It was intended as a serious but light hearted look at the possible reasons there might be differences between audio formats with and without frequencies above 20kHz. To quote from my web-page:
Being currently over 60, and having in my youth studied information theory, I have a low tolerance for claims that “high definition” recording is anything but a marketing gimmick. I keep, like the Great Randi, trying to find a way to prove it. Well, I got the idea that maybe some of the presumably positive results on the audibility of frequencies above 18,000Hz were due to intermodulation distortion that could covert energy in the ultrasonic range into sonic frequencies. So I started measuring loudspeakers for distortion of different types—and looking at the HF content of current disks. The result is the paper below, which is a HOOT! Anytime you want a good laugh, take a read. Slides from the AES convention in Banff on intermodulation distortion in loudspeakers and its relationship to “high definition” audio.
Single-bit converters are cheap—but their signal to noise ratio is approximately 0dB, and the noise spectrum is white, or uniform with frequency. This means that if you run them at 2.8 megahertz, the S/N at 10kHz is 2,800,000/10,000, or about 24dB. To make them work at all, you have to convert the output of the converter to analog, pass it through an extremely complex filter, and feed it back to the input. The feedback reduces the noise at audio frequencies while increasing it at frequencies we hope are inaudible. The result is a bit stream at 2.8MHz, that you can convert to PCM with some straightforward digital mathematics. The converter used in SACD equipment was designed by Bob Adams at Analog Devices. The conversion filter is a work of art, fiendishly complex but with decent linearity at audio frequencies. The filter defines the properties of the output signal, which is far from the “purity” often claimed. The converter noise starts to rise dramatically above 20kHz, as can be seen in the paper above. Fortunately, no one can hear it.
Single bit converters seem to be a godsend, but they are extremely sensitive to clock jitter. Great care must be taken in the clock signals that drive them, and such care is often, perhaps usually, lacking. But this is another story.
For reasons never explained, Analog Devices decided to bring the raw converter output to a pin on the converter chip, even though the chip was designed for PCM output. Engineers at Philips and Sony—looking for some way to get consumers to junk their CDs and buy some new product—jumped at this so-called “pure” signal. Early demos were either deliberately or ignorantly falsified to make the SACS sound better than PCM. In one case I think there was even a different mix used in the comparisons. So we have SACD. With some imagination, and audio engineers HAVE to be possessed with imagination, SACD sounds infinitely better.
Microphone technique and loudspeaker reproduction
As for the illusion that is audio reproduction, my current research is into physical models of human hearing based on what we know about the mechanics of the ear, the information passed upward to the brain, and the properties of speech and music. Too much to explain here. But a few facts may be helpful.
First, speech and music both depend on sounds that are largely pitched tones rich in upper harmonics. Why pitch? Why many upper harmonics? Imagine that the ear was capable of very high pitch discrimination (and it is.) Then it could use pitch discrimination to filter out environmental noise and other signals, and concentrate on a particular sound it found important (it can.) A consequence is that the upper harmonics of useful sounds (speech) is where the information is. Take away everything below 1000Hz and we can understand speech very well—sometimes even better than when lower frequencies are present.
This all matters because particularly in the presence of noise and reverberation we localize sounds primarily at frequencies above 1000Hz. Head shadowing is large at these frequencies, and we can detect sound direction to 2 degrees or better, good enough to separate the instruments in a string quartet at a distance of 80 feet or more. In great acoustics we can use these abilities sitting in a concert seat, but they are useless in listening to a typical recording.
In two channel stereo sound from the left speaker diffracts around the head to the right ear, and sound from the right speaker diffracts around the head to the left ear. If both speakers are producing the same sound we hear a “phantom center” because the signals in our two ears are identical. But what we hear is NOT the signals the speakers produced. The around the head diffracted sound interferes with the direct sound from each speaker. At low frequencies the two signals add, but at about 1600Hz the time delay is sufficient to cause the diffracted sound to cancel the direct sound, creating a very audible dip at that frequency. Vocalists panned to the center are always heard with this odd frequency response unless the recording engineer attempts to correct it—which assumes the engineer knows the size of the listener’s head.
This diffracted sound strongly affects the localization of signals that are NOT panned exactly to the center, because the principle frequencies used for localizing sound in a complex scene are concentrated in the region of these interference dips. The result is that some frequencies are localized in different directions than others. The brain needs to make some kind of average to decide where the image should be—and the data to base the decision on is continually changing. The result is a high degree of uncertainty as to where any sound between the center and left or right really comes from. These topics are covered in the following paper—check out figure 7:
The bottom line is that any sound source not panned to the exact center of a stereo image – and for any listener not at the exact center between the speakers—the precise location of the sound is not stable. It is thus impossible with loudspeaker stereo to have precise localization of sounds between center and left, or between center and right. We can still think we can localize sound—the brain does not like uncertainty—but we really can’t. When there is a small delay between the left and right loudspeaker signals, such as occurs when a pair of microphones is spaced apart by 17 to 24 centimeters, these delays add to the frequency dependent confusion of localization. The image is widened, but the uncertainty increases. Loudspeaker stereo can NEVER approach our ability to localize in a natural environment. So the recording engineer typically spreads the sound out wider than we would hear it in a concert. That plus ample imagination creates the illusion we call stereo.
But it gets worse. If we assume we are using a point microphone such as a Calrec Soundfield, crossed hypercardioids, etc. the microphone patterns are also less sharp than our ears. The most directive configuration is figure-of-eight patterns at 90 degrees – the Blumlein arrangement. The angle that produces a one dB difference in the channels is 3.3 degrees—not as good as the ~2 degree angle of natural hearing. Other patterns are worse. Two cardioid microphones separated by 120 degrees need a source angle difference of 11.5 degrees to achieve the same 1dB of channel difference.
I own three soundfield microphones, including a Calrec mark four, and have made several recordings of orchestral and chamber music with them while simultaneously recording with a dummy head that has accurate models of my own pinna, ear canals, and eardrums. When these recordings are played back through special headphones equalized with probe microphones at my own eardrums the realism is startling. I have compared the binaural recordings to the B format soundfield recordings many times, using any reproduction technique I can think of. The closest match is when I play the soundfield recordings through headphones. The timbre and the apparent distance of the music is unchanged – but the instruments are all jammed up in the center because the soundfield simply cannot separate them as well as the binaural head.
Playing the soundfield recordings through loudspeakers is hopeless. The sound is invariably muddy and distant. You cannot use any type of first-order microphone without putting the microphones much closer to the music than would be ideal for concert listening. This increases the direct to reverberant ratio and widens the image enough that the recording begins to have the clarity of natural hearing.
In my own experience as a recording engineer I find that some variety if multi-miking is essential, although I seldom use more than about 14 microphones. Most ensembles are too large to find a single point at which the sound is both balanced and clear. The fact that a single microphone pair must be closer to the conductor than a perfect concert seat means that instruments in the back of the orchestra invariably sound too soft and too distant. Early recordings moved the orchestras into the middle of a hall precisely to mitigate the early reflections of a stage house that increase the sonic distance of more distant instruments. Under these conditions a simple microphone technique has a chance of working.
Recording a small group is a different matter. I routinely use a miniature soundfield to record the Boston Camerata if a multichannel recording is not wanted. But I always use at least two additional microphones for the hall sound. And nearly always a few accents are helpful.
Two channel stereo is limiting. It is very helpful to reproduce the sound with more than two loudspeakers. Three speakers give you twice the localization accuracy as two if you fully utilize the center, and the sweet spot is also greatly broadened. With careful multi-miking and a good surround setup an exceedingly musical mix can be made. An accurate illusion of depth can be achieved through careful use of artificial reverberation—a field in which I have both expertise and equipment. See “Recording the Verdi Requiem in Surround and High Definition Video.”
I gave up using a soundfield microphone or a dummy head as a main microphone because you cannot derive a discrete center channel from them. I rely largely on Schoeps super cardioid microphones to capture as much direct sound from a section as possible with as little leakage from other sections and the fewest early reflections. I don’t think I can convince anyone about the virtues of this type of technique, but (nearly) every commercially successful engineer of classical (or pop) music does pretty much the same thing. They would gladly do something else if it worked better.