Incidentally, the female vocalist in this recording also used a condenser mike, which she hand held, so the distance from her mouth was varying. It's worth noting that the DSD/SACD system crashed and burned on some of her sibilants and also on some of her "T" sounds, generating a spurious spitty distortion in reaction to these sounds which contained rich high frequency energy, apparently too much for the DSD/SACD algorithms to be able to handle cleanly. Thus, the treble overload distortion problem, which we first observed and reported on as our first objection to DSD, many years ago when Sony gave its first demos of DSD at AES, is still very much with us.
      There is a silver lining to this sad and stormy saga. Record industry insiders are telling us that, within the industry, DSD/SACD is on its way out, and indeed is already regarded as a dead format. The only place it is still alive is in the minds of consumers, who were temporarily seduced into the SACD camp by the fact that, among consumer software releases, SACD jumped out ahead of DVD-A at the starting gate. But, in the recording studios, where tomorrow's consumer software releases are already being made, PCM, in the form of DVD-A or DVD 24/96, already dominates, and very few professionals are even taking DSD/SACD seriously any more. They simply can't take seriously a music recording and archiving system where the signal out sounds so different from the signal in. They can't take seriously a music recording and archiving system that loses so much intimate musical information, timbre, and texture. And they want to be able to creatively control the voicing and subtle coloring of their musical document themselves (say by using different sounding mikes for different musical instruments) -- not have the recording system impose its own strong coloring upon them (and upon you the listener) without anyone having a choice.


Why?

      It is obvious that DSD/SACD has some severe sonic problems, as we have chronicled above and in previous articles. Now it's time to ask why. There are several crucial questions you should ask, and we should try to answer.
      Why does DSD/SACD sound so different from DVD-A?
      Why does DSD/SACD sound so different from a live mike feed, so different  from the original music?
      Why did DSD/SACD lose so much musical information, coming directly from the violin and the close up guitar?
      Why does DSD/SACD convert musical treble information into bursts of white noise?
      Why does DSD/SACD distort when there's high treble energy, as with sibilants and cymbals?
      First, DSD/SACD sounds so very different from DVD-A for a very simple reason. DVD-A intrinsically has 24 bits of musical resolution, whereas DSD/SACD intrinsically has only 6 bits of musical resolution (especially for music's trebles; below we'll discuss variations with frequency). Additionally, DSD/SACD attempts to artificially boost its intrinsic resolution via the use of smoke and mirrors, and the particular choice of smoke and mirrors it employs also change and color the sound, making its sound even further different from DVD-A.
      Second, DSD/SACD sounds very different from the live mike feed, and from the original music, for much the same reasons. DVD-A is evidently very accurate, accurately revealing the sound of the live mike feed and the sound of real musical instruments (indeed even revealing the slight glare introduced by solid state chips handling the signal). Since DSD/SACD sounds very different from a competing system that is accurate to the original input signal, it logically follows that DSD/SACD will sound very different from the original input signal. Again, the root of the problem for DSD/SACD is twofold. It intrinsically has only 6 bits of resolution. And the smoke and mirror tricks, which it uses to artificially boost its resolution, themselves have sonic consequences that make the signal output from DSD/SACD sound very different from the signal input to DSD/SACD.
      Third, DSD/SACD loses a lot of musical information because 6 bits of resolution aren't nearly enough to capture the musical information the human ear can appreciate -- and because the smoke and mirrors tricks used to artificially boost DSD/SACD's seeming resolution beyond 6 bits have dire consequences, effectively erasing information in the time domain. Let's plunge now into a more detailed explanation.

Intrinsic Resolution

      You probably already know that 16 bits of musical resolution, as on the classic CD format, is not quite enough to optimally capture music to the satisfaction of human hearing ability. In other words, human hearing listening to music live can resolve musical details and information to a finer degree than 16 bits' worth, and thus 16 bits are not sufficient to accurately render music, for purposes of appreciation by human hearing. You've surely read many articles verifying that music sounds audibly better when the playback equipment uses interpolation to enhance the 16 bit native resolution coming off a CD up to say 20 bits. If human hearing could not hear and appreciate reproduction of music with a resolution of 20 bits instead of 16 bits, then we would not be able to hear (let alone appreciate and enjoy) any difference between the 16 bit playback and the enhanced 20 bit playback. The fact that we do hear and appreciate and want that improvement is in itself proof we can hear 20 bits of resolution, and therefore that a recording/reproduction system must have at least 20 bits of resolution, not just 16 bits, if it is to sound truly accurate, truly revealing of the many subtleties that make music sound real and live instead of canned. DVD-A has a full 24 bits of musical resolution, and it sounds much more subtly nuanced and real than 16 bit CD, so these extra bits above 16 are truly important. We also verified this ourselves in our original research, documented in IAR issues #49 (volume 5) and onwards. We measured musically audible differences between audio components, and were able to show that audible differences occurred even way down at the 20 bit level of resolution or beyond. This proved that a digital system should have at least 20 bits of resolution, if it were not to audibly lose musical information. And it suggests that the 24 bit resolution of DVD-A is a wise, useful, and perhaps necessary achievement, to obtain audible accuracy.
      In contrast, DSD/SACD has only 6 bits of intrinsic informational resolution. This is obviously much worse than the 24 bit resolution of DVD-A, much worse than the resolution needed to give a system audible accuracy. How much worse? How about 262,144 times worse? The resolving power of DVD-A is 262,144 times finer than DSD/SACD's intrinsic capability. The intrinsic information recovery capability of DSD/SACD is 262,144 times cruder than that of DVD-A, and roughly 262,144 times cruder than human hearing requires for audible accuracy. In other words, DVD-A can capture many subtle details of the real sound of musical instruments, and human hearing can hear and appreciate these many subtle details that make music sound real and live - but DSD/SACD will miss capturing and reproducing all these subtle details, because DSD/SACD can't even resolve (can't see or capture) these musical details in the input music waveform. In fact, DSD/SACD's intrinsic resolution loses all pieces of input musical information unless they are huge, obvious generalized trends of the gross music waveform outline. DSD/SACD's intrinsic resolution can't even see the smaller steps of the music waveform input to it, and indeed can't even see the steps until they are huge, 262,144 times larger than the steps that DVD-A can see in the same input music waveform.
      A visual analogy might help you get a feel for just how big a difference this is. Imagine that poker chips (or coins) came in many different colors. Now, imagine a tall stack of these poker chips, in a rich variety of colors. Imagine that your eyes can discern each poker chip in the stack, and that you can appreciate all the varied colors of the different poker chips in the stack, as your eye progressively scans the stack from bottom up to the top. That's like being able to discern all the different nuances of a musical note as the note progresses from beginning to end. You can hear that at the beginning of a violin note there's a sharp bite when the bow first attacks the string; you can hear that the note changes color during the sustain; you can hear that the sustain is not a constant tone, but as the note progresses it varies in color, varying with bowing pressure, and with the subtle noises that gut strings make when stroking steel; you can hear another color change in the color of the note when the violinist suddenly reverses bowing direction. Your ears, and hopefully the recording system, have the resolution to hear all these many coloration changes in the violin note as it progresses - just as, in our visual analogy, your vision has the resolution to see the different colors in the stack of poker chips.
      Now, imagine that the resolution of your eyesight were poorer and cruder, in fact 262,144 times cruder. Then you would not be able to see the individual poker chips in the stack. You would not be able to see and appreciate all their different colors in the stack, progressing from bottom to top. Indeed, you'd be lucky to even be able to tell that there was any sort of object, such as a stack or stick or something in front of you, unless it was very, very big. How big? If your visual resolution were 262,144 times cruder, you would not even be able to detect the presence of this object until it was as tall as a skyscraper half a mile high (twice as tall as the Empire State Building) (and, since our visual field is two dimensional, the 'column' of poker chips would also have to be as fat across as the skyscraper is tall, i.e. half a mile wide, before your vision could even detect its presence).
      So, to complete this visual analogy, imagine a solid wall composed of many poker chip stacks. The wall is 262,144 chips tall, half a mile tall, and there are enough stacks side by side so that this wall is half a mile wide. If you have the crude intrinsic resolution vision of DSD/SACD, you can just barely discern that there is an object of some sort in front of you (just as DSD/SACD can just barely discern that there is some generic sort of violin note playing). You would not be able to discern any details or texture within this huge wall or object in front of you. It would just appear to be a monolithic blob, and your perception of its many colors would be simplified into a single monochromatic color that was some sort of average color for all the varied poker chips.
      If however your visual resolution is 262,144 times better, then you can see each of the 262,144 individual poker chips in each of the stacks, and you can see and appreciate all the different varied colors as each stack progresses from bottom to top (just as you can appreciate all the nuanced changes in color as the violin note progresses from attack to decay, those crucial nuances which make the violin sound vivid and real and live).
      So far we've looked at just the comparative ratio between the intrinsic resolution of the two competing systems, which is a very large number, 262,144. While we still have this visual analogy fresh in our minds, it's also worthwhile briefly looking at the overall total intrinsic resolutions of these two systems.
      Imagine a town (or lake or country restaurant) that's 32 miles away from you. You've made this 32 mile tip many times, so you know the route well, with all the thousands, indeed millions of different varied details you can see along the way. Now, imagine our stack of multicolored poker chips turned on its side so its height runs along the ground, and imagine that the height of the stack was 32 miles, so that this stack of poker chips laid down now reaches the entire length of your 32 mile trip. This 32 mile trip represents the total height of the music waveform, encoded into digital. The total intrinsic resolution of a 24 bit system, such as DVD-A, is so fine that it can pick out each individual poker chip in that 32 mile long stack, so you can see and appreciate the different varied color of each detail chip in this very long stack. If my arithmetic is correct, there are 16,777,216 poker chips in that stack laid on its side, and 24 bit digital can see each and every one of them, and pass along to you the information about the varied nature of each detail. This visual analogy, translated back into audio, means that, with a music waveform 32 miles high in amplitude, a 24 bit digital system such as DVD-A can pass along to you the fine details that make music sound real and live, details no thicker than a poker chip in that 32 mile high stack, details such as the buzzing and groaning of gut and rosin when they first attack steel in the playing of a violin.
      This amazing feat is also a testament to human hearing. We can hear and appreciate musical details that are merely about 1/16,777,216 of the full music waveform amplitude, and that sensitivity is important to telling us that music sounds real, live, vivid. If the amplitude of a music waveform were 32 miles high, we could hear and appreciate subtle musical details about no thicker than a poker chip, and that's why we can hear and appreciate the finer detail, the more realistic sounding detail, that a 24 bit system can give us (compared say to 16 bit CD).
      How well does DSD/SACD's intrinsic resolution fare in resolving these same musical details? How well does it fare in perceiving each of the 16,777,216 poker chips in that 32 mile long stack laid on its side, in that 32 mile long journey you make? How well does it fare in reproducing for your ears the many individual details, each merely the thickness of a poker chip, in that 32 mile long journey your ears make in tracking the amplitude of the music waveform? Your amazing ears can hear and appreciate details about as small as a poker chip, even in a music waveform 32 miles high. How well does DSD/SACD give you the resolution you need? As you might already guess, the answer is, not well at all. In point of fact, on your 32 mile journey, the intrinsic resolution of DSD/SACD won't allow it to even detect any object smaller than half a mile long. The intrinsic resolution of DSD/SACD can only manage to divide the total amplitude of the music waveform into 64 crude chunks. If a musical detail is any smaller than 1/64 of the full amplitude, then the intrinsic resolution of DSD/SACD can't even see that it's there. That's equivalent to not being able to even detect any object alongside the road, on your 32 mile journey, unless that object is bigger than half a mile in size. Note that, if the object is bigger than half a mile, DSD/SACD still can't see any textural details within this huge object, nor any varied colors within it; it can only just barely detect that some sort of huge blob is indeed there.
      Returning from our visual analogy back to the music waveform proper, we see that DSD/SACD intrinsically chops the full scale of maximum amplitude into 64 crude, huge chunks, whereas 24 bit PCM, e.g. DVD-A, resolves that same full scale amplitude into much finer detail, namely 16,777,216 fine elements of subtle musical information. Even if a violin tone were at full scale amplitude, note that dividing it into just 64 segments yields only the crudest, most general, simplest information about the ongoing violin note. DSD/SACD can do tolerably well at tracking the overall sine wave shape of the continuing, ongoing violin note, and slight overall modifications to that overall sinusoidal shape, such as wrought by the presence of its second and third harmonic overtones. However, all the manifold subtle noises -- which tell you that the violin note is just starting its attack, or that the bowing is reversing, or that the violin sounds real and live because during the note you hear timbral and textural sounds of gut and rosin scraping steel -- all these crucial details are never even perceived by DSD/SACD's crude intrinsic 64 segment resolution. All these details are represented by sudden, singular, individualized kinks or spikes added to the overall sine wave shape of the simple underlying violin tone, but all these individual, distinct spikes are smaller than 1/64 of full scale in size, so the intrinsic resolution of DSD/SACD never even perceives them.
      And of course most violin notes are recorded at nowhere near full scale amplitude (indeed, most classical music occurs at roughly 1/10 of full scale amplitude). If a typical violin note is at 1/10 of a recording system's full scale amplitude, then the whole amplitude of the sine wave would be represented by merely 6 segments in DSD/SACD's intrinsic resolution.
      That's why DSD/SACD can only intrinsically recognize the broadest, most general overall outlines of the ongoing violin note, and can at best perceive it as a simple ongoing sine wave, with perhaps a second and third harmonic - but no starting or stopping sounds, and none of the natural timbral and textural information that makes the violin sound real, vivid, live. That overly simplified portrait, of a simple ongoing sine wave tone with just a couple of harmonics, naturally sounds smoothed down, rounder, more syrupy and liquid, softer, prettified, easier on the ears, and perhaps euphonically preferable for relaxing background cocktail music.
      In contrast, a real violin's sound suddenly attacks, suddenly reverses direction, and even during continuous parts of a note emits buzzing and scraping sounds that bring expressive emotion to the music -- that emote, implore, shout, weep, or celebrate, constantly changing in sympathy with the musical composition and at the command of the artist's interpretation. These real violin sounds demand and command your attention, especially because they are constantly changing, individualized sounds. Real violin sounds are usually not gentle, smoothed down, unchanging, simple, continuous ongoing tones, wherein those constantly changing, individualized natural sounds are obliterated into 1984-ish uniformity, whereby a (say) five second long violin note sounds generically the same throughout its duration, with the second through fifth temporal seconds simply being generalized clones of the same averaged sound you heard during the first temporal second.
      Also, as you can imagine, music reproduced with only 6 bits of resolution would sound pretty awful (crude, distorted, and chopped up). On a musical note at full scale loudness, distortion would be 1.6% of full scale, and this would probably already sound quite ugly since the nature of the distortion would not be benign sounding harmonic distortion. But classical music spends most of its time with its loudest notes around the 10% level, so you'd hear all these musical passages with 16% distortion. And of course as soon as some important individual piece of musical timbral information dropped in size below 1/64 of full scale, it would disappear entirely, which is 100% distortion.
      Let's take a moment to explain how the 6 bit intrinsic resolution of DSD/SACD comes about. Consider a sine wave, at the 20 kc upper edge of traditional music spectrum. It has a complete cycle within that 1/20,000 of a second, and this complete cycle has two peaks whose amplitude needs to be sampled (a positive peak and a negative peak). Thus, the minimum sampling rate for 20 kc music has to be at least 40 kc (and is usually 44.1 to 48 kc because of practical requirements). A 1 bit sampling system, such as DSD/SACD, can improve its true intrinsic resolution by oversampling beyond this 44.1 or 48 kc minimal requirement, and by then legitimately simply averaging these

(Continued on page 42)