Monday, February 14, 2011

The Basics of Digital audio

The Basics of Digital Audio

The theory of digital sound

Basic signal theory

As you probably know, sound is air which is moving very quickly. The speed of these movements is called "frequency", which is a very important property of sound, especially music. The frequency of a sound is measured in Hz (=Hertz, named after a man called Hertz :-/ who did a lot of research into sound and acoustics some time ago). Most people can hear frequencies in the range between 100Hz-15000Hz. Some people can hear very high frequencies above 19000Hz, but scientists always assume that the human ear is able to discern frequencies between 20Hz-20000Hz, since those numbers make their calculations a lot easier.
Here's a few examples of different frequencies, if you'd like to play with them for a while:
60 Hz
440 Hz
-very- low
too high
Another very important property of sound is its level; most people call it volume. It is measured in dB (=deciBell, named after a man called deciBell (NOT!!) all right, his real name was Bell, but he did invent the telephone and that is why us Dutch people still say 'mag ik hier misschien even bellen?' when they want to use your phone).
So why don't we measure loudness in Bell instead of deciBell? Well, mainly because your ear really can discern an incredible amount (, that's 11 zeroes) of different loudness levels, so they had to think of a trick(which I'm not going to explain here, sorry!) be able to describe an incredible range with only a few numbers. They agreed to use 10th's of Bells, deciBells, dB, instead of Bells.
Connect Sound Systems

Most professional audio equipment uses a VU meter (=Volume Unit meter) which shows you the input or output level of your equipment. This is very convenient, but only if you know how to use it: A general rule is to set up the input and output levels of your equipment so that the loudest part of the piece you want to record/play approaches the 0dB lights. It is important to stay on the lower side of 0dB, because if you don't, your sound will be distorted badly and there's no way to restore that. If you're recording to (analog!) tape, instead of (digital) harddisk, you can increase the levels a bit, there is enough so-called 'headroom' (=ability to amplify a little more without distortion) to push the VU-meters to +6dB. There is some more information on calibrating equipment levels inthe recording section below.
Some examples of different levels, if you'd like to play with them for a while:
0,0dB = 100%
-6,0dB = 50,0%
-18,0dB = 12,5%
+6,0dB = 200%
maximum level
half power
very quiet
a little too loud-a lot of distortion
Okay, now that you know the most important things about sound, let's finally go to the digital bit (ooh, a pun :-/ ): I've just told you about the properties of 'normal' (analog) sound. Now I'll tell you what the most important properties of digital sound are.

Digital Audio Theory

First of all, the famous 'sample rate'. The sample rate of a piece of digital audio is defined as 'the number of samples recorded per second'. Sample rates are measured in Hz, or kHz (kiloHertz, a thousand samples per second). The most common sample rates used in multimedia applications are:
8000 Hz
11025 Hz
22050 Hz
really yucky
not much better
only use it if you have to
Professionals use higher rates:
32000 Hz
44100 Hz
48000 Hz
only a couple of old samplers
ahh, what a relief
some audio cards, DAT recorders

Some modern equipment has the processing power required to enable even higher rates: 96000Hz or even an awesome 192.000Hz will possibly / probably be the professional (DVD?) standard rates in couple of years. The advantages of a higher samplerate are simple: increased sound quality. The disadvantages are also simple: a sample with a higher samplerate requires an awful lot more disk space than a low-rate sample. But with the harddisk and CD-R prices of today that isn't too much of a problem anymore.

....But Why?!

To answer that, let's look at a single period of a simple sine wave:
  • it starts at zero..
  • ..then it goes way up..
  • ..then it goes back to zero..
  • ..then it goes way down..
  • ..then it goes back to zero.
  • and so on...Sine waves sure have monotonous lives ;-)
a sine wave
When recording a certain frequency, you will need at least (but preferably more than) two samples for each period, to accurately record it's peak and valley. This means you will need a samplerate which is at least (more than) twice as high as the highest frequency you'd like to record, which, for humans, is around 20000Hz. That's why the pro's use 44100Hz or higher as the minimum samplerate! They can record frequencies up to 22050Hz with that. (Now you know why an 8000 Hz sample sounds so horrible: it only plays back a tiny part of what we can hear!)

Using an even higher samplerate, like 96000Hz, you can record higher frequencies, but you won't hear things like 48000Hz anyway. That's not the main goal of those super-rates. If you record at 96000Hz, you will have more than four samples for each 20000Hz period, so the chance of losing high frequencies will decrease dramatically! It will take quite a few years for consumer level soundcards to support these numbers, though. There are a few pro cards which already do, but you could easily buy a small car for the same money...

That's enough about frequency for now. As I said before, another very important property of sound is its level. Let's have a look at how digital audio cards process the sound levels.


The capacity of digital audio cards is measured in bits, e.g. 8-bit soundcards, 16-bit soundcards. The number of bits a sound cards can manage tells you something about how accurately it can record sound: it tells you how many differences it can detect. Each extra bit on a sound cards gives you another 6dB of accurately represented sound (Why? Well, Because. It's just a way of nature). This means 8-bit soundcards have adynamic range(=difference between the softest possible signal and the loudest possible signal) of 8x6dB=48dB. Not a lot, since people can hear up to 120dB. So, people invented 16-bit audio, which gives us 16x6dB=96dB. That's still not 120dB, but as you know, CD's sound really good, compared to tapes. Some freaks, that's including myself ;-) want to be able to make full use of the ear's potentials by spending money on soundcards with 18-bit, 20-bit, or even 24-bit or 32-bit ADC's (Analog to Digital Convertors, the gadgets that create the actual sample) which gives them dynamic ranges of 108dB, 120dB, or even 144dB or 192dB.
Unfortunately, all of the dynamic ranges I mentioned are strictly theoretical maximum levels. There's absolutely not a way in the world you'll get 96dB out of a standard 16-bit multimedia sound card!!! Most professional audio card manufacturers are quite proud of a dynamic range over 90 dB on a 16bit audio card. This is partly because of the fact that it's not that easy to put a lot of electronic components on a small area without a lot of different physical laws trying to get attention. Induction, conduction or even bad connections or (very likely) cheap components simply aren't very friendly to the dynamic range and overall quality of a soundcard. But there's another problem, that will become clear in the next paragraph.

Quantization noise

Back in the old days, when the first digital piano's were put on the market, (most of us didn't even live yet) nobody really wanted them. Why not? Such a cool and modern instrument, and you coould even choose a different piano sound!

The problem with those things was that they weren't as sophisticated as today's digital music equipment. Mainly because they didn't feature as many bits (and so they weren't even half as dynamic as the real thing) but also because they had a very clearly rough edge at the end of the samples.

quantization noiseImagine a piano sample like the one you see here. It slowly fades out until you here nothing.
At least, that's what you'll want... As you can see by looking at the two separate images, that's not at all what you get... These images both are extreme close-ups of the same area of the original piano sample. The highest image could be the soft end of a piano tone. The lowest image however looks more like morse code than a piano sample! the sample has been converted to 8 bit, which leaves only 256 levels instead of the original 65536. The result is devastating.

Imagine playing the digital piano in a very soft and subtle way, what'd you get? some futuristic composition for square waves! That's not what you paid for ;-) This froth is called quantization noise, because it is noise that is generated by (bad) quantization.

There is a way to prevent this from happening, though. While sampling the piano, the soundcard can add a little noise to the signal (about 3-6dB, that's literally a bit of noise) which will help the signal to become a little louder. That way, it might just be big enough to get a little more realistic variation instead of a square wave. The funny part is that you won't hear the noise, because it's so soft and it doesn't change as much as the recorded signal, so your ears automatically forget it. This technique is called dithering. It is also used in some graphics programs e.g. for resizing an image.


Another problem with digital audio equipment, is called jitter. Until now, I've always assumed that the soundcard recorded the sample at exactly 44100Hz, taking one sample every 1/44100 second. Unfortunately that is -totally- unreal. There *always* is a tiny timing error which causes the sample to be taken just a little too late or just a little too soon.

Does this make a big difference then? Well, you could start nagging about everything, but then you'd probably have bought a more expensive soundcard in the first place. The really bad part is that jitter is frequency dependent. Because it's related to the timing of the sample, it can change the recorded frequencies just a little. If it records a sample just a little too soon, the card thinks that the recorded frequency is a little lower than it really is. This is noticable at frequencies below 5000Hz but especially bad at the lowest frequencies, because the influence of a little error is much bigger there. Typical jitter-times go between 1.0 x 10 -9seconds (that's a NANOsecond, read:almost nothing) and 1.0 x 10 -7 seconds (that's a hundred NANOseconds, not a lot more) but they make the difference between a 'pro' sound and a 'consumer' sound on e.g. different CD-players.

Digitizing sound

When you record a sample with your sound card, it goes through a lot of stages before you can store it on your hard disk as a sound file. Fortunately you don't have to worry about these stages, because modern sound cards and samplers take care of them for you.
I'm going to be a big bore and tell you about these stages anyway.
Let's see what happens when you press 'rec':
The sound card starts a very accurate stopwatch (the samplerate).
AD conversion process
Analog to Digital Conversion process
Then it transforms the sound coming in: it simply cuts off the very high frequencies which it cannot handle. This cripples the sound a lot, but it is required to prevent even more serious damage to the sound, which would make the sound unrecognizable. This is a low-pass (cut the 'high' frequencies, let the 'low' frequencies pass through) anti-aliasing (smoothing, blurring) filter (because it takes away some parts and leaves the rest)
Every time the stopwatch has completed a cycle, the sound card's ADC looks at the filtered input signal. It calculates how loud the incoming sound is at that exact moment in time (very much like a microphone would measure air pressure) and transforms the loudness level into the nearest digital number.
and shouts that number to the computer, which stores it somewhere in memory, probably on a hard disk.

Sound card manufacturers put a brickwall-filter (look at the image below!) in their sound card, to prevent a very nasty side-effect called 'foldover'. Foldover is a pretty difficult concept, but I'll try to keep it simple.
It's more or less the same thing that happens when you look at a car's wheel when it drives past you very quickly. You'll sometimes see the wheel moving backwards. Another example can be found in old western movies where you'll see a train going by. The 'wheels' of the train will be moving backwards too, if the train's going fast enough.
All these 'illusions' are foldover-effects. They occur when a fast system at regular intervals analyzes something which is moving even faster than the system itself.
When recording at 22050Hz, your sound card will simply not be able to record any frequencies above 11025Hz, because you need at least two samples for each period, as described above. Without the low-pass filter, the sound card would blindly try to record those frequencies. But afterwards, when you play back the sample, you'll hear a totally different frequency instead of the original one. Just like the car's wheel that seems to be moving backwards, while it really isn't.
(The frequency you'll actually hear equals the sampling frequency minus the original frequency, e.g. 22050-12050=10000Hz, instead of the original frequency, in this case 12050Hz).
'brickwall' filter
a brickwall filter at 4000Hz
Therefore, the maximum frequency that can be recorded with a certain sample rate, is half the sample rate. That frequency is called the Nyquist frequency, sometimes abbreviated to fN, after a man named Harold Nyquist, who worked at Bell Telephone Laboratories and more or less invented audio sampling. A big guy in digital audio. Anyway, to prevent all that from happening, the sound card manufacturers put a special filter in their card (see figure of brickwall filter on the right).
This low-pass filter removes high frequencies like any equalizer or Hi-Cut Switch does, except it is *much* more agressive. You can see that the filter allows all sound below 1000Hz to pass through, and that it gives the frequency range of 1000Hz-3500Hz a small boost. (This boost is necessary to be able to cut off the higher frequencies with such violence.) Frequencies above 4000Hz are eliminated extremely agressively. That is why they call it a brickwall-filter, because of the wall-like slope.
The filter displayed above might be used for a sample rate of about 8000Hz, since an 8000Hz sample has a Nyquist frequency, the maximum recordable frequency, of 4000Hz. This makes it very important to choose the appropriate sample rate for your sample; that is, if you've got a legitimate reason not to record at 44100Hz, or higher ;-)

Read more:

No comments:

Post a Comment