How is audio represented with numbers in computers?

Audio

Audio Problem Overview


I like thinking about how everything can be and is represented by numbers. For example, plaintext is represented by a code like ASCII, and images are represented by RGB values. These are the simplest ways to represent text and images.

What is the simplest way that audio can be represented with numbers? I want to learn how to write programs that work with audio and thought this would be a good way to start. I can't seem to find any good explanations on the internet, though.

Audio Solutions


Solution 1 - Audio

Physically, as you probably know, audio is a vibration. Typically, we're talking about vibrations of air between approximitely 20Hz and 20,000Hz. That means the air is moving back and forth 20 to 20,000 times per second.

If you measure that vibration and convert it to an electrical signal (say, using a microphone), you'll get an electrical signal with the voltage varying in the same waveform as the sound. In our pure-tone hypothetical, that waveform will match that of the sine function.

Now, we have an analogue signal, the voltage. Still not digital. But, we know this voltage varies between (for example) -1V and +1V. We can, of course, attach a volt meter to the wires and read the voltage.

Arbitrarily, we'll change the scale on our volt meter. We'll multiple the volts by 32767. It now calls -1V -32767 and +1V 32767. Oh, and it'll round to the nearest integer.

Now, we hook our volt meter to a computer, and instruct the computer to read the meter 44,100 times per second. Add a second volt meter (for the other stereo channel), and we now have the data that goes on an audio CD.

This format is called stereo 44,100 Hz, 16-bit linear PCM. And it really is just a bunch of voltage measurements.

Solution 2 - Audio

Minimal C audio generation example

The example below generates a pure 1000k Hz sinus in raw format. At the common 44.1kHz sampling rate, it will last about 4 seconds.

main.c:

#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

int main(void) {
    FILE *f;
    const double PI2 = 2 * acos(-1.0);
    const double SAMPLE_FREQ = 44100;
    const unsigned int NSAMPLES = 4 * SAMPLE_FREQ;
    uint16_t ampl;
    uint8_t bytes[2];
    unsigned int t;

    f = fopen("out.raw", "wb");
    for (t = 0; t < NSAMPLES; ++t) {
        ampl = UINT16_MAX * 0.5 * (1.0 + sin(PI2 * t * 1000.0 / SAMPLE_FREQ));
        bytes[0] = ampl >> 8;
        bytes[1] = ampl & 0xFF;
        fwrite(bytes, 2, sizeof(uint8_t), f);
    }
    fclose(f);
    return EXIT_SUCCESS;
}

GitHub upstream.

Generate out.raw:

gcc -std=c99 -o main main.c -lm
./main

Play out.raw directly:

sudo apt-get install ffmpeg
ffplay -autoexit -f u16be -ar 44100 -ac 1 out.raw

or convert to a more common audio format and then play with a more common audio player:

ffmpeg -f u16be -ar 44100 -ac 1 -i out.raw out.flac
vlc out.flac

Generated FLAC file: https://github.com/cirosantilli/media/blob/master/canon.flac

Parameters explained at: https://superuser.com/questions/76665/how-to-play-a-pcm-file-on-an-unix-system/1063230#1063230

Related: https://stackoverflow.com/questions/55286149/what-ffmpeg-command-to-use-to-convert-a-list-of-unsigned-integers-into-an-audio

Tested on Ubuntu 18.04.

Canon in D in C

Here is a more interesting synthesis example.

Outcome: https://www.youtube.com/watch?v=JISozfHATms

main.c

#include <math.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>

typedef uint16_t point_type_t;

double PI2;

void write_ampl(FILE *f, point_type_t ampl) {
    uint8_t bytes[2];
    bytes[0] = ampl >> 8;
    bytes[1] = ampl & 0xFF;
    fwrite(bytes, 2, sizeof(uint8_t), f);
}

/* https://en.wikipedia.org/wiki/Piano_key_frequencies */
double piano_freq(unsigned int i) {
    return 440.0 * pow(2, (i - 49.0) / 12.0);
}

/* Chord formed by the nth note of the piano. */
point_type_t piano_sum(unsigned int max_ampl, unsigned int time,
        double sample_freq, unsigned int nargs, unsigned int *notes) {
    unsigned int i;
    double sum = 0;
    for (i = 0 ; i < nargs; ++i)
        sum += sin(PI2 * time * piano_freq(notes[i]) / sample_freq);
    return max_ampl * 0.5 * (nargs + sum) / nargs;
}

enum notes {
    A0 = 1, AS0, B0,
    C1, C1S, D1, D1S, E1, F1, F1S, G1, G1S, A1, A1S, B1,
    C2, C2S, D2, D2S, E2, F2, F2S, G2, G2S, A2, A2S, B2,
    C3, C3S, D3, D3S, E3, F3, F3S, G3, G3S, A3, A3S, B3,
    C4, C4S, D4, D4S, E4, F4, F4S, G4, G4S, A4, A4S, B4,
    C5, C5S, D5, D5S, E5, F5, F5S, G5, G5S, A5, A5S, B5,
    C6, C6S, D6, D6S, E6, F6, F6S, G6, G6S, A6, A6S, B6,
    C7, C7S, D7, D7S, E7, F7, F7S, G7, G7S, A7, A7S, B7,
    C8,
};

int main(void) {
    FILE *f;
    PI2 = 2 * acos(-1.0);
    const double SAMPLE_FREQ = 44100;
    point_type_t ampl;
    point_type_t max_ampl = UINT16_MAX;
    unsigned int t, i;
    unsigned int samples_per_unit = SAMPLE_FREQ * 0.375;
    unsigned int *ip[] = {
        (unsigned int[]){4, 2, C3, E4},
        (unsigned int[]){4, 2, G3, D4},
        (unsigned int[]){4, 2, A3, C4},
        (unsigned int[]){4, 2, E3, B3},

        (unsigned int[]){4, 2, F3, A3},
        (unsigned int[]){4, 2, C3, G3},
        (unsigned int[]){4, 2, F3, A3},
        (unsigned int[]){4, 2, G3, B3},

        (unsigned int[]){4, 3, C3, G4, E5},
        (unsigned int[]){4, 3, G3, B4, D5},
        (unsigned int[]){4, 2, A3,     C5},
        (unsigned int[]){4, 3, E3, G4, B4},

        (unsigned int[]){4, 3, F3, C4, A4},
        (unsigned int[]){4, 3, C3, G4, G4},
        (unsigned int[]){4, 3, F3, F4, A4},
        (unsigned int[]){4, 3, G3, D4, B4},

        (unsigned int[]){2, 3, C4, E4, C5},
        (unsigned int[]){2, 3, C4, E4, C5},
        (unsigned int[]){2, 3, G3, D4, D5},
        (unsigned int[]){2, 3, G3, D4, B4},

        (unsigned int[]){2, 3, A3, C4, C5},
        (unsigned int[]){2, 3, A3, C4, E5},
        (unsigned int[]){2, 2, E3,     G5},
        (unsigned int[]){2, 2, E3,     G4},

        (unsigned int[]){2, 3, F3, A3, A4},
        (unsigned int[]){2, 3, F3, A3, F4},
        (unsigned int[]){2, 3, C3,     E4},
        (unsigned int[]){2, 3, C3,     G4},

        (unsigned int[]){2, 3, F3, A3, F4},
        (unsigned int[]){2, 3, F3, A3, C5},
        (unsigned int[]){2, 3, G3, B3, B4},
        (unsigned int[]){2, 3, G3, B3, G4},

        (unsigned int[]){2, 3, C4, E4, C5},
        (unsigned int[]){1, 3, C4, E4, E5},
        (unsigned int[]){1, 3, C4, E4, G5},
        (unsigned int[]){1, 2, G3,     G5},
        (unsigned int[]){1, 2, G3,     A5},
        (unsigned int[]){1, 2, G3,     G5},
        (unsigned int[]){1, 2, G3,     F5},

        (unsigned int[]){3, 3, A3, C4, E5},
        (unsigned int[]){1, 3, A3, C4, E5},
        (unsigned int[]){1, 3, E3, G3, E5},
        (unsigned int[]){1, 3, E3, G3, F5},
        (unsigned int[]){1, 3, E3, G3, E5},
        (unsigned int[]){1, 3, E3, G3, D5},
    };
    f = fopen("canon.raw", "wb");
    for (i = 0; i < sizeof(ip) / sizeof(int*); ++i) {
        unsigned int *cur = ip[i];
        unsigned int total = samples_per_unit * cur[0];
        for (t = 0; t < total; ++t) {
            ampl = piano_sum(max_ampl, t, SAMPLE_FREQ, cur[1], &cur[2]);
            write_ampl(f, ampl);
        }
    }
    fclose(f);
    return EXIT_SUCCESS;
}

GitHub upstream.

For YouTube, I prepared it as:

wget -O canon.png https://upload.wikimedia.org/wikipedia/commons/thumb/3/35/The_C_Programming_Language_logo.svg/564px-The_C_Programming_Language_logo.svg.png
ffmpeg -loop 1 -y -i canon.png -i canon.flac -shortest -acodec copy -vcodec vp9 canon.mkv

as explained at: https://superuser.com/questions/700419/how-to-convert-mp3-to-youtube-allowed-video-format/1472572#1472572

Tested on Ubuntu 18.04.

Physics

Audio is encoded as a single number for every moment in time. Compare that to a video, which needs WIDTH * HEIGHT numbers per moment in time.

This number is then converted to the linear displacement of the diaphragm of your speaker:

|   /
|  /
|-/
| | A   I   R
|-\
|  \
|   \
<-> displacement

|     /
|    /
|---/
|   | A I   R
|---\
|    \
|     \
<---> displacement

|       /
|      /
|-----/
|     | A I R
|-----\
|      \
|       \
<-----> displacement

The displacement pushes air backwards and forwards, creating pressure differences, which travel through air as P-waves.

Only displacement matters: a constant signal, even if maximal, produces no sound: the diaphragm just stays at a fixed position.

The sampling frequency determines how fast the displacements should be done.

44,1kHz is a common sampling frequency because humans can hear up to 20kHz and because of the Nyquist–Shannon sampling theorem.

The sampling frequency is analogous to the FPS for video, although it has a much higher value compared to the 25 (cinema) - 144 (hardcore gaming monitors) range we commonly see for video.

Formats

Uncompressed:

In practice, most people deal exclusively with compressed formats, which make files streaming much smaller. Some of those formats take into account characteristics of the human ear to further compress the audio in a lossy way that people won't notice. The most popular royalty free formats as of 2019 appear to be:

Biology

Humans perceive sound mostly by their frequency decomposition (AKA Fourier transform).

I think this is because the inner ear has parts which resonate to different frequencies (TODO confirm).

Therefore, when synthesizing music, we think more in terms of adding up frequencies instead of points in time. This is illustrated in this example.

This leads to thinking in terms of a 1D vector between 20Hz and 20kHz for each point in time.

The mathematical Fourier transform loses the notion of time, so what we do when synthesizing is to take groups of points, and sum up frequencies for that group, and take the Fourier transform there.

Luckily, the Fourier transform is linear, so we can just add up and normalize displacements directly.

The size of each group of points leads to a time - frequency precision tradeoff, mediated by the same mathematics as Heisenberg's uncertainty principle.

Wavelets may be a more precise mathematical description of this intermediary time - frequency description.

Quick ways to generate common tones out of the box

The amazing FFmpeg library covers several of them: https://stackoverflow.com/questions/5109038/linux-sinus-audio-generator/57610684#57610684

sudo apt-get install ffmpeg
ffmpeg -f lavfi -i "sine=frequency=1000:duration=5" out.wav

LMMS

This is so incredibly easy to use, it should be your first try if you just want to generate some MIDI tracks with lots of synthesis options out of the box.

Example: https://askubuntu.com/questions/709673/save-as-midi-when-playing-from-vmpk-qsynth/1298231#1298231

MuseScore

https://github.com/musescore/MuseScore

The best FOSS scorewriter GUI I've seen so far. You can really compose for an orchestra with this.

Csound

https://en.wikipedia.org/wiki/Csound

https://github.com/csound/csound

Program that reads a custom XML format that allows you to create some very funky arbitrary synthesized sounds and tunes.

sudo apt install csound

Here's a really cool and advanced demo: https://github.com/csound/csound/blob/b319c336d31d942af2d279b636339df83dc9f9f9/examples/xanadu.csd rendered at: https://www.youtube.com/watch?v=7fXhVMDCfaA

abcmidi

Nice project that converts MIDI to the ABC notation and vice versa, allowing you to edit a MIDI file in your text editor: https://sound.stackexchange.com/questions/39457/how-to-open-midi-file-in-text-editor/50058#50058

Python pyo

https://github.com/belangeo/pyo

Python sound library.

Got it to work after a bit of frustration: https://stackoverflow.com/questions/32445375/pyo-server-boot-fails-with-pyolib-core-pyoserverstateexception-on-ubuntu-14-0/64960589#64960589

MusicXML

https://en.wikipedia.org/wiki/MusicXML

An attempt to standardize music sheet representation.

I can't find easily how to convert it to an audio format from the command line however... https://stackoverflow.com/questions/33775336/convert-musicxml-to-wav/64974602#64974602

Other high level out-of-box open source synthesizers for Linux

If you are going down this road, you might as well have a look at the big boys to learn about common synthesis techniques:

Solution 3 - Audio

Audio can represented by digital samples. Essentially, a sampler (also called an Analog to digital converter) grabs a value of an audio signal every 1/fs, where fs is the sampling frequency. The ADC, then quantizes the signal, which is a rounding operation. So if your signal ranges from 0 to 3 Volts (Full Scale Range) then a sample will be rounded to, for example a 16-bit number. In this example, a 16-bit number is recorded once every 1/fs/

So for example, most WAV/MP3s are sampled an audio signal at 44 kHz. I don't know how detail you want, but there's this thing called the "Nyquist Sampling Rate" the says that the sampling frequency must be at least twice the desired frequency. So on your WAV/MP3 file you are at best going to be able to hear up tp 22 kHz frequencies.

There is a lot of detail you can go into in this area. The simplest form would certainly be the WAV format. It is uncompressed audio. Formats like mp3 and ogg are have to be decompressed before you can work with them.

Solution 4 - Audio

The simplest way to represent sound as numbers is PCM (Pulse Code Modulation). This means that the amplitude of the sound is recorded at a set frequency (each amplitude value is called a sample). CD quality sound for example is 16 bit samples (in stereo) at the frequency 44100 Hz.

A sample can be represented as an integer number (usually 8, 12, 16, 24 or 32 bits) or a floating point number (16 bit float or 32 bit double). The number can either be signed or unsigned.

For 16 bit signed samples the value 0 would be in the middle, and -32768 and 32767 would be the maximum amplitues. For 16 bit unsigned samples the value 32768 would be in the middle, and 0 and 65535 would be the maximum amplitudes.

For floating point samples the usual format is that 0 is in the middle, and -1.0 and 1.0 are the maximum amplitudes.

The PCM data can then be compressed, for example using MP3.

Solution 5 - Audio

I think samples of the waveform at a specific sample frequency would be the most basic representation.

Solution 6 - Audio

Have you ever looked at a waveform close up? The Y-axis is simply represented as an integer, typically in 16 bits.

Solution 7 - Audio

I think a good way to start playing with audio would be with Processing and Minim. This program will draw the frequency spectrum of sound from your microphone!

import ddf.minim.*;
import ddf.minim.analysis.*;
 
AudioInput in;
FFT fft;
 
void setup()
{
  size(1024, 600);
  noSmooth();
  Minim.start(this);
  in = Minim.getLineIn();
  fft = new FFT(in.bufferSize(), in.sampleRate());
}
 
void draw()
{
  background(0);
  fft.forward(in.mix);
  stroke(255);
  for(int i = 0; i < fft.specSize(); i++)
    line(i*2+1, height, i*2+1, height - fft.getBand(i)*10);
}
 
void stop()
{
  in.close();
  Minim.stop();
  super.stop();
}

Solution 8 - Audio

Look up things like analog-digital conversion. That should get you started. These devices can convert a audio signal (sine waves) into digital representations. So, a 16-bit ADC would be able to represent a sine from between -32768 to 32768. This is in fixed-point. It is also possible to do it in floating-point (though not recommended for performance reasons but may be needed for range reasons). The opposite (digital-analog conversion) happens when we convert numbers to sine waves. This is handled by something called a DAC.

Solution 9 - Audio

There are 2 steps involved in converting actual analogous audio into a digital form.

  1. Sampling
  2. Quantization

Sampling

The rate at which a continuous waveform (in this case, audio) is sampled, is called the sampling rate. The frequency range perceived by humans is 20 - 20,000 Hz. However, CDs use the Nyquist sampling theorem, which means sampling rate of 44,100 Hz, covers frequencies in the range 0 - 22,050Hz.

Quantization

The discrete set of values received from the 'Sampling' phase now need to be converted into a finite number of values. An 8-bit quantization provides 256 possible values, while a 16 bit quantization provides upto 65,536 values.

Solution 10 - Audio

The answers all relate to sampling frequency, but don't address the question. A particular snapshot of a sound would, I imagine, include individual amplitudes for a lot of different frequencies (say you hit both an A and a C simultaneously on a keyboard, with the A being louder). How does that get recorded in a 16 bit number? If all you are doing is measuring amplitude (how loud the sound is), how do you get the different notes?

Ah! I think I get it from this comment: "This number is then converted to the linear displacement of the diaphragm of your speaker." The notes appear by how fast the diaphragm is vibrating. That's why you need the 44,000 different values per second. A note is somewhere on the order of 1000 hertz, so a pure note would make the diaphragm move in and out about 1000 times per second. A recording of a whole orchestrate has many different notes all over the place, and that miraculously can be converted into a single time history of diaphragm motion. 44,000 times per second the diaphragm is instructed to move in or out a little bit, and that simple (long) list of numbers can represent Beyonce to Beethoven!

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionPaige RutenView Question on Stackoverflow
Solution 1 - AudioderobertView Answer on Stackoverflow
Solution 2 - AudioCiro Santilli Путлер Капут 六四事View Answer on Stackoverflow
Solution 3 - AudiodevinView Answer on Stackoverflow
Solution 4 - AudioGuffaView Answer on Stackoverflow
Solution 5 - AudioJimmyView Answer on Stackoverflow
Solution 6 - AudioChrisView Answer on Stackoverflow
Solution 7 - AudioNathanView Answer on Stackoverflow
Solution 8 - AudiosybreonView Answer on Stackoverflow
Solution 9 - AudioJoel MenezesView Answer on Stackoverflow
Solution 10 - AudiocuriousView Answer on Stackoverflow