Algorithm to mix sound

AlgorithmAudio

Algorithm Problem Overview


I have two raw sound streams that I need to add together. For the purposes of this question, we can assume they are the same bitrate and bit depth (say 16 bit sample, 44.1khz sample rate).

Obviously if I just add them together I will overflow and underflow my 16 bit space. If I add them together and divide by two, then the volume of each is halved, which isn't correct sonically - if two people are speaking in a room, their voices don't become quieter by half, and a microphone can pick them both up without hitting the limiter.

  • So what's the correct method to add these sounds together in my software mixer?
  • Am I wrong and the correct method is to lower the volume of each by half?
  • Do I need to add a compressor/limiter or some other processing stage to get the volume and mixing effect I'm trying for?

-Adam

Algorithm Solutions


Solution 1 - Algorithm

You should add them together, but clip the result to the allowable range to prevent over/underflow.

In the event of the clipping occuring, you will introduce distortion into the audio, but that's unavoidable. You can use your clipping code to "detect" this condition and report it to the user/operator (equivalent of red 'clip' light on a mixer...)

You could implement a more "proper" compressor/limiter, but without knowing your exact application, it's hard to say if it would be worth it.

If you're doing lots of audio processing, you might want to represent your audio levels as floating-point values, and only go back to the 16-bit space at the end of the process. High-end digital audio systems often work this way.

Solution 2 - Algorithm

I'd prefer to comment on one of the two highly ranked replies but owing to my meager reputation (I assume) I cannot.

The "ticked" answer: add together and clip is correct, but not if you want to avoid clipping.

The answer with the link starts with a workable voodoo algorithm for two positive signals in [0,1] but then applies some very faulty algebra to derive a completely incorrect algorithm for signed values and 8-bit values. The algorithm also does not scale to three or more inputs (the product of the signals will go down while the sum increases).

So - convert input signals to float, scale them to [0,1] (e.g. A signed 16-bit value would become
float v = ( s + 32767.0 ) / 65536.0 (close enough...))
and then sum them.

To scale the input signals you should probably do some actual work rather than multiply by or subtract a voodoo value. I'd suggest keeping a running average volume and then if it starts to drift high (above 0.25 say) or low (below 0.01 say) start applying a scaling value based on the volume. This essentially becomes an automatic level implementation, and it scales with any number of inputs. Best of all, in most cases it won't mess with your signal at all.

Solution 3 - Algorithm

There is an article about mixing here. I'd be interested to know what others think about this.

Solution 4 - Algorithm

Most audio mixing applications will do their mixing with floating point numbers (32 bit is plenty good enough for mixing a small number of streams). Translate the 16 bit samples into floating point numbers with the range -1.0 to 1.0 representing full scale in the 16 bit world. Then sum the samples together - you now have plenty of headroom. Finally, if you end up with any samples whose value goes over full scale, you can either attenuate the whole signal or use hard limiting (clipping values to 1.0).

This will give much better sounding results than adding 16 bit samples together and letting them overflow. Here's a very simple code example showing how you might sum two 16 bit samples together:

short sample1 = ...;
short sample2 = ...;
float samplef1 = sample1 / 32768.0f;
float samplef2 = sample2 / 32768.0f;
float mixed = samplef1 + sample2f;
// reduce the volume a bit:
mixed *= 0.8;
// hard clipping
if (mixed > 1.0f) mixed = 1.0f;
if (mixed < -1.0f) mixed = -1.0f;
short outputSample = (short)(mixed * 32768.0f)

Solution 5 - Algorithm

"Quieter by half" isn't quite correct. Because of the ear's logarithmic response, dividing the samples in half will make it 6-db quieter - certainly noticeable, but not disastrous.

You might want to compromise by multiplying by 0.75. That will make it 3-db quieter, but will lessen the chance of overflow and also lessen the distortion when it does happen.

Solution 6 - Algorithm

I cannot believe that nobody knows the correct answer. Everyone is close enough but still, a pure philosophy. The nearest, i.e. the best was: (s1 + s2) -(s1 * s2). It's excelent approach, especially for MCUs.

So, the algorithm goes:

  1. Find out the volume in which you want the output sound to be. It can be the average or maxima of one of the signals.
    factor = average(s1) You assume that both signals are already OK, not overflowing the 32767.0
  2. Normalize both signals with this factor:
    s1 = (s1/max(s1))*factor
    s2 = (s2/max(s2))*factor
  3. Add them together and normalize the result with the same factor
    output = ((s1+s2)/max(s1+s2))*factor

Note that after the step 1. you don't really need to turn back to integers, you may work with floats in -1.0 to 1.0 interval and apply the return to integers at the end with the previously chosen power factor. I hope I didn't mistake now, cause I'm in a hurry.

Solution 7 - Algorithm

You can also buy yourself some headroom with an algorithm like y= 1.1x - 0.2x^3 for the curve, and with a cap on the top and bottom. I used this in Hexaphone when the player is playing multiple notes together (up to 6).

float waveshape_distort( float in ) {
  if(in <= -1.25f) {
    return -0.984375;
  } else if(in >= 1.25f) {
    return 0.984375;
  } else {    
    return 1.1f * in - 0.2f * in * in * in;
  }
}

It's not bullet-proof - but will let you get up to 1.25 level, and smoothes the clip to a nice curve. Produces harmonic distortion, which sounds better than clipping and may be desirable in some circumstances.

Solution 8 - Algorithm

If you need to do this right, I would suggest looking at open source software mixer implementations, at least for the theory.

Some links:

Audacity

GStreamer

Actually you should probably be using a library.

Solution 9 - Algorithm

You're right about adding them together. You could always scan the sum of the two files for peak points, and scale the entire file down if they hit some kind of threshold (or if the average of it and its surrounding spots hit a threshold)

Solution 10 - Algorithm

convert the samples to floating point values ranging from -1.0 to +1.0, then:

out = (s1 + s2) - (s1 * s2);

Solution 11 - Algorithm

I think that, so long as the streams are uncorrelated, you shouldn't have too much to worry about, you should be able to get by with clipping. If you're really concerned about distortion at the clip points, a soft limiter would probably work OK.

Solution 12 - Algorithm

> convert the samples to floating point values ranging from -1.0 to +1.0, then: > > out = (s1 + s2) - (s1 * s2);

Will introduce heavy distortion when |s1 + s2| approach 1.0 (at least when I tried it when mixing simple sine waves). I read this recommendation on several locations, but in my humble opinion, it is a useless approach.

What happens physically when waves 'mix' is that their amplitutes add, just like many of the posters here suggested already. Either

  • clip (distorts the result as well) or
  • summarize your 16 bit values into a 32 bit number, and then divide by the number of your sources (that's what I would suggest as it's the only way known to me avoiding distortions)

Solution 13 - Algorithm

Since your profile says you work in embedded systems, I will assume that floating point operations are not always an option.

> So what's the correct method to add these sounds together in my software mixer?

As you guessed, adding and clipping is the correct way to go if you do not want to lose volume on the sources. With samples that are int16_t, you need to the sum to be int32_t, then limit and convert back to int16_t.

> Am I wrong and the correct method is to lower the volume of each by half?

Yes. Halving of volume is somewhat subjective, but what you can see here and there is that halving the volume (loudness) is a decrease of about 10 dB (dividing the power by 10, or the sample values by 3.16). But you mean obviously to lower the sample values by half. This is a 6 dB decrease, a noticeable reduction, but not quite as much as halving the volume (the loudness table [there][1] is very useful).

With this 6 dB reduction you will avoid all clipping. But what happens when you want more input channels? For four channels, you would need to divide the input values by 4, that is lowering by 12 dB, thus going to less that half the loudness for each channel.

> Do I need to add a compressor/limiter or some other processing stage to 
get the volume and mixing effect I'm trying for?

You want to mix, not clip, and not lose loudness on the input signals. This is not possible, not without some kind of distortion.

As suggested by Mark Ransom, a solution to avoid clipping while not losing as much as 6 dB per channel is to hit somewhere in between "adding and clipping" and "averaging".

That is for two sources: adding, dividing by somewhere between 1 and 2 (reduce the range from [-65536, 65534] to something smaller), then limiting.

If you often clip with this solution and it sounds too harsh, then you might want to soften the limit knee with a compressor. This is a bit more complex, since you need to make the dividing factor dependent on the input power. Try the limiter alone first, and consider the compressor only if you are not happy with the result.

[1]: https://www.gcaudio.com/tips-tricks/the-relationship-of-voltage-loudness-power-and-decibels/ "voltage_power_loudness"

Solution 14 - Algorithm

I did it this way once: I used floats (samples between -1 and 1), and I initialized a "autoGain" variable with a value of 1. Then I would add all the samples together (could also be more than 2). Then I would multiply the outgoing signal with autoGain. If the absolute value of the sum of the signals before multiplication would be higher than 1, I would make assign 1/this sum value. This would effectively make autogain smaller than 1 let's say 0.7 and would be equivalent to some operator quickly turning down the main volume as soon as he sees that the overall sound is getting too loud. Then I would over an adjustable period of time add to the autogain until it finally would be back at "1" (our operator has recovered from shock and is slowly cranking up the volume :-)).

Solution 15 - Algorithm

// #include <algorithm>
// short ileft, nleft; ...
// short iright, nright; ...

// Mix
float hiL = ileft + nleft;
float hiR = iright + nright;

// Clipping
short left = std::max(-32768.0f, std::min(hiL, 32767.0f));
short right = std::max(-32768.0f, std::min(hiR, 32767.0f));

Solution 16 - Algorithm

I did the following thing:

MAX_VAL = Full 8 or 16 or whatever value
dst_val = your base audio sample
src_val = sample to add to base

Res = (((MAX_VAL - dst_val) * src_val) / MAX_VAL) + dst_val

Multiply the left headroom of src by the MAX_VAL normalized destination value and add it. It will never clip, never be less loud and sound absolutely natural.

Example:

250.5882 = (((255 - 180) * 240) / 255) + 180

And this sounds good :)

Solution 17 - Algorithm

I found a new way to add samples in a way in which they can never exceed a given range. The basic Idea is to convert values in a range between -1 to 1 to a range between approximately -Infinity to +Infinity, add everything together and reverse the initial transformation. I came up with the following formulas for this:

f(x)=-\frac{x}{|x|-1}

f'(x)=\frac{x}{|x|+1}

o=f'(\sum f(s))

I tried it out and it does work, but for multiple loud sounds the resulting audio sounds worse than just adding the samples together and clipping every value which is too big. I used the following code to test this:

#include <math.h>
#include <stdio.h>
#include <float.h>
#include <stddef.h>
#include <stdint.h>
#include <string.h>
#include <stdbool.h>
#include <sndfile.h>

// fabs wasn't accurate enough
long double ldabs(long double x){
  return x < 0 ? -x : x;
}

// -Inf<input<+Inf, -1<=output<=+1
long double infiniteToFinite( long double sample ){
  // if the input value was too big, we'll just map it to -1 or 1
  if( isinf(sample) )
    return sample < 0 ? -1. : 1.;
  long double ret = sample / ( ldabs(sample) + 1 );
  // Just in case of calculation errors
  if( isnan(ret) )
    ret = sample < 0 ? -1. : 1.;
  if( ret < -1. )
    ret = -1.;
  if( ret > 1. )
    ret = 1.;
  return ret;
}

// -1<=input<=+1, -Inf<output<+Inf
long double finiteToInfinite( long double sample ){
  // if out of range, clamp to 1 or -1
  if( sample > 1. )
    sample = 1.;
  if( sample < -1. )
    sample = -1.;
  long double res = -( sample / ( ldabs(sample) - 1. ) );
  // sample was too close to 1 or -1, return largest long double
  if( isinf(res) )
    return sample < 0 ? -LDBL_MAX : LDBL_MAX;
  return res;
}

// -1<input<1, -1<=output<=1 | Try to avoid input values too close to 1 or -1
long double addSamples( size_t count, long double sample[] ){
  long double sum = 0;
  while( count-- ){
    sum += finiteToInfinite( sample[count] );
    if( isinf(sum) )
      sum = sum < 0 ? -LDBL_MAX : LDBL_MAX;
  }
  return infiniteToFinite( sum );
}

#define BUFFER_LEN 256

int main( int argc, char* argv[] ){

  if( argc < 3 ){
    fprintf(stderr,"Usage: %s output.wav input1.wav [input2.wav...]\n",*argv);
    return 1;
  }

  {
    SNDFILE *outfile, *infiles[argc-2];
    SF_INFO sfinfo;
    SF_INFO sfinfo_tmp;

    memset( &sfinfo, 0, sizeof(sfinfo) );

    for( int i=0; i<argc-2; i++ ){
      memset( &sfinfo_tmp, 0, sizeof(sfinfo_tmp) );
      if(!( infiles[i] = sf_open( argv[i+2], SFM_READ, &sfinfo_tmp ) )){
        fprintf(stderr,"Could not open file: %s\n",argv[i+2]);
        puts(sf_strerror(0));
        goto cleanup;
      }
      printf("Sample rate %d, channel count %d\n",sfinfo_tmp.samplerate,sfinfo_tmp.channels);
      if( i ){
        if( sfinfo_tmp.samplerate != sfinfo.samplerate
         || sfinfo_tmp.channels != sfinfo.channels
        ){
          fprintf(stderr,"Mismatching sample rate or channel count\n");
          goto cleanup;
        }
      }else{
        sfinfo = sfinfo_tmp;
      }
      continue;
      cleanup: {
        while(i--)
          sf_close(infiles[i]);
        return 2;
      }
    }

    if(!( outfile = sf_open(argv[1], SFM_WRITE, &sfinfo) )){
      fprintf(stderr,"Could not open file: %s\n",argv[1]);
      puts(sf_strerror(0));
      for( int i=0; i<argc-2; i++ )
        sf_close(infiles[i]);
      return 3;
    }

    double inbuffer[argc-2][BUFFER_LEN];
    double outbuffer[BUFFER_LEN];

    size_t max_read;
    do {
      max_read = 0;
      memset(outbuffer,0,BUFFER_LEN*sizeof(double));
      for( int i=0; i<argc-2; i++ ){
        memset( inbuffer[i], 0, BUFFER_LEN*sizeof(double) );
        size_t read_count = sf_read_double( infiles[i], inbuffer[i], BUFFER_LEN );
        if( read_count > max_read )
          max_read = read_count;
      }
      long double insamples[argc-2];
      for( size_t j=0; j<max_read; j++ ){
        for( int i=0; i<argc-2; i++ )
          insamples[i] = inbuffer[i][j];
        outbuffer[j] = addSamples( argc-2, insamples );
      }
      sf_write_double( outfile, outbuffer, max_read );
    } while( max_read );

    sf_close(outfile);
    for( int i=0; i<argc-2; i++ )
      sf_close(infiles[i]);
  }

  return 0;
}

Solution 18 - Algorithm

Thank you everyone for sharing your ideas, recently i'm also doing some work related to sound mixing. I'm also have done experimenting thing on this issue, may it help you guys :).

Note that i'm using 8Khz sample rate & 16 bit sample (SInt16) sound in ios RemoteIO AudioUnit.

Along my experiments the best result i found was something different from all this answer, but the basic is the same (As Roddy suggest)

"You should add them together, but clip the result to the allowable range to prevent over/underflow".

But what should be the best way to adding without overflow/underflow ?

Key Idea:: You have two sound wave say A & B, and the resultant wave C will the superposition of two wave A & B. Sample under limited bit range may cause it to overflow. So now we can calculate the maximum limit cross at the upside & minimum limit cross at the downside of the superposition wave form. Now we will subtract maximum upside limit cross to the upper portion of the superposition wave form and add minimum downside limit cross to the lower portion of the superposition wave form. VOILA ... you are done.

Steps:

  1. First traverse your data loop once for the maximum value of upper limit cross & minimum value of lower limit cross.
  2. Make another traversal to the audio data, subtract the maximum value from the positive audio data portion and add minimum value to the negative portion of audio data.

the following code would show the implementation.

static unsigned long upSideDownValue = 0;
static unsigned long downSideUpValue = 0;
#define SINT16_MIN -32768
#define SINT16_MAX 32767
SInt16* mixTwoVoice (SInt16* RecordedVoiceData, SInt16* RealTimeData, SInt16 *OutputData, unsigned int dataLength){

unsigned long tempDownUpSideValue = 0;
unsigned long tempUpSideDownValue = 0;
//calibrate maker loop
for(unsigned int i=0;i<dataLength ; i++)
{
    SInt32 summedValue = RecordedVoiceData[i] + RealTimeData[i];
    
    if(SINT16_MIN < summedValue && summedValue < SINT16_MAX)
    {
        //the value is within range -- good boy
    }
    else
    {
       //nasty calibration needed
        unsigned long tempCalibrateValue;
        tempCalibrateValue = ABS(summedValue) - SINT16_MIN; // here an optimization comes ;)
        
        if(summedValue < 0)
        {
            //check the downside -- to calibrate
            if(tempDownUpSideValue < tempCalibrateValue)
                tempDownUpSideValue = tempCalibrateValue;
        }
        else
        {
            //check the upside ---- to calibrate
            if(tempUpSideDownValue < tempCalibrateValue)
                tempUpSideDownValue = tempCalibrateValue;
        }
    }
}

//here we need some function which will gradually set the value
downSideUpValue = tempUpSideDownValue;
upSideDownValue = tempUpSideDownValue;

//real mixer loop
for(unsigned int i=0;i<dataLength;i++)
{
    SInt32 summedValue = RecordedVoiceData[i] + RealTimeData[i];
    
    if(summedValue < 0)
    {
        OutputData[i] = summedValue + downSideUpValue;
    }
    else if(summedValue > 0)
    {
        OutputData[i] = summedValue - upSideDownValue;
    }
    else
    {
        OutputData[i] = summedValue;
    }
}

return OutputData;
}

it works fine for me, i have later intention gradually change the value of upSideDownValue & downSideUpValue to gain a smoother output.

Solution 19 - Algorithm

This question is old but here is the valid method IMO.

  1. Convert both sample in power.
  2. Add both sample in power.
  3. Normalize it. Such as the maximum value doesn't go over your limit.
  4. Convert back in amplitude.

You can make the first 2 steps together, but will need the maximum and minimum to normalize in a second pass for step 3 and 4.

I hope it helps someone.

Solution 20 - Algorithm

I'd say just add them together. If you're overflowing your 16 bit PCM space, then the sounds you're using are already incredibly loud to begin with and you should attenuate them. If that would cause them to be too soft by themselves, look for another way of increasing the overall volume output, such as an OS setting or turning the knob on your speakers.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionAdam DavisView Question on Stackoverflow
Solution 1 - AlgorithmRoddyView Answer on Stackoverflow
Solution 2 - AlgorithmpodpersonView Answer on Stackoverflow
Solution 3 - AlgorithmBen DyerView Answer on Stackoverflow
Solution 4 - AlgorithmMark HeathView Answer on Stackoverflow
Solution 5 - AlgorithmMark RansomView Answer on Stackoverflow
Solution 6 - AlgorithmDalenView Answer on Stackoverflow
Solution 7 - AlgorithmGlenn BarnettView Answer on Stackoverflow
Solution 8 - Algorithmkrusty.arView Answer on Stackoverflow
Solution 9 - AlgorithmJon SmockView Answer on Stackoverflow
Solution 10 - Algorithmuser226799View Answer on Stackoverflow
Solution 11 - AlgorithmTony ArklesView Answer on Stackoverflow
Solution 12 - AlgorithmMichael BeerView Answer on Stackoverflow
Solution 13 - AlgorithmGauthierView Answer on Stackoverflow
Solution 14 - AlgorithmAndiView Answer on Stackoverflow
Solution 15 - AlgorithmLukaView Answer on Stackoverflow
Solution 16 - AlgorithmJulian WingertView Answer on Stackoverflow
Solution 17 - AlgorithmDaniel AbrechtView Answer on Stackoverflow
Solution 18 - AlgorithmRatul SharkerView Answer on Stackoverflow
Solution 19 - AlgorithmPatrick Allard GagnéView Answer on Stackoverflow
Solution 20 - AlgorithmAdam RosenfieldView Answer on Stackoverflow