Randomly Pick Lines From a File Without Slurping It With Unix

LinuxUnixAwkRandom SampleFile Processing

Linux Problem Overview


I have a 10^7 lines file, in which I want to choose 1/100 of lines randomly from the file. This is the AWK code I have, but it slurps all the file content before hand. My PC memory cannot handle such slurps. Is there other approach to do it?

awk 'BEGIN{srand()}
!/^$/{ a[c++]=$0}
END {  
  for ( i=1;i<=c ;i++ )  { 
    num=int(rand() * c)
    if ( a[num] ) {
        print a[num]
        delete a[num]
        d++
    }
    if ( d == c/100 ) break
  }
 }' file

Linux Solutions


Solution 1 - Linux

if you have that many lines, are you sure you want exactly 1% or a statistical estimate would be enough?

In that second case, just randomize at 1% at each line...

awk 'BEGIN {srand()} !/^$/ { if (rand() <= .01) print $0}'

If you'd like the header line plus a random sample of lines after, use:

awk 'BEGIN {srand()} !/^$/ { if (rand() <= .01 || FNR==1) print $0}'

Solution 2 - Linux

You used awk, but I don't know if it's required. If it's not, here's a trivial way to do w/ perl (and without loading the entire file into memory):

cat your_file.txt | perl -n -e 'print if (rand() < .01)'

(simpler form, from comments):

perl -ne 'print if (rand() < .01)' your_file.txt 

Solution 3 - Linux

I wrote this exact code in Gawk -- you're in luck. It's long partially because it preserves input order. There are probably performance enhancements that can be made.

This algorithm is correct without knowing the input size in advance. I posted a rosetta stone here about it. (I didn't post this version because it does unnecessary comparisons.)

Original thread: Submitted for your review -- random sampling in awk.

# Waterman's Algorithm R for random sampling
# by way of Knuth's The Art of Computer Programming, volume 2

BEGIN {
    if (!n) {
        print "Usage: sample.awk -v n=[size]"
        exit
    }
    t = n
    srand()

}

NR <= n {
    pool[NR] = $0
    places[NR] = NR
    next

}

NR > n {
    t++
    M = int(rand()*t) + 1
    if (M <= n) {
        READ_NEXT_RECORD(M)
    }

}

END {
    if (NR < n) {
        print "sample.awk: Not enough records for sample" \
            > "/dev/stderr"
        exit
    }
    # gawk needs a numeric sort function
    # since it doesn't have one, zero-pad and sort alphabetically
    pad = length(NR)
    for (i in pool) {
        new_index = sprintf("%0" pad "d", i)
        newpool[new_index] = pool[i]
    }
    x = asorti(newpool, ordered)
    for (i = 1; i <= x; i++)
        print newpool[ordered[i]]

}

function READ_NEXT_RECORD(idx) {
    rec = places[idx]
    delete pool[rec]
    pool[NR] = $0
    places[idx] = NR  
} 

Solution 4 - Linux

This should work on most any GNU/Linux machine.

$ shuf -n $(( $(wc -l < $file) / 100)) $file

I'd be surprised if memory management was done inappropriately by the GNU shuf command.

Solution 5 - Linux

I don't know awk, but there is a great technique for solving a more general version of the problem you've described, and in the general case it is quite a lot faster than the for line in file return line if rand < 0.01 approach, so it might be useful if you intend to do tasks like the above many (thousands, millions) of times. It is known as reservoir sampling and this page has a pretty good explanation of a version of it that is applicable to your situation.

Solution 6 - Linux

The problem of how to uniformly sample N elements out of a large population (of unknown size) is known as Reservoir Sampling. (If you like algorithms problems, do spend a few minutes trying to solve it without reading the algorithm on Wikipedia.)

A web search for "Reservoir Sampling" will find a lot of implementations. Here is Perl and Python code that implements what you want, and here is another Stack Overflow thread discussing it.

Solution 7 - Linux

In this case, reservoir sampling to get exactly k values is trivial enough with awk that I'm surprised no solution has suggested that yet. I had to solve the same problem and I wrote the following awk program for sampling:

#!/usr/bin/env awk -f
BEGIN{
    srand();
    if(k=="") k=10
}

NR <= k {
    reservoir[NR-1] = $0;
    next;
}

{ i = int(NR * rand()) }

i < k { reservoir[i] = $0 }

END {
    for (i in reservoir) {
        print reservoir[i];
    }
}

If saved as sample_lines and made executable, it can be run like: ./sample_lines -v k=5 input_file. If k is not given, then 10 will be used by default.

Then figuring out what k is has to be done separately, for example by setting -v "k=$(dc -e "$(cat input_file | wc -l) 100 / n")"

Solution 8 - Linux

You could do it in two passes:

  • Run through the file once, just to count how many lines there are
  • Randomly select the line numbers of the lines you want to print, storing them in a sorted list (or a set)
  • Run through the file once more and pick out the lines at the selected positions

Example in python:

fn = '/usr/share/dict/words'

from random import randint
from sys import stdout

count = 0
with open(fn) as f:
   for line in f:
      count += 1

selected = set()
while len(selected) < count//100:
   selected.add(randint(0, count-1))

index = 0
with open(fn) as f:
   for line in f:
      if index in selected:
          stdout.write(line)
      index += 1

Solution 9 - Linux

Instead of waiting until the end to randomly pick your 1% of lines, do it every 100 lines in "/^$/". That way, you only hold 100 lines at a time.

Solution 10 - Linux

If the aim is just to avoid memory exhaustion, and the file is a regular file, no need to implement reservoir sampling. The number of lines in the file can be known if you do two passes in the file, one to get the number of lines (like with wc -l), one to select the sample:

file=/some/file
awk -v percent=0.01 -v n="$(wc -l < "$file")" '
  BEGIN {srand(); p = int(n * percent)}
  rand() * n-- < p {p--; print}' < "$file"

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionneversaintView Question on Stackoverflow
Solution 1 - LinuxcadrianView Answer on Stackoverflow
Solution 2 - LinuxBillView Answer on Stackoverflow
Solution 3 - LinuxSteven HuwigView Answer on Stackoverflow
Solution 4 - LinuxashawleyView Answer on Stackoverflow
Solution 5 - LinuxadvaitView Answer on Stackoverflow
Solution 6 - LinuxTudor BosmanView Answer on Stackoverflow
Solution 7 - LinuxkqrView Answer on Stackoverflow
Solution 8 - LinuxsthView Answer on Stackoverflow
Solution 9 - LinuxTravis JensenView Answer on Stackoverflow
Solution 10 - LinuxStephane ChazelasView Answer on Stackoverflow