Select random lines from a file

BashShellRandomText Processing

Bash Problem Overview


In a Bash script, I want to pick out N random lines from input file and output to another file.

How can this be done?

Bash Solutions


Solution 1 - Bash

Use shuf with the -n option as shown below, to get N random lines:

shuf -n N input > output

Solution 2 - Bash

Sort the file randomly and pick first 100 lines:

lines=100
input_file=/usr/share/dict/words

# This is the basic selection method
<$input_file sort -R | head -n $lines

# If the file has duplicates that must never cause duplicate results
<$input_file sort | uniq        | sort -R | head -n $lines

# If the file has blank lines that must be filtered, use sed
<$input_file sed $'/^[ \t]*$/d' | sort -R | head -n $lines

Of course <$input_file can be replaced with any piped standard input. This (sort -R and $'...\t...' to get sed to match tab chars) works with GNU/Linux and BSD/macOS.

Solution 3 - Bash

Well According to a comment on the shuf answer he shuffed 78 000 000 000 lines in under a minute.

Challenge accepted...

EDIT: I beat my own record

powershuf did it in 0.047 seconds

$ time ./powershuf.py -n 10 --file lines_78000000000.txt > /dev/null 
./powershuf.py -n 10 --file lines_78000000000.txt > /dev/null  0.02s user 0.01s system 80% cpu 0.047 total

The reason it is so fast, well I don't read the whole file and just move the file pointer 10 times and print the line after the pointer.

Gitlab Repo

Old attempt

First I needed a file of 78.000.000.000 lines:

seq 1 78 | xargs -n 1 -P 16 -I% seq 1 1000 | xargs -n 1 -P 16 -I% echo "" > lines_78000.txt
seq 1 1000 | xargs -n 1 -P 16 -I% cat lines_78000.txt > lines_78000000.txt
seq 1 1000 | xargs -n 1 -P 16 -I% cat lines_78000000.txt > lines_78000000000.txt

This gives me a a file with 78 Billion newlines ;-)

Now for the shuf part:

$ time shuf -n 10 lines_78000000000.txt










shuf -n 10 lines_78000000000.txt  2171.20s user 22.17s system 99% cpu 36:35.80 total

The bottleneck was CPU and not using multiple threads, it pinned 1 core at 100% the other 15 were not used.

Python is what I regularly use so that's what I'll use to make this faster:

#!/bin/python3
import random
f = open("lines_78000000000.txt", "rt")
count = 0
while 1:
  buffer = f.read(65536)
  if not buffer: break
  count += buffer.count('\n')

for i in range(10):
  f.readline(random.randint(1, count))

This got me just under a minute:

$ time ./shuf.py         










./shuf.py  42.57s user 16.19s system 98% cpu 59.752 total

I did this on a Lenovo X1 extreme 2nd gen with the i9 and Samsung NVMe which gives me plenty read and write speed.

I know it can get faster but I'll leave some room to give others a try.

Line counter source: Luther Blissett

Solution 4 - Bash

My preferred option is very fast, I sampled a tab-delimited data file with 13 columns, 23.1M rows, 2.0GB uncompressed.

# randomly sample select 5% of lines in file
# including header row, exclude blank lines, new seed

time \
awk 'BEGIN  {srand()} 
     !/^$/  { if (rand() <= .05 || FNR==1) print > "data-sample.txt"}' data.txt

# awk  tsv004  3.76s user 1.46s system 91% cpu 5.716 total

Solution 5 - Bash

seq 1 100 | python3 -c 'print(__import__("random").choice(__import__("sys").stdin.readlines()))'

Solution 6 - Bash

# Function to sample N lines randomly from a file
# Parameter $1: Name of the original file
# Parameter $2: N lines to be sampled 
rand_line_sampler() {
	N_t=$(awk '{print $1}' $1 | wc -l) # Number of total lines

	N_t_m_d=$(( $N_t - $2 - 1 )) # Number oftotal lines minus desired number of lines

	N_d_m_1=$(( $2 - 1)) # Number of desired lines minus 1

	# vector to have the 0 (fail) with size of N_t_m_d 
	echo '0' > vector_0.temp
	for i in $(seq 1 1 $N_t_m_d); do
	        echo "0" >> vector_0.temp
	done

	# vector to have the 1 (success) with size of desired number of lines
	echo '1' > vector_1.temp
	for i in $(seq 1 1 $N_d_m_1); do
	        echo "1" >> vector_1.temp
	done

	cat vector_1.temp vector_0.temp | shuf > rand_vector.temp

	paste -d" " rand_vector.temp $1 |
	awk '$1 != 0 {$1=""; print}' |
	sed 's/^ *//' > sampled_file.txt # file with the sampled lines

	rm vector_0.temp vector_1.temp rand_vector.temp
}

rand_line_sampler "parameter_1" "parameter_2"

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
Questionuser121196View Question on Stackoverflow
Solution 1 - BashdogbaneView Answer on Stackoverflow
Solution 2 - Bashuser881480View Answer on Stackoverflow
Solution 3 - BashStein van BroekhovenView Answer on Stackoverflow
Solution 4 - BashMerlinView Answer on Stackoverflow
Solution 5 - BashAndelfView Answer on Stackoverflow
Solution 6 - BashandrecView Answer on Stackoverflow