Bash Script: count unique lines in file

Bash Problem Overview

#Situation:

I have a large file (millions of lines) containing IP addresses and ports from a several hour network capture, one ip/port per line. Lines are of this format:

ip.ad.dre.ss[:port]

#Desired result:

There is an entry for each packet I received while logging, so there are a lot of duplicate addresses. I'd like to be able to run this through a shell script of some sort which will be able to reduce it to lines of the format

ip.ad.dre.ss[:port] count

where count is the number of occurrences of that specific address (and port). No special work has to be done, treat different ports as different addresses.

So far, I'm using this command to scrape all of the ip addresses from the log file:

grep -o -E [0-9]+\.[0-9]+\.[0-9]+\.[0-9]+(:[0-9]+)? ip_traffic-1.log > ips.txt

From that, I can use a fairly simple regex to scrape out all of the ip addresses that were sent by my address (which I don't care about)

I can then use the following to extract the unique entries:

sort -u ips.txt > intermediate.txt

I don't know how I can aggregate the line counts somehow with sort.

Bash Solutions

Solution 1 - Bash

You can use the uniq command to get counts of sorted repeated lines:

sort ips.txt | uniq -c

To get the most frequent results at top (thanks to Peter Jaric):

sort ips.txt | uniq -c | sort -bgr

Solution 2 - Bash

To count the total number of unique lines (i.e. not considering duplicate lines) we can use uniq or Awk with wc:

sort ips.txt | uniq | wc -l
awk '!seen[$0]++' ips.txt | wc -l

Awk's arrays are associative so it may run a little faster than sorting.

Generating text file:

$  for i in {1..100000}; do echo $RANDOM; done > random.txt
$ time sort random.txt | uniq | wc -l
31175

real    0m1.193s
user    0m0.701s
sys     0m0.388s

$ time awk '!seen[$0]++' random.txt | wc -l
31175

real    0m0.675s
user    0m0.108s
sys     0m0.171s

Solution 3 - Bash

This is the fastest way to get the count of the repeated lines and have them nicely printed sored by the least frequent to the most frequent:

awk '{!seen[$0]++}END{for (i in seen) print seen[i], i}' ips.txt | sort -n

If you don't care about performance and you want something easier to remember, then simply run:

sort ips.txt | uniq -c | sort -n

PS:

sort -n parse the field as a number, that is correct since we're sorting using the counts.

Content Type	Original Author	Original Content on Stackoverflow
Question	Wug	View Question on Stackoverflow
Solution 1 - Bash	Michael Hoffman	View Answer on Stackoverflow
Solution 2 - Bash	qwr	View Answer on Stackoverflow
Solution 3 - Bash	Luca Mastrostefano	View Answer on Stackoverflow

Bash Script: count unique lines in file

Bash Problem Overview

Bash Solutions

Solution 1 - Bash

Solution 2 - Bash

Solution 3 - Bash

Downloading a Google font and setting up an offline site that uses it

Postgresql: Conditionally unique constraint

Attributions