Bash Script: count unique lines in file

Bash

Bash Problem Overview


#Situation:

I have a large file (millions of lines) containing IP addresses and ports from a several hour network capture, one ip/port per line. Lines are of this format:

ip.ad.dre.ss[:port]

#Desired result:

There is an entry for each packet I received while logging, so there are a lot of duplicate addresses. I'd like to be able to run this through a shell script of some sort which will be able to reduce it to lines of the format

ip.ad.dre.ss[:port] count

where count is the number of occurrences of that specific address (and port). No special work has to be done, treat different ports as different addresses.

So far, I'm using this command to scrape all of the ip addresses from the log file:

grep -o -E [0-9]+\.[0-9]+\.[0-9]+\.[0-9]+(:[0-9]+)? ip_traffic-1.log > ips.txt

From that, I can use a fairly simple regex to scrape out all of the ip addresses that were sent by my address (which I don't care about)

I can then use the following to extract the unique entries:

sort -u ips.txt > intermediate.txt

I don't know how I can aggregate the line counts somehow with sort.

Bash Solutions


Solution 1 - Bash

You can use the uniq command to get counts of sorted repeated lines:

sort ips.txt | uniq -c

To get the most frequent results at top (thanks to Peter Jaric):

sort ips.txt | uniq -c | sort -bgr

Solution 2 - Bash

To count the total number of unique lines (i.e. not considering duplicate lines) we can use uniq or Awk with wc:

sort ips.txt | uniq | wc -l
awk '!seen[$0]++' ips.txt | wc -l

Awk's arrays are associative so it may run a little faster than sorting.

Generating text file:

$  for i in {1..100000}; do echo $RANDOM; done > random.txt
$ time sort random.txt | uniq | wc -l
31175

real    0m1.193s
user    0m0.701s
sys     0m0.388s

$ time awk '!seen[$0]++' random.txt | wc -l
31175

real    0m0.675s
user    0m0.108s
sys     0m0.171s


Solution 3 - Bash

This is the fastest way to get the count of the repeated lines and have them nicely printed sored by the least frequent to the most frequent:

awk '{!seen[$0]++}END{for (i in seen) print seen[i], i}' ips.txt | sort -n

If you don't care about performance and you want something easier to remember, then simply run:

sort ips.txt | uniq -c | sort -n

PS:

sort -n parse the field as a number, that is correct since we're sorting using the counts.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionWugView Question on Stackoverflow
Solution 1 - BashMichael HoffmanView Answer on Stackoverflow
Solution 2 - BashqwrView Answer on Stackoverflow
Solution 3 - BashLuca MastrostefanoView Answer on Stackoverflow