Count line lengths in file using command line tools
BashShellCommand LineScriptingBash Problem Overview
Problem
If I have a long file with lots of lines of varying lengths, how can I count the occurrences of each line length?
Example:
file.txt
this
is
a
sample
file
with
several
lines
of
varying
length
Running count_line_lengths file.txt
would give:
Length Occurences
1 1
2 2
4 3
5 1
6 2
7 2
Ideas?
Bash Solutions
Solution 1 - Bash
This
- counts the line lengths using
awk
, then - sorts the (numeric) line lengths using
sort -n
and finally - counts the unique line length values
uniq -c
.
$ awk '{print length}' input.txt | sort -n | uniq -c
1 1
2 2
3 4
1 5
2 6
2 7
In the output, the first column is the number of lines with the given length, and the second column is the line length.
Solution 2 - Bash
Pure awk
awk '{++a[length()]} END{for (i in a) print i, a[i]}' file.txt
4 3
5 1
6 2
7 2
1 1
2 2
Solution 3 - Bash
Using bash
arrays:
#!/bin/bash
while read line; do
((histogram[${#line}]++))
done < file.txt
echo "Length Occurrence"
for length in "${!histogram[@]}"; do
printf "%-6s %s\n" "${length}" "${histogram[$length]}"
done
Example run:
$ ./t.sh
Length Occurrence
1 1
2 2
4 3
5 1
6 2
7 2
Solution 4 - Bash
$ perl -lne '$c{length($_)}++ }{ print qq($_ $c{$_}) for (keys %c);' file.txt
Output
6 2
1 1
4 3
7 2
2 2
5 1
Solution 5 - Bash
You can accomplish this by using basic unix utilities only:
$ printf "%s %s\n" $(for line in $(cat file.txt); do printf $line | wc -c; done | sort -n | uniq -c | sed -E "s/([0-9]+)[^0-9]+([0-9]+)/\2 \1/") 1 1 2 2 4 3 5 1 6 2 7 2
How it works?
- Here's the source file:
$ cat file.txt this is a sample file with several lines of varying length
- Replace each line of the source file with its length:
$ for line in $(cat file.txt); do printf $line | wc -c; done 4 2 1 6 4 4 7 5 2 7 6
- Sort and count the number of length occurrences:
$ for line in $(cat file.txt); do printf $line | wc -c; done | sort -n | uniq -c 1 1 2 2 3 4 1 5 2 6 2 7
- Swap and format the numbers:
$ printf "%s %s\n" $(for line in $(cat file.txt); do printf $line | wc -c; done | sort -n | uniq -c | sed -E "s/([0-9]+)[^0-9]+([0-9]+)/\2 \1/") 1 1 2 2 4 3 5 1 6 2 7 2
Solution 6 - Bash
If you allow for the columns to be swapped and don't need the headers, something as easy as
while read line; do echo -n "$line" | wc -m; done < file | sort | uniq -c
(without any advanced tricks with sed
or awk
) will work. The output is:
1 1
2 2
3 4
1 5
2 6
2 7
One important thing to keep in mind: wc -c
counts the bytes, not the characters, and will not give the correct length for strings containing multibyte characters. Therefore the use of wc -m
.
References:
Solution 7 - Bash
Try this:
awk '{print length}' FILENAME
Or next if you want the longest length:
awk '{ln=length} ln>max{max=ln} END {print FILENAME " " max}'
You can combine above command with find using -exec option.