Count line lengths in file using command line tools

BashShellCommand LineScripting

Bash Problem Overview


Problem

If I have a long file with lots of lines of varying lengths, how can I count the occurrences of each line length?

Example:

file.txt

this
is
a
sample
file
with
several
lines
of
varying
length

Running count_line_lengths file.txt would give:

Length Occurences
1      1
2      2
4      3
5      1
6      2
7      2

Ideas?

Bash Solutions


Solution 1 - Bash

This

  • counts the line lengths using awk, then
  • sorts the (numeric) line lengths using sort -n and finally
  • counts the unique line length values uniq -c.
$ awk '{print length}' input.txt | sort -n | uniq -c
      1 1
      2 2
      3 4
      1 5
      2 6
      2 7

In the output, the first column is the number of lines with the given length, and the second column is the line length.

Solution 2 - Bash

Pure awk

awk '{++a[length()]} END{for (i in a) print i, a[i]}' file.txt

4 3
5 1
6 2
7 2
1 1
2 2

Solution 3 - Bash

Using bash arrays:

#!/bin/bash

while read line; do
    ((histogram[${#line}]++))
done < file.txt

echo "Length Occurrence"
for length in "${!histogram[@]}"; do
    printf "%-6s %s\n" "${length}" "${histogram[$length]}"
done

Example run:

$ ./t.sh
Length Occurrence
1      1
2      2
4      3
5      1
6      2
7      2

Solution 4 - Bash

$ perl -lne '$c{length($_)}++ }{ print qq($_ $c{$_}) for (keys %c);' file.txt
Output
6 2
1 1
4 3
7 2
2 2
5 1

Solution 5 - Bash

You can accomplish this by using basic unix utilities only:

$ printf "%s %s\n" $(for line in $(cat file.txt); do printf $line | wc -c; done | sort -n | uniq -c | sed -E "s/([0-9]+)[^0-9]+([0-9]+)/\2 \1/")
1 1
2 2
4 3
5 1
6 2
7 2

How it works?

  1. Here's the source file:

    $ cat file.txt
    this
    is
    a
    sample
    file
    with
    several
    lines
    of
    varying
    length
    

  2. Replace each line of the source file with its length:
    $ for line in $(cat file.txt); do printf $line | wc -c; done
    4
    2
    1
    6
    4
    4
    7
    5
    2
    7
    6
    

  3. Sort and count the number of length occurrences:
    $ for line in $(cat file.txt); do printf $line | wc -c; done | sort -n | uniq -c
    1 1
    2 2
    3 4
    1 5
    2 6
    2 7
    

  4. Swap and format the numbers:
    $ printf "%s %s\n" $(for line in $(cat file.txt); do printf $line | wc -c; done | sort -n | uniq -c | sed -E "s/([0-9]+)[^0-9]+([0-9]+)/\2 \1/")
    1 1
    2 2
    4 3
    5 1
    6 2
    7 2
    

Solution 6 - Bash

If you allow for the columns to be swapped and don't need the headers, something as easy as

while read line; do echo -n "$line" | wc -m; done < file | sort | uniq -c

(without any advanced tricks with sed or awk) will work. The output is:

1 1
2 2
3 4
1 5
2 6
2 7

One important thing to keep in mind: wc -c counts the bytes, not the characters, and will not give the correct length for strings containing multibyte characters. Therefore the use of wc -m.

References:

man uniq(1)

man sort(1)

man wc(1)

Solution 7 - Bash

Try this:

awk '{print length}' FILENAME

Or next if you want the longest length:

awk '{ln=length} ln>max{max=ln} END {print FILENAME " " max}'

You can combine above command with find using -exec option.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionPete HamiltonView Question on Stackoverflow
Solution 1 - BashIgnacio Vazquez-AbramsView Answer on Stackoverflow
Solution 2 - BashiruvarView Answer on Stackoverflow
Solution 3 - BashAdrian FrühwirthView Answer on Stackoverflow
Solution 4 - BashjfsView Answer on Stackoverflow
Solution 5 - BashMaksym GanenkoView Answer on Stackoverflow
Solution 6 - BashimrekView Answer on Stackoverflow
Solution 7 - BashSergio MarsilliView Answer on Stackoverflow