How to count differences between two files on linux?

ShellCountDiff

Shell Problem Overview


I need to work with large files and must find differences between two. And I don't need the different bits, but the number of differences.

To find the number of different rows I come up with

diff --suppress-common-lines --speed-large-files -y File1 File2 | wc -l

And it works, but is there a better way to do it?

And how to count the exact number of differences (with standard tools like bash, diff, awk, sed some old version of perl)?

Shell Solutions


Solution 1 - Shell

If you want to count the number of lines that are different use this:

diff -U 0 file1 file2 | grep ^@ | wc -l

Doesn't John's answer double count the different lines?

Solution 2 - Shell

diff -U 0 file1 file2 | grep -v ^@ | wc -l

That minus 2 for the two file names at the top of the diff listing. Unified format is probably a bit faster than side-by-side format.

Solution 3 - Shell

If using Linux/Unix, what about comm -1 file1 file2 to print lines in file1 that aren't in file2, comm -1 file1 file2 | wc -l to count them, and similarly for comm -2 ...?

Solution 4 - Shell

Since every output line that differs starts with < or > character, I would suggest this:

diff file1 file2 | grep ^[\>\<] | wc -l

By using only \< or \> in the script line you can count differences only in one of the files.

Solution 5 - Shell

I believe the correct solution is in this answer, that is:

$ diff -y --suppress-common-lines a b | grep '^' | wc -l
1

Solution 6 - Shell

If you're dealing with files with analogous content that should be sorted the same line-for-line (like CSV files describing similar things) and you would e.g. want to find 2 differences in the following files:

File a:    File b:
min,max    min,max
1,5        2,5
3,4        3,4
-2,10      -1,1

you could implement it in Python like this:

different_lines = 0
with open(file1) as a, open(file2) as b:
    for line in a:
        other_line = b.readline()
        if line != other_line:
            different_lines += 1

Solution 7 - Shell

Here is a way to count any kind of differences between two files, with specified regex for those differences - here . for any character except newline:

git diff --patience --word-diff=porcelain --word-diff-regex=. file1 file2 | pcre2grep -M "^@[\s\S]*" | pcre2grep -M --file-offsets "(^-.*\n)(^\+.*\n)?|(^\+.*\n)" | wc -l

An excerpt from man git-diff : > ``` --patience Generate a diff using the "patience diff" algorithm. --word-diff[=] Show a word diff, using the to delimit changed words. By default, words are delimited by whitespace; see --word-diff-regex below. porcelain Use a special line-based format intended for script consumption. Added/removed/unchanged runs are printed in the usual unified diff format, starting with a +/-/ character at the beginning of the line and extending to the end of the line. Newlines in the input are represented by a tilde ~ on a line of its own. --word-diff-regex= Use to decide what a word is, instead of considering runs of non-whitespace to be a word. Also implies --word-diff unless it was already enabled. Every non-overlapping match of the is considered a word. Anything between these matches is considered whitespace and ignored(!) for the purposes of finding differences. You may want to append |[^[:space:]] to your regular expression to make sure that it matches all non-whitespace characters. A match that contains a newline is silently truncated(!) at the newline. For example, --word-diff-regex=. will treat each character as a word and, correspondingly, show differences character by character.

pcre2grep is part of pcre2-utils package on Ubuntu 20.04.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionZsolt BotykaiView Question on Stackoverflow
Solution 1 - ShellJoshView Answer on Stackoverflow
Solution 2 - ShellJohn KugelmanView Answer on Stackoverflow
Solution 3 - ShelldubiousjimView Answer on Stackoverflow
Solution 4 - ShellMichal NemecView Answer on Stackoverflow
Solution 5 - ShelltsusankaView Answer on Stackoverflow
Solution 6 - ShellDaniel LeeView Answer on Stackoverflow
Solution 7 - ShellvstepaniukView Answer on Stackoverflow