How to count differences between two files on linux?
ShellCountDiffShell Problem Overview
I need to work with large files and must find differences between two. And I don't need the different bits, but the number of differences.
To find the number of different rows I come up with
diff --suppress-common-lines --speed-large-files -y File1 File2 | wc -l
And it works, but is there a better way to do it?
And how to count the exact number of differences (with standard tools like bash, diff, awk, sed some old version of perl)?
Shell Solutions
Solution 1 - Shell
If you want to count the number of lines that are different use this:
diff -U 0 file1 file2 | grep ^@ | wc -l
Doesn't John's answer double count the different lines?
Solution 2 - Shell
diff -U 0 file1 file2 | grep -v ^@ | wc -l
That minus 2 for the two file names at the top of the diff
listing. Unified format is probably a bit faster than side-by-side format.
Solution 3 - Shell
If using Linux/Unix, what about comm -1 file1 file2
to print lines in file1 that aren't in file2, comm -1 file1 file2 | wc -l
to count them, and similarly for comm -2 ...
?
Solution 4 - Shell
Since every output line that differs starts with <
or >
character, I would suggest this:
diff file1 file2 | grep ^[\>\<] | wc -l
By using only \<
or \>
in the script line you can count differences only in one of the files.
Solution 5 - Shell
I believe the correct solution is in this answer, that is:
$ diff -y --suppress-common-lines a b | grep '^' | wc -l
1
Solution 6 - Shell
If you're dealing with files with analogous content that should be sorted the same line-for-line (like CSV files describing similar things) and you would e.g. want to find 2 differences in the following files:
File a: File b:
min,max min,max
1,5 2,5
3,4 3,4
-2,10 -1,1
you could implement it in Python like this:
different_lines = 0
with open(file1) as a, open(file2) as b:
for line in a:
other_line = b.readline()
if line != other_line:
different_lines += 1
Solution 7 - Shell
Here is a way to count any kind of differences between two files, with specified regex for those differences - here .
for any character except newline:
git diff --patience --word-diff=porcelain --word-diff-regex=. file1 file2 | pcre2grep -M "^@[\s\S]*" | pcre2grep -M --file-offsets "(^-.*\n)(^\+.*\n)?|(^\+.*\n)" | wc -l
An excerpt from man git-diff
:
> ```
--patience
Generate a diff using the "patience diff" algorithm.
--word-diff[=
character at the beginning of the line and extending to the end of the line. Newlines in the input
are represented by a tilde ~ on a line of its own.
--word-diff-regex=
pcre2grep
is part of pcre2-utils
package on Ubuntu 20.04.