Fastest way to tell if two files have the same contents in Unix/Linux?

LinuxFileUnixDiff

Linux Problem Overview


I have a shell script in which I need to check whether two files contain the same data or not. I do this a for a lot of files, and in my script the diff command seems to be the performance bottleneck.

Here's the line:

diff -q $dst $new > /dev/null

if ($status) then ...

Could there be a faster way to compare the files, maybe a custom algorithm instead of the default diff?

Linux Solutions


Solution 1 - Linux

I believe cmp will stop at the first byte difference:

cmp --silent $old $new || echo "files are different"

Solution 2 - Linux

I like @Alex Howansky have used 'cmp --silent' for this. But I need both positive and negative response so I use:

cmp --silent file1 file2 && echo '### SUCCESS: Files Are Identical! ###' || echo '### WARNING: Files Are Different! ###'

I can then run this in the terminal or with a ssh to check files against a constant file.

Solution 3 - Linux

To quickly and safely compare any two files:

if cmp --silent -- "$FILE1" "$FILE2"; then
  echo "files contents are identical"
else
  echo "files differ"
fi

It's readable, efficient, and works for any file names including "` $()

Solution 4 - Linux

Because I suck and don't have enough reputation points I can't add this tidbit in as a comment.

But, if you are going to use the cmp command (and don't need/want to be verbose) you can just grab the exit status. Per the cmp man page:

> If a FILE is '-' or missing, read standard input. Exit status is 0 > if inputs are the same, 1 if different, 2 if trouble.

So, you could do something like:

STATUS="$(cmp --silent $FILE1 $FILE2; echo $?)"  # "$?" gives exit status for each comparison

if [[ $STATUS -ne 0 ]]; then  # if status isn't equal to 0, then execute code
    DO A COMMAND ON $FILE1
else
    DO SOMETHING ELSE
fi

EDIT: Thanks for the comments everyone! I updated the test syntax here. However, I would suggest you use Vasili's answer if you are looking for something similar to this answer in readability, style, and syntax.

Solution 5 - Linux

For files that are not different, any method will require having read both files entirely, even if the read was in the past.

There is no alternative. So creating hashes or checksums at some point in time requires reading the whole file. Big files take time.

File metadata retrieval is much faster than reading a large file.

So, is there any file metadata you can use to establish that the files are different? File size ? or even results of the file command which does just read a small portion of the file?

File size example code fragment:

  ls -l $1 $2 | 
  awk 'NR==1{a=$5} NR==2{b=$5} 
       END{val=(a==b)?0 :1; exit( val) }'
       
[ $? -eq 0 ] && echo 'same' || echo 'different'  

If the files are the same size then you are stuck with full file reads.

Solution 6 - Linux

You can compare by checksum algorithm like sha256

sha256sum oldFile > oldFile.sha256

echo "$(cat oldFile.sha256) newFile" | sha256sum --check

newFile: OK

if the files are distinct the result will be

newFile: FAILED
sha256sum: WARNING: 1 computed checksum did NOT match

Solution 7 - Linux

Doing some testing with a Raspberry Pi 3B+ (I'm using an overlay file system, and need to sync periodically), I ran a comparison of my own for diff -q and cmp -s; note that this is a log from inside /dev/shm, so disk access speeds are a non-issue:

[root@mypi shm]# dd if=/dev/urandom of=test.file bs=1M count=100 ; time diff -q test.file test.copy && echo diff true || echo diff false ; time cmp -s test.file test.copy && echo cmp true || echo cmp false ; cp -a test.file test.copy ; time diff -q test.file test.copy && echo diff true || echo diff false; time cmp -s test.file test.copy && echo cmp true || echo cmp false
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 6.2564 s, 16.8 MB/s
Files test.file and test.copy differ

real    0m0.008s
user    0m0.008s
sys     0m0.000s
diff false

real    0m0.009s
user    0m0.007s
sys     0m0.001s
cmp false
cp: overwrite âtest.copyâ? y

real    0m0.966s
user    0m0.447s
sys     0m0.518s
diff true

real    0m0.785s
user    0m0.211s
sys     0m0.573s
cmp true
[root@mypi shm]# pico /root/rwbscripts/utils/squish.sh

I ran it a couple of times. cmp -s consistently had slightly shorter times on the test box I was using. So if you want to use cmp -s to do things between two files....

identical (){
  echo "$1" and "$2" are the same.
  echo This is a function, you can put whatever you want in here.
}
different () {
  echo "$1" and "$2" are different.
  echo This is a function, you can put whatever you want in here, too.
}
cmp -s "$FILEA" "$FILEB" && identical "$FILEA" "$FILEB" || different "$FILEA" "$FILEB"

Solution 8 - Linux

Try also to use the cksum command:

chk1=`cksum <file1> | awk -F" " '{print $1}'`
chk2=`cksum <file2> | awk -F" " '{print $1}'`

if [ $chk1 -eq $chk2 ]
then
  echo "File is identical"
else
  echo "File is not identical"
fi

The cksum command will output the byte count of a file. See 'man cksum'.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionJDSView Question on Stackoverflow
Solution 1 - LinuxAlex HowanskyView Answer on Stackoverflow
Solution 2 - Linuxpn1 dudeView Answer on Stackoverflow
Solution 3 - LinuxVasiliNovikovView Answer on Stackoverflow
Solution 4 - LinuxGregory MartinView Answer on Stackoverflow
Solution 5 - Linuxjim mcnamaraView Answer on Stackoverflow
Solution 6 - Linuxrafael prudencio cruzView Answer on Stackoverflow
Solution 7 - LinuxJack SimthView Answer on Stackoverflow
Solution 8 - LinuxNono TapsView Answer on Stackoverflow