Comparing two files in linux terminal

LinuxTerminalDiffFile Comparison

Linux Problem Overview


There are two files called "a.txt" and "b.txt" both have a list of words. Now I want to check which words are extra in "a.txt" and are not in "b.txt".

I need a efficient algorithm as I need to compare two dictionaries.

Linux Solutions


Solution 1 - Linux

if you have vim installed,try this:

vimdiff file1 file2

or

vim -d file1 file2

you will find it fantastic.enter image description here

Solution 2 - Linux

Sort them and use comm:

comm -23 <(sort a.txt) <(sort b.txt)

comm compares (sorted) input files and by default outputs three columns: lines that are unique to a, lines that are unique to b, and lines that are present in both. By specifying -1, -2 and/or -3 you can suppress the corresponding output. Therefore comm -23 a b lists only the entries that are unique to a. I use the <(...) syntax to sort the files on the fly, if they are already sorted you don't need this.

Solution 3 - Linux

If you prefer the diff output style from git diff, you can use it with the --no-index flag to compare files not in a git repository:

git diff --no-index a.txt b.txt

Using a couple of files with around 200k file name strings in each, I benchmarked (with the built-in timecommand) this approach vs some of the other answers here:

git diff --no-index a.txt b.txt
# ~1.2s

comm -23 <(sort a.txt) <(sort b.txt)
# ~0.2s

diff a.txt b.txt
# ~2.6s

sdiff a.txt b.txt
# ~2.7s

vimdiff a.txt b.txt
# ~3.2s

comm seems to be the fastest by far, while git diff --no-index appears to be the fastest approach for diff-style output.


Update 2018-03-25 You can actually omit the --no-index flag unless you are inside a git repository and want to compare untracked files within that repository. From the man pages:

> This form is to compare the given two paths on the filesystem. You can omit the --no-index option when running the command in a working tree controlled by Git and at least one of the paths points outside the working tree, or when running the command outside a working tree controlled by Git.

Solution 4 - Linux

Try sdiff (man sdiff)

sdiff -s file1 file2

Solution 5 - Linux

You can use diff tool in linux to compare two files. You can use --changed-group-format and --unchanged-group-format options to filter required data.

Following three options can use to select the relevant group for each option:

  • '%<' get lines from FILE1

  • '%>' get lines from FILE2

  • '' (empty string) for removing lines from both files.

> E.g: diff --changed-group-format="%<" --unchanged-group-format="" file1.txt file2.txt

[root@vmoracle11 tmp]# cat file1.txt 
test one
test two
test three
test four
test eight
[root@vmoracle11 tmp]# cat file2.txt 
test one
test three
test nine
[root@vmoracle11 tmp]# diff --changed-group-format='%<' --unchanged-group-format='' file1.txt file2.txt 
test two
test four
test eight

Solution 6 - Linux

You can also use: colordiff: Displays the output of diff with colors.

About vimdiff: It allows you to compare files via SSH, for example :

vimdiff /var/log/secure scp://192.168.1.25/var/log/secure

Extracted from: http://www.sysadmit.com/2016/05/linux-diferencias-entre-dos-archivos.html

Solution 7 - Linux

Also, do not forget about mcdiff - Internal diff viewer of GNU Midnight Commander.

For example:

mcdiff file1 file2

Enjoy!

Solution 8 - Linux

Use comm -13 (requires sorted files):

$ cat file1
one
two
three

$ cat file2
one
two
three
four

$ comm -13 <(sort file1) <(sort file2)
four

Solution 9 - Linux

Here is my solution for this :

mkdir temp
mkdir results
cp /usr/share/dict/american-english ~/temp/american-english-dictionary
cp /usr/share/dict/british-english ~/temp/british-english-dictionary
cat ~/temp/american-english-dictionary | wc -l > ~/results/count-american-english-dictionary
cat ~/temp/british-english-dictionary | wc -l > ~/results/count-british-english-dictionary
grep -Fxf ~/temp/american-english-dictionary ~/temp/british-english-dictionary > ~/results/common-english
grep -Fxvf ~/results/common-english ~/temp/american-english-dictionary > ~/results/unique-american-english
grep -Fxvf ~/results/common-english ~/temp/british-english-dictionary > ~/results/unique-british-english

Solution 10 - Linux

You can also use:

sdiff file1 file2

To display differences side by side within your terminal!

Solution 11 - Linux

diff a.txt b.txt | grep '<'

can then pipe to cut for a clean output

diff a.txt b.txt | grep '<' | cut -c 3

Solution 12 - Linux

Using awk for it. Test files:

$ cat a.txt
one
two
three
four
four
$ cat b.txt
three
two
one

The awk:

$ awk '
NR==FNR {                    # process b.txt  or the first file
    seen[$0]                 # hash words to hash seen
    next                     # next word in b.txt
}                            # process a.txt  or all files after the first
!($0 in seen)' b.txt a.txt   # if word is not hashed to seen, output it

Duplicates are outputed:

four
four

To avoid duplicates, add each newly met word in a.txt to seen hash:

$ awk '
NR==FNR {
    seen[$0]
    next
}
!($0 in seen) {              # if word is not hashed to seen
    seen[$0]                 # hash unseen a.txt words to seen to avoid duplicates 
    print                    # and output it
}' b.txt a.txt

Output:

four

If the word lists are comma-separated, like:

$ cat a.txt
four,four,three,three,two,one
five,six
$ cat b.txt
one,two,three

you have to do a couple of extra laps (forloops):

awk -F, '                    # comma-separated input
NR==FNR {
    for(i=1;i<=NF;i++)       # loop all comma-separated fields
        seen[$i]
    next
}
{
    for(i=1;i<=NF;i++)
        if(!($i in seen)) {
             seen[$i]        # this time we buffer output (below):
             buffer=buffer (buffer==""?"":",") $i
        }
    if(buffer!="") {         # output unempty buffers after each record in a.txt
        print buffer
        buffer=""
    }
}' b.txt a.txt

Output this time:

four
five,six

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionAli ImranView Question on Stackoverflow
Solution 1 - LinuxFengya LiView Answer on Stackoverflow
Solution 2 - LinuxAnders JohanssonView Answer on Stackoverflow
Solution 3 - LinuxjoelostblomView Answer on Stackoverflow
Solution 4 - LinuxmudriiView Answer on Stackoverflow
Solution 5 - LinuxManjulaView Answer on Stackoverflow
Solution 6 - LinuxFindlinuxOneView Answer on Stackoverflow
Solution 7 - LinuxIurii GolskyiView Answer on Stackoverflow
Solution 8 - LinuxChris SeymourView Answer on Stackoverflow
Solution 9 - LinuxAli ImranView Answer on Stackoverflow
Solution 10 - Linuxotto_View Answer on Stackoverflow
Solution 11 - LinuxchrisequalsdevView Answer on Stackoverflow
Solution 12 - LinuxJames BrownView Answer on Stackoverflow