bash, Linux: Set difference between two text files

BashFile IoSet Difference

Bash Problem Overview


I have two files A-nodes_to_delete and B-nodes_to_keep. Each file has a many lines with numeric ids.

I want to have the list of numeric ids that are in nodes_to_delete but NOT in nodes_to_keep, e.g. alt text .

Doing it within a PostgreSQL database is unreasonably slow. Any neat way to do it in bash using Linux CLI tools?

UPDATE: This would seem to be a Pythonic job, but the files are really, really large. I have solved some similar problems using uniq, sort and some set theory techniques. This was about two or three orders of magnitude faster than the database equivalents.

Bash Solutions


Solution 1 - Bash

The comm command does that.

Solution 2 - Bash

Somebody showed me how to do exactly this in sh a couple months ago, and then I couldn't find it for a while... and while looking I stumbled onto your question. Here it is :

set_union () {
   sort $1 $2 | uniq
}

set_difference () {
   sort $1 $2 $2 | uniq -u
}

set_symmetric_difference() {
   sort $1 $2 | uniq -u
}

Solution 3 - Bash

Use comm - it will compare two sorted files line by line.

The short answer to your question

This command will return lines unique to deleteNodes, and not in keepNodes.

comm -1 -3 <(sort keepNodes) <(sort deleteNodes)

Example setup

Let's create the files named keepNodes and deleteNodes, and use them as unsorted input for the comm command.

$ cat > keepNodes <(echo bob; echo amber;)
$ cat > deleteNodes <(echo bob; echo ann;)

By default, running comm without arguments prints 3 columns with this layout:

lines_unique_to_FILE1
    lines_unique_to_FILE2
        lines_which_appear_in_both

Using our example files above, run comm without arguments. Note the three columns.

$ comm <(sort keepNodes) <(sort deleteNodes)
amber
    ann
        bob

Suppressing column output

Suppress column 1, 2 or 3 with -N; note that when a column is hidden, the whitespace shrinks up.

$ comm -1 <(sort keepNodes) <(sort deleteNodes)
ann
	bob
$ comm -2 <(sort keepNodes) <(sort deleteNodes)
amber
	bob
$ comm -3 <(sort keepNodes) <(sort deleteNodes)
amber
	ann
$ comm -1 -3 <(sort keepNodes) <(sort deleteNodes)
ann
$ comm -2 -3 <(sort keepNodes) <(sort deleteNodes)
amber
$ comm -1 -2 <(sort keepNodes) <(sort deleteNodes)
bob

Sorting is important!

If you execute comm without first sorting the file, it fails gracefully with a message about which file is not sorted.

comm: file 1 is not in sorted order

Solution 4 - Bash

comm was specifically designed for this kind of use case, but it requires sorted input.

awk is arguably a better tool for this as it's fairly straight forward to find set difference, doesn't require sort, and offers additional flexibility.

awk 'NR == FNR { a[$0]; next } !($0 in a)' nodes_to_keep nodes_to_delete

Perhaps, for example, you'd like to only find the difference in lines that represent non-negative numbers:

awk -v r='^[0-9]+$' 'NR == FNR && $0 ~ r {
    a[$0]
    next
} $0 ~ r && !($0 in a)' nodes_to_keep nodes_to_delete

Solution 5 - Bash

Maybe you need a better way to do it in postgres, I can pretty much bet that you won't find a faster way to do it using flat files. You should be able to do a simple inner join and assuming that both id cols are indexed that should be very fast.

Solution 6 - Bash

Another portable solution, which also works in case of multisets, a set which allows multiple instance of an element, is to use grep with patterns in a separate file:

grep -Fvx -f B A

The parameters:

  • -f: a file containing a list of patterns, one by line
  • -F: treat the patterns as string, not regex
  • -x: match whole lines in A-nodes_to_delete
  • -v: invert the matching (match if does not match)

If the patterns in B do not match a line in A, the command outputs the line otherwise nothing.

A nice feature of this solution is that it is possible to make it work with multi-column files (for A) while comm and uniq -u solutions require one column files.

Solution 7 - Bash

So, this is slightly different from the other answers. I can't say that a C++ compiler is exactly a "Linux CLI tool", but running g++ -O3 -march=native -o set_diff main.cpp (with the below code in main.cpp can do the trick):

#include<algorithm>
#include<iostream>
#include<iterator>
#include<fstream>
#include<string>
#include<unordered_set>

using namespace std;

int main(int argc, char** argv) {
	ifstream keep_file(argv[1]), del_file(argv[2]);
	unordered_multiset<string> init_lines{istream_iterator<string>(keep_file), istream_iterator<string>()};
	string line;
	while (getline(del_file, line)) {
		init_lines.erase(line);
	}
	copy(init_lines.begin(),init_lines.end(), ostream_iterator<string>(cout, "\n"));
}

To use, simply run set_diff B A (not A B, since B is nodes_to_keep) and the resulting difference will be printed to stdout.

Note that I've forgone a few C++ best practices to keep the code simpler.

Many additional speed optimizations could be made (at the price of more memory). mmap would also be particularly useful for large data sets, but that'd make the code much more involved.

Since you mentioned that the data sets are large, I thought that reading nodes_to_delete a line at a time might be a good idea to reduce memory consumption. The approach taken in the code above isn't particularly efficient if there are lots of dupes in your nodes_to_delete. Also, order is not preserved.


Something easier to copy and paste into bash (i.e. skipping creation of main.cpp):

g++ -O3 -march=native -xc++ -o set_diff - <<EOF
#include<algorithm>
#include<iostream>
#include<iterator>
#include<fstream>
#include<string>
#include<unordered_set>

using namespace std;

int main(int argc, char** argv) {
        ifstream keep_file(argv[1]), del_file(argv[2]);
        unordered_multiset<string> init_lines{istream_iterator<string>(keep_file), istream_iterator<string>()};
        string line;
        while (getline(del_file, line)) {
                init_lines.erase(line);
        }
        copy(init_lines.begin(),init_lines.end(), ostream_iterator<string>(cout, "\n"));
}
EOF

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionAdam MatanView Question on Stackoverflow
Solution 1 - BashmswView Answer on Stackoverflow
Solution 2 - BashslinkpView Answer on Stackoverflow
Solution 3 - BashactivedecayView Answer on Stackoverflow
Solution 4 - BashJohn BView Answer on Stackoverflow
Solution 5 - BashDark CastleView Answer on Stackoverflow
Solution 6 - BashAlenversView Answer on Stackoverflow
Solution 7 - BashYenForYangView Answer on Stackoverflow