C - How to implement Set data structure?

C Problem Overview

Is there any tricky way to implement a set data structure (a collection of unique values) in C? All elements in a set will be of the same type and there is a huge RAM memory.

As I know, for integers it can be done really fast'N'easy using value-indexed arrays. But I'd like to have a very general Set data type. And it would be nice if a set could include itself.

C Solutions

Solution 1 - C

There are multiple ways of implementing set (and map) functionality, for example:

tree-based approach (ordered traversal)
hash-based approach (unordered traversal)

Since you mentioned value-indexed arrays, let's try the hash-based approach which builds naturally on top of the value-indexed array technique.

Beware of the advantages and disadvantages of hash-based vs. tree-based approaches.

You can design a hash-set (a special case of hash-tables) of pointers to hashable PODs, with chaining, internally represented as a fixed-size array of buckets of hashables, where:

all hashables in a bucket have the same hash value
a bucket can be implemented as a dynamic array or linked list of hashables
a hashable's hash value is used to index into the array of buckets (hash-value-indexed array)
one or more of the hashables contained in the hash-set could be (a pointer to) another hash-set, or even to the hash-set itself (i.e. self-inclusion is possible)

With large amounts of memory at your disposal, you can size your array of buckets generously and, in combination with a good hash method, drastically reduce the probability of collision, achieving virtually constant-time performance.

You would have to implement:

the hash function for the type being hashed
an equality function for the type being used to test whether two hashables are equal or not
the hash-set contains/insert/remove functionality.

You can also use open addressing as an alternative to maintaining and managing buckets.

Solution 2 - C

Sets are usually implemented as some variety of a http://en.wikipedia.org/wiki/Binary_tree">binary tree. http://en.wikipedia.org/wiki/Red_black_tree">Red black trees have good worst case performance.

These can also be used to build an http://en.wikipedia.org/wiki/Associative_array">map</a> to allow key / value lookups.

This approach requires some sort of ordering on the elements of the set and the key values in a map.

I'm not sure how you would manage a set that could possibly contain itself using binary trees if you limit set membership to well defined types in C ... comparison between such constructs could be problematic. You could do it easily enough in C++, though.

Solution 3 - C

If the maximum number of elements in the set (the cardinality of the underlying data type) is small enough, you might want to consider using a plain old array of bits (or whatever you call them in your favourite language).

Then you have a simple set membership check: bit n is 1 if element n is in the set. You could even count 'ordinary' members from 1, and only make bit 0 equal to 1 if the set contains itself.

This approach will probably require some sort of other data structure (or function) to translate from the member data type to the position in the bit array (and back), but it makes basic set operations (union, intersection, membership test, difference, insertion, removal,compelment) very very easy. And it is only suitable for relatively small sets, you wouldn't want to use it for sets of 32-bit integers I don't suppose.

Solution 4 - C

The way to get genericity in C is by void *, so you're going to be using pointers anyway, and pointers to different objects are unique. This means you need a hash map or binary tree containing pointers, and this will work for all data objects.

The downside of this is that you can't enter rvalues independently. You can't have a set containing the value 5; you have to assign 5 to a variable, which means it won't match a random 5. You could enter it as (void *) 5, and for practical purposes this is likely to work with small integers, but if your integers can get into large enough sizes to compete with pointers this has a very small probability of failing.

Nor does this work with string values. Given char a[] = "Hello, World!"; char b[] = "Hello, World!";, a set of pointers would find a and b to be different. You would probably want to hash the values, but if you're concerned about hash collisions you should save the string in the set and do a strncmp() to compare the stored string with the probing string.

(There's similar problems with floating-point numbers, but trying to represent floating-point numbers in sets is a bad idea in the first place.)

Therefore, you'd probably want a tagged value, one tag for any sort of object, one for integer value, and one for string value, and possibly more for different sorts of values. It's complicated, but doable.

Content Type	Original Author	Original Content on Stackoverflow
Question	psihodelia	View Question on Stackoverflow
Solution 1 - C	vladr	View Answer on Stackoverflow
Solution 2 - C	andand	View Answer on Stackoverflow
Solution 3 - C	High Performance Mark	View Answer on Stackoverflow
Solution 4 - C	David Thornley	View Answer on Stackoverflow