what is the difference between set and unordered_set in C++?

C++AlgorithmData StructuresC++11

C++ Problem Overview


I came across this good question, which is similar but not at all same since it talks about Java, which has different implementation of hash-tables, by virtue of having synchronized accessor /mutators: https://stackoverflow.com/questions/40471/differences-between-hashmap-and-hashtable

So what is the difference in C++ implementation of set and unordered_set? This question can be of course extended to map vs unordered_map and so on for other C++ containers.

Here is my initial assessment:

set: While the standard doesn't explicitly ask it to be implemented as trees, the time-complexity constraint asked for its operations for find/insert, means it will always be implemented as a tree. Usually as RB tree (as seen in GCC 4.8), which is height-balanced. Since they are height balanced, they have predictable time-complexity for find()

Pros: Compact (compared to other DS in comparison)

Con: Access time complexity is O(lg n)

unordered_set: While the standard doesn't explicitly asks it to be implemented as trees, the time-complexity constraint asked for its operations for find/insert, means it will always be implemented as a hash-table.

Pros:

  1. Faster (promises amortized O(1) for search)
  2. Easy to convert basic primitives to thread-safe, as compared to tree-DS

Cons:

  1. Look up not guaranteed to be O(1). Theoretical worst case is O(n).
  2. Not as compact as tree (for practical purposes load factors is never 1).

Note: The O(1), for hashtable comes from the assumption that there are no collision. Even with load-factor of .5, every second variable insertion is leading to collision. It could be observed that the load-factor of hash-table is inversely proportional to the number of operations required for accessing a element in it. More we reduce #operations, sparser hash-table. When the element stored are of size comparable to pointer, then overhead is quite significant.

Did I miss any difference between map/set for performance analysis that one should know?

C++ Solutions


Solution 1 - C++

I think you've generally answered your own question, however, this:

> Not as compact as tree. (for practical purposes load factors is never 1)

is not necessarily true. Each node of a tree (we'll assume it's a red-black tree) for a type T utilizes space that is equal to at least 2 * pointer_size + sizeof(T) + sizeof(bool). This may be 3 * pointer size depending on whether the tree contains a parent pointer for each tree node.

Compare this to a hash-map: there will be wasted array space for each hash map due to the fact that load factor < 1 as you've said. However, assuming the hash map uses singly linked lists for chaining (and really, there's no real reason not to), each element inserted take only sizeof(T) + pointer size.

Note that this analysis ignores any overhead which may come from extra space used by alignment.

For any element T which has a small size (so, any basic type), the size of the pointers and other overhead dominates. At a load factor of > 0.5 (for example) the std::unordered_set may indeed use up less memory than the equivalent std::set.

The other big missing point is the fact that iterating through a std::set is guaranteed to produce an ordering from smallest to largest, based on the given comparison function, while iterating through an std::unordered_set will return the values in a "random" order.

Solution 2 - C++

Another difference (though not performance-related) is that set insertion doesn't invalidate iterators, while unordered_set insertion can if it triggers a rehash. In practice it's a pretty minor concern, since references to the actual elements remain valid.

Solution 3 - C++

Yuushi addresses spatial efficiency and other points well already; just a few other parts of the question I'll comment on...

> The O(1), for hashtable comes from the assumption that there are no collision.

That's not true. What O(1) means is not that the first lookup attempt will always succeed, it's that there is - on average - a constant number of attempts needed, rather than something that grows as the number of values grows. For example, with an unordered_set or ..._map, the max_load_factor defaults to 1.0 on construction, and if load factor approaches that with a good hash function, the average number of elements that hash to any one bucket will be around 2 regardless of how many values are in the table.

> Even with load-factor of .5, every second variable insertion is leading to collision.

True, but it doesn't get as dire as you might intuitively expect: that average chain length of 2 at 1.0 load factor's not bad.

> It could be observed that the load-factor of hash-table is inversely > proportional to the number of operations required for accessing a > element in it. More we reduce #operations, sparser hash-table.

There's definitely a correlation (it's not inverse).

Solution 4 - C++

In some case set is more convenient.

For example using vector as key:

set<vector<int>> s;
s.insert({1, 2});
s.insert({1, 3});
s.insert({1, 2});

for(const auto& vec:s)
    cout<<vec<<endl;   // I have override << for vector
// 1 2
// 1 3 

The reason why vector<int> can be in set because vector override operator<.

But if you use unordered_set<vector<int>> you have to create a hash function for vector<int>, because vector does't have a hash function, so you have to define one like:

struct VectorHash {
    size_t operator()(const std::vector<int>& v) const {
        std::hash<int> hasher;
        size_t seed = 0;
        for (int i : v) {
            seed ^= hasher(i) + 0x9e3779b9 + (seed<<6) + (seed>>2);
        }
        return seed;
    }
};

vector<vector<int>> two(){
    //unordered_set<vector<int>> s; // error vector<int> doesn't  have hash function
    unordered_set<vector<int>, VectorHash> s;
    s.insert({1, 2});
    s.insert({1, 3});
    s.insert({1, 2});

    for(const auto& vec:s)
        cout<<vec<<endl;
    // 1 2
    // 1 3
}

you can see that in some case unordered_set is more complicated.

Mainly cited from: https://stackoverflow.com/a/29855973/6329006

More difference between unordered_set and set see this: https://stackoverflow.com/a/52203931/6329006

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionAjeet GangaView Question on Stackoverflow
Solution 1 - C++YuushiView Answer on Stackoverflow
Solution 2 - C++dhaffeyView Answer on Stackoverflow
Solution 3 - C++Tony DelroyView Answer on Stackoverflow
Solution 4 - C++JayhelloView Answer on Stackoverflow