What is the underlying data structure of a STL set in C++?

C++ Problem Overview

I would like to know how a set is implemented in C++. If I were to implement my own set container without using the STL provided container, what would be the best way to go about this task?

I understand STL sets are based on the abstract data structure of a binary search tree. So what is the underlying data structure? An array?

Also, how does insert() work for a set? How does the set check whether an element already exists in it?

I read on wikipedia that another way to implement a set is with a hash table. How would this work?

C++ Solutions

Solution 1 - C++

Step debug into g++ 6.4 stdlibc++ source

Did you know that on Ubuntu's 16.04 default g++-6 package or a GCC 6.4 build from source you can step into the C++ library without any further setup?

By doing that we easily conclude that a Red-black tree used in this implementation.

This makes sense, since std::set can be traversed in order, which would not be efficient in if a hash map were used.

main.cpp

#include <cassert>
#include <set>

int main() {
    std::set<int> s;
    s.insert(1);
    s.insert(2);
    assert(s.find(1) != s.end());
    assert(s.find(2) != s.end());
    assert(s.find(3) == s3.end());
}

Compile and debug:

g++ -g -std=c++11 -O0 -o main.out main.cpp
gdb -ex 'start' -q --args main.out

Now, if you step into s.insert(1) you immediately reach /usr/include/c++/6/bits/stl_set.h:

487 #if __cplusplus >= 201103L
488       std::pair<iterator, bool>
489       insert(value_type&& __x)
490       {
491     std::pair<typename _Rep_type::iterator, bool> __p =
492       _M_t._M_insert_unique(std::move(__x));
493     return std::pair<iterator, bool>(__p.first, __p.second);
494       }
495 #endif

which clearly just forwards to _M_t._M_insert_unique.

So we open the source file in vim and find the definition of _M_t:

      typedef _Rb_tree<key_type, value_type, _Identity<value_type>,
           key_compare, _Key_alloc_type> _Rep_type;
       _Rep_type _M_t;  // Red-black tree representing set.

So _M_t is of type _Rep_type and _Rep_type is a _Rb_tree.

OK, now that is enough evidence for me. If you don't believe that _Rb_tree is a Black-red tree, step a bit further and read the algorithm.

unordered_set uses hash table

Same procedure, but replace set with unordered_set on the code.

This makes sense, since std::unordered_set cannot be traversed in order, so the standard library chose hash map instead of Red-black tree, since hash map has a better amortized insert time complexity.

Stepping into insert leads to /usr/include/c++/6/bits/unordered_set.h:

415       std::pair<iterator, bool>
416       insert(value_type&& __x)
417       { return _M_h.insert(std::move(__x)); }

So we open the source file in vim and search for _M_h:

      typedef __uset_hashtable<_Value, _Hash, _Pred, _Alloc>  _Hashtable;
      _Hashtable _M_h;

So hash table it is.

std::map and std::unordered_map

Analogous for std::set vs std:unordered_set: https://stackoverflow.com/questions/18414579/stdmap-what-data-structure-is-inside/51945119#51945119

Performance characteristics

You could also infer the data structure used by timing them:

Graph generation procedure and Heap vs BST analysis and at: https://stackoverflow.com/questions/6147242/heap-vs-binary-search-tree-bst/29548834#29548834

We clearly see for:

std::set, a logarithmic insertion time
std::unordered_set, a more complex hashmap pattern:
- on the non-zoomed plot, we clearly see the backing dynamic array doubling on huge one off linearly increasing spikes
- on the zoomed plot, we see that the times are basically constant and going towards 250ns, therefore much faster than the std::map, except for very small map sizes
  
  Several strips are clearly visible, and their inclination becomes smaller whenever the array doubles.
  
  I believe this is due to average linearly increasing linked list walks withing each bin. Then when the array doubles, we have more bins, so shorter walks.

Solution 2 - C++

As KTC said, how std::set is implemented can vary -- the C++ standard simply specifies an abstract data type. In other words, the standard does not specify how a container should be implemented, just what operations it is required to support. However, most implementations of the STL do, as far as I am aware, use red-black trees or other balanced binary search trees of some kind (GNU libstdc++, for instance, uses red-black trees).

While you could theoretically implement a set as a hash table and get faster asymptotic performance (amortized O(key length) versus O(log n) for lookup and insert), that would require having the user supply a hash function for whatever type they wanted to store (see [Wikipedia's entry on hash tables][2] for a good explanation of how they work). As for an implementation of a binary search tree, you wouldn't want to use an array -- as Raul mentioned, you would want some kind of Node data structure.

[2]: http://en.wikipedia.org/wiki/Hash_table "Wikipedia's entry on hash tables"

Solution 3 - C++

You could implement a binary search tree by first defining a Node struct:

struct Node
{
  void *nodeData;
  Node *leftChild;
  Node *rightChild;
}

Then, you could define a root of the tree with another Node *rootNode;

The Wikipedia entry on Binary Search Tree has a pretty good example of how to implement an insert method, so I would also recommend checking that out.

In terms of duplicates, they are generally not allowed in sets, so you could either just discard that input, throw an exception, etc, depending on your specification.

Solution 4 - C++

> I understand STL sets are based on the abstract data structure of a binary search tree. > So what is the underlying data structure? An array?

As others have pointed out, it varies. A set is commonly implemented as a tree (red-black tree, balanced tree, etc.) though but there may be other implementations that exist.

> Also, how does insert() work for a > set?

It depends on the underlying implementation of your set. If it is implemented as a binary tree, Wikipedia has a sample recursive implementation for the insert() function. You may want to check it out.

> How does the set check whether an > element already exists in it?

If it is implemented as a tree, then it traverses the tree and check each element. However, sets do not allow duplicate elements to be stored though. If you want a set that allows duplicate elements, then you need a multiset.

> I read on wikipedia that another way > to implement a set is with a hash > table. How would this work?

You may be referring to a hash_set, where the set is implemented using hash tables. You'll need to provide a hash function to know which location to store your element. This implementation is ideal when you want to be able to search for an element quickly. However, if it is important for your elements to be stored in particular order, then the tree implementation is more appropriate since you can traverse it preorder, inorder or postorder.

Solution 5 - C++

How a particular container is implemented in C++ is entirely implementation dependent. All that is required is for the result to meet the requirements set out in the standard, such as complexity requirement for the various methods, iterators requirements etc.

Solution 6 - C++

cppreference says:

> Sets are usually implemented as red-black trees.

I checked, and both libc++ and libstdc++ do use red-black trees for std::set.

std::unordered_set was implemented with a hash table in libc++ and I presume the same for libstdc++ but didn't check.

Edit: Apparently my word is not good enough.

libc++: 1 2
libstdc++: 1

Solution 7 - C++

Chiming in on this, because I did not see anyone explicitly mention it... The C++ standard does not specify the data structure to use for std::set and std::map. What it does however specify is the run-time complexity of various operations. The requirements on computational complexity for the insert, delete and find operations more-or-less force an implementation to use a balanced tree algorithm.

There are two common algorithms for implementing balanced binary trees: Red-Black and AVL. Of the two, Red-Black is a little bit simpler of an implementation, requiring 1 less bit of storage per tree node (which hardly matters, since you're going to burn a byte on it in a simple implementation anyway), and is a little bit faster than AVL on node deletions (this is due to a more relaxed requirement on the balancing of the tree).

All of this, combined with std::map's requirement that the key & datum are stored in an std::pair force this all upon you without explicitly naming the data structure you must use for the container.

This all, in turn is compounded by the c++14/17 supplemental features to the container that allow splicing of nodes from one tree into another.

Content Type	Original Author	Original Content on Stackoverflow
Question	zebraman	View Question on Stackoverflow
Solution 1 - C++	Ciro Santilli Путлер Капут 六四事	View Answer on Stackoverflow
Solution 2 - C++	Toli	View Answer on Stackoverflow
Solution 3 - C++	Raul Agrait	View Answer on Stackoverflow
Solution 4 - C++	jasonline	View Answer on Stackoverflow
Solution 5 - C++	KTC	View Answer on Stackoverflow
Solution 6 - C++	Timmmm	View Answer on Stackoverflow
Solution 7 - C++	jschmerge	View Answer on Stackoverflow