Compelling examples of custom C++ allocators?

C++Memory Management Std Memory Alignment Allocator

C++ Problem Overview

What are some really good reasons to ditch std::allocator in favor of a custom solution? Have you run across any situations where it was absolutely necessary for correctness, performance, scalability, etc? Any really clever examples?

Custom allocators have always been a feature of the Standard Library that I haven't had much need for. I was just wondering if anyone here on SO could provide some compelling examples to justify their existence.

C++ Solutions

Solution 1 - C++

As I mention here, I've seen Intel TBB's custom STL allocator significantly improve performance of a multithreaded app simply by changing a single

std::vector<T>

std::vector<T,tbb::scalable_allocator<T> >

(this is a quick and convenient way of switching the allocator to use TBB's nifty thread-private heaps; see page 7 in this document)

Solution 2 - C++

One area where custom allocators can be useful is game development, especially on game consoles, as they have only a small amount of memory and no swap. On such systems you want to make sure that you have tight control over each subsystem, so that one uncritical system can't steal the memory from a critical one. Other things like pool allocators can help to reduce memory fragmentation. You can find a long, detailed paper on the topic at:

EASTL -- Electronic Arts Standard Template Library

Solution 3 - C++

I am working on a mmap-allocator that allows vectors to use memory from a memory-mapped file. The goal is to have vectors that use storage that are directly in the virtual memory mapped by mmap. Our problem is to improve reading of really large files (>10GB) into memory with no copy overhead, therefore I need this custom allocator.

So far I have the skeleton of a custom allocator (which derives from std::allocator), I think it is a good starting point to write own allocators. Feel free to use this piece of code in whatever way you want:

#include <memory>
#include <stdio.h>

namespace mmap_allocator_namespace
{
        // See StackOverflow replies to this answer for important commentary about inheriting from std::allocator before replicating this code.
        template <typename T>
        class mmap_allocator: public std::allocator<T>
        {
public:
                typedef size_t size_type;
                typedef T* pointer;
                typedef const T* const_pointer;

                template<typename _Tp1>
                struct rebind
                {
                        typedef mmap_allocator<_Tp1> other;
                };

                pointer allocate(size_type n, const void *hint=0)
                {
                        fprintf(stderr, "Alloc %d bytes.\n", n*sizeof(T));
                        return std::allocator<T>::allocate(n, hint);
                }

                void deallocate(pointer p, size_type n)
                {
                        fprintf(stderr, "Dealloc %d bytes (%p).\n", n*sizeof(T), p);
                        return std::allocator<T>::deallocate(p, n);
                }

                mmap_allocator() throw(): std::allocator<T>() { fprintf(stderr, "Hello allocator!\n"); }
                mmap_allocator(const mmap_allocator &a) throw(): std::allocator<T>(a) { }
                template <class U>                    
                mmap_allocator(const mmap_allocator<U> &a) throw(): std::allocator<T>(a) { }
                ~mmap_allocator() throw() { }
        };
}

To use this, declare an STL container as follows:

using namespace std;
using namespace mmap_allocator_namespace;

vector<int, mmap_allocator<int> > int_vec(1024, 0, mmap_allocator<int>());

It can be used for example to log whenever memory is allocated. What is neccessary is the rebind struct, else the vector container uses the superclasses allocate/deallocate methods.

Update: The memory mapping allocator is now available at https://github.com/johannesthoma/mmap_allocator and is LGPL. Feel free to use it for your projects.

Solution 4 - C++

I'm working with a MySQL storage engine that uses c++ for its code. We're using a custom allocator to use the MySQL memory system rather than competing with MySQL for memory. It allows us to make sure we're using memory as the user configured MySQL to use, and not "extra".

Solution 5 - C++

It can be useful to use custom allocators to use a memory pool instead of the heap. That's one example among many others.

For most cases, this is certainly a premature optimization. But it can be very useful in certain contexts (embedded devices, games, etc).

Solution 6 - C++

When working with GPUs or other co-processors it is sometimes beneficial to allocate data structures in main memory in a special way. This special way of allocating memory can implemented in a custom allocator in a convenient fashion.

The reason why custom allocation through the accelerator runtime can be beneficial when using accelerators is the following:

through custom allocation the accelerator runtime or driver is notified of the memory block
in addition the operating system can make sure that the allocated block of memory is page-locked (some call this pinned memory), that is, the virtual memory subsystem of the operating system may not move or remove the page within or from memory
if 1. and 2. hold and a data transfer between a page-locked memory block and an accelerator is requested, the runtime can directly access the data in main memory since it knows where it is and it can be sure the operating system did not move/remove it
this saves one memory copy that would occur with memory that was allocated in a non-page-locked way: the data has to be copied in main memory to a page-locked staging area from with the accelerator can initialize the data transfer (through DMA)

Solution 7 - C++

I haven't written C++ code with a custom STL allocator, but I can imagine a webserver written in C++, which uses a custom allocator for automatic deletion of temporary data needed for responding to a HTTP request. The custom allocator can free all temporary data at once once the response has been generated.

Another possible use case for a custom allocator (which I have used) is writing a unit test to prove that that a function's behavior doesn't depend on some part of its input. The custom allocator can fill up the memory region with any pattern.

Solution 8 - C++

I'm using custom allocators here; you might even say it was to work around other custom dynamic memory management.

Background: we have overloads for malloc, calloc, free, and the various variants of operator new and delete, and the linker happily makes STL use these for us. This lets us do things like automatic small object pooling, leak detection, alloc fill, free fill, padding allocation with sentries, cache-line alignment for certain allocs, and delayed free.

The problem is, we're running in an embedded environment -- there isn't enough memory around to actually do leak detection accounting properly over an extended period. At least, not in the standard RAM -- there's another heap of RAM available elsewhere, through custom allocation functions.

Solution: write a custom allocator that uses the extended heap, and use it only in the internals of the memory leak tracking architecture... Everything else defaults to the normal new/delete overloads that do leak tracking. This avoids the tracker tracking itself (and provides a bit of extra packing functionality too, we know the size of tracker nodes).

We also use this to keep function cost profiling data, for the same reason; writing an entry for each function call and return, as well as thread switches, can get expensive fast. Custom allocator again gives us smaller allocs in a larger debug memory area.

Solution 9 - C++

I am using a custom allocator for counting the number of allocations/deallocations in one part of my program and measuring how long it takes. There are other ways this could be achieved but this method is very convenient for me. It is especially useful that I can use the custom allocator for only a subset of my containers.

Solution 10 - C++

One essential situation: When writing code that must work across module (EXE/DLL) boundaries, it is essential to keep your allocations and deletions happening in only one module.

Where I ran into this was a Plugin architecture on Windows. It is essential that, for example, if you pass a std::string across the DLL boundary, that any reallocations of the string occur from the heap where it originated from, NOT the heap in the DLL which may be different*.

*It's more complicated than this actually, as if you are dynamically linking to the CRT this might work anyways. But if each DLL has a static link to the CRT you are heading to a world of pain, where phantom allocation errors continually occur.

Solution 11 - C++

Obligatory link to Andrei Alexandrescu's CppCon 2015 talk on allocators:

https://www.youtube.com/watch?v=LIb3L4vKZ7U

The nice thing is that just devising them makes you think of ideas of how you would use them :-)

Solution 12 - C++

A custom allocator is a reasonable way to securely erase memory before it is deallocated.

template <class T>
class allocator
{
public:
    using value_type    = T;

    allocator() noexcept {}
    template <class U> allocator(allocator<U> const&) noexcept {}

    value_type*  // Use pointer if pointer is not a value_type*
    allocate(std::size_t n)
    {
        return static_cast<value_type*>(::operator new (n*sizeof(value_type)));
    }

    void
    deallocate(value_type* p, std::size_t) noexcept  // Use pointer if pointer is not a value_type*
    {
        OPENSSL_cleanse(p, n);
        ::operator delete(p);
    }
};
template <class T, class U>
bool
operator==(allocator<T> const&, allocator<U> const&) noexcept
{
    return true;
}
template <class T, class U>
bool
operator!=(allocator<T> const& x, allocator<U> const& y) noexcept
{
    return !(x == y);
}

Recommend using allocator boilerplate by Hinnant: https://howardhinnant.github.io/allocator_boilerplate.html)

Solution 13 - C++

One example of I time I have used these was working with very resource constrained embedded systems. Lets say you have 2k of ram free and your program has to use some of that memory. You need to store say 4-5 sequences somewhere that's not on the stack and additionally you need to have very precise access over where these things get stored, this is a situation where you might want to write your own allocator. The default implementations can fragment the memory, this might be unacceptable if you don't have enough memory and cannot restart your program.

One project I was working on was using AVR-GCC on some low powered chips. We had to store 8 sequences of variable length but with a known maximum. The standard library implementation of the memory management is a thin wrapper around malloc/free which keeps track of where to place items with by prepending every allocated block of memory with a pointer to just past the end of that allocated piece of memory. When allocating a new piece of memory the standard allocator has to walk over each of the pieces of memory to find the next block that is available where the requested size of memory will fit. On a desktop platform this would be very fast for this few items but you have to keep in mind that some of these microcontrollers are very slow and primitive in comparison. Additionally the memory fragmentation issue was a massive problem that meant we really had no choice but to take a different approach.

So what we did was to implement our own memory pool. Each block of memory was big enough to fit the largest sequence we would need in it. This allocated fixed sized blocks of memory ahead of time and marked which blocks of memory were currently in use. We did this by keeping one 8 bit integer where each bit represented if a certain block was used. We traded off memory usage here for attempting to make the whole process faster, which in our case was justified as we were pushing this microcontroller chip close to it's maximum processing capacity.

There's a number of other times I can see writing your own custom allocator in the context of embedded systems, for example if the memory for the sequence isn't in main ram as might frequently be the case on these platforms.

Solution 14 - C++

I personally use Loki::Allocator / SmallObject to optimize memory usage for small objects — it show good efficiency and satisfying performance if you have to work with moderate amounts of really small objects (1 to 256 bytes). It can be up to ~30 times more efficient than standard C++ new/delete allocation if we talk about allocating moderate amounts of small objects of many different sizes. Also, there's a VC-specific solution called "QuickHeap", it brings best possible performance (allocate and deallocate operations just read and write the address of the block being allocated/returned to heap, respectively in up to 99.(9)% cases — depends on settings and initialization), but at a cost of a notable overhead — it needs two pointers per extent and one extra for each new memory block. It's a fastest possible solution for working with huge (10 000++) amounts of objects being created and deleted if you don't need a big variety of object sizes (it creates an individual pool for each object size, from 1 to 1023 bytes in current implementation, so initialization costs may belittle the overall performance boost, but one can go ahead and allocate/deallocate some dummy objects before the application enters it's performance-critical phase(s)).

The issue with the standard C++ new/delete implementation is that it's usually just a wrapper for C malloc/free allocation, and it works good for larger blocks of memory, like 1024+ bytes. It has a notable overhead in terms of performance and, sometimes, extra memory used for mapping too. So, in most cases custom allocators are implemented in a way to maximize the performance and/or minimize the amount of extra memory needed for allocating small (≤1024 bytes) objects.

Solution 15 - C++

For shared memory it is vital that not only the container head, but also the data it contains are stored in shared memory.

The allocator of Boost::Interprocess is a good example. However, as you can read here this allone does not suffice, to make all STL containers shared memory compatible (Due to different mapping offsets in different processes, pointers might "break").

Solution 16 - C++

Sometime ago I found this solution very useful to me: Fast C++11 allocator for STL containers. It slightly speeds up STL containers on VS2017 (~5x) as well as on GCC (~7x). It is a special purpose allocator based on memory pool. It can be used with STL containers only thanks to the mechanism you are asking for.

Solution 17 - C++

In a graphics simulation, I've seen custom allocators used for

Alignment constraints that std::allocator didn't directly support.
Minimizing fragmentation by using separate pools for short-lived (just this frame) and long-lived allocations.

Content Type	Original Author	Original Content on Stackoverflow
Question	Naaff	View Question on Stackoverflow
Solution 1 - C++	timday	View Answer on Stackoverflow
Solution 2 - C++	Grumbel	View Answer on Stackoverflow
Solution 3 - C++	Johannes Thoma	View Answer on Stackoverflow
Solution 4 - C++	Thomas Jones-Low	View Answer on Stackoverflow
Solution 5 - C++	Martin Cote	View Answer on Stackoverflow
Solution 6 - C++	Sebastian	View Answer on Stackoverflow
Solution 7 - C++	pts	View Answer on Stackoverflow
Solution 8 - C++	leander	View Answer on Stackoverflow
Solution 9 - C++	Jørgen Fogh	View Answer on Stackoverflow
Solution 10 - C++	Stephen	View Answer on Stackoverflow
Solution 11 - C++	einpoklum	View Answer on Stackoverflow
Solution 12 - C++	Jarom Nelson	View Answer on Stackoverflow
Solution 13 - C++	shuttle87	View Answer on Stackoverflow
Solution 14 - C++	Fractal Multiversity	View Answer on Stackoverflow
Solution 15 - C++	ted	View Answer on Stackoverflow
Solution 16 - C++	no one special	View Answer on Stackoverflow
Solution 17 - C++	Adrian McCarthy	View Answer on Stackoverflow