How does jemalloc work? What are the benefits?

Firefox Problem Overview

Firefox 3 came with a new allocator: jemalloc.

I have heard at several places that this new allocator is better. The top Google results don't gave any further information though and I am interested in how exactly it works.

Firefox Solutions

Solution 1 - Firefox

jemalloc first appeared for FreeBSD, the brainchild of one "Jason Evans", hence the "je". I would ridicule him for being egotistical had I not once written an operating system called paxos :-)

See this PDF for full details. It's a white paper describing in detail how the algorithms work.

The main benefit is scalability in multi-processor and multi-threaded systems achieved, in part, by using multiple arenas (the chunks of raw memory from which allocations are made).

In single-threaded situations, there is no real benefit to multiple arenas so a single arena is used.

However, in multi-threaded situations, many arenas are created (four times as many arenas as there are processors), and threads are assigned to these arenas in a round-robin fashion.

This means that lock contention can be reduced since, while multiple threads may call malloc or free concurrently, they'll only contend if they share the same arena. Two threads with different arenas will not affect each other.

In addition, jemalloc tries to optimise for cache locality since the act of fetching data from RAM is much slower than using data already in the CPU caches (no different in concept to the difference between fast fetching from RAM versus slow fetching from disk). To that end, it first tries to minimise memory usage overall since that is more likely to ensure the application's entire working set is in cache.

And, where that can't be achieved, it tries to ensure that allocations are contiguous, since memory allocated together tends to be used together.

From the white paper, these strategies seem to give similar performance to current best algorithms for single threaded use while offering improvements for multi-threaded usage.

Solution 2 - Firefox

There is one interesting source: the C-source itself: https://dxr.mozilla.org/mozilla-central/source/memory/build/mozjemalloc.cpp (old)

In the beginning, a short summary describes how it works roughly.

// This allocator implementation is designed to provide scalable performance
// for multi-threaded programs on multi-processor systems.  The following
// features are included for this purpose:
//
//   + Multiple arenas are used if there are multiple CPUs, which reduces lock
//     contention and cache sloshing.
//
//   + Cache line sharing between arenas is avoided for internal data
//     structures.
//
//   + Memory is managed in chunks and runs (chunks can be split into runs),
//     rather than as individual pages.  This provides a constant-time
//     mechanism for associating allocations with particular arenas.
//
// Allocation requests are rounded up to the nearest size class, and no record
// of the original request size is maintained.  Allocations are broken into
// categories according to size class.  Assuming runtime defaults, 4 kB pages
// and a 16 byte quantum on a 32-bit system, the size classes in each category
// are as follows:
//
//   |=====================================|
//   | Category | Subcategory    |    Size |
//   |=====================================|
//   | Small    | Tiny           |       4 |
//   |          |                |       8 |
//   |          |----------------+---------|
//   |          | Quantum-spaced |      16 |
//   |          |                |      32 |
//   |          |                |      48 |
//   |          |                |     ... |
//   |          |                |     480 |
//   |          |                |     496 |
//   |          |                |     512 |
//   |          |----------------+---------|
//   |          | Sub-page       |    1 kB |
//   |          |                |    2 kB |
//   |=====================================|
//   | Large                     |    4 kB |
//   |                           |    8 kB |
//   |                           |   12 kB |
//   |                           |     ... |
//   |                           | 1012 kB |
//   |                           | 1016 kB |
//   |                           | 1020 kB |
//   |=====================================|
//   | Huge                      |    1 MB |
//   |                           |    2 MB |
//   |                           |    3 MB |
//   |                           |     ... |
//   |=====================================|
//
// NOTE: Due to Mozilla bug 691003, we cannot reserve less than one word for an
// allocation on Linux or Mac.  So on 32-bit *nix, the smallest bucket size is
// 4 bytes, and on 64-bit, the smallest bucket size is 8 bytes.
//
// A different mechanism is used for each category:
//
//   Small : Each size class is segregated into its own set of runs.  Each run
//           maintains a bitmap of which regions are free/allocated.
//
//   Large : Each allocation is backed by a dedicated run.  Metadata are stored
//           in the associated arena chunk header maps.
//
//   Huge : Each allocation is backed by a dedicated contiguous set of chunks.
//          Metadata are stored in a separate red-black tree.
//
// *****************************************************************************

Though, a more depth algorithm analysis is missing.

Solution 3 - Firefox

As for what benefits jemalloc brought to mozilla, per <http://blog.pavlov.net/2008/03/11/firefox-3-memory-usage/> (also first google result for mozilla+jemalloc):

[...]concluded that jemalloc gave us the smallest amount of fragmentation after running for a long period of time. [...] Our automated tests on Windows Vista showed a 22% drop in memory usage when we turned jemalloc on.

Solution 4 - Firefox

Aerospike implemented jemalloc back in a private branch in 2013. In 2014, it was incorporated into Aerospike 3.3. Psi Mankoski just wrote about Aerospike's implementation, plus when and how to effectively use jemalloc, for High Scalability.

jemalloc really helped Aerospike take advantage of modern multithreaded, multi-CPU, multi-core computer architectures. There are also some very important debugging capabilities built in to jemalloc to manage arenas. The debugging allowed Psi to be able to tell, for instance, what was a true memory leak, versus what was the result of memory fragmentation. Psi also discusses how thread cache and per-thread allocation provided an overall performance (speed) improvement.

Content Type	Original Author	Original Content on Stackoverflow
Question	Albert	View Question on Stackoverflow
Solution 1 - Firefox	paxdiablo	View Answer on Stackoverflow
Solution 2 - Firefox	Albert	View Answer on Stackoverflow
Solution 3 - Firefox	Nickolay	View Answer on Stackoverflow
Solution 4 - Firefox	Peter Corless	View Answer on Stackoverflow