C++ Lock-Free Object Pool

C++ Lock-Free Object Pool

In C++, you often need a container where the objects will remain at a fixed location in memory so that you can easily refer to them individually. For example, our game could contain hundreds of thousands of Entity's and Component's, and our game systems need handles (e.g. pointers) to the individual objects they are interested in (e.g. the avatar, enemies, doors, etc.). We don't want to have to loop over the whole list of objects every frame to find the particular ones we want.

We can't use (e.g.) a std::vector for this, because these objects would move as other entries were destroyed, invalidating any pointers/handles to them. We could dynamically allocate their memory and just store their pointers, but dynamic memory allocations are slow, as is pointer-hopping through memory.

To achieve this we will create an object pool container called Pool below. The objects themselves won't be reused, rather we will emplace the objects as needed, reusing the memory instead. The object constructor and destructor are used for initialization and cleanup of the objects themselves.

One important thing we want for this container is concurrency: we want to allow many threads to safely create, access, and destroy different objects in this pool all at the same time, lock-free. Multiple threads can of course read the same object at once, but it's important to note that our Pool class doesn't provide any guard against several threads operating on the same object. In many cases though that's not needed anyway, as (e.g.) Component's are first updated before other systems need to read them, and game systems should not (e.g.) still try to read Entity's that are being unloaded from the game world.

Pool Class Members

Our Pool class will contain several member variables, listed below. mStorage is a pointer to an array of memory for the objects we'll be storing, and mSize and mCapacity are the current and maximum number of objects that can be stored.

To find which slots in the memory buffer are free to create objects, we will use a singly-linked list, mFreeList. This is simply a pointer to an array of integers: for a free slot at array index cIndex, the value stored in mFreeList[cIndex] is the index of the next free slot in the list. mHeadNodeIndex stores the location of the head node of the list. We could use a bit array instead to use less memory, but using a free list gives us a O(1) lookup for finding a free slot (as we'll see further below), and is thus significantly faster to use.

template <typename DataType, bool Awaitable = false>
class Pool
{
    //...
private:

    //Get the cache line size (typically 64 bytes)
    static constexpr auto sMemberAlign = 
        std::hardware_destructive_interference_size;

    //Modified frequently during operations:
    alignas(sMemberAlign) std::atomic<int> mSize = 0;
    alignas(sMemberAlign) GuardedIndex mHeadNodeIndex;

    //Not modified post-allocation:
    alignas(sMemberAlign) std::byte* mStorage = nullptr; //Object memory
    int* mFreeList = nullptr;
    int mCapacity = 0;
}

Since this class is multithreaded, the member variables that are frequently modified are aligned to separate cache lines via std::hardware_destructive_interference_size. This prevents false sharing between them: this way a write to one variable won't trigger unnecessary reloading of neighboring variables from memory.

Another thing to note is that mHeadNodeIndex is not simply an integer but is a GuardedIndex, as shown below:

//Alignas so can be used naturally in std::atomic
class alignas(2*alignof(int)) GuardedIndex
{
public:    
    constexpr GuardedIndex() = default;
    constexpr GuardedIndex(int aIndex) : mIndex(aIndex) {}

    constexpr int Get_Index() const { return mIndex; }
    constexpr void Set_Index(int aIndex) { mIndex = aIndex; ++mGuardCount; }

private:
    int mIndex = 0;
    unsigned int mGuardCount = 0; //Unsigned so rollover is defined
};

static_assert(std::atomic<GuardedIndex>::is_always_lock_free, 
    "This will be much slower! Consider reducing the size of the members!");

This is just an index paired with a guard count, which gets incremented every time we set a new index as the head node of the free list. This count is used to guard against the ABA problem, as seen further below. GuardedIndex is also over-aligned to help ensure that std::atomic<GuardedIndex> can be implemented lock-free.

Note that If this container was single-threaded, we could save memory by embedding the free list within the memory for the object storage. This is because any given slot will either be empty and contain a node for the free list, or have an object and not be on the free list at all. Since it is multithreaded though, this would lead to undefined behavior as we'll discuss further below.

By using an array to store our objects, we'll be able to quickly iterate over them if we need to. However, one side effect of this is that this container is not growable, as doing so would require not only locking the Pool itself, but also all of the objects within it as their memory would be moved. This is not a bad thing though, as it forces us to be more careful about the memory usage of our program.

Therefore it's important to allocate enough memory to cover the maximum number of objects that our Pool will need. While this is often known ahead of time, if not we can optionally allow our Pool container to be awaitable: if the Pool is full, an emplacing thread will be able to wait (block) on mSize for a slot to be open.

Initialization and Cleanup

To initialize our Pool container, we allocate its memory and set up the singly-linked free list. Again, the free list is just an array of integers, where each one is the index of the next slot in the linked list. Thus the values are initialized to 1, 2, 3, ..., N, -1, where the -1 is a sentinel value indicating there are no remaining free slots.

//Class template parameters removed for brevity
template <typename AllocatorType>
void Pool::Allocate(AllocatorType& aAllocator, int aCapacity)
{
    //Allocate the object memory
    static constexpr auto sDataAlign = std::max(alignof(DataType), 
        sMemberAlign);
    mStorage = aAllocator.Allocate(aCapacity * sizeof(DataType), sDataAlign);
    mCapacity = aCapacity;

    //Allocate the free list memory
    static constexpr auto sListAlign = std::max(alignof(int), sMemberAlign);
    auto cFreeListMemory = aAllocator.Allocate(aCapacity * sizeof(int), 
        sListAlign);
    mFreeList = reinterpret_cast<int*>(cFreeListMemory);

    //Initialize free list
    for (int ci = 0 ; ci < (aCapacity - 1); ++ci)
        new (&mFreeList[ci]) = int(ci + 1);
    new (&mFreeList[aCapacity - 1]) = int(-1);

    //Publish free list head node. 
    mHeadNodeIndex.Set_Index(0);
}

The location of the head node is stored in mFirstFreeSlot non-atomically: no other threads can be using the Pool while we're allocating memory. All of the stores in Allocate() will be synchronized-with other threads by whatever mechanism is already in place to inform them that the Pool has been initialized (e.g. launching of threads).

The Free() method is straightforward: it cleans up the Pool by freeing the memory we allocated earlier. Note that we don't have to destroy the free list entries that we created since int's are trivially destructible.

//Class template parameters removed for brevity
template <typename AllocatorType>
void Pool::Free(AllocatorType& aAllocator)
{
    Assert(empty(), "Objects not destroyed!\n");

    //Note: no destruction of indices are needed: is trivial type
    aAllocator.Free(reinterpret_cast<std::byte*>(mFreeList));
    mFreeList = nullptr;
    aAllocator.Free(mStorage);
    mStorage = nullptr;

    mCapacity = 0;
}

Also, to assert that the container is empty, we can do a relaxed load of the container size: its value is already synchronized with whatever mechanism informed this thread that other threads have stopped working on this object (e.g. thread join). Similarly, all loads of the container size can be relaxed as there is no additional data we need to synchronize. Note that any query of the current size could be immediately out of date if other threads are operating on the Pool.

//Class template parameters removed for brevity
int Pool::size() const
{
    return mSize.load(std::memory_order::relaxed);
}

bool Pool::empty() const
{
    return (size() == 0);
}

Emplacing Objects

To emplace an object in the Pool, the main thing we have to do is reserve a slot from the mFreeList to put it in. To do that, we try to advance mHeadNodeIndex to the next node in the linked list using a compare-and-swap (CAS) atomic operation, compare_exchange_weak(). If we succeed, then we've popped the (former) head node off of the free list, and can create our object at that index.

To advance mHeadNodeIndex, we first read its (guarded) value, and go to that index in the free list. We then read THAT value to get the next entry in the free list. Finally, we set this new index in the GuardedValue (incrementing the guard count), and attempt to CAS it with mHeadNodeIndex. If the CAS succeeds, we emplace our new object at the given slot in the object memory, and increment the container size mSize.

There could be many threads attempting to reserve the head node at the same time. If another thread beats us, the CAS operation will fail its comparison because it will instead see the new mHeadNodeIndex value that the other thread has stored, rather than the original value that we expected. In this case, we try again until we either succeed or the Pool becomes full. The code to implement this is shown below:

//Class template parameters removed for brevity
template <typename... ArgumentTypes>
std::pair<bool, int> Pool::Emplace(ArgumentTypes&&... aArguments)    
{
    //Get initial value
    auto cHead = std::atomic_ref(mHeadNodeIndex);
    GuardedIndex cHeadEntry = cHead.load(std::memory_order::acquire);

    //Loop until we reserve the head node for our object
    int cHeadIndex;
    GuardedIndex cNextSlotEntry;
    do
    {
        //Bail if we're full
        cHeadIndex = cHeadEntry.Get_Index();
        if (cHeadIndex == -1)
            return { false, cFirstSlot };

        //Find the next slot in the free list
        auto cListNode = std::atomic_ref(mFreeList[cFirstSlot]);
        auto cNextSlotIndex = cListNode.load(std::memory_order::relaxed);

        //Set next index and guard against ABA
        cNextSlotEntry = cHeadEntry;
        cNextSlotEntry.Set_Index(cNextSlotIndex);

        //Set new value for head index
    }
    while(!cHead.compare_exchange_weak(cHeadEntry, cNextSlotEntry, 
        std::memory_order::acquire, std::memory_order::acquire));

    //We have exclusive access to this slot. Create new object
    std::byte* cAddress = mStorage + cFirstSlot * sizeof(DataType);
    new (cAddress) DataType(std::forward<ArgumentTypes>(aArguments)...);

    //Update the size and return
    mSize.fetch_add(1, std::memory_order::relaxed);
    return { true, cHeadIndex };
}

Since the non-atomic mHeadNodeIndex and mFreeList variables are accessed by multiple threads, we use std::atomic_ref to operate on them safely. We could have made them std::atomic in the first place, but then the operations on them in Allocate() and Free() would have had to have been unnecessarily slower: those are single-threaded contexts and thus don't need synchronization.

Most of the atomic operations in this function are std::memory_order::acquire. The loads on cHead (before the loop and for CAS failure) must be acquire to synchronize the mFreeList array value stored in Erase() (shown below) at the cHead index. This ensures that we read and install the correct node as the new head node of our free list. The load of the mFreeList array value can be relaxed though, as there is no additional data that we need to synchronize.

We use the acquire ordering on CAS success as well, to prevent the creation of the object from being reordered before the CAS operation. Note that the ordering does not need to be std::memory_order::acq_rel because we aren't making any changes here that we need to synchronize with other threads (other than the modification of mHeadNodeIndex itself). We also don't need release semantics as none of the statements in the loop can get ordered after the CAS loop anyway (it would break the program, violating the as-if rule).

The update of mSize can also be relaxed, because no additional data needs to be synchronized with its change. Note that atomic read-modify-write operations (such as fetch_add() and compare_exchange_weak()) are guaranteed to load the previously-stored value regardless of memory order (or else they wouldn't work properly!).

There is a corner case we have to be careful about: one thread could very quickly emplace several objects, then erase the first object, all while another thread is within this loop. This looping thread may see the original head index (with value A) but not its changes (to B, C, ...) from the emplaces and its reversion back to the original value (A) from the erase. This is known as the ABA problem, and our thread would succeed in its CAS operation if it weren't for the guard count. If it did succeed, it could swap the wrong index (B) in for the new head node, causing a potential memory stomp (at B) and undefined behavior. By incrementing the guard count with each swap with GuardedIndex, the CAS operation for this scenario will fail and it will safely retry instead.

Finally, as we said before, we can't embed the free list inside of the object memory because it would yield undefined behavior. This can happen when two threads enter the loop at about the same time. When one thread succeeds and creates an object, it would overwrite the memory for the linked list node, destroying it. If the other thread then tries to read the linked list node there, it's too late as it no longer exists! Although the result of this read is guaranteed to be discarded as the CAS operation will fail, accessing an object beyond its lifetime is undefined behavior. The compiler may optimize our program assuming that this can never occur, leading to unexpected consequences.

Accessing and Erasing Objects

To access an object, we have to first reinterpret its address from std::byte* to DataType*, then std::launder the pointer. This laundering doesn't execute any code; it instead informs the compiler that it can't make any assumptions about where the object came from when doing optimizations. This is because we will likely create objects at this location many times throughout the lifetime of the Pool.

//Class template parameters removed for brevity
DataType& Pool::operator[](int aIndex)
{
    //Non-const method calls the const method
    const auto& cConstThis = *const_cast<const Pool*>(this);
    return const_cast<DataType&>(cConstThis[aIndex]);
}

const DataType& Pool::operator[](int aIndex) const
{
    const std::byte* cAddress = mStorage + aIndex * sizeof(DataType);
    return *std::launder(reinterpret_cast<const DataType*>(cAddress));
}

To erase an object, we simply call its destructor. The tricky part is adding this node back to the head of the free list. We do this with another CAS loop to update mHeadNodeIndex: we read the index of the current head node, set it in the free list at the input index, and try to set the input index as the new mHeadNodeIndex:

//Class template parameters removed for brevity
void Pool::Erase(int aIndex)
{
    //Destroy the object
    (*this)[aIndex].~DataType();

    //Add this index to the front of the list
    auto cHead = std::atomic_ref(mHeadNodeIndex);
    GuardedIndex cHeadEntry = cHead.load(std::memory_order::relaxed);
    GuardedIndex cNewHeadEntry;
    do
    {
        //Store the current head node in our free list slot
        auto cListNode = std::atomic_ref(mFreeList[aIndex]);
        cListNode.store(cHeadEntry.Get_Index(), std::memory_order::relaxed);

        //Set our index and guard against ABA
        cNewHeadEntry = cHeadEntry;
        cNewHeadEntry.Set_Index(aIndex);

        //Set new value
    }
    while (!cHead.compare_exchange_weak(cHeadEntry, cNewHeadEntry, 
        std::memory_order::release, std::memory_order::relaxed))

    //Decrease the size
    Decrease_Size();
}

The loads of cHead are relaxed as we don't need to synchronize (read) any additional data. The CAS-success ordering is std::memory_order::release so that the write to the mFreeList array will be visible to threads acquire'ing this head node in the Emplace() function, forming an acquire-release pair. Also note that release ordering prevents the object destruction from being reordered to after the CAS operation.

Since these CAS operations synchronize the changes to the free list, both the writes to it here and its reads in Emplace() can be relaxed. Operations on mFreeList still needs to be atomic though, as a thread could be writing to an index here while another is trying to read it in Emplace(), which would otherwise be undefined behavior (a race condition).

The implementation of Decrease_Size() is discussed in the Awaiting section below.

Awaiting

If we want to wait for room if the Pool is full, we can call Emplace_Await() below. It repeatedly tries to call Emplace() until it succeeds, and each time it fails it will try to wait until it's notified to wake up again. When it calls wait() it will first do a load of mSize, and if it loads a different value than mCapacity (something was potentially erased) then it will return immediately to try to emplace again. If it loads mCapacity then the thread blocks, and when it unblocks it will again read mSize. If it sees mCapacity again (e.g. from ABA) it will block again, else it will try again to emplace.

//Class template parameters removed for brevity
template <typename... ArgumentTypes>
int Pool::Emplace_Await(ArgumentTypes&&... aArguments) requires (Awaitable)
{
    while(true)
    {
        auto [cCreated, cIndex] = 
            Emplace(std::forward<ArgumentTypes>(aArguments)...);

        if(cCreated)
            return cIndex;

        //The pool was full, wait until we're notified
        mSize.wait(mCapacity, std::memory_order::acquire);
    }
}

Note that the load within wait() has acquire memory ordering, to make sure that this thread synchronizes the new head index associated with the update. It could be relaxed, as if we load an old index we'll just retry, but doing so is slow.

Note that it's OK if we load an out-of-date value for mSize: we'll either retry when we should be waiting (where we'll either succeed or block eventually) or we'll wait when we should retry. In this case, we've observed the (out-of-date) mCapacity value, but we'll unblock soon as we are "eligible to be unblocked" by the notify_all() in Decrease_Size() below:

//Class template parameters removed for brevity
void Pool::Decrease_Size()
{
    //Release if awaiting (Syncs indices), else relaxed (nothing to sync)
    static constexpr auto sOrder = Awaitable ? std::memory_order::release : std::memory_order::relaxed;
    [[maybe_unused]] auto cPriorSize = mSize.fetch_sub(1, sOrder);

    if constexpr (Awaitable)
    {
        if (cPriorSize == mCapacity)
            mSize.notify_all();
    }
}

If the Pool isn't Awaitable, then all we do is relaxed-decrement mSize and return, as there's nothing else to do or synchronize. However if we are awaiting the decrement has release semantics, forming an acquire-release pair with the wait in Emplace_Await(), which synchronizes the head index update in Erase().

If the container was full, we call notify_all() to wake all threads waiting on mSize. We use notify_all() instead of notify_one() because we'll only call it once and we want to make sure we don't miss any waiting threads.

Conclusion

The object Pool lets us create, access, and destroy different objects lock-free in multithreaded environments. This container is essential for efficient asynchronous programming, especially in gaming, where we can use Pool's to manage the hundreds of thousands of game Entity's and Component's that we may need for large open-world games. This allows us to (e.g.) simultaneously load and unload different chunks of the world, while also (de)spawning enemies, pedestrians, vehicles, and any other objects and components that your game might need.

I should note that while I have written this code for an abstract machine, I have only tested this code on x86, and have not yet been able to do so on an ARM processor, which has a more relaxed memory model. Thus there may be subtle problems with the chosen atomic memory orders that I haven't found yet. If you're targeting a platform using ARM (e.g. Android, iOS), please review the memory ordering carefully and let me know if you encounter any problems!

License & Disclaimers

Support This Blog!

Did you find this article valuable?

Support Paul Mattione by becoming a sponsor. Any amount is appreciated!