Home

PLF C++ Library - plf::colony

Introduction

A colony is the highest-performance C++ template-based data container for high-modification scenarios with unordered data. Specifically it provides better performance than other std:: library containers when:

  1. Insertion order is unimportant,
  2. Insertions and erasures to the container are occuring frequently in realtime ie. in performance-critical code, and/or
  3. Pointers which point to non-erased container elements must not be invalidated by insertion or erasure.

While the benchmarks in the section below are a better area to get a feel for the performance benefits, the general speed characteristics are:

As explored further in the benchmarks there are some vector/deque modifications which can outperform colony during iteration while maintaining pointer validity, but at a cost to usability and memory usage. Colony's other advantages include the freeing and recycling of unused memory on-the-fly, the guaranteed stability of pointers/references/iterators to non-erased elements (which makes programming with containers of inter-related data structures faster and much easier), and broad cross-compiler support.

It can be reasonably compared to a "bag" or "bucket-array" styled structure, where each element location can be emptied and skipped over. Unlike a bucket array it does not use keys, iterates faster due to the use of a jump-counting skipfield instead of a boolean field, and frees or recycles unused memory from erased element locations on the fly. As it was initially developed predominantly for game development, colony favours larger-than-scalar-type (structs/classes of greater-than 128 bits total) performance over scalar-type (float, int, etc) performance.

Motivation

Note: initial motivation for the project came from video game engine development. Since this point it has become a more generalized container. Nevertheless, the initial motivation is presented below.
When working on video game engines we are predominantly dealing with collections of data where:

  1. Elements within data collections refer to elements within other data collections (through a variety of methods - indices, pointers, etc). An example is a game entity referring to both a texture object and collision blocks, as well as sound data. These references must stay valid throughout the course of the game/level. For this reason, any container (or use of a container) which causes pointer or index invalidation can create difficulties or necessitate workarounds.
  2. Order is unimportant for the most part. The majority of data collections are simply iterated over, transformed, referred to and utilized with no regard to order.
  3. Erasing or otherwise removing or deactivating objects occurs frequently in-game and in realtime (though often erasures will be implemented to occur at the end of a frame due to multithreading concerns). An example could be destroying a wall, or a game enemy. For this reason methods of erasure which create strong performance penalties are avoided.
  4. Creating new objects and adding them into the gameworld on-the-fly is also common - for example, a tree which drops leaves every so often, or a quadtree.
  5. We don't always know in advance how many elements there will be in a container at the beginning of development, or even at the beginning of a level during playback - an example of this being a MMORPG (massively multiplayer online role-playing game). In a MMORPG the number of game entities fluctuates based on how many players are playing, though there may be maximum capacity limits. Genericized game engines in particular have to adapt to considerably different user requirements and scopes. For this reason extensible containers which can expand and contract in realtime are usually necessary.
  6. Depending on the complexity and scope of any given game, we can be dealing with anywhere between 10 and 100000 objects in a given area. We are not typically dealing with the types of capacities necessitated by large-scale mathematical or statistical applications.
  7. For performance reasons, memory storage which is more-or-less contiguous is preferred. Lists, vectors of pointers to dynamically-allocated objects, and maps as implemented in the standard library are unusable.
  8. Memory wastage is avoided, and in particular, any container which allocates upon initialisation tends to be avoided as this can incur purposeless memory and performance costs.

To meet these requirements, game developers tend to either (a) develop their own custom containers for given scenario or (b) develop workarounds for the failings of std::vector. These workarounds are many and varied, but the most common are probably:

  1. Using a boolean flag (or similar) to indicate the inactivity of an object (as opposed to actually erasing from the vector). When erasing, one simply adjusts the boolean flag, and when iterating, items with the adjusted boolean flag are skipped. External elements refer to elements within the container via indexes rather than pointers (which can be invalidated upon insertion).

    Advantages: Fast erasure.

    Disadvantages: Slow to iterate due to branching.

  2. Utilizing a vector of data with a secondary vector of indexes. When erasing, the erasure occurs in the vector of indexes, not the vector of data, and when iterating, one iterates over the vector of indexes, then accessing the data from the vector of data via the index.

    Advantages: Faster iteration.

    Disadvantages: Erasure still incurs some reallocation cost, can increase jitter.

  3. Combining a swap-and-pop mechanism with some form of dereferenced lookup system to enable contiguous element iteration (sometimes known as a 'packed array', various other names). In this case when erasing we swap the back element from the vector of elements with the element being erased, then pop the (swapped) back element. When iterating over the data we simply iterate through the vector of elements. However to maintain valid external links to elements (which may be moved at any time) we must also maintain a stable vector of indexes (or similar) and update the index numbers corresponding with the element being erased and the element being moved. In this case external objects referring to elements within the container must store a pointer to the index for the element in question. There are many varied alternatives to this method. The method is more useful when the container's data is mostly being iterated over with fewer references to the individual elements from outside objects.

    Advantages: Iteration is at standard vector speed.

    Disadvantages: Erase could be slow if objects are large and swap cost is therefore large. Insertion cost is larger than other techniques. All references to elements incur additional costs due to a two-fold reference mechanism.


All three techniques have the disadvantage of slow singular insertions, and the first two will also continually expand memory usage when erasing and inserting over periods of time. The third deals better with this scenario as it swaps from the back rather than leaving gaps in the elements vector, however will suffer in performance if elements within the container are heavily referred to by external objects/elements.

Colony is an attempt to bring a more generic solution to this domain. It has the advantage of good iteration speed while maintaining a similar erasure speed to the boolean technique described above, and without causing pointer invalidation upon either erasure or insertion. It's insertion speed is also much faster than a vector's. Memory from erased elements is either reused by subsequent insertions or released to the OS on-the-fly. It achieves these ends by via a number of new approaches: a jump-counting skipfield (instead of a boolean field), a linked chain of increasingly-large memory blocks (instead of a singular memory block or vector of blocks), and a custom internal reduced stack (based on plf::stack) to enable erased memory location re-use.

More data on the performance characteristics of the colony versus other containers and workarounds can be found in the benchmarks section, read below to gain an understanding of the mechanics.

Implementation

plf::colony uses a chained-group memory allocation pattern with a growth factor of 2, (doubly-linked chains of element "groups" containing memory blocks with additional structure metadata, including in this case per-memory-block jump-counting skipfields). A growth factor of 2 tends gives the best performance in a majority of scenarios (hence why it is used for the majority of vector implementations). Colony's minimum and maximum memory block sizes can also be adjusted to suit individual use-cases. Using a multiple-memory-block approach removes the necessity for data reallocation upon insertion, and because data is not reallocated, all references/pointers/iterators to container elements stay valid post-insertion.

Due to a std::vector being the most widely-used and commonly-understood of the std:: library containers, we here contrast the storage mechanisms of a colony with that of a vector:

Visual demonstration of inserting to a full vector Visual demonstration of inserting to a full colony

The lack of reallocation is also why insertion into a colony is faster than insertion into a std::vector. Now we look at erasure. When an element is erased in a std::vector, the following happens:

Visual demonstration of randomly erasing from a vector

When an element is erased in a colony, this happens:

Visual demonstration of randomly erasing from a colony

This is why a colony has a performance advantage over std::vector when erasing.

Upon subsequent insertions to a colony post-erasure, colony will check to see if it's erased-location stack is empty. If it is empty, it inserts to the end of the colony, creating a new group if the last group is full. If it is not empty, it pops an erased element location off the stack and reuses that memory address for the newly-inserted element. If you erase all elements in any given group in a colony, the group is removed from the colony's chain and released to the OS - at that point any erased element locations present in the erased-location stack are also removed.

License

plf::colony is under a permissive zlib license. This means: Software is used on 'as-is' basis. It can be modified and used in commercial software. Authors are not liable for any damages arising from its use. The distribution of a modified version of the software is subject to the following restrictions:

Download

Download here (26kb zip file) or view the repository

The colony library is a simple .h header file, to be included with a #include command.

In addition if you are interested in benchmarking you can also download the plf benchmark suite (36kb zip file), which includes plf::nanotimer.

Function Descriptions and syntax

Colony meets the requirements of the C++ Container, AllocatorAwareContainer, and ReversibleContainer concepts.

For the most part the syntax and semantics of colony functions are very similar to all std:: c++ libraries. Formal description is as follows:

template <class T, class Allocator = std::allocator<T>, typename Skipfield_Type = unsigned short> class colony

T - the element type. In general T must meet the requirements of Erasable, CopyAssignable and CopyConstructible.
However, if emplace is utilized to insert elements into the colony, and no functions which involve copying or moving are utilized, T is only required to meet the requirements of Erasable.
If move-insert is utilized instead of emplace, T must also meet the requirements of MoveConstructible .

Allocator - an allocator that is used to acquire memory to store the elements. The type must meet the requirements of Allocator. The behavior is undefined if Allocator::value_type is not the same as T.

T_skipfield_type - an unsigned integer type. This type is used to form the skipfield which skips over erased T elements. The maximum size of element memory blocks is constrained by this type's bit-depth (due to the nature of a jump-counting skipfield). As an example, unsigned short on most platforms is 16-bit which thereby constrains the size of individual memory blocks to a maximum of 65535 elements. unsigned short has been found to be the optimal type for performance based on benchmarking. However this template parameter is important for a variety of scenarios relating to performance or memory usage. In the case of small collections (eg. < 512 elements) in a memory-constrained environment, it is useful to reduce the memory usage of the skipfield by reducing the skipfield bit depth to a Uint8 type. In addition the reduced skipfield size will reduce cache saturation in this case without impacting iteration speed due to the low element total. In the case of very large collections (millions) and where memory usage is not a concern, changing the skipfield bitdepth to a larger type leads to slightly increased iteration performance (offset by decreased erasure and insertion performance) due to the larger memory block sizes made possible by the larger bit depth. Since these scenarios are on a per-case basis, it is best to leave the control in the user's hands.

Basic example of usage

#include <iostream>
#include "plf_colony.h"

int main(int argc, char **argv)
{
  plf::colony<int> i_colony;

  // Insert 100 ints:
  for (int i = 0; i != 100; ++i)
  {
    i_colony.insert(i);
  }

  // Erase half of them:
  for (plf::colony<int>::iterator it = i_colony.begin(); it != i_colony.end(); ++it)
  {
    it = i_colony.erase(it);
  }

  // Total the remaining ints:
  int total = 0;

  for (plf::colony<int>::iterator it = i_colony.begin(); it != i_colony.end(); ++it)
  {
    total += *it;
  }

  std::cout << "Total: " << total << std::endl;
  std::cin.get();
  return 0;
} 

Example demonstrating pointer stability

#include <iostream>
#include "plf_colony.h"

int main(int argc, char **argv)
{
  plf::colony<int> i_colony;
  plf::colony<int>::iterator it;
  plf::colony<int *> p_colony;
  plf::colony<int *>::iterator p_it;

  // Insert 100 ints to i_colony and pointers to those ints to p_colony:
  for (int i = 0; i != 100; ++i)
  {
    it = i_colony.insert(i);
    p_colony.insert(&(*it));
  }

  // Erase half of the ints:
  for (it = i_colony.begin(); it != i_colony.end(); ++it)
  {
    it = i_colony.erase(it);
  }

  // Erase half of the int pointers:
  for (p_it = p_colony.begin(); p_it != p_colony.end(); ++p_it)
  {
    p_it = p_colony.erase(p_it);
  }

  // Total the remaining ints via the pointer colony (pointers will still be valid even after insertions and erasures):
  int total = 0;

  for (p_it = p_colony.begin(); p_it != p_colony.end(); ++p_it)
  {
    total += *(*p_it);
  }

  std::cout << "Total: " << total << std::endl;
  
  if (total == 2500)
  {
    std::cout << "Pointers still valid!" << std::endl;
  }
  
  std::cin.get();
  return 0;
} 

Iterator Invalidation

All read-only operations, swap, std::swap Never
clear, reinitialize, operator = Always
reserve, shrink_to_fit Only if capacity is changed
change_group_sizes, change_minimum_group_size, change_maximum_group_size Only if supplied minimum group size is larger than smallest group in colony, or supplied maximum group size is smaller than largest group in colony.
erase Only for the erased element. If an iterator is == end() it may be invalidated if the last element in the colony is erased, in some cases. If a reverse_iterator is == rend() it may be invalidated if the first element in the colony is erased, in some cases.
insert, emplace If an iterator is == end() it may be invalidated by a subsequent insert/emplace, in some cases.

Member types

Member type Definition
value_type T
allocator_type Allocator
skipfield_type T_skipfield_type
size_type Allocator::size_type (pre-c++11)
std::allocator_traits<Allocator>::size_type (post-c++11)
difference_type Allocator::difference_type (pre-c++11)
std::allocator_traits<Allocator>::difference_type (post-c++11)
reference Allocator::reference (pre-c++11)
value_type & (post-c++11)
const_reference Allocator::const_reference (pre-c++11)
const value_type & (post-c++11)
pointer Allocator::pointer (pre-c++11)
std::allocator_traits<Allocator>::pointer (post-c++11)
const_pointer Allocator::const_pointer (pre-c++11)
std::allocator_traits<Allocator>::const_pointer (post-c++11)
iterator BidirectionalIterator
const_iterator Constant BidirectionalIterator
reverse_iterator BidirectionalIterator
const_reverse_iterator Constant BidirectionalIterator

Constructors

colony() explicit colony(const allocator_type &alloc)
fill explicit colony(size_type n, Skipfield_type min_group_size = 8, Skipfield_type max_group_size = std::numeric_limits<Skipfield_type>::max(), const allocator_type &alloc = allocator_type())
explicit colony(size_type n, const value_type &element, Skipfield_type min_group_size = 8, Skipfield_type max_group_size = std::numeric_limits<Skipfield_type>::max(), const allocator_type &alloc = allocator_type())
range template<typename InputIterator> colony(const InputIterator &first, const InputIterator &last, Skipfield_type min_group_size = 8, Skipfield_type max_group_size = std::numeric_limits<Skipfield_type>::max(), const allocator_type &alloc = allocator_type())
copy colony(const colony &source)
move colony(colony &&source) noexcept (C++11 and upwards)
initializer list colony(const std::initializer_list<value_type> &element_list, Skipfield_type min_group_size = 8, Skipfield_type max_group_size = std::numeric_limits<Skipfield_type>::max(), const allocator_type &alloc = allocator_type())
Some constructor usage examples

Iterators

All iterators are bidirectional but also provide >, <, >= and <= for convenience (for example, in for loops). Functions for iterator, reverse_iterator, const_iterator and const_reverse_iterator follow:

operator *
operator ->
operator ++
operator --
operator =
operator ==
operator !=
operator <
operator >
operator <=
operator >=
base() (reverse_iterator and const_reverse_iterator only)

All operators have O(1) amortised time-complexity. Originally there were +=, -=, + and - operators, however the time complexity of these varied from O(n) to O(1) depending on the underlying state of the colony, averaging in at O(log n). As such they were not includable in the iterator functions (as per C++ standards). These have been transplanted to colony's advance(), next(), prev() and distance() member functions. Greater-than/lesser-than operator usage indicates whether an iterator is higher/lower in position compared to another iterator in the same colony (ie. closer to the end/beginning of the colony). const_iterator is provided for both C++03 and C++11/14/etc compilers.

Member functions

Insert

single element iterator insert (const value_type &val)
fill iterator insert (const size_type n, const value_type &val)
range template <class InputIterator> iterator insert (const InputIterator first, const InputIterator last)
move iterator insert (value_type&& val) (C++11 and upwards)
initializer list iterator insert (const std::initializer_list<value_type> il)

Erase

single element iterator erase(const_iterator it)
range void erase(const_iterator first, const_iterator last)

Other functions

Non-member functions

Benchmarks

Last updated 8-2-2017 v3.87

  1. Test setup
  2. Tests design
  3. Raw performance benchmarks (against standard containers)
  4. Comparitive performance benchmarks (against modified standard containers)
  5. Low-modification scenario test
  6. High-modification scenario test
  7. Referencing scenario test
  8. Overall performance conclusions

Test machine setup

The test setup is an Intel E8500, 8GB ram, running GCC 5.1 x64 as compiler. OS is a stripped-back Windows 7 x64 SP1 installation with the most services, background tasks (including explorer) and all networking disabled. Build settings are "-O2 -march=native -std=c++11 -fomit-frame-pointer". Results under MSVC 2015 update 3, on an Intel Xeon E3-1241 (Haswell core) can be viewed here. There is no commentary for the MSVC results.

The source code for the benchmarks can be found in the colony page downloads section.

General test design

Tests are based on a sliding scale of number of runs vs number of elements, so a test with only 10 elements in a container may average 100000 runs to guarantee a more stable result average, whereas a test with 100000 elements may only average 10 runs. This tends to give adequate results without overly lengthening test times. I have not included results involving 'reserve()' functions as the differences to overall insertion performance were not adequate.

Insertion: is into empty containers for the raw and comparitive tests, entering single elements at a time. For the 'scenario' tests there is also ongoing insertion at random intervals. This matches the use case of colony, where insertion on-the-fly is expected to be one of the driving factors of usage. Insertion is always at the most performant location for the specific container, for example front for list, or back for vector.
Erasure: initially takes place in an iterative fashion for the raw tests, erasing elements at random as we iterate through the container. The exception to this is the tests involving a remove_if pattern (pointer_deque and indexed_vector) which have a secondary pass when using this pattern.
Iteration: is straightforward iteration from the start to end of any containers. Typically there are more runs for iteration than the other tests due to iteration being a much quicker procedure, so more runs deliver a more stable average.

Raw performance tests

Before we begin measuring colony against containers or container modifications which do not invalidate links on erasure or insertion, we need to identify which containers are good candidates for comparison based on raw performance without regard to linkage invalidation. With that in mind the following tests compare colony against the main standard library containers. Tests are carried out on the following types: (a) a 8-bit type ie. char, (b) a 32-bit type ie. int, (c) a 64-bit type ie. double, (d) a small struct containing two pointers and four scalar types, and (e) a large struct containing 2 pointers, 4 scalar types, a large array of ints and a small array of chars.

The first test measures time to insert N elements into a given container, the second measures the time taken to erase 25% of those same elements from the container, and the third test measures iteration performance after the erasure has taken place. Erasure tests avoid the remove_if pattern for the moment to show standard random-access erasure performance more clearly (this pattern is explored in the second test). Both linear and logarithmic views of each benchmark are provided in order to better show the performance of lower element amounts.

Insertion Performance

Click images or hover over to see results at linear scale instead

test result graph
test result graph
test result graph
test result graph
test result graph

A predictable pattern for all but the large struct test is shown for insertion:
std::deque dominates insertion, with plf::colony equal after about 100 elements, but then for large structs it's performance completely eclipses std::deque. This is because libstdc++'s implementation of deque caps the memory block size at 512 bytes, whereas the large struct in question is ~506 bytes (depending on platform), meaning it, essentially, becomes a std::list but with additional overheads. Colony avoids this downfall due to it's memory allocation pattern of basing memory block sizes on fixed numbers of elements with a growth factor of 2, not fixed numbers of bytes. std::vector is nearly on a par with std::deque for very small element types with element numbers greater than a thousand, but becomes worse and worse the larger the size of the stored type is, and the fewer stored elements there are.
std::list, std::map and std::multiset all perform poorly by contrast, with the exception of large structs, where std::list comes in 2nd place to colony. Overall, plf::colony and std::deque dominate.

Erase Performance

Here we forward-iterate over each container and erase 25% of all elements at random. If (due to the variability of random number generators) 25% of all elements have not been erased by the end of the container iteration, the test will reverse-iterate through the container and randomly erase the remaining necessary number of elements until that 25% has been reached.

Click images or hover over to see results at linear scale instead

test result graph
test result graph
test result graph
test result graph
test result graph

Across all types plf::colony dominates performance, with std::list coming close behind. std::deque and std::vector have predictably poor performance as a remove_if pattern is not being used, as much as 100000x worse than plf::colony and std::list for large numbers of large types.

Post-erasure Iteration Performance

Since data is typically iterated across more than it is erased or inserted, iteration speed is, for many areas, more important than erase or insertion performance, despite the fact that it almost always takes factors of ten less time than either of those two.

Click images or hover over to see results at linear scale instead

test result graph
test result graph
test result graph
test result graph
test result graph

std::vector and std::deque come in first and second place for most types, with colony in third place - the only place where this does not occur is for large structs, where colony dominates std::deque after approximately 20000 elements are inserted. Once again this is due to the structure of deque as explained in the insertion conclusion above.

For under 1000 elements, std::list is about on par with both std::deque and std::vector, both of which dominate these tests, with std::vector taking 1st place. However the number of elements necessary before this effect occurs on std::list decreases according to how large the stored type is, suggesting that performance in this case is due to some effect of the cpu cache or implementation. Querying the GCC mailing list about this resulted in the following response, which I believe to be accurate due to the correlation between std::list iteration performance and type size: "I suspect that for working sets which fit into the first-level cache of the CPU, the simpler iterators for std::list are faster than std::deque because the additional memory accesses in std::list are cheaper than the fairly complicated iterator implementation in std::deque". What this suggests is that for typical programs, where more than one data set is competing for space in the L1 or L2 caches, std::list performance will not follow the pattern above and generally will be poor.

Raw tests Conclusion

From the above data we can see that std::list is not a good contendor against plf::colony, as it has weaker insertion and erase performance, and the only scenario where it has good iteration performance is where (a) the amount of data in the container is small enough to fit entirely into the cache and (b) where that data set is the only data set being operated on by the program in question, and in fact the computer as a whole. That being a relatively uncommon case, std::list is not a general contendor.

std::deque is a contendor, having strong insertion performance and excellent iteration performance but poor non-remove_if erasure performance - however std::deque invalidates pointers upon erasure, meaning it requires modification to be used in a way comparable to colony. std::vector is a slightly weaker contendor, having weak insertion performance and worse non-remove_if erasure performance than std::deque, however it's iteration performance is always the best, being entirely contiguous in memory, rather than deque which is only partially contiguous. std::vector invalidates pointers on both insertion and erasure, meaning it will also require modification to compare to colony.

Comparative performance tests

Colony is primarily designed for scenarios where good insertion/erasure/iteration performance is required while guarantee'ing linkage stability for outside elements referring to elements within the container, and where ordered insertion is unimportant. The two containers from the raw performance tests which may compare both in performance and usage (after modification) are std::deque and std::vector. std::list does not meet these requirements as it has poor insertion and iteration performance.

pointer_deque and indexed_vector

Because std::deque does not invalidate pointers to elements upon insertion to the back or front, we can guarantee that pointers won't be invalidated during unordered insertion. This means we can use a modification called a 'pointer-to-deque deque', or pointer_deque. Here we take a deque of elements and construct a secondary deque containing pointers to each element in the first deque. The second deque functions as an erasable iteration field for the first deque ie. when we erase we only erase from the pointer deque, and when we iterate, we iterate over the pointer deque and access only those elements pointed to by the pointer deque. In doing so we reduce erase times for larger-than-scalar types, as it is computationally cheaper to reallocate pointers (upon erasure) than larger structs. By doing this we avoid reallocation during erasure for the element deque, meaning pointers to elements within the element deque stay valid.

We cannot employ quite the same technique with std::vector because it reallocates during insertion to the back of the vector upon capacity being reached. But since indexes stay valid regardless of a vector reallocates, we can employ a similar tactic using indexes instead of pointers; which we'll call an indexed_vector. In this case we use a secondary vector of indexes to iterate over the vector of elements, and only erase from the vector of indexes. This strategy has the advantage of potentially lower memory usage, as the bitdepth of the indexes can be reduced to match the maximum known number of elements, but it will lose a small amount of performance due to the pointer addition necessary to utilise indexes instead of pointers. In addition outside objects refering to elements within the indexed_vector must use indexes instead of pointers to refer to the elements, and this means the outside object must know both the index and the container it is indexing; whereas a pointer approach can ignore this and simply point to the element in question.

We will also compare these two container modifications using a remove_if pattern for erasure vs regular erasure, by adding an additional boolean field to indicate erasure to the original stored struct type, and utilizing two passes - the first to randomly flag elements as being ready for erasure via the boolean field, the second using the remove_if pattern.

vector_bool and deque_bool

A second modification approach, which we'll call a vector_bool, is a very common approach in a lot of game engines - a bool or similar type is added to the original struct or class, and this field is tested against to see whether or not the object is 'active' (true) - if inactive (false), it is skipped over. We will also compare this approach using a deque.

packed_deque

packed_deque is an implementation of a 'packed_array' as described in the motivation section earlier, but using deques instead of vectors or arrays. As we've seen in the raw performance benchmarks, (GCC) libstdc++'s deque is almost as fast as vector for iteration, but about twice as fast for back insertion and random location erasure. It also doesn't invalidate pointers upon insertion, which is also a good thing. These things become important when designing a container which is meant to handle large numbers of insertions and random-location erasures. Although in the case of a packed-array, random location erasures don't really happen, the 'erased' elements just get replaced with elements from the back, so erasure speed is not as critical, but insertion speed is critical as it will always consume significantly more CPU time than iteration.

With that in mind my implementation uses two std::deque's internally: one containing structs which package together the container's element type and a pointer, and one containing pointers to each of the 'package' structs in the first deque. The latter is what is used by the container's 'handle' class to enable external objects to refer to container elements. The pointer in the package itself in turn points to the package's corresponding 'handle' pointer in the second deque. This enables the container to update the handle pointer when and if a package is moved from the back upon an erasure.

Anyone familiar with packed array-style implementations can skip this paragraph. For anyone who isn't, this is how it works when an element is erased from packed_deque, unless the element in question is already at the back of the deque. It:

  1. Uses the pointer within the package to find the 'handle' pointer which pointed to the erased element, and adds it to a free list.
  2. Moves the package at the back of the container to the location of the package containing the element being erased.
  3. Uses the pointer in the package which has just been moved to update the corresponding handle pointer, to correctly point to the package's new location.
  4. Pops the back package off the first deque (should be safe to destruct after the move - if it's not, the element's implementation is broke).

In this way, the data in the first deque stays contiguous and is hence fast to iterate over. And any handles referring to the back element which got moved stay valid after the erasure.

This implementation will not work well under MSVC as MSVC's deque implementation performs badly.

Tests

Since neither indexed_vector nor pointer_deque will have erasure time benefits for small scalar types, and because game development is predominantly concerned with storage of larger-than-scalar types, we will only test using small structs from this point onwards. In addition, we will test 4 levels of erasure and subsequent iteration performance: 0% of all elements, 25% of all elements, 50% of all elements, and 75% of all elements.

Insertion

Click images or hover over to see results at linear scale instead

test result graph

For insertion plf::colony outperforms the others for greater than 100 and less than 20000 elements. Below 100 elements it is outperformed by pointer_deque and deque_bool, and above 20000 elements it is outperformed by deque_bool. packed_deque consistently comes 4th, and both vector methods perform poorly by contrast.

Erasure

Click images or hover over to see results at linear scale instead

test result graph
test result graph
test result graph

Here the gap consistently widens between the candidates as erasure percentage increases. The two boolean skipfield methods obviously dominate performance, being the easiest and fastest to implement in terms of erasure. Above 25% erasure both of the remove_if variants outperform the others, with packed_deque and colony criss-crossing each other in terms of performance. The non-remove_if variants of pointer_deque and indexed_vector of course perform poorly.

Post-erasure Iteration

Click images or hover over to see results at linear scale instead

test result graph

Colony performs the worst out of the lot for iteration with zero erasures, with deque_bool coming in slightly worse only for large numbers of elements. Unsurprisingly packed_deque performs the best out of the lot as this constitutes pure contiguous iterative performance with no skipfield or iteration field. While close, the pointer_deque approach has a slight performance advantage over the indexed_vector.

test result graph

Now we begin to see the advantage of a jump-counting skipfield over a boolean skipfield. Because boolean skipfields require branching code to iterate over, and 25% of all elements being erased represents a large number of cache misses. Other than that the results are much the same as the 0% test.

test result graph

At 50% randomised erasures, a CPU's branch prediction cannot work at all, and so the boolean approaches degrade significantly.

test result graph

At this point we can clearly see that the boolean approaches are not useful in terms of iteration.

Summary: for iteration packed_deque comes 1st, pointer_deque 2nd, indexed_vector 3rd, colony 4th, vector_bool 5th and deque_bool 6th.

'Real-world' scenario testing - low modification

While testing iteration, erasure and insertion separately can be a useful tool, they don't tell us how containers perform under real-world conditions, as under most use-cases we will not be simply inserting a bunch of elements, erasing some of them, then iterating once over the data. To get more valid results, we need to think about the kinds of use-cases we may have for different types of data, in this cas, video-game data.

In this test we simulate the use-case of a container for general video game objects, actors/entities, enemies etc. Initially we insert a number of small structs into the container, then simulate in-game 'frames'. We iterating over container elements every frame, and erase(at random locations)/insert 1% of the original number of elements for every minute of gametime ie. 3600 frames assuming 60 frames-per-second. We measure the total time taken to simulate this scenario for 108000 frames (half an hour of simulated game-time, assuming 60 frames per second), as well as the amount of memory used by the container at the end of the test. We then re-test this scenario with 5% of all elements being inserted/erased, then 10%.

With some reluctance I have also included the results for std::list in this test and the high modification tests, despite the fact that the earlier 'raw' performance tests show that it is not a contendor for colony, at least in the same problem domain. This is because some people were not able to relate the raw performance test outcomes to the expected outcomes of subsequent tests. By contrast there is less point again in including std::map/std::multiset for these tests, as their raw iteration, erasure, and insertion outcomes were much worse than other containers.

Click images or hover over to see results at linear scale instead

Performance results

test result graph
test result graph
test result graph

Memory results

test result graph
test result graph
test result graph

'Real-world' scenario testing - high modification

Same as the previous test but here we erase/insert 1% of all elements per-frame instead of per 3600 frames, then once again increase the percentage to 5% then 10% per-frame. This simulates the use-case of continuously-changing data, for example video game bullets, decals, quadtree/octree nodes, cellular/atomic simulation or weather simulation.

Performance results

Click images or hover over to see results at linear scale instead

test result graph
test result graph
test result graph

Memory results

test result graph
test result graph
test result graph

Obviously if you are doing a lot of referencing into a packed_deque from outside objects, you may encounter some speed problems due to the dereferencing system - which the next test will cover. On the other hand, if you are mainly iterating over unordered data, erasing occasionally, and only ever referring outwards from elements within the packed_deque, it will be an ideal solution.

'Real-world' scenario-testing: referencing with interlinked containers

In order to completely test plf::colony against a packed-array-style implementation, it is necessary to measure performance while exploiting both container's linking mechanisms - for colony this is pointers or iterators, for a packed array this is a handle, or more specifically a pointer to a pointer (or the equivalent index-based solution). Because that additional dereferencing is a performance loss and potential cache miss, any test which involves a large amount of inter-linking between elements in multiple containers should lose some small amount of performance when using a packed_deque instead of a colony. Since games typically have high levels of interlinking between elements in multiple containers (as described in the motivation section), this is relevant to performance concerns.

Consequently, this test utilizes four instances of the same container type, each containing different element types:

  1. A 'collisions' container (which could represent collision rectangles within a quadtree/octree/grid/etc)
  2. An 'entities' container (which could representing general game objects) and
  3. Two subcomponent containers (these could be sprites, sounds, or anything else).
test class diagram

This is actually a very low number of inter-related containers for a game, but we're reducing the number of components in this case just to simplify the test. Elements in the 'entities' container link to elements in all three of the other containers. In turn, the elements in the collisions container link back to the corresponding elements in the entities container. The subcomponent containers do not link to anything.

In the test, elements are first added to all four containers and interlinked (as this is a simplified test, there's a one-to-one ratio of entity elements to 'collision' and subcomponent elements). The core test process after this point is similar to the modification scenarios tested earlier: we iterate over the entity container once every frame, adding a number from both the entities and subcomponents to a total. We also erase a percentage of entities per-frame (and their corresponding subcomponents and collision blocks) - similar to the earlier tests.

However during each frame we also iterate over the 'collisions' container and erase a percentage of these elements (and their corresponding entities and subcomponents) at random as well. This could be seen as simulating entities being removed from the game based on collisions occurring, but is mainly to test the performance effects of accessing the subcomponents via a chain of handles versus a chain of pointers. Then, again during each frame, we re-add a certain number of entities, collision blocks and subcomponents back into the containers based on the supplied percentage value. Since both colony and packed_deque essentially re-use erased-element memory locations, this tests the efficacy of each containers mechanism for doing so (packed_deque's move-and-pop + handle free-list, versus colony's stack + skipfield).

Since neither container will grow substantially in memory usage over time as a result of this process, a longer test length is not necessary like it was for the earlier modification scenario-testing with indexed_vector and pointer_deque. Testing on both plf::colony and plf::packed_deque showed that both of their test results increased linearly according to the number of simulated frames in the test (indexed_vector and pointer_deque have a more exponential growth). Similarly to the modification tests above, we will start with 1% of all elements being erased/inserted per 3600 frames, then 5% and 10%, then move up to 1% of all elements being erased/inserted per-frame, then 5% per-frame, then 10% per-frame.

Performance results

Click images or hover over to see results at linear scale instead

test result graph
test result graph
test result graph
test result graph
test result graph
test result graph

As we can see from the performance results, at low levels of modification packed_deque has a small performance advantage over colony, until about 9000 elements. After 9000 elements, colony has a larger performance advantage. And the higher the level of modification, the fewer elements there have to be in the containers before colony has an advantage. At 1% modification per frame, only 200 elements are needed, while at 5% per-frame and above, colony always has a strong advantage.

Memory results

test result graph
test result graph
test result graph
test result graph
test result graph
test result graph

As we can see the memory results don't really change between different levels of modification - packed_deque always has a small advantage over colony in terms of memory usage. The memory difference could be mitigated somewhat by setting a smaller maximum group size for colony, but this will likely come at an expense of speed.

Overall Performance Conclusions

For situations where unordered, often-inserted/erased content is required, colony provides a convenient solution, while also outperforming the alternatives for the most part. Where modification rates and the number of elements are low, a packed-array-style structure like packed_deque may be your best solution both in terms of performance and memory usage. However once the numbers of elements and rates of modification begin to rise, a colony shines forth. In addition, a packed array will be affected adversely when the type is larger or non-trivially movable, due to the necessity of moving elements from the back when erasing.

Using a boolean skipfield in combination with a vector or deque is of course a no-go, due to it's poor iteration performance once the number of erasures increases. Similarly, using an erasable iteration field as was used with indexed_vector and pointer_deque results in both wasteful memory consumption and poor performance once the number of modifications or elements becomes too high.

In short, use a packed-array where both the rate of modification is <= 10% of all elements per 3600 iterations over data and the number of elements is <= 9000, or if external object access to container elements via the dereferencing system is uncommon, or if all memory is at a premium. In all other cases involving unordered data, use colony.

Frequently Asked Questions

  1. TLDR, what is a colony?

    A combination of a linked-list of increasingly-large memory blocks with metadata, combined with a jump-counting skipfield, combined with an erased-element memory-location stack for later reinsertions, resulting in a std::-styled C++ template data container with positive performance characteristics while maintaining pointer stability to container elements.

  2. Where is it worth using a colony in place of other std:: containers?

    As mentioned, it is worthwhile for performance reasons in situations where the order of container elements is not important and:

    1. Insertion order is unimportant
    2. Insertions and erasures to the container occur frequently in performance-critical code, and
    3. Links to non-erased container elements may not be invalidated by insertion or erasure.

    Under these circumstances a colony will generally out-perform other std:: containers. In addition, because it never invalidates pointer references to container elements (except when the element being pointed to has been previously erased) it may make many programming tasks involving inter-relating structures in an object-oriented or modular environment much faster, and could be considered in those circumstances.

  3. What are some examples of situations where a colony might improve performance?

    Some ideal situations to use a colony: cellular/atomic simulation, persistent octtrees/quadtrees, game entities or destructible-objects in a video game, particle physics, anywhere where objects are being created and destroyed continuously. Also, anywhere where a vector of pointers to dynamically-allocated objects or a std::list would typically end up being used in order to preserve object references but where order is unimportant.

  4. What situations should you explicitly not use a colony for?

    A colony should not be used as a stack, ie. erasing backwards from the back, and then filling, then erasing from the back, etcetera. In this case you should use a stack-capable container ie. plf::stack, std::vector or std::deque. The reason is that erasing backwards sequentially creates the greatest time complexity for skipfield updates, as does reinserting to the start of a sequentially-erased skipblock (which is what stack usage will entail). This effect is mitigated somewhat if an entire group is erased, in which case it is released to the OS and subsequent insertions will simply be pushing to back without the need to update a skipfield, but you'd still incur the skipfield update cost during erasure. Although in practice the performance difference is small due to the low cost of the operations and the small size of the skipfield, in general you should avoid erasing sequentially during reverse iteration except where neccessary, as the jump-counting skipfield format is optimized for either random erasure or sequential erasure while iterating forwards.

  5. What are the time complexities for general operations?

    Insert (single): O(1) amortised unless prior erasures have occurred in theusage lifetime of the colony instance. If prior erasures have occurred,updating the skipfield may require a memmove operation, which creates avariable time complexity depending on the range of skipfield needing to becopied (though in practice this will resolve to a singular raw memory blockcopy in most scenarios). This is O(random) with the range of the random numberbeing between O(1) and O(std::numeric_limits<skipfield_type>::max() - 2).Average time complexity varies based on erasure pattern, but with a randomerasure pattern it's closer to O(1) amortized.

    Insert (multiple): O(N) unless prior erasures have occurred. See Insertion(single) for rules in this case.

    Erase (single): If erasures to elements consecutive with the element being erased have not occurred, or only consecutive erasures before the element being erased have occurred, O(1) amortised. If consecutive erasures after the element being erased have occurred, updating of the skipfield requires a memmove operation or vectorized update of O(n) complexity, where n is the number of consecutive erased elements after the element being erased. This is O(random) with the range of the random number being between from 1 and std::numeric_limits<skipfield_type>::max() - 2. Average time complexity varies based on erasure pattern, but with a random erasure pattern it's closer to O(1) amortized.

    Erase (multiple): ~O(log N).

    std::find: O(n)

    Iterator operations:
    ++ and -- : O(1) amortized
    begin()/end(): O(1)
    advance/next/prev: between O(1) and O(n), depending on current location, end location and state of colony. Average ~O(log N).

  6. If the time complexities of the insert/erase functions are (mostly) O(random, ranged), why are they still fast?

    The time complexities are largely based on the skipfield updates necessary for the jump-counting skipfield pattern. The skipfield for each group is contiguous and separate from the skipfields for other groups, and so fits into the cache easily (unless the skipfield type is large), thus any changes to it can occur quickly - time complexity is no indicator of performance on a modern CPU for anything less than very large amounts of N (or when the type of N is large). The colony implementation uses memmove to modify the skipfield instead of iterative updates for all but one of the insert/erase operations, which decreases performance cost. memmove will typically be implemented as a single raw memory chunk copy. There is one rarer case in erase which does not use memmove, when an element is erased and is surrounded on both sides by consecutive erased elements. In this case it isn't possible to update the skipfield using memmove because the requisite numbers do not exist in the skipfield and therefore cannot be copied, so it is implemented as a vectorized iterative update instead. Again, due to a low amount of branching and the small size of each skipfield the time taken for this update is still small.

  7. Is it similar to a deque?

    A deque is reasonably dissimilar to a colony - being a double-ended queue, it requires a different internal framework. It typically uses a vector of memory blocks, whereas the colony implementation uses a linked list of memory blocks, essentially. A deque can't technically use a linked list of memory blocks because it will make some random_access iterator operations (eg. + operator) non-O(1). In addition, being a double-ended queue makes having a growth factor for memory blocks problematic because the rules for growth at each end of the queue become difficult to implement in a way which increase performance for all scenarios without memory bloat.

    A deque and colony have no comparable performance characteristics except for insertion (assuming a good deque implementation). Deque erasure performance varies wildly depending on the implementation compared to std::vector, but is generally similar to vector erasure performance. A deque invalidates pointers to subsequent container elements when erasing elements, which a colony does not.

  8. What are the thread-safe guarantees?

    Unlike a std::vector, a colony can be read from and written to at the same time (assuming different locations for read and write), however it cannot be iterated over and written to at the same time. If we look at a (non-concurrent implementation of) std::vector's threadsafe matrix to see which basic operations can occur at the same time, it reads as follows (please note push_back() is the same as insertion in this regard):

    std::vector Insertion Erasure Iteration Read
    Insertion No No No No
    Erasure No No No No
    Iteration No No Yes Yes
    Read No No Yes Yes

    In other words, multiple reads and iterations over iterators can happen simultaneously, but the potential reallocation and pointer/iterator invalidation caused by insertion/push_back and erasure means those operations cannot occur at the same time as anything else.

    Colony on the other hand does not invalidate pointers/iterators to non-erased elements during insertion and erasure, resulting in the following matrix:

    plf::colony Insertion Erasure Iteration Read
    Insertion No No No Yes
    Erasure No No No Mostly*
    Iteration No No Yes Yes
    Read Yes Mostly* Yes Yes

    * Erasures will not invalidate iterators unless the iterator points to the erased element.

    In other words, reads may occur at the same time as insertions and erasures (provided that the element being erased is not the element being read), multiple reads and iterations may occur at the same time, but iterations may not occur at the same time as an erasure or insertion, as either of these may change the state of the skipfield which's being iterated over. Note that iterators pointing to end() may be invalidated by insertion.

    So, colony could be considered more inherently threadsafe than a (non-concurrent implementation of) std::vector, but still has some areas which would require mutexes or atomics to navigate in a multithreaded environment.

  9. Any pitfalls to watch out for?

    1. Because erased-element memory locations will be reused by insert() and emplace(), insertion position is essentially random unless no erasures have been made, or an equal number of erasures and insertions have been made.
    2. For architectural reasons (empty groups cannot be present), reserve can only reserve a number of elements up to the maximum bit-depth of the skipfield type.
  10. Am I better off storing iterators or pointers to colony elements?

    Testing so far indicates that storing pointers and then using get_iterator_from_pointer() when or if you need to do an erase operation on the element being pointed to, yields better performance than storing iterators and performing erase directly on the iterator. This is simply due to the size of iterators (3 pointers)

  11. Any special-case uses?

    In the special case where many, many elements are being continually erased/inserted in realtime, you might want to experiment with limiting the size of your internal memory groups in the constructor. The form of this is as follows:
    plf::vector<object> a_vector;
    a_vector.change_group_sizes(500, 5000);

    where the first number is the minimum size of the internal memory groups and the second is the maximum size. Note these can be the same size, resulting in an unchanging group size for the lifetime of the colony (unless change_group_sizes is called again or operator = is called).

    One reason to do this is that it is slightly slower to pop an element location off the internal erased-location-recycling stack, than it is to insert a new element to the end of the colony (the default behaviour when there are no previously-erased elements). If there are any erased elements in the colony, the colony will recycle those memory locations, unless the entire group is empty, at which point it is freed to memory. So if a group size is large and many, many erasures occur but the group is not completely emptied, (a) the number of erased element locations in the recycling stack could get large and increase memory usage and (b) iteration performance may suffer due to large memory gaps between any two non-erased elements. In that scenario you may want to exeriment with benchmarking and limiting the minimum/maximum sizes of the groups, and find the optimal size for a specific use case.

    Please note that the the fill, range and initializer-list constructors can also take group size parameters, making it possible to construct filled colonies using custom group sizes.

  12. What is colony's Abstract Data Type (ADT)?

    Though I am happy to be proven wrong I suspect colony is it's own abstract data type. While it is similar to a multiset or bag, those utilize key values and are not sortable (by means other than automatically by key value). Colony does not utilize key values, is sortable, and does not provide the sort of functionality one would find in a bag (eg. counting the number of times a specific value occurs). Some have suggested simiarities to deque - but as described earlier the three core aspects of colony are:

    1. A multiple-memory-block based allocation pattern which allows for the removal of memory blocks when they become empty of elements.
    2. A skipfield to indicate erasures instead of reallocating elements, the iteration of which should typically not necessitate the use of branching code.
    3. A mechanism for recording erased element locations to allow for reuse of erased element memory space upon subsequent insertion.

    The only aspect out of these which deque also shares is a multiple-memory-block allocation pattern - not a strong association. As a result, deques do not guarantee pointer validity to non-erased elements post insertion or erasure, as colony does. Similarly if we look at a multiset, an unordered one could be implemented on top of a colony by utilizing a hash table (and would in fact be more efficient than most non-flat implementations) but the fact that there is a necessity to add something to make it a multiset (to take and store key values) means colony is not an multiset.

  13. Exception Guarantees?

    All operations which allocate memory have strong exception guarantees and will roll back if an exception occurs, except for operator = which has a basic exception guarantee (see below). For colony, iterators are bounded by asserts in debug mode, but unbounded in release mode, so it is possible for an incorrect use of an iterator (iterating past end(), for example) to trigger an out-of-bounds memory exception. These are the only two areas where exceptions can occur.

    The reason why operator = only has a basic guarantee is they do not utilize the copy-swap idiom, as the copy-swap idiom significantly increases the chance of any exception occuring - this is because the most common way an exception tends to occur during a copy operation is due to a lack of memory space, and the copy-swap idiom doubles the memory requirement for a copy operation by constructing the copy before destructing the original data. This is doubly innappropriate in game development, which colony has been initially for, where memory constraints are often critical and the runtime lag from memory spikes can cause detrimental game performance. So in the case of colony if a non-memory-allocation-based exception occurs during copy, the = operators will have already destructed their data, so the containers will be empty and cannot roll back - hence they have a basic guarantee, not a strong guarantee.

  14. Iterators not following rule of zero?

    It was found under GCC and clang that the high-modification scenario benchmarks suffered a 11% overall performance loss when iterator copy and move constructors/operators where not explicitly defined, possibly because GCC did not know to vectorise the operations without explicit code or because the compiler implemented copy/swap idiom instead of simply copying directly. No performance gain was found under any tests when removing the copy and move constructors/operators.

  15. Why must groups be removed when empty?

    Two reasons:

    1. Standards compliance: if groups aren't removed then ++ and -- iterator operations become O(random) in terms of time complexity, making them non-compliant with the C++ standard. At the moment they are O(1) amortised, typically one update for both skipfield and element pointers, but two if a skipfield jump takes the iterator beyond the bounds of the current group and into the next group. But if empty groups are allowed, there could be anywhere between 1 and size_type empty groups between the current element and the next. Essentially you get the same scenario as do when iterating over a boolean skipfield.
    2. Performance: iterating over empty groups is slower than them not being present, cache wise - but also if you have to allow for empty groups while iterating, then you have to include a while loop in every iteration operation, increasing cache misses and code size. Lastly, pushing to the back of a colony is faster than recycling memory locations. When a group is removed it's recyclable memory locations are also removed from the stack, hence subsequent insertions are also more likely to be pushed to the back of the colony.
  16. Why use a stack instead of a free list for erased location storage?

    Two reasons:

    1. Colony elements are not guaranteed to be of sufficient size (32-bit on x86, 64-bit or greater on other architectures) to store a pointer. For smaller types and 64-bit pointers, a free list union would end up wasting more space than a 16-bit skipfield. While the use-cases in games indicate larger-than-scalar types, there has been some interest and indication of usefulness from the high-performance-computing and storage domains, and I cannot predict their data patterns or usage.
    2. A free list will likely incur a significant number of cache misses and subsequent jitter when a colony memory block is freed to the OS. With a stack the erased locations are in contiguous memory space, so it is extremely fast and cache-friendly to process the stack and remove erased locations belonging to the memory block being freed. On the other hand with a free list, removing the erased locations involves iterating through the list - which will most likely skip randomly throughout the colony blocks, involving many cache misses - in order to locate the nodes belonging to the erased memory block and joining previous nodes to the next non-erased-block locations.

    In short, it's probably not an effective solution in this case.

  17. Why use three pointers for iterators rather than a group pointer, single index and then dereferencing?

    It's faster. In attempting on two separate occasions to switch to an index-based iterator approach, utilizing three separate pointers was notably faster.

  18. Group sizes - what are they based on, how do they expand, etc

    Group sizes start from the either the default minimum size (8 elements, larger if the type stored is small) or an amount defined by the programmer (with a minimum of 3 elements). Subsequent group sizes then increase the total capacity of the colony by a factor of 2 (so, 1st group 8 elements, 2nd 8 elements, 3rd 16 elements, 4th 32 elements etcetera) until the maximum group size is reached. The default maximum group size is the maximum possible number that the skipfield bitdepth is capable of representing (std::numeric_limits<skipfield_type>::max()). By default the skipfield bitdepth is 16 so the maximum size of a group is 65535 elements. However the skipfield bitdepth is also a template parameter which can be set to any unsigned integer - unsigned char, unsigned int, Uint_64, etc. Unsigned short (guaranteed to be at least 16 bit, equivalent to C++11's uint_least16_t type) was found to have the best performance in real-world testing due to the balance between memory contiguousness, memory waste and the restriction on skipfield update time complexity. Initially the design also fitted the use-case of gaming better (as games tend to utilize lower numbers of elements than some other fields), and that was the primary development field at the time.

  19. Why store a size member for each group when this can be obtained via reinterpret_cast<element_pointer_type>(skipfield) - elements?

    Because it's faster. While it can be obtained that way, having it precalculated gives a small but significant benefit. And it's only an additional 16 bits per group in most cases.



Version history

Contact: footer
plf:: library and this page Copyright (c) 2017, Matthew Bentley