PLF C++ Library - Core2 Benchmarks

Main benchmarks last updated 8-2-2017 with colony v3.87. plf::list auxilary benchmarks added 11-03-2018.

Test setup
Tests design
Raw performance benchmarks (against standard containers)
Comparitive performance benchmarks (against modified standard containers)
Low-modification scenario test
High-modification scenario test
Referencing scenario test
Referencing scenario test
List-specific tests
Overall performance conclusions

Test machine setup

The test setup is an Intel E8500, 8GB ram, running GCC 5.1 x64 as compiler. OS is a stripped-back Windows 7 x64 SP1 installation with the most services, background tasks (including explorer) and all networking disabled. Build settings are "-O2 -march=native -std=c++11 -fomit-frame-pointer".

The source code for the benchmarks can be found in the colony page downloads section.

General test design

The first (roughly) 10% of all test runs are discarded in order to 'warm up' the cache and get statistically-meaningful results. Tests are based on a sliding scale of number of runs vs number of elements, so a test with only 10 elements in a container may average 100000 runs to guarantee a more stable result average, whereas a test with 100000 elements may only average 10 runs. This tends to give adequate results without overly lengthening test times. I have not included results involving 'reserve()' as yet.

Insertion: is into empty containers for the raw and comparitive tests, entering single elements at a time. For the 'scenario' tests there is also ongoing insertion at random intervals. This matches the use case of colony, where insertion on-the-fly is expected to be one of the driving factors of usage. Insertion is always at the most performant location for the specific container, for example front for list, or back for vector.
Erasure: initially takes place in an iterative fashion for the raw tests, erasing elements at random as we iterate through the container. The exception to this is the tests involving a remove_if pattern (pointer_deque and indexed_vector) which have a secondary pass when using this pattern.
Iteration: is straightforward iteration from the start to end of any containers. Typically there are more runs for iteration than the other tests due to iteration being a much quicker procedure, so more runs deliver a more stable average.

Raw performance tests

Before we begin measuring colony against containers or container modifications which do not invalidate links on erasure or insertion, we need to identify which containers are good candidates for comparison based on raw performance without regard to linkage invalidation. With that in mind the following tests compare colony against the main standard library containers. Tests are carried out on the following types: (a) a 8-bit type ie. char, (b) a 32-bit type ie. int, (c) a 64-bit type ie. double, (d) a small struct containing two pointers and four scalar types, and (e) a large struct containing 2 pointers, 4 scalar types, a large array of ints and a small array of chars.

The first test measures time to insert N elements into a given container, the second measures the time taken to erase 25% of those same elements from the container, and the third test measures iteration performance after the erasure has taken place. Erasure tests avoid the remove_if pattern for the moment to show standard random-access erasure performance more clearly (this pattern is explored in the second test). Both linear and logarithmic views of each benchmark are provided in order to better show the performance of lower element amounts.

Insertion Performance

Click images or hover over to see results at linear scale instead

A predictable pattern for all but the large struct test is shown for insertion:
std::deque dominates insertion, with plf::colony equal after about 100 elements, but then for large structs it's performance completely eclipses std::deque. This is because libstdc++'s implementation of deque caps the memory block size at 512 bytes, whereas the large struct in question is ~506 bytes (depending on platform), meaning it, essentially, becomes a std::list but with additional overheads. Colony avoids this downfall due to it's memory allocation pattern of basing memory block sizes on fixed numbers of elements with a growth factor of 2, not fixed numbers of bytes. std::vector is nearly on a par with std::deque for very small element types with element numbers greater than a thousand, but becomes worse and worse the larger the size of the stored type is, and the fewer stored elements there are.
std::list, std::map and std::multiset all perform poorly by contrast, with the exception of large structs, where std::list comes in 2nd place to colony. Overall, plf::colony and std::deque dominate.

Erase Performance

Here we forward-iterate over each container and erase 25% of all elements at random. If (due to the variability of random number generators) 25% of all elements have not been erased by the end of the container iteration, the test will reverse-iterate through the container and randomly erase the remaining necessary number of elements until that 25% has been reached.

Click images or hover over to see results at linear scale instead

Across all types plf::colony dominates performance, with std::list coming close behind. std::deque and std::vector have predictably poor performance as a remove_if pattern is not being used, as much as 100000x worse than plf::colony and std::list for large numbers of large types.

Post-erasure Iteration Performance

Since data is typically iterated across more than it is erased or inserted, iteration speed is, for many areas, more important than erase or insertion performance, despite the fact that it almost always takes factors of ten less time than either of those two.

Click images or hover over to see results at linear scale instead

std::vector and std::deque come in first and second place for most types, with colony in third place - the only place where this does not occur is for large structs, where colony dominates std::deque after approximately 20000 elements are inserted. Once again this is due to the structure of deque as explained in the insertion conclusion above.

For under 1000 elements, std::list is about on par with both std::deque and std::vector, both of which dominate these tests, with std::vector taking 1st place. However the number of elements necessary before this effect occurs on std::list decreases according to how large the stored type is, suggesting that performance in this case is due to some effect of the cpu cache or implementation. Querying the GCC mailing list about this resulted in the following response, which I believe to be accurate due to the correlation between std::list iteration performance and type size: "I suspect that for working sets which fit into the first-level cache of the CPU, the simpler iterators for std::list are faster than std::deque because the additional memory accesses in std::list are cheaper than the fairly complicated iterator implementation in std::deque". What this suggests is that for typical programs, where more than one data set is competing for space in the L1 or L2 caches, std::list performance will not follow the pattern above and generally will be poor.

Raw tests Conclusion

From the above data we can see that std::list is not a good contendor against plf::colony, as it has weaker insertion and erase performance, and the only scenario where it has good iteration performance is where (a) the amount of data in the container is small enough to fit entirely into the cache and (b) where that data set is the only data set being operated on by the program in question, and in fact the computer as a whole. That being a relatively uncommon case, std::list is not a general contendor.

std::deque is a contendor, having strong insertion performance and excellent iteration performance but poor non-remove_if erasure performance - however std::deque invalidates pointers upon erasure, meaning it requires modification to be used in a way comparable to colony. std::vector is a slightly weaker contendor, having weak insertion performance and worse non-remove_if erasure performance than std::deque, however it's iteration performance is always the best, being entirely contiguous in memory, rather than deque which is only partially contiguous. std::vector invalidates pointers on both insertion and erasure, meaning it will also require modification to compare to colony.

Comparative performance tests

Colony is primarily designed for scenarios where good insertion/erasure/iteration performance is required while guarantee'ing linkage stability for outside elements referring to elements within the container, and where ordered insertion is unimportant. The two containers from the raw performance tests which may compare both in performance and usage (after modification) are std::deque and std::vector. std::list does not meet these requirements as it has poor insertion and iteration performance.

pointer_deque and indexed_vector

Because std::deque does not invalidate pointers to elements upon insertion to the back or front, we can guarantee that pointers won't be invalidated during unordered insertion. This means we can use a modification called a 'pointer-to-deque deque', or pointer_deque. Here we take a deque of elements and construct a secondary deque containing pointers to each element in the first deque. The second deque functions as an erasable iteration field for the first deque ie. when we erase we only erase from the pointer deque, and when we iterate, we iterate over the pointer deque and access only those elements pointed to by the pointer deque. In doing so we reduce erase times for larger-than-scalar types, as it is computationally cheaper to reallocate pointers (upon erasure) than larger structs. By doing this we avoid reallocation during erasure for the element deque, meaning pointers to elements within the element deque stay valid.

We cannot employ quite the same technique with std::vector because it reallocates during insertion to the back of the vector upon capacity being reached. But since indexes stay valid regardless of a vector reallocates, we can employ a similar tactic using indexes instead of pointers; which we'll call an indexed_vector. In this case we use a secondary vector of indexes to iterate over the vector of elements, and only erase from the vector of indexes. This strategy has the advantage of potentially lower memory usage, as the bitdepth of the indexes can be reduced to match the maximum known number of elements, but it will lose a small amount of performance due to the pointer addition necessary to utilise indexes instead of pointers. In addition outside objects refering to elements within the indexed_vector must use indexes instead of pointers to refer to the elements, and this means the outside object must know both the index and the container it is indexing; whereas a pointer approach can ignore this and simply point to the element in question.

We will also compare these two container modifications using a remove_if pattern for erasure vs regular erasure, by adding an additional boolean field to indicate erasure to the original stored struct type, and utilizing two passes - the first to randomly flag elements as being ready for erasure via the boolean field, the second using the remove_if pattern.

vector_bool and deque_bool

A second modification approach, which we'll call a vector_bool, is a very common approach in a lot of game engines - a bool or similar type is added to the original struct or class, and this field is tested against to see whether or not the object is 'active' (true) - if inactive (false), it is skipped over. We will also compare this approach using a deque.

packed_deque

packed_deque is an implementation of a 'packed_array' as described in the motivation section earlier, but using deques instead of vectors or arrays. As we've seen in the raw performance benchmarks, (GCC) libstdc++'s deque is almost as fast as vector for iteration, but about twice as fast for back insertion and random location erasure. It also doesn't invalidate pointers upon insertion, which is also a good thing. These things become important when designing a container which is meant to handle large numbers of insertions and random-location erasures. Although in the case of a packed-array, random location erasures don't really happen, the 'erased' elements just get replaced with elements from the back, so erasure speed is not as critical, but insertion speed is critical as it will always consume significantly more CPU time than iteration.

With that in mind my implementation uses two std::deque's internally: one containing structs which package together the container's element type and a pointer, and one containing pointers to each of the 'package' structs in the first deque. The latter is what is used by the container's 'handle' class to enable external objects to refer to container elements. The pointer in the package itself in turn points to the package's corresponding 'handle' pointer in the second deque. This enables the container to update the handle pointer when and if a package is moved from the back upon an erasure.

Anyone familiar with packed array-style implementations can skip this paragraph. For anyone who isn't, this is how it works when an element is erased from packed_deque, unless the element in question is already at the back of the deque. It:

Uses the pointer within the package to find the 'handle' pointer which pointed to the erased element, and adds it to a free list.
Moves the package at the back of the container to the location of the package containing the element being erased.
Uses the pointer in the package which has just been moved to update the corresponding handle pointer, to correctly point to the package's new location.
Pops the back package off the first deque (should be safe to destruct after the move - if it's not, the element's implementation is broke).

In this way, the data in the first deque stays contiguous and is hence fast to iterate over. And any handles referring to the back element which got moved stay valid after the erasure.

This implementation will not work well under MSVC as MSVC's deque implementation performs badly.

Tests

Since neither indexed_vector nor pointer_deque will have erasure time benefits for small scalar types, and because game development is predominantly concerned with storage of larger-than-scalar types, we will only test using small structs from this point onwards. In addition, we will test 4 levels of erasure and subsequent iteration performance: 0% of all elements, 25% of all elements, 50% of all elements, and 75% of all elements.

Insertion

Click images or hover over to see results at linear scale instead

For insertion plf::colony outperforms the others for greater than 100 and less than 20000 elements. Below 100 elements it is outperformed by pointer_deque and deque_bool, and above 20000 elements it is outperformed by deque_bool. packed_deque consistently comes 4th, and both vector methods perform poorly by contrast.

Erasure

Click images or hover over to see results at linear scale instead

Here the gap consistently widens between the candidates as erasure percentage increases. The two boolean skipfield methods obviously dominate performance, being the easiest and fastest to implement in terms of erasure. Above 25% erasure both of the remove_if variants outperform the others, with packed_deque and colony criss-crossing each other in terms of performance. The non-remove_if variants of pointer_deque and indexed_vector of course perform poorly.

Post-erasure Iteration

Click images or hover over to see results at linear scale instead

Colony performs the worst out of the lot for iteration with zero erasures, with deque_bool coming in slightly worse only for large numbers of elements. Unsurprisingly packed_deque performs the best out of the lot as this constitutes pure contiguous iterative performance with no skipfield or iteration field. While close, the pointer_deque approach has a slight performance advantage over the indexed_vector.

Now we begin to see the advantage of a jump-counting skipfield over a boolean skipfield. Because boolean skipfields require branching code to iterate over, and 25% of all elements being erased represents a large number of cache misses. Other than that the results are much the same as the 0% test.

At 50% randomised erasures, a CPU's branch prediction cannot work at all, and so the boolean approaches degrade significantly.

At this point we can clearly see that the boolean approaches are not useful in terms of iteration.

Summary: for iteration packed_deque comes 1st, pointer_deque 2nd, indexed_vector 3rd, colony 4th, vector_bool 5th and deque_bool 6th.

'Real-world' scenario testing - low modification

While testing iteration, erasure and insertion separately can be a useful tool, they don't tell us how containers perform under real-world conditions, as under most use-cases we will not simply be inserting a bunch of elements, erasing some of them, then iterating once. To get more valid results, we need to think about the kinds of use-cases we may have for different types of data, in this case, video-game data.

In this test we simulate the use-case of a container for general video game objects, actors/entities, enemies etc. Initially we insert a number of small structs into the container, then simulate in-game 'frames'. To do so we iterate over the container elements every frame, and erase (at random locations) or insert 1% of the original number of elements for every minute of gametime ie. 3600 frames, assuming 60 frames-per-second. We measure the total time taken to simulate this scenario for 108000 frames (half an hour of simulated game-time, assuming 60 frames per second), as well as the amount of memory used by the container at the end of the test. We then re-test this scenario with 5% of all elements being inserted/erased, then 10%.

With some reluctance I have also included the results for std::list in this test and the high modification tests, despite the fact that the earlier 'raw' performance tests show that it is not a contendor for colony, at least in the same problem domain. This is because some people were not able to relate the raw performance test outcomes to the expected outcomes of subsequent tests. By contrast there is less point again in including std::map/std::multiset for these tests, as their raw iteration, erasure, and insertion outcomes were much worse than other containers.

Click images or hover over to see results at linear scale instead

Performance results

Memory results

'Real-world' scenario testing - high modification

Same as the previous test but here we erase/insert 1% of all elements per-frame instead of per 3600 frames, then once again increase the percentage to 5% then 10% per-frame. This simulates the use-case of continuously-changing data, for example video game bullets, decals, quadtree/octree nodes, cellular/atomic simulation or weather simulation.

Performance results

Click images or hover over to see results at linear scale instead

Memory results

Obviously if you are doing a lot of referencing into a packed_deque from outside objects, you may encounter some speed problems due to the dereferencing system - which the next test will cover. On the other hand, if you are mainly iterating over unordered data, erasing occasionally, and only ever referring outwards from elements within the packed_deque, it will be an ideal solution.

'Real-world' scenario-testing: referencing with interlinked containers

In order to completely test plf::colony against a packed-array-style implementation, it is necessary to measure performance while exploiting both container's linking mechanisms - for colony this is pointers or iterators, for a packed array this is a handle, or more specifically a pointer to a pointer (or the equivalent index-based solution). Because that additional dereferencing is a performance loss and potential cache miss, any test which involves a large amount of inter-linking between elements in multiple containers should lose some small amount of performance when using a packed_deque instead of a colony. Since games typically have high levels of interlinking between elements in multiple containers (as described in the motivation section), this is relevant to performance concerns.

Consequently, this test utilizes four instances of the same container type, each containing different element types:

A 'collisions' container (which could represent collision rectangles within a quadtree/octree/grid/etc)
An 'entities' container (which could representing general game objects) and
Two subcomponent containers (these could be sprites, sounds, or anything else).

This is actually a very low number of inter-related containers for a game, but we're reducing the number of components in this case just to simplify the test. Elements in the 'entities' container link to elements in all three of the other containers. In turn, the elements in the collisions container link back to the corresponding elements in the entities container. The subcomponent containers do not link to anything.

In the test, elements are first added to all four containers and interlinked (as this is a simplified test, there's a one-to-one ratio of entity elements to 'collision' and subcomponent elements). The core test process after this point is similar to the modification scenarios tested earlier: we iterate over the entity container once every frame, adding a number from both the entities and subcomponents to a total. We also erase a percentage of entities per-frame (and their corresponding subcomponents and collision blocks) - similar to the earlier tests.

However during each frame we also iterate over the 'collisions' container and erase a percentage of these elements (and their corresponding entities and subcomponents) at random as well. This could be seen as simulating entities being removed from the game based on collisions occurring, but is mainly to test the performance effects of accessing the subcomponents via a chain of handles versus a chain of pointers. Then, again during each frame, we re-add a certain number of entities, collision blocks and subcomponents back into the containers based on the supplied percentage value. Since both colony and packed_deque essentially re-use erased-element memory locations, this tests the efficacy of each containers mechanism for doing so (packed_deque's move-and-pop + handle free-list, versus colony's stack + skipfield).

Since neither container will grow substantially in memory usage over time as a result of this process, a longer test length is not necessary like it was for the earlier modification scenario-testing with indexed_vector and pointer_deque. Testing on both plf::colony and plf::packed_deque showed that both of their test results increased linearly according to the number of simulated frames in the test (indexed_vector and pointer_deque have a more exponential growth). Similarly to the modification tests above, we will start with 1% of all elements being erased/inserted per 3600 frames, then 5% and 10%, then move up to 1% of all elements being erased/inserted per-frame, then 5% per-frame, then 10% per-frame.

Performance results

Click images or hover over to see results at linear scale instead

As we can see from the performance results, at low levels of modification packed_deque has a small performance advantage over colony, until about 9000 elements. After 9000 elements, colony has a larger performance advantage. And the higher the level of modification, the fewer elements there have to be in the containers before colony has an advantage. At 1% modification per frame, only 200 elements are needed, while at 5% per-frame and above, colony always has a strong advantage.

Memory results

As we can see the memory results don't really change between different levels of modification - packed_deque always has a small advantage over colony in terms of memory usage. The memory difference could be mitigated somewhat by setting a smaller maximum group size for colony, but this will likely come at an expense of speed.

List-specific tests

Here we take a look at some results relating specifically to plf::list versus std::list. Namely reverse, remove, unique, clear and destruction. Results for remove_if should be the same as the results for remove. All benchmarks insert a number of ints with random values, iterate over the list erasing some percentage of elements at random, then perform the requisite function. The reason for randomly erasing some of the elements is that for some of these functions order is irrelevant to processing (eg. reverse merely flips previous and next pointers on nodes), and plf::list takes advantage of this by linearly iterating over the memory blocks and skipping erased elements. So it is important to know how having some erased elements affects the performance results.

Below 500 elements, std::list performs about the same as plf::list. After 500 elements there is an increasing benefit to plf::list. On average plf::list is 217% faster than std::list, and for the highest number of elements is 807% faster.

For remove/remove_if, plf::list is consistently ~100% faster than std::list.

On average, plf::list is 73% faster than std::list for unique.

For clear(), plf::list is between 147000% faster and 1072000% faster, depending on how many elements are in the list. Unlike std::list, plf::list's clear does not free up allocated memory blocks but instead reserves them for future insertions. It should be noted that because the type stored in question (int) is trivially-destructible, individual processing of nodes is avoided in this context, at least for plf::list. The results may be different if non-trivially-destructible types and/or pointers are utilized.

Unlike clear(), list destruction does of course free up allocated memory blocks for plf::list. Regardless, plf::list is on average 3700% faster than std::list for destruction. In this context both lists are unchanged after insertion. Lets see how those results alter once we sort the list elements, thereby disrupting the original linear order of elements in memory:

Essentially plf::list's results don't change much. However std::list's results for the largest number of elements increases by almost 100%, going from 3700% slower than plf::list on average, to 5700% slower. For the highest number of elements the difference goes from 1300% slower to 5800% slower.

Overall Performance Conclusions for colony benchmarks

For situations where unordered, often-inserted/erased content is required, colony provides a convenient solution, while also outperforming the alternatives for the most part. Where modification rates and the number of elements are low, a packed-array-style structure like packed_deque may be your best solution both in terms of performance and memory usage. However once the numbers of elements and rates of modification begin to rise, a colony shines forth. In addition, a packed array will be affected adversely when the type is larger or non-trivially movable, due to the necessity of moving elements from the back when erasing.

Using a boolean skipfield in combination with a vector or deque is of course a no-go, due to it's poor iteration performance once the number of erasures increases. Similarly, using an erasable iteration field as was used with indexed_vector and pointer_deque results in both wasteful memory consumption and poor performance once the number of modifications or elements becomes too high.

In short, use a packed-array where both the rate of modification is <= 10% of all elements per 3600 iterations over data and the number of elements is <= 9000, or if external object access to container elements via the dereferencing system is uncommon, or if all memory is at a premium. In all other cases involving unordered data, use colony.