PLF C++ Library - Blog

23-09-2021 - The mistake in human thinking

Where do we go wrong with our thinking about categories? I like to think of the classic example of the lump of wood and the chair. Take a lump of wood and by degrees gradually transform it into a chair. At what point does it become a chair? Aristotelian logic does not have an answer. Fuzzy logic, sort of, does. By degrees, it is, a chair - that is to say at a certain point it maps 40% to what most of us would consider a chair, based on what our brains input of what a 'chair' is (based on examples we have been given) and our own personal aggregate experience of chairs. Then later, 55% a chair, and so on and suchforth. The question is wrong: it's not, "when is it a chair?" or worse, "when do we all agree it's a chair?" - it's "to what extent does it map to our concept of a chair?". The reality is, there's just stuff, and stuff happening. There's a particular configuration of atoms, and we map that to the concept 'chair' based on what is valuable to us. If a chair didn't have any utility to us (be that function or beauty (beauty being an attraction to what is functional to us in some manner, biologically-speakin')) we wouldn't bother to name it, most likely.

A chair doesn't exist, our concept of chair utility exists, and we map that onto our environment. The question of 'what is art?' follows the same guidelines, as art has multiple utilities that are not consistent from 'artwork' to 'artwork'.

So now we get onto the programming side of things. Classical polymorphism is an extension of this original misunderstanding - class chair, subclass throne, subclass beanbag etc. At what point is it not 'of class chair'? That's the wrong question. Map to the utility. And since things rarely have less than two modes of utility, this is where data-oriented design steps in. I'm thinking specifically of ECS design in this case. What properties does this object have? Yes it's a chair, but is it breakable? Is it flammable? What are the modes of utility we are mapping for in any given project? After that point, apply whatever label you want. Whereas polymorphism over-reaches and falsely assigns the importance of the label itself as being comparable to the utility of the object.

It turns out this is a far more useful way of thinking about things, albiet one that requires more care and attention to specifics. One that matches our biology, not the aristotelian abstraction that's been ladled on top of that for millenia. For starters it's much easier to extract the utility, and secondly it's a more efficient way of -thinking- about the objects. We are just a collection of atoms. Programs are just a collection of bytes. Map to that.

08-09-2021 - Colony != Hive

To be clear, I'm still running with hive for the standards committee proposal, however I realised that the proposal is significantly divergent from the colony implementation, so I've setup hive as it's own container, separate from colony. The differences are as follows:

The result is a container which (currently as of time of writing) only runs under GCC10, until MSVC and clang get their act together regards C++20 support, and which is 24kb lighter. Hive won't be listed on the website here as it doesn't fit the PLF scope ("portable library framework", so C++98/03/11/14/17/20 and compatible with lots of compilers), however it'll be linked to on the colony page.

The other advantage of keeping colony separate from hive is that colony is insulated from any future changes necessitated to hive, meaning it becomes a more stable platform. Obviously any good ideas can be ported in between hive and colony.

06-09-2021 - POT Blocks paper errata

There was a casual minor mistake in the Random access of elements in data containers which use multiple memory blocks of increasing capacity paper (also know as 'POT blocks bit tricks', now updated). The 5th line of the equation should've read "k := (b != 0) << (b + n)" but instead read "k := (j != 0) << (b + n)" which neither matched the earlier description of the equation in the paper, nor my corresponding C++ code. This resulted in the technique not working in the first block, which it does now if you follow it through.

17-06-2021 - Colony == Hive

While it might've been noticed by some of the more observant, colony is now optionally typedef'd as hive, as this is the name the committee feels more comfortable with going forward. In terms of my implementation the default name will always be colony, as that's the name I am most comfortable with. There is a paper which explains the name change here. The typedef is only available under C++11 and above due to the limitations of template typedefs in C++98/03. plf::colony_priority and plf::colony_limits are also typedef'd as plf::hive_priority and plf::hive_limits. I'm not... 100% happy about the change, but c'est la vie.

13-06-2021 - More benchmark-based progression

In order to reduce the memory footprint of plf::queue compared to std::queue, there are now two modes that can be selected as a template parameter. 'Priority' is now the second template parameter, before Allocator, as I figure people will likely use Priority more. But at any rate, the new default mode, priority == plf::memory, tends to break the block sizes down into subsets of size(), whereas the original mode, priority == plf::speed, tends to allocate blocks of capacity == size(). The performance loss is relatively low for plf::memory and is still faster than std::deque in a queue context, but the memory savings are massive. More information can be found in the updated 'real-world' benchmarks.

07-06-2021 - Benchmark-based progression

Let's explore the idea of basing design of a product on benchmarks. There are a number of flaws with this: for starters, you can't do an initial design based on benchmarks, you have to build something to benchmark first. So you need an initial concept based on logic and reasoning about your hardware, before you can begin benchmarking. But, in order to get that initial logic, you have to know basic performance characteristics and how they are achieved on a code level - and the only way to know that is via benchmarks either yourself or someone else has done on other products in the past (or having an in-depth mechanical knowledge of how things are working on the hardware level). Otherwise you're just theorising about the hardware, which has limitations.

The second flaw is that, of course, benchmark results will change based on your platform (and compiler) - a platform with a cacheline width of, say, 4k, will benchmark differently from one with a width of 64k. But unless you actually benchmark your code on at least One platform, you don't really know what you're dealing with. And most platforms are reasonably similar nowadays in the sense of all having heavily-pipelined, branch-predicted, cache-locality-driven performance.

So I'm exploring the idea here of benchmark-based design, which is, beyond the initial drawing board part of design, at a certain point you flip over into basing design changes based on what your benchmarks tell you. Nothing revelatory here, something which most companies do, if they're any good. But you want to put the horse before the cart - this involves trying things you think might not work, as well as things you think might. You might be surprised how often your assumptions are challenged. The important part of this is discovering things that work better, then working backwards to figure out how and why they do that.

I've found this kind of workflow doesn't sync well with the use of git/subversion/mercurial/etc, as often I'm benchmarking up to 5 different alternative approaches at once across multiple types of benchmark (if not several different desktops) to see which ones work best. That's how I get performance. But tracking that number of alternative branches with a git-like pipeline just overcomplicates things to no end. So my use of github etc is purely functional - it provides a common and accepted output for those wanting the stable builds.

To expand upon this, plf::queue's original benchmarks were based on the same kind of raw performance benchmarks I used with plf::stack - however this was, somewhat, inappropriate, given that the way a queue is used is very different from the way most containers are used. So I put together a benchmark based on more 'real world' queue usage - that is to say, having an equal number of pushes and pops most of the time, then increasing or decreasing the overall number of elements at other points in time - and tracking the overall performance over time. The intro stats on the queue homepage have been updated to reflect this.

This obviously yielded somewhat different results to the 'raw' stack-like benchmarks, and while the results were still good, I felt I could do better. After reasoning about what the blocks in the stack were doing for a while, I tried a number of different approaches, including some I thought logically couldn't improve performance, and found that some of the more illogical approaches worked better. But I also tracked memory usage and found that the more 'illogical' techniques were using lower amounts of memory, and were also able to re-use the blocks more often, which explained things. Using this new knowledge I was able to construct new versions which not only used less memory (more-or-less on a par with std::queue) but performed better.

Benchmarking in this way is tedious - any good benchmark has to be tested against multiple element types and double-checked on alternative sets of hardware. but when I see people out there making performance assumptions about code, when they haven't actually benchmarked their assumptions, I know this is the right way. You can't guarantee performance if you don't perform.

27-05-2021 - list update

The most recent v2.0 plf::list update is more of a maintenance release than anything revelatory - though it does fix a significant bug in C++20 sentinel range insert (insert(position, iterator_type1, iterator_type2)), adds sentinel support to assign, overhauls range/initializer_list insert/assign with the same optimisations present in fill-insert, and adds support for move_iterator's to range insert/assign.

The other thing it does is add user-defined sort technique use to sort() - via the PLF_SORT_FUNCTION macro. Ala colony, the user must define that macro to be the name of the sort function they want sort() to use internally, and that function must take the exact same parameters as std::sort(). This means you can use std::stable_sort, and others like gfx::timsort. In addition that macro must be defined before plf_list.h is #included. If it is not defined, std::sort will be the function used. It still uses the same pointer-sort-hack to enable sorting, but for example, stable sorting may be more useful in some circumstances.

17-05-2021 - plf::queue [updated]

Taron M raised the idea of implementing a faster queue, given that std::queue can only run on deque at best, and std::deque's implementation in both libstdc++ and MSVC mean that it effectively becomes a linked list for element types above a certain size. Which is a bit bogus. The result is plf::queue, which is a *lot* faster than std::queue in real-world benchmarking, as in:

* Averaged across total numbers of elements ranging between 10 and 1000000, with the number of samples being 126 and the number of elements increasing by 1.1x each sample. Benchmark in question is a pump test, involving pushing and popping consecutively while adjusting the overall number of elements over time. Full benchmarks are here.

Structurally it's very similar to plf::stack, which made it relatively easy to develop quickly - multiple memory blocks, 2x growth factor, etcetera. We initially spitballed a few interesting ideas to save space in the event of a ring-like scenario, such as re-using space at the start of the first memory block rather than creating a new one; but ultimately the overhead from doing these techniques would've both complicated and killed performance. Also, it has shrink_to_fit(), reserve(), trim() and a few other things, much like plf::stack.

01-05-2021 - Switch trick

I was investigating a particular trick I'd used for 4-way decision processes in colony, as part of my speech writing process for my talk at CPPnow! 2021. Initially I held my current line, which was that using a switch for this process resulted in better/faster compiler output, both in debug and release builds. This used to be true at one point, however I realised that was a long time ago, and the container had changed since then, and the compilers had changed substantially since then, sooo. Was this still true?

The trick itself involves situations where you have a 3 or 4-way if/then/else block based on the outcome of two boolean decisions, A and B. You bitshift one of the booleans up by one, combine the numbers using logical OR, and put your 3/4 outcomes in 3 switch branches. Initially I did some assembly output analysis based on this simple set of code:

int main()
	int x = rand() % 5, y = rand() % 2, z = rand() % 3, c = rand() % 4, d = rand() % 5;

	const bool a =(rand() % 20) > 5;
	const bool b =(rand() & 1) == 1;

	if (a && b) d = y * c;
	else if (a && !b) c = 2 * d;
	else if (!a && b) y = c / z;
	else z = c / d * 5;

	return x * c * d * y * z;

vs this code:

int main()
	int x = rand() % 5, y = rand() % 2, z = rand() % 3, c = rand() % 4, d = rand() % 5;

	const bool a =(rand() % 20) > 5;
	const bool b =(rand() & 1) == 1;

	switch ((a << 1) | b)
		case 0: z = c / d * 5; break;
		case 1: y = c / z; break;
		case 2: c = 2 * d; break;
		default: d = y * c;

	return x * c * d * y * z;

And, indeed the switch version produced about half the number of branching instructions, at both O0 and O2. However this was inconsistent between compilers (MSVC was the worst, producing 10 (!) branching instructions for the O0 build. How? Also - WWWWhyyyyy???). Here's some graphs from the (now removed) slides:

Switch vs if assembly lines

Switch vs if branch lines

As you can see there's a significant difference there. All compilers are their most recent non-beta versions at the time of writing. But assembly analysis is not benchmarking. So I proceeded to change all the four-way branches in colony itself to a variety of different configurations - some using switch as above, some using if, some using if but with logical AND instead of && as above. I tested across a wide variety of benchmarks in the benchmark suite, and found that I was wrong. The switch variant was no longer in the fore as far as performance was concerned, though it was for earlier versions of GCC (circa GCC5). In the end I went with a logical AND variant, as it seems to have better performance across the largest number of benchmarks.

So now you know - doing a C++ talk can sometimes result in better code.

Also, I now have a new paper, as promised - on modulated jump-counting patterns. These allow a user, for example, to have 2 or more different types of objects in the array/storage area which the skipfield corresponds to, and conditionally skip-or-process either type of object during iteration. For example, in an array containing red objects and green objects, this allows the dev to iterate over only the red objects, or only the green objects, or both, with O(1) ++ iteration time. It is an extension and modification of jump-counting skipfields and can be used with either the high or low-complexity jump-counting skipfields. I am unsure of what this might be entirely useful for at this point - possibly some kind of very specific multiset.

23-04-2021 - A new paper

I have another paper in the pipeline, this time describing a new application of modified jump-counting skipfields, but this paper is something different:
Random access of elements in data containers using multiple memory blocks of increasing capacity

It describes a method which allows the use of increasingly-large memory block capacities, in random-access multiple-memory-block containers such as deques, while retaining O(1) access to any element based upon it's index. It's not useful for colony, but it may be useful for other things. I have a custom container based on this principle, but it's not ready for primetime yet - it hasn't been optimized. Ideally I'd get that out the door before releasing this, but we none of us know when we're going to die, so might as well do it now.

The method requires a little modification for use in a deque - specifically, you would have to expand the capacities in both directions, when inserting to the front or back of the container. But is doable. I came up with the concept while working on plf::list, but have subsequently seen others mention the same concept online in other places. But no-one's actually articulated the specifics of how to do it, which is what this paper intends.

Hopefully is of use.

24-03-2021 - Summarising DoD, maybe

If I could summarise data-oriented design in a heartbeat, it would be that:
traditional comp sci refocuses on micro perf too much, whereas most of the actual performance changes happen at the macro structure level, and what constitutes optimal macro level structure largely changes from platform to platform so it's better to provide developers with the right tools to fundamentally manage their own core structures and code flow with ease and simplicity, to match those platforms, rather than supplying those structures yourself and pretending the micro-optimisations you're doing on those structures are going to have a major impact on performance.

Making things more complicated in order to obtain minor performance optimisations (like the majority of std:: library additions) obscures and complicates attempts to make the real macro changes. Memorising the amount of increasingly verbose nuance necessary to get the nth-level performance out of the standard library does not contribute necessarily to better code, cleaner code, or more importantly, better-performing code.

I note this with full awareness of the part I have to play in bringing colony to the standard, or attempting to do so. It is better to have this thing than not to have it; that's the core thing. It is an optional macro level structure that can have a fundamental impact on performance if used instead of, say, std::list or vector in particular scenarios. But the amount of bullcrap I've had to go through in terms of understanding all the micro-opt/nuance stuff in C++ like allocators vs allocator_traits, alignment, assign, emplace vs move vs insert etc, has been a major impediment to progress.

Ultimately fate will decide whether this was worth it. I'm erring on the side of 'maybe'. Object orientation isn't going away anytime soon, so it's best that programmers have adequate tools to reach performance milestones without reinventing wheels.

In summary, Keep It Simple + think macro-performance, largely.

06-03-2021 - if constexpr (!all_the_things)

Constexpr-everything hasn't been critically-assessed. There is the potential for cache issues when constexpr functions return large amounts of data, or when functions returning small amounts of data are called many times, for example, and the issue of where that data is stored (stack or heap) and how long it is available for (program lifetime?). There is an awful mess of issues for any respectable compiler to sort through, and I don't envy those developers. What is clear is that it is not a necessary win. My own testing has shown constexpr-everything versions of colony to underperform the original by 2% when using GCC10 in regular code and while I haven't looked into why this is to any great extent, my guesses are along the lines of the following:

  1. Possible cache thrashing issues from storing function return data from many separate function calls.
  2. Creating in-app data when in fact the original function call would've been faster.

For an example of the latter, think about size() in std::vector. This can be calculated in most implementations by (vector.end_iterator - vector.memory_block), both of which will most likely be in cache at the time of calling size(). That's if size isn't already a member variable. Calculating a minus operation from stuff that's already in cache is about 100x faster than making a call out to main memory for a compile-time-stored value of this function, if that is necessary.

In addition constexpr function calls

  1. shift the responsibility of storage of pre-calculated results from the programmer to the compiler, and removes the ability for the programmer to think about the cache locality/optimal storage of precalculated results
  2. shift the decision of whether or not to evaluate at runtime/compile-time to the compiler, where in some cases doing it at compile-time and storing the result may decrease performance (see the Doom 3 BFG edition technical notes for their experience with pre-calc'ing meshes vs calculating them on the fly)
  3. may dramatically increase code size/file size in some cases if the return results of constexpr functions are large
  4. may dramatically increase compile times

At any rate, I have a constexpr-everything ver of colony, if people want to try such things out. Just email me.

29-01-2021 - A brief history of Colony

This was originally intended to be a chapter in what was to be a Data Driven Design fieldguide book based on various author's experiences, spear-headed by Mike Acton, way back in 2019. Unfortunately that book never came to fruition, so rather than leave this writing to dissolve into digital dust, I've guest-published it on Arne Mertz's blog - which has been sadly dormant of recent. Maybe someone else should volunteer to write something as well?

06-01-2021 - Bypassing bad bugs

The new version of colony manages to bypass the bad GCC9/10 optimisation bug which dropped performance in certain tests by up to 6-10%. I'm not sure what caused the bug exactly, but hopefully they figure it out. In this case, it seems to've been bypassed by refactoring out some common code into functions, suggesting potentially a GCC bug in large functions. Also, the other benefits are there's less code now, and certain benchmarks perform better even with GCC8. So all good.

The path to finding this 'solution' was fairly circuitous, involving me realising what a load of BS RAII is sometimes. It's nice when it works, but when you figure out that objects allocating their own sub-objects (as opposed to simply incorporating subobjects as flat data or arrays within the parent object) can lead to a much larger number of allocations/deallocations/try-and-catch blocks than is really necessary, you have to wonder just how much of IT infrastructure is slowed down by A instantiating child object B which in turn instantiates C etc. Object orientation is nice, to a point; it helps you think about code. But the buck should stop there. When you are able to see what your design is doing performance-wise with your data as you're supplying it, that can become your orientation from that point onwards.

03-01-2021 - Happy New Rand!

plf::rand is an adaptation of Melissa O'Neill's lightning-fast PCG random number generator with a fallback to xorshift for C++98/03. Read the page for details on the whys and wherefores of why C's rand() isn't good enough for most benchmarking. In addition, thanks to Tomas Maly, Colony now has much better support for over-aligned types. In the past, I initially allocated the memory blocks for the elements and skipfield of each group structure separately (a group is an element memory block + metadata struct). However, thankfully someone pointed out to me that this was needless and only one allocation per group was necessary. The downside came in the form of allocators not being fully aware of the alignment requirements of the allocation (since I was allocating in unsigned chars of enough size to cover both the elements and skipfield).

This meant that although the type's spacing between elements was maintained by the type itself, allocations were not aligning to a memory locaiton divisible by the alignment itself. This meant two things: SIMD wasn't actually possible, and performance wasn't as good as it could be. By aligning correctly via Tomas's suggested code and my subsequent adaptations, 2% performance increase has been realised in high-modification (large amounts of insert and erase) benchmarking in non-SIMD code, and SIMD code is now possible, as long as one bears in mind the limitations of using SIMD with a colony container.

Since we lose speed if we allocate the skipfield and element memory blocks separately, Tomas's idea was to allign a block in multiples of the element's alignment - this by necessity increases the memory usage slightly, as the chances of the skipfield + elements total size is unlikely to be an exact multiple of the type alignment. However the more elements you have in the colony, the more insignificant this memory usage difference becomes, simply because the total element capacity of each group becomes larger. In the current code, without doing any reserve()'ing or specifying minimum/maximum block capacities, at 100 small-struct (48-byte type) elements inserted individually) there is approximately 1 byte wasted per-element compared to the previous non-aligned code. However at 1000 elements this becomes around 0.1 byte per element, and at 100000 elements it's around 0.001 bytes per element, etc. So, the tradeoff it worth it.

So as it turns out, the phrase "it takes a village to raise a child" is also largely true of good code.

06-11-2020 - Colony 6.01 is released

Well that was fast. Anyway, non-member std::erase and std::erase_if overloads for colony are now implemented - for those who're unfamiliar, these're effectively the equivalent of remove and remove_if in std::list - erase if == value, or erase if predicate condition is true, respectively. They are C++20 requirements. They've also been added for plf::list, and various redundant assertion removals and macro renamings have been implemented for all three of colony, list and plf::stack. There's also been a correction to the noexcept guarantee of splice (more lenient), and range-erase now returns an iterator.

Specifying a return iterator for range-erase in colony seems almost pointless, given that there's no reallocation, so the return iterator will almost always be the last iterator of the const_iterator first, const_iterator last pair. However if last was end(), the new value of end() (if it has changed) will be returned. end() can change if the memory block it was associated with becomes empty of elements and is therefore removed from the iterative sequence.

In the last case either the user submitted end() as last, or they incremented an iterator that happened to point to the final element in the colony and submitted that as last. The latter is the only valid reason to return an iterator from the function, as it may occur as part of a loop which is erasing elements and ends when end() is reached. If end() is changed by the erasure of an entire memory block, but the iterator being used in the loop does not accurately reflect end()'s new value after the erasure, that iterator could effectively iterate past end() and the loop would never finish.

26-10-2020 - Colony 6 is released

Colony v6.00 is designed to be more in line with the current proposed draft for the standard library. It also contains numerous improvements such as:

For reference, one should not expect perfect performance results under GCC9/10 presently - there's a performance bug which's been discovered here which gets in the way, at least on a small subset of tests. Hopefully it will be fixed. Thankfully the bug does not exist in GCC8 nor any other compiler and overall, performance is better.

For anyone who's been following colony development you already know that some of the constructors have been modified to take a minimal struct called plf::limits - this contains a min and max size_t member, and a constructor. It allows block capacity limits to be supplied by the user to constructors without creating overload conflicts with the fill constructor. It is also used now by the reshape() and block_limits() functions - which are the new names for the set_block_capacity_limits() and get_block_capacity_limits() functions. In consultation with SG14, the feeling was that 'reshape' better described the function's behaviour, as it will adjust existing memory block sizes where necessary and reallocate elements.

In addition, there were several helper functions which were easily replicated by combinations of other functions - these have been removed:

The main point of doing this was to present a minimal interface for the user, and only provide supplementary functions where critical - for example, get_iterator_from_pointer() still exists because it is not replicable with other functions.

I can expect this implementation to change subtley as the standardisation process continues - if it continues. For anyone wanting a more concrete solution which will not change (but without the above benefits as have been explained), the final version of colony v5 + it's documentation will be maintained here. Any serious bugs that are discovered, will be fixed in this version as well going forward.

Stay safe, keep healthy, and keep coding-

Matt out.

26-06-2020 - Indiesort

plf::indiesort grew out of the plf::list project, which used a similar technique for it's internal sort function. Unlike indiesort, plf::list's sort is able to just change the previous/next pointers rather than reallocating elements, so it's a lot faster. But indiesort has it's place outside of plf::list. For one, it's much faster for use with std::list than std::list's own internal sort function, asides from when dealing with large types. And for other containers, it does a great job of cutting down sort time (compared to std::sort) when dealing with large or non-trivally-copyable/movable types - up to 130% faster in a vector or array.

How does it achieve this? The basic approach is, create an array with pointers to each element, then sort the pointers instead of the elements, via the value of the elements they point to. The difficult part was then finding a way of redistributing the elements again without just brute force copying them all into a temporary buffer then back to the range/container (which is what I initially did with colony's sort()). Eventually I worked out the most efficient solution, which only requires one element to be copied into a memory buffer as opposed to all of them.

Hence, most elements only get copied once using this technique, some twice, and some not at all (if they haven't moved during sorting). So for any situation where copying an element is expensive, it shines. And of course, the sort techniques which rely on random-access are typically much, much faster than those which don't (another reason why std::list::sort() is slow).

11-12-2019 - Naming things is hard

There's a point in time in every man's life where he reflects upon the mistakes he made. Today is not that day. Today is the day I tell you about how I'm fudging my mistake. Without further ado, here is the problem:

The original skipfield pattern used by colony was the 'advanced jump-counting skipfield pattern'. How did it get that name? Well, for starters, a jump-counting skipfield pattern does exactly that - it counts the jumps between one non-skipped item and the next, in an efficient manner. The difference between what I previously called the 'advanced jump-counting skipfield' and the 'simple jump-counting skipfield' was that one was, to my way of thinking at the time, simpler. The 'advanced' pattern did some more complicated stuff. It was also (to my way of thinking at the time, again) more useful essentially - more versatile.

However several years on I started to think about using the 'simple' version in more advanced scenarios - thinking about this only became possible because of some of the different ways I was organising colony on a structural level - but I started to notice that, actually, when viewed in a slightly different manner, it wasn't so much 'simpler' as 'different'. Doing the same things that I did with the 'advanced' version was possible, but required additional data.

To be specific I'm talking about changing a skipped node to an unskipped node. Originally I thought this could not be done with the simple version with any efficiency - while the advanced version stored information in it's nodes about their distance from the beginning of a run of skipped nodes (a 'skipblock'), the simple version didn't. Then I realised it could be done at least for the start and end nodes of skipblock, which do store information. Then I realised (much later) that you could, if you had prior knowledge of the start and end locations of a given skipblock, do anything you wanted to any of the nodes within that skipblock.

As my understand of this pattern grew, and I replaced the skipfield in colony with the 'simple' version, I faced a problem. For starters, it was no longer simpler than the 'advanced' version - it was merely different. But I'd already named, and published, a paper on the advanced version which explicitly stated aspects of the simple version, which were no longer true from my understanding. For this reason, because I needed a new name for this more complex version of the 'simple' pattern, I just called it the Bentley pattern as a working title.

Now, if you know me, or you've been to New Zealand you're realise that this kind of self-aggrandisement (naming an algorithm after oneself) doesn't necessarily sit well with me, or my culture - even though, in computer circles that's what we do all the time - that's where the term 'boolean' comes from (George Boole). But I couldn't think of anything else. However now I feel like I've come up with a solution, although it invalidates the original paper, given that I'm changing both the names for the advanced jump-counting skipfield and the 'simple' jump-counting skipfield.

The fudge is changing 'advanced' to 'high complexity' and 'simple' to 'low complexity', although in this case complexity explicitly refers to time-complexity, not overall complexity of the algorithms. Specifically the low-complexity jump-counting pattern allows for mostly O(1) completion of all operations, but requires additional operational data, which the high-complexity algorithm does not. Many of the procedures of the high-complexity jump-counting skipfield have undefined time complexity, as the amount of work done is determined by the length of any skipblocks involved.

In actual operational practice the performance difference between using the low-complexity and high-complexity algorithms in colony was good but not maximal, as time complexity is not a very good predictor of performance, but any performance increase is worth having. At any rate, I hope that clears things up. For future purposes the current algorithm driving the skipfields in colony is the low-complexity jump-counting pattern, and the previous one was the high-complexity jump-counting pattern. The papers written have been updated to reflect this, and hopefully someday I will find a publisher for the second. They are both available on the main page.

On the plus side I've now been able to re-integrate some of the data I had to cut out of the high-complexity jump-counting paper, in order to make it fit within the original journal's page limits. So now there's info on intersection, parallel processing... etcetera etcetera. Don't get me wrong, they were great - they just had strict limits on how much they could publish. With other journals I've dealt with, the general academic 'stance' is chronically bad, prone to valuing style over substance and valuing the worst criteria as the highest criteria. A lot of grandstanding and adolescent behaviour with a desire to look 'authoritative'. And from what I hear, a lot of academia is prone to this.

24-10-2019 - Yet another guest post :)

Once again Arne has allowed me to spool off my brain noise into bits and bytes, representing characters and other assorted connotations of meaning. Much thanks to him and the post is here, this time the topic is Time complexity and why it's relevance is entirely dependant on hardware, essentially.

01-8-2019 - SIMD and colony

I made a modification to plf::colony which enables it to be used with SIMD gather techniques, using the skipfields as masks for processing. I'd like some feedback on how useful this is, and whether it's worth keeping or throwing away. Feel free to email me (see main page).

The way it works in colony is that the skipfield is 0 when an item is unskipped, non-zero when it is skipped. Providing a developer direct access to the skipfield allows them to use SIMD to create a gather mask by transforming the skipfield in parallel to whatever the particular architecture requires eg. for AVX2 and above, this's a vector of integers where each element whose highest bit is one, will be read into SIMD registers.

The new function (with the underwhelming name of "get_raw_memory_block_pointers()"), returns a pointer to a dynamically-allocated struct of the following type:

struct raw_memory_block_pointers : private uchar_allocator_type { // array of pointers to element memory blocks: aligned_pointer_type *element_memory_block_pointers; // array of pointers to skipfield memory blocks: skipfield_pointer_type *skipfield_memory_block_pointers; // array of the number of elements in each memory block: skipfield_type *block_sizes; // size of each array: size_type number_of_blocks; };

There is a destructor so all the sub-struct deallocations are taken care of upon 'delete' of the returned struct.

By using these data sets a programmer can create an appropriate mask for each memory block by processing the skipfield into a separate array/vector, then using that mask perform a Gather operation on the element memory blocks to fetch active elements into SIMD registers. And if desired, and if the processor supports it, a subsequent Scatter back into the colony element memory blocks.

I am wondering how useful this is in comparison to manually creating a mask by iterating over the colony conventionally. The latter approach also ignores memory block boundaries, so might be more useful in that respect, even if you can't use SIMD to create the mask in that case.

There's a demo of how to use the new function in the test suite, which I've just updated in the repo. Let me know how you get on.

05-6-2019 - Win10 Simplifier

This is a tool I've been working on for a while, and deem ready for public service. It automates a lot of changes I make to typical Win10 installations, as well as automating the running of a load of tools such as ShutUp10 and Win10 Debloater. Hope you find it useful.

16-4-2019 - Constexpr all the things, including That Thing

I never really 'got' constexpr, being a construct that largely does what an optimizing compiler does anyway, but then I read some article by some guy, and it made a bit more sense. Still, I've never found a use for it, much less one that resulted in a performance difference. Then I ran across one scenario where a user had a moveable-but-non-copyable type, and that caused issues with some of the functions which consolidated a colony. Reason was that they were using a function which copy-constructed the existing colony into a new colony, then moved the new colony to the old. This is a shrink-to-fit as far as a colony is concerned.

The copy-construction was obviously a problem for a non-copyable type, so I told the function to move the elements instead if they were movable. However, the branch where elements were copy-constructed was still compiled, resulting in type violation errors, despite the decision being for-all-intents-and-purposes compile-time. This is where constexpr came in handy. With "if constexpr" you can guarantee that the path not taken will not actually be compiled, and by using type traits in conjunction with it you can avoid type errors for branches with operations which would be illegal on that type.

That's, of course, if your compiler supports "if constexpr" properly. Up until some time last year, MSVC had a ton of bugs around constexpr, and all compilers currently seem to require a flag specifying that C++17 is required before they'll enable "if constexpr". With any luck, before the year is out, that might change (being 2 years on since 2017). Regardless, all three constainers - plf::stack, plf::list and plf::colony - now use constexpr at all type-traits-determined if statements (plus a couple of other places) when C++17 is available/specified. In some scenarios I found this also reduces code size and increases performance, depending on the functions the program uses. Yell out if you get any bugs.

30-1-2019 - Memory retention

I've been toying with the idea of retaining some or all memory blocks within colony when they become empty, then simply reusing them when the rest of the blocks become full. The problem with this is that it adds a bit of code, fills up your memory faster, and doesn't have any performance advantages. I've benchmarked this on core2 and haswell, and basically the best, most optimized solution I could come up with (which was - only retain blocks of maximum size, or the last block in the chain), just scraped by as being equal to the version of colony without block retention.

Unfortunately, that performance gain you get from not deallocating and then reallocating later, is pretty small when compared to the small amount of extra code and handling necessary to store the list of empty groups. So what's my recommendation? Well, use an allocator. Colony, like most containers, takes an allocator as a template argument, so you can use a pool allocator or similar to reuse old memory space rather than freeing to the system all the time. Of course, the allocator has it's own (small) amount of overhead, but it's better than bogging colony down with the code that 90% of people won't use.

For anyone who's interested, the benchmark results are here. Peace out.

16-1-2019 - Full benchmarks for v5 available + plf::timsort removed + containers updated

Here. The biggest noticable difference, as the previous benchmarks were done with colony v4.00, is erasure. Big, big improvements there.

Also, plf::timsort has been removed as fairly major changes are coming through for the original gfx::timsort and the changes I made to the code are not substantial enough to justify maintaining a fork. Instead the original non-forked project, gfx::timsort is now supported internally in plf::colony and plf::list, and will be used instead of std::sort whenever the header for that code is included before the colony/list header in any project. However, as always, in most cases you should stick to using std::sort, except in the areas where timsort excels (sequences which are already partially sorted, or partially/fully reversed).

In addition, all containers have had their compiler/library detection code updated to better reflect the status of libc++ (should not disable type trait support when using clang + libc++ now). Reverse() within plf::list has been simplified/optimized, with the following results on haswell:
Where erasures have occured prior to reversal, up to 38% performance increase for lower numbers of elements and 7% performance increase on average. Where erasures have not occured prior to reversal, variable results but no change to performance on average. For between 10 and 120 elements, average roughly 8% performance increase, between 120 and 1000 elements average 3% performance loss, for between 1000 and 100000 elements on average there is no change to performance.

Interestingly for me, std::list's reverse() performance got one hell of a lot better between whatever version of libstdc++ GCC 7.1 and 8.2 use respectively. At 7.1, plf::list was ~492% faster than std::list's reverse. Now, even though plf::list's reverse() has gotten faster, for gcc 8.1 it's only 70% faster on average than std::list. I'd have to go back and re-run the tests to make sure this wasn't some benchmarking mistake, but eh - better things to do...

10-1-2019 - A summary of performance differences between colony v4.5 and v5.01

Tests in question were run on a haswell processor, under GCC 8.1.

Referencer tests (multiple collections of interlinked small structs): Widely variable performance differences depending on number of elements, but on average 2% performance gain for higher modification rates and 0.8% performance loss for lower modification rates.
General use tests (insertion, erasure and iteration on the fly measured over time): Up to 6% performance gain for high modification rates and numbers of elements under 19000. Average gain of 0.5%.

Raw tests:
Large struct: no change to insertion or iteration, 3% average performance improvement for erasure with up to 9% improvement for very large numbers of elements.
Small struct: no change to insertion or iteration, for erasure there was up to 3% performance decrease for numbers of elements under 30, up to 5% performance improvement for numbers of elements over 200000, on average no significant difference.
Int: no change to iteration, insertion: up to 9% worse performance for under 150 elements, up to 25% better performance for more than 130000 elements, erasure: 2% worse performance on average.
Double: no change to iteration, widely variable results for insertion, on average 2% worse, for erasure more consistently 1% better on average.
Char: 45% faster iteration on average (not sure how that works but I imagine bumping the minimum alignment width to 2 * sizeof(short) ie. 32-bits has something to do with it), widely variable results for insertion but 1% slower on average, 2% faster erasure on average.

The datasheets for the results above are here.

9-1-2019 - Differences between performance of colony using char vs short for skipfield

Basically colony is usually most effective with the default unsigned short skipfield type - I've come across no benchmark scenarios yet where unsigned int gives a performance advantage, even for vast numbers of elements. There's a number of reasons for that - cache space saving, the removal of memory blocks when they become empty and the skipfield-type-imposed limitation on the size of those memory blocks and subsequent statistical likelihood of them being empty. Basically, the skipfield patterns used for colony require that the number of elements per block cannot be larger than the maximum size representable by the skipfield type ie. 65535 for unsigned short, 255 for unsigned char. But, sometimes you only want a small number of elements - under 1000 to be specific. Where this is the case, you may want to swap to using unsigned char instead of unsigned short for your colony skipfield type.

Skipfield type is the second parameter in the colony template, so plf::colony<small_struct, unsigned char> temp; will give you an unsigned char skipfield type for your container instance. How does this work out performance-wise? Well, it varies from CPU to CPU. For core2 you can sometimes get a 2%-20% performance advantage over unsigned short, depending on the number of elements you have and ratio of insertion/erasure-to-iteration for your container usage. Very high levels of modification don't benefit significantly from an unsigned char skipfield regardless of number of elements. On ivy bridge, there is seldom a performance advantage for using unsigned char. Mostly it is a performance detriment.

Here's the results for Haswell (GCC 8.1) when using unsigned char skipfields instead of unsigned short skipfields:

Referencer tests (multiple collections of interlinked small structs): Up to 500% (!) loss for > 65000 elements, high modification rates, on average 26% loss. Up to 5% gain for numbers of elements under 1000 and low modification rates.
General use tests (insertion, erasure and iteration on the fly measured over time): Up to 4% loss for large numbers of elements. Average loss of 2%.

Raw tests (large struct and int not measured):
Small struct: up to 10% loss for iteration, average 6%. Widely variable results for insertion, on average 4% gain. Erasure 1% loss on average.
Double: up to 14% loss for iteration, average 3%. Widely variable results for insertion, on average 1% loss. Erasure 3% loss on average.
Char: up to 10% gain for iteration, average 6%. Widely variable results for insertion, on average 3% loss. No change for erasure.

Memory usage is a different story again and is of course consistent across CPUs. Across the board, in terms of the benchmarks I was running (with small structs around 48 bytes) the amount of memory saving going to an unsigned char skipfield was about 2% for under 1000 elements. Of course this will be more if you're storing a scalar type of some sort, and hence the skipfield takes up a larger part of the overall memory cost. But once you get up to larger numbers of elements, you can sometimes end up saving 50% memory overall - this is not due to the skipfield type per se, merely due to the effect it has on limiting the size of the element memory blocks. You can also accomplish this with the change_block_sizes() functions. Since with an unsigned char skipfield type you can't have a memory block larger than 255 elements, this means that your unused capacity in a new memory block is never going to be more than 254 elements. However, with an unsigned short skipfield type the unused space in a new memory block could be up to 65535 elements wide. This can be a significant amount of memory if your element type is large.

It should be noted that the largest performance decrease above 1000 elements when using a skipfield type of unsigned char instead of unsigned short, was 2% on core2 and 7% on ivy bridge. And the bulk of the performance increases for using an unsigned char type occured on core2 and under 256 elements. Hence my recommendation is: if you want performance, stick with the default skipfield type for the most part. But depending on your CPU and context, for under 1000 elements you may or may not get a significant performance increase using an unsigned char skipfield type. And if your focus is on saving memory, change to unsigned char and/or change your minimum and maximum memory block sizes.

To throw an additional spanner into the works, I decided to customise a version of colony for small sizes and unsigned char skipfields, so that the size_type and difference_type typedefs used throughout the template did not allow for more than 32767 elements in the colony total. Basically this is to see whether a specialized 'small colony' container would be useful as opposed to having one container with a template parameter for the skipfield. I found that this was not worthwhile - the performance saving was generally 0 while the memory saving was between 4% and 2% when the number of elements was under 40, and less than 1% if the number of elements was over 40 (some variability but I put this down to undiagnosed bugs arising from not static_cast'ing the new size_types in all circumstances - and I don't have time or interest in pinning it down). Adding that to the additional overhead required to keep two versions of colony maintained, it turned out to be a worthless gesture, but one worth trying at least.

The datasheets for the results above are here.

13-10-2018 - Making in a film in C++ then upscaling it with weird nerd sh*t

~11 years ago a made a film called o2 in C++ - rather, I generated the visuals in C++ by using SDL and just breaking things. Mostly I used sin & cos & rgb values (which is immediately apparent when you watch it). Sometimes I ended up using the values left over in memory from previous programs to accidentally make random stuff (like in the opening sequence) - I'm not actually sure if you can still do this on modern OS's (some of my experiences suggest not).

Anyway, this footage was cut together and I used it to make a 15-minute music video comprising five of the songs from this electronic album. Unfortunately the video seems to be a litmus test for how awful your video codec is. Divx (remember that?) couldn't compress it at all, Xvid did the best job in 2007, but it took several years before h264 became widespread enough (and the encoders for it good enough) so I could actually make a decent non-lossless compress of the film.

Unfortunately youtube dials down the bitrate hard enough that even with a great encode, the video still turned it into an unviewable sea of shifting mpeg blocks. Even when upscaled to 720p with lanczos, the bitrate was not sufficient to avoid the clustertruck of minecraftification. But lately I've been wondering if it was ever going to be possible to make this viewable online. Luckily, I discovered Waifu2x. Waifu2x is an attempt to use a neural net to upscale anime videos & pictures using sample sets to tell the upscaler what the result should look like at a higher resolution. It is extremely precise and crazy. Plus, Super-Nerdcore.

So, I thought I'd give that a shot - it needs an nvidia GPU and CUDA and A BUNCH OF CUDA STUFF but it all works well. The process was surprisingly easy, once I found the right tools. Anyway, I upscaled this film from it's original SD resolution to 4k - and it worked. If anything, it made the upscaled film look better. So, I finally uploaded it to youtube, it converted to 1080p (the film's still in 4:3 aspect ratio, and youtube don't like that for 4k), and it's still not great - though sufficiently more viewable due to the higher bitrate afforded by 1080p, there's still quite a lot of minecraftification happening. So instead, here's a 3GB download of the film at 4k: enjoy.

For reference, the lossless-compressed (UTvideo) version of the film is 106GB.

05-10-2018 - Colony 5 - the new mechanisms

As mentioned on the colony page and in countless as-yet unaccepted standards proposals, any colony implementation is structured around 3 aspects:

In colony v4.5 the third of these was changed from a stack of memory locations to a series of per-memory-block free lists. In colony v5.0 both the second and third change; the third changes from per-memory-block singly-linked free lists of erased element locations, to per-memory-block doubly-linked free lists of skipblocks. A skipblock, in colony terminology, refers to any sequence of contiguous erased elements - including sequences of only one erased element. By recording skipblock locations instead of singular erased element locations, we improve performance by decreasing the number of skipfield nodes which need to be altered upon erasure or insertion. But, this also enables a further improvement.

By only storing skipblocks we are able to change the second aspect above (the skipfield) to the Bentley pattern, instead of using the Advanced Jump-counting skipfield pattern. Without going into it in great detail, the Bentley pattern is a refinement of the simple variant of the jump-counting skipfield pattern. It does not require the middle nodes of a skipblock to be updated upon changes to the skipblock - however, only the beginning and end nodes of the skipblock may be changed. But since we're only keeping records of skipblocks instead of individual erased nodes, this means we can choose which erased element within the skipblock to reuse, and only change a beginning or end node.

In addition, reusing the beginning or end node of a skipblock (current implementation reuses the beginning because of a benchmarked performance advantage in terms of multiple insertions and thus reusing multiple sequential nodes from left to right instead of from right to left) is advantageous for iteration performance; reusing a middle node means splitting a skipblock in two, increasing the overall number of skips during iteration.

Without writing a paper here (I am writing a paper - just not right here), the simple upshot of this is that all skipfield updates for single insertions and erasures are now O(1) amortized, and skipfield updates for multple insertions/erasures are O(1) per-memory-block affected - as opposed to previous versions where all these were undefined in terms of time complexity. Insertion, erasure and iteration become faster in the context of multiple erasures and insertions over time; as the number of skipblocks - and the subsequent number of jumps required during iteration - is reduced substantially.

A lingering question might be, why the change from a singly-linked per-memory-block free list to doubly-linked? Well, turns out if you're storing/reusing singular erased element locations, singly-linked is fine - you make the newly-erased element the new head of the free list and point it's 'next' index to the previous head - regardless of scenario. However if you're storing whole skipblocks, when you reach the scenario where you're erasing an element that is directly between two skipblocks, you have to join the skipblocks and remove the record of one of the skipblocks from the free list - which requires either:

  1. Traversing the whole singly-linked free list to find the skipblock previous to the removed skipblock, and changing it's 'next' index to point to the skipblock after the removed skipblock, or
  2. Changing the singly-linked free list to a doubly-linked free list and updating 'next' and 'previous' indexes respectively for the skipblocks before and after the removed skipblock in the free list sequence.

Across multiple benchmarks the second solution works out to be faster as the first requires jumping all over the memory block in the case of many non-contiguous erasures and subsequently many skipblocks. When the elements in question are large, this can have huge cache effects, since the free list index info is stored within the erased element's memory space, not it's skipfield node. The performance cost of having to update and maintain both a previous and next index is minimal by contrast.

A side-effect of this is that elements are now over-aligned to (sizeof(skipfield type) * 2) (or the type alignment if that is larger). So, assuming the default skipfield type of unsigned short, a colony storing char or short has that type overaligned to 32bits (assuming unsigned short is 16 bits on that platform), in order to store both a next and previous index when an element gets erased and becomes part of a free list. The performance cost of this is minimal, the storage cost is larger (though not as large as it was before v4.5 when colony was still using a pointer stack), but, colony was never designed for small types - if you store chars or shorts in a colony you are already wasting space due to the skipfield node being larger/as large as the element itself. So not a huge loss from my perspective.

So yeah - big update, big change, for the better. Hope everyone gets some use out of it!

02-08-2018 - In the name of Elegance

I did another guest post on Arne's blog about the dangers of letting area-specific 'elegant' solutions dominate a language.

10-07-2018 - The results of Assumptions

The next time someone tells you that std::fill_n or std::fill is as fast as memset, get them to benchmark it.

My results:

Xubuntu 18, Core2Duo E8500 CPU, GCC 7.3

Results in release mode (-O2 -march=native): ============================================= 2018-07-10 21:28:11 Running ./google_benchmark_test Run on (2 X 3800.15 MHz CPU s) CPU Caches: L1 Data 32K (x2) L1 Instruction 32K (x2) L2 Unified 6144K (x1) ----------------------------------------------------- Benchmark Time CPU Iterations ----------------------------------------------------- memory_filln 16488 ns 16477 ns 42460 memory_fill 16493 ns 16493 ns 42440 memory_memset 8414 ns 8408 ns 83022 Results in debug mode (-O0): ============================= ----------------------------------------------------- Benchmark Time CPU Iterations ----------------------------------------------------- memory_filln 87209 ns 87139 ns 8029 memory_fill 94593 ns 94533 ns 7411 memory_memset 8441 ns 8434 ns 82833 Results in -O3 (clang at -O2 is much the same): ================================================ ----------------------------------------------------- Benchmark Time CPU Iterations ----------------------------------------------------- memory_filln 8437 ns 8437 ns 82799 memory_fill 8437 ns 8437 ns 82756 memory_memset 8436 ns 8436 ns 82754
The code (using google benchmark):
#include <cstring>
#include <algorithm>
#include <benchmark/benchmark.h>

double total = 0;

static void memory_memset(benchmark::State& state)
	int ints[50000];

	for (auto _ : state)
		std::memset(ints, 0, sizeof(int) * 50000);

	for (int counter = 0; counter != 50000; ++counter)
		total += ints[counter];

static void memory_filln(benchmark::State& state)
	int ints[50000];
	for (auto _ : state)
		std::fill_n(ints, 50000, 0);

	for (int counter = 0; counter != 50000; ++counter)
		total += ints[counter];

static void memory_fill(benchmark::State& state)
	int ints[50000];
	for (auto _ : state)
		std::fill(std::begin(ints), std::end(ints), 0);

	for (int counter = 0; counter != 50000; ++counter)
		total += ints[counter];

// Register the function as a benchmark

int main (int argc, char ** argv)
    benchmark::Initialize (&argc, argv);
    benchmark::RunSpecifiedBenchmarks ();
	printf("Total = %f\n", total);
    return 0;

Note: The for-loop counting the array contents is necessary for the benchmark loops not to get optimized out by clang (which detects that the arrays are unused and removes them).

10-06-2018 - Google benchmark and some strange results

So I took part in an exercise to try and understand google benchmark, as a small part of the larger task of understanding the google toolchains en generale. Google documentation is... lets say, "sparse", sometimes inaccurate and typically lacking in reasonable explanations for newbs, so it took quite a bit of searching to understand what the different macros were doing. Final result: figured out they were running each process a certain number of times to get an estimate as to how many iterations would be necessary to achieve statistically-meaningful results, then running the benchmark post-warmup. Plan to submit a merge request to the google benchmark docs, at some point when I have time.

At any rate, I was surprised to find the insertion (push_back) results did not match my own benchmarks in windows (I was in this case running goog-bench on xubuntu 18 on a core2). Deque, bless it's little libstdc++ heart, was outperforming it by a factor of 2 - which was unusual. For 100000 insertions, libstdc++'s deque should've been performing approximately 781 allocations per-run, as it allocates memory blocks in 512-byte chunks; whereas colony, with it's memory block growth factor of 2, should've performed approximately 12 allocations total. Memory allocation being an expensive operation in general, under my windows benchmarks colony had largely outperformed deque in terms of insertion. So the only conclusion I could come to...

... was that windows memory allocation sucks. I fired up Win7, checked my libstdc++/gcc version numbers (gcc 7.3 x64 on both linux and windows/mingw), downloaded python, cmake, google benchmark & test, and recompiled the test under nuwen mingw. Sure enough, same GCC, same optimization flags (-o2 -march=native), same compilation under Codelite, but now colony outperformed deque by a factor of 2. I re-ran tests both on linux and windows, just to make sure there weren't any fluke values, but it was consistent. BTW, all speedstep/power-saving/services/AV/networking/UAC/etcetera/etcetera/etcetera are disabled on both setups. I also tested a box with an Ivy bridge CPU and Windows 10, and the results were exactly the same. So there you have it; memory allocation under windows is breathtakingly slow. But that wasn't the only surprise. Take a look at the comparitive results below:

Windows results:
-------------------------------------------------------------------------- Benchmark Time CPU Iterations -------------------------------------------------------------------------- benchmark_colony_creation 1 ns 1 ns 897430145 benchmark_deque_creation 128 ns 128 ns 4985723 benchmark_vector_creation 1 ns 1 ns 897430145 benchmark_colony_insertion 281589 ns 275333 ns 2493 benchmark_deque_insertion 546001 ns 546003 ns 1000 benchmark_vector_insertion 736900 ns 736903 ns 1122 benchmark_colony_erasure 713045 ns 713048 ns 897 benchmark_deque_erasure 183300495 ns 183301175 ns 4 benchmark_deque_erasure_remove_if 336472 ns 328826 ns 2040 benchmark_vector_erasure 312000513 ns 312002000 ns 2 benchmark_vector_erasure_remove_if 260000 ns 260002 ns 2640 benchmark_colony_iterate_and_sum 121685 ns 121686 ns 6410 benchmark_deque_iterate_and_sum 95949 ns 95949 ns 7479 benchmark_vector_iterate_and_sum 69534 ns 69535 ns 8974
Linux results:
-------------------------------------------------------------------------- Benchmark Time CPU Iterations -------------------------------------------------------------------------- benchmark_colony_creation 1 ns 1 ns 885853600 benchmark_deque_creation 42 ns 42 ns 16841287 benchmark_vector_creation 1 ns 1 ns 886048673 benchmark_colony_insertion 427091 ns 427091 ns 1639 benchmark_deque_insertion 253210 ns 253210 ns 2764 benchmark_vector_insertion 700716 ns 700709 ns 997 benchmark_colony_erasure 743468 ns 743458 ns 939 benchmark_deque_erasure 94806854 ns 94806240 ns 7 benchmark_deque_erasure_remove_if 297223 ns 297222 ns 2355 benchmark_vector_erasure 96288165 ns 96288299 ns 7 benchmark_vector_erasure_remove_if 169655 ns 169655 ns 4100 benchmark_colony_iterate_and_sum 120794 ns 120793 ns 5801 benchmark_deque_iterate_and_sum 95539 ns 95539 ns 7300 benchmark_vector_iterate_and_sum 69133 ns 69133 ns 10100

Note: Both remove_if and non-remove_if erase benchmarks randomly remove 1 in every 8 elements from the total in the container.
You can see non-remove_if erasure (erasing randomly during iteration) in both vector and deque is 2-3x as slow under windows compared to linux - this indicates that, not only are memory Allocations slower, but also memory Copies are slower. However this particular result did not show up on the Win10/Ivy Bridge setup (where the non-remove_if erase results were comparitively equal between linux and windows), indicating that either Win10 is better at memory copies or windows is better at handling memory copies on Ivy Bridge CPUs than Core2's. The "creation" benchmarks (instantiate an instance of a template, check it's size, then destroy it) show the initial allocation/deallocation of deque (unlike vector and colony, libstdc++'s deque implementation allocates it's first memory block upon instantiation instead of upon first insertion) was roughly 4x slower in windows! Lastly, vector erasure with remove_if is almost twice as fast under linux as opposed to windows.

So how did windows memory system get so inefficient? Or perhaps a better question, how did linux's memory management get so efficient? Perhaps that's the right question. Millions of contributors worldwide with an emphasis on performance and stability not "new features" or shiny interfaces, probably. One might ask, why not test this under MSVC? Well, I already did that with my regular benchmarks, and they showed much the same results as mingw, only slower, so that's probably not necessary. Given that colony was initially designed for game engines and games almost universally are designed for windows on the desktop/laptop, this is not a great loss for colony, at least in it's original use-case. And of course it's non-remove_if erasure results are still the best out of all containers. The last thing to note from these results is that colony insertion is noticably slower under linux than under windows. I'm not sure why this should be, on the same compiler, same processor etc. So there's something to look into there.

For anyone wanting to run these benchmarks themselves, here is the code:

UPDATE 19-06-2018: I tried doing a simple google benchmark with memory allocation/deallocation only. This was consistent with the findings for the allocation/deallocation of deque - about 37ns for Xubuntu (GCC) and about 176ns for Windows 7 (mingw-GCC - ~220ns under MSVC2017). The results were almost exactly the same on the Ivy Bridge/Win10 setup. In addition I also tested memset, which was 20-30% slower on windows, depending on the CPU/platform. The .cpp for this test has been included in the zip above.

20-04-2018 - Colony 4.5 vs colony 4.0 benchmark results

Performance increase/decrease for raw performance tests on average across all numbers of elements, erasing 25% of all elements during erasure:
Operation Char Int Double Small Struct Large Struct
Insert +2% +3% +5% +4% 0%
Erase +0% +26% +13% +10% +5%
Iterate -43% -1% 0% 0% 0%

The negative result for char iteration is due to the fact that the char type has to be overaligned to the same size as the skipfield type (unsigned short) in order for the free list mechanism to work in v4.5 (v4.0 used a pointer stack). This can be mitigated by using unsigned char for the skipfield type in colony's template arguments (only really useful for small collections < 1024 elements).

Erasure performance for small structs, across all numbers of elements, erasing 75% of all elements:

+35% average, +58% max (low number of elements), +20% min (high number of elements)

General use-case benchmarking, small struct, on average across all numbers of elements and rates of modification:

Low-modification: +1%,
High-modification: +2.5%

Referencer use-case benchmarking, small struct, on average across all numbers of elements and levels of modification:

+2%, increased lead vs packed-array types

02-04-2018 - Colony 4.5 released - major update.

So here's the thing. colony 4.5 is a lot faster than it's predecessor, due to ditching a stack for erased locations and employing per-memory-block free-lists (plus a global intrusive list of blocks-with-erasures) instead. It also uses less memory, and enables erase to be more exception-safe due to a lack of allocation. Also it supports overaligned types! Here's how that all came about:

Someone at the container-focused C++ committee meeting in Jacksonville said colony shouldn't use a stack for erasures because that could create an allocation exception upon expanding the stack. Now that's partially true, erase can throw exceptions anyway because (a) a destructor might throw an exception - for example, if an object has a file open and can't close it on destruction for some reason - and (b) an invalid/uninitialized iterator could be supplied to the function. Regardless, it's a good point, and I started looking into how to make that not happen. I came back to an old idea I'd had for using per-memory-block free lists instead of a stack for recording erased/reusable locations for future insertions.

The main reason I didn't go with that in the first instance was because I didn't really understand how explicit casting works in C++; originally I thought that I would need to use unions to make a free list work in this case, which don't support basic object-oriented functionality like constructors etc. When in reality, if you have a pointer to something in C++, you can reinterpret_cast anything to it, provided you don't overflow bit boundaries and mess something up. I don't particularly see why C++ doesn't allow one to do this with just regular objects, but I'm sure there's some eldritch reasoning.

The second reason I thought it wouldn't work was because free lists typically utilize pointers, and while fast, pointers are large(ish) - so using colony with scalar types would mean overaligning them to 64-bit most of the time - not cool man!!! Because of my experiences writing plf::list, I was more experienced with thinking through this approach with pointers in mind. But, I realised eventually that since the free list is per-block I could use indexes relative to the start of that memory block, instead of pointers, and that these indexes only had to be of the same type as the skipfield (unsigned short by default) - so this would work fine for everything except char - where the type would have to be overaligned to unsigned short. Not a huge loss since I don't see people using char with colony much, if at all... (and people can always change the skipfield type to unsigned char for small collections and get better performance anyway)

So what I wound up with was (a) per-block free-lists of indexes via reinterpret_cast'ing of the 'next' free list index into the current index's erased element's memory space, and (b) per-block free list heads (also an index) and (c) a global intrusive singly-linked list of all memory blocks which have erasures (with a global list head). Pretty cool. No additional memory allocations necessary to record the erasures. No wasted memory (asides from the extra per-block head, and 'next' pointer for the global groups-with-erasures list). Better performance due to a lack of allocations upon erasure. Lastly, no having to process the stack to remove erased locations when a memory block is freed to the OS. When you get rid of a memory block now, it's free list of course goes with it- so all you have to do is remove it from the global intrusive list of memory blocks with erasures (pretty quick).

The nice thing about this exercise is how easy it was - I didn't expect the upgrade to be anything but a bundle of heavily-bugged freak-outs, but the initial work was done in about 3 hours, while the rest of it - fixing up capacity(), setting up overaligning for char under C++11 etc, finding the edge-cases, optimizing, took about a week or so. The side-benefit of setting up for overaligning when the skipfield type is larger than type T ended up being support for over-aligned types, which is also cool. I don't forsee a hell of a lot of use of SIMD with colony (for the reasons stated here) but at least it's more possible now.

By the way, rules for colony behaviour when the skipfield type is larger than the element type are as follows:

Also, colony is now exactly three years old since first release! ... Happy Birthday?!

3-12-2017 - 'auto': Dangerous Gimmick, or merely the innocuous result of too many programmers huffing the 'terse code' paint?

For 30 years, comp sci teachers have been trying to educate programmers out of using meaninglessly terse variable, class and object names. But now, the battle appears to be being lost, as too many programmers are failing to recognise that typenames are not an exception.

For the sake of argument, let's say I'm stupid. I like to write code like "int i = 10;". 'i' has an inherent meaning. Sure, I can look at the code and infer the meaning, but that increases cognitive load - or maybe I'm the kind of programmer that takes a perverse pride in my ability to rapidly infer meaning out of meaningless code. If the latter, I'm even dumber than I thought because I failed to realise that those brain cycles could be better spent on other things.

Let's separate this out into 4 concepts: "terseness", "aesthetics", "readability" and "meaningfulness". Compare and contrast the following two paragraphs:

"Colony is a container with fast insertion and erasure times, which doesn't invalidate pointers to non-erased elements."


"Colony is a container with fast insertion and erasure times, which doesn't invalidate pointers to non-erased elements. This makes it beneficial for programs where there is substantial interlinking between different sets of elements and where elements are being erased and inserted continuously in real time."

The first paragraph is more terse. Aesthetically, there is no substantial difference between the two, unless you aesthetically prefer shorter writings - this is because aesthetics are largely subjective and meaningless, typically based more on familiarity than logic. The readability of both paragraphs is the same - it's just that the second paragraph makes you read more. But the meaningfulness of the second paragraph is greater. The second sentence in the paragraph *can* be inferred from the first, but doing so (a) increases the cognitive load of the reader and (b) increases the chance they might create an incorrect inference in their own minds, purely through the flexibility of the human brain and thinking.

So we can see that readability and aesthetics do not necessarily have anything to do with terseness or meaningfulness, but terseness and meaningfulness can be opposed. And while there are certainly ways of writing which are both more terse and more meaningful at the same time, often there's a tradeoff. Many programmers appear to side on the 'fewer keystrokes' way of thinking, but this is a mistake if you understand that code is always read more than it is written, even if it's a program that only you yourself will ever read. Going over and over your code is made more tedious and more time-consuming if you waste cycles decrypting the semantics of it.

For this reason, teachers of comp sci will tell you to give your variables and classnames real names, ones which indicate their purpose. Even if "i" is a counter in a loop, just calling it "counter" negates other possible meanings of the variable, and makes the code more meaningful as a result. So why, now, are programmers in multiple languages opting for code similar to "auto i = 0;"?

Let's, again, break this down into multiple categories. The two I'm familiar with are "type cascading" and "terseness". We already understand how terseness is problematic, but to go into more depth, look at the following two sections of C++ code:

for (auto i = some_stuff.begin(); i != some_stuff.end(); ++i)
   // do stuff with 'i'


for (vector<int>::iterator current_stuff = some_stuff.begin(); current_stuff != some_stuff.end(); ++current_stuff)
   // do stuff with 'current_stuff'

So, we're forcing the reader to infer three things in the first section: the type of 'some_stuff', the type of 'i' (iterator or pointer to an element, , presumably) and the meaning of 'i' in the enclosed code. Now, if the loop is reasonably short and uncomplicated, perhaps the omission of a meaningful name for 'i' is not that important - it's right there, after all. If the loop contains a lot of code however, 'i' makes it less meaningful, particularly if there are other meaningless variable names like 'j'. But the worse of the two is the type inference. Not only does the reader have to infer the type of 'i' by going back through the code and finding out what 'some_stuff' refers to, but they must now also infer whether 'i' is an iterator or a pointer.

So while we save keystrokes with the first section, and from a certain point of view it could be considered more aesthetically " pleasing", it is also much more meaningless, and creates more cognitive load, rather than less. This is bad code: not because it's terse, but because you waste more human energy and brain cycles in the code's lifetime.

Type-cascading is another anti-pattern. If you're using 'auto' across multiple functions which call each other in a heirarchy, where the initial type is specified at point 'A', all that means is that at point 'B' you have to navigate back through multiple function calls to find the actual type - wasting time and human energy. It's something that can be done more meaningfully using typedef's, but overall in terms of code meaningfulness should be avoided, where possible. Templates are an example of useful type cascading, but it can be taken to extremes, and when it does, the signal-to-noise ratio becomes strong.

Now, I'm not saying 'auto' is incredibly worthless or that there couldn't be some scenario where it's even necessary - but the way it's used in general in C++ is wrong - it increases terseness at the cost of meaningfulness, and sometimes it doesn't even increase terseness (see 'int', 'char')! It comes across to me as a gimmick, something to make the language more attractive to programmers who're used to the same feature being present in another language. But it's a mistake to think that in all those years promoting meaningful naming in code, that we should leave type definitions out of that equation.

Addendum: I wrote a related article called "Terseness: how little is too much?" on Arne Mertz's blog, which explores the concept of meaningful naming in general.

Contact: footer
plf:: library and this page Copyright (c) 2017, Matthew Bentley