PLF C++ Library - Blog

10-07-2018 - The results of Assumptions

The next time someone tells you that std::fill_n or std::fill is as fast as memset, get them to benchmark it.

My results:

Xubuntu 18, Core2Duo E8500 CPU, GCC 7.3

Results in release mode (-O2 -march=native): ============================================= 2018-07-10 21:28:11 Running ./google_benchmark_test Run on (2 X 3800.15 MHz CPU s) CPU Caches: L1 Data 32K (x2) L1 Instruction 32K (x2) L2 Unified 6144K (x1) ----------------------------------------------------- Benchmark Time CPU Iterations ----------------------------------------------------- memory_filln 16488 ns 16477 ns 42460 memory_fill 16493 ns 16493 ns 42440 memory_memset 8414 ns 8408 ns 83022 Results in debug mode (-O0): ============================= ----------------------------------------------------- Benchmark Time CPU Iterations ----------------------------------------------------- memory_filln 87209 ns 87139 ns 8029 memory_fill 94593 ns 94533 ns 7411 memory_memset 8441 ns 8434 ns 82833 Results in -O3 (clang at -O2 is much the same): ================================================ ----------------------------------------------------- Benchmark Time CPU Iterations ----------------------------------------------------- memory_filln 8437 ns 8437 ns 82799 memory_fill 8437 ns 8437 ns 82756 memory_memset 8436 ns 8436 ns 82754
The code (using google benchmark):
#include <cstring>
#include <algorithm>
#include <benchmark/benchmark.h>

double total = 0;


static void memory_memset(benchmark::State& state)
{
	int ints[50000];
	
	for (auto _ : state)
	{
		std::memset(ints, 0, sizeof(int) * 50000);
	}

	for (int counter = 0; counter != 50000; ++counter)
	{
		total += ints[counter];
	}
}


static void memory_filln(benchmark::State& state)
{
	int ints[50000];
	
	for (auto _ : state)
	{
		std::fill_n(ints, 50000, 0);
	}

	for (int counter = 0; counter != 50000; ++counter)
	{
		total += ints[counter];
	}
}


static void memory_fill(benchmark::State& state)
{
	int ints[50000];
	
	for (auto _ : state)
	{
		std::fill(std::begin(ints), std::end(ints), 0);
	}

	for (int counter = 0; counter != 50000; ++counter)
	{
		total += ints[counter];
	}
}


// Register the function as a benchmark
BENCHMARK(memory_filln);
BENCHMARK(memory_fill);
BENCHMARK(memory_memset);



int main (int argc, char ** argv)
{
    benchmark::Initialize (&argc, argv);
    benchmark::RunSpecifiedBenchmarks ();
	printf("Total = %f\n", total);
	getchar();
    return 0;
}

Note: The for-loop counting the array contents is necessary for the benchmark loops not to get optimized out by clang (which detects that the arrays are unused and removes them).

10-06-2018 - Google benchmark and some strange results

So I took part in an exercise to try and understand google benchmark, as a small part of the larger task of understanding the google toolchains en generale. Google documentation is... lets say, "sparse", sometimes innaccurate and typically lacking in reasonable explanations for newbs, so it took quite a bit of searching to understand what the different macros were doing. Final result: figured out they were running each process a certain number of times to get an estimate as to how many iterations would be necessary to achieve statistically-meaningful results, then running the benchmark post-warmup. Plan to submit a merge request to the google benchmark docs, at some point when I have time.

At any rate, I was surprised to find the insertion (push_back) results did not match my own benchmarks in windows (I was in this case running goog-bench on xubuntu 18 on a core2). Deque, bless it's little libstdc++ heart, was outperforming it by a factor of 2 - which was unusual. For 100000 insertions, libstdc++'s deque should've been performing approximately 781 allocations per-run, as it allocates memory blocks in 512-byte chunks; whereas colony, with it's memory block growth factor of 2, should've performed approximately 12 allocations total. Memory allocation being an expensive operation in general, under my windows benchmarks colony had largely outperformed deque in terms of insertion. So the only conclusion I could come to...

... was that windows memory allocation sucks. I fired up Win7, checked my libstdc++/gcc version numbers (gcc 7.3 x64 on both linux and windows/mingw), downloaded python, cmake, google benchmark & test, and recompiled the test under nuwen mingw. Sure enough, same GCC, same optimization flags (-o2 -march=native), same compilation under Codelite, but now colony outperformed deque by a factor of 2. I re-ran tests both on linux and windows, just to make sure there weren't any fluke values, but it was consistent. BTW, all speedstep/power-saving/services/AV/networking/UAC/etcetera/etcetera/etcetera are disabled on both setups. I also tested a box with an Ivy bridge CPU and Windows 10, and the results were exactly the same. So there you have it; memory allocation under windows is breathtakingly slow. But that wasn't the only surprise. Take a look at the comparitive results below:

Windows results:
-------------------------------------------------------------------------- Benchmark Time CPU Iterations -------------------------------------------------------------------------- benchmark_colony_creation 1 ns 1 ns 897430145 benchmark_deque_creation 128 ns 128 ns 4985723 benchmark_vector_creation 1 ns 1 ns 897430145 benchmark_colony_insertion 281589 ns 275333 ns 2493 benchmark_deque_insertion 546001 ns 546003 ns 1000 benchmark_vector_insertion 736900 ns 736903 ns 1122 benchmark_colony_erasure 713045 ns 713048 ns 897 benchmark_deque_erasure 183300495 ns 183301175 ns 4 benchmark_deque_erasure_remove_if 336472 ns 328826 ns 2040 benchmark_vector_erasure 312000513 ns 312002000 ns 2 benchmark_vector_erasure_remove_if 260000 ns 260002 ns 2640 benchmark_colony_iterate_and_sum 121685 ns 121686 ns 6410 benchmark_deque_iterate_and_sum 95949 ns 95949 ns 7479 benchmark_vector_iterate_and_sum 69534 ns 69535 ns 8974
Linux results:
-------------------------------------------------------------------------- Benchmark Time CPU Iterations -------------------------------------------------------------------------- benchmark_colony_creation 1 ns 1 ns 885853600 benchmark_deque_creation 42 ns 42 ns 16841287 benchmark_vector_creation 1 ns 1 ns 886048673 benchmark_colony_insertion 427091 ns 427091 ns 1639 benchmark_deque_insertion 253210 ns 253210 ns 2764 benchmark_vector_insertion 700716 ns 700709 ns 997 benchmark_colony_erasure 743468 ns 743458 ns 939 benchmark_deque_erasure 94806854 ns 94806240 ns 7 benchmark_deque_erasure_remove_if 297223 ns 297222 ns 2355 benchmark_vector_erasure 96288165 ns 96288299 ns 7 benchmark_vector_erasure_remove_if 169655 ns 169655 ns 4100 benchmark_colony_iterate_and_sum 120794 ns 120793 ns 5801 benchmark_deque_iterate_and_sum 95539 ns 95539 ns 7300 benchmark_vector_iterate_and_sum 69133 ns 69133 ns 10100

Note: Both remove_if and non-remove_if erase benchmarks randomly remove 1 in every 8 elements from the total in the container.
You can see non-remove_if erasure (erasing randomly during iteration) in both vector and deque is 2-3x as slow under windows compared to linux - this indicates that, not only are memory Allocations slower, but also memory Copies are slower. However this particular result did not show up on the Win10/Ivy Bridge setup (where the non-remove_if erase results were comparitively equal between linux and windows), indicating that either Win10 is better at memory copies or windows is better at handling memory copies on Ivy Bridge CPUs than Core2's. The "creation" benchmarks (instantiate an instance of a template, check it's size, then destroy it) show the initial allocation/deallocation of deque (unlike vector and colony, libstdc++'s deque implementation allocates it's first memory block upon instantiation instead of upon first insertion) was roughly 4x slower in windows! Lastly, vector erasure with remove_if is almost twice as fast under linux as opposed to windows.

So how did windows memory system get so inefficient? Or perhaps a better question, how did linux's memory management get so efficient? Perhaps that's the right question. Millions of contributors worldwide with an emphasis on performance and stability not "new features" or shiny interfaces, probably. One might ask, why not test this under MSVC? Well, I already did that with my regular benchmarks, and they showed much the same results as mingw, only slower, so that's probably not necessary. Given that colony was initially designed for game engines and games almost universally are designed for windows on the desktop/laptop, this is not a great loss for colony, at least in it's original use-case. And of course it's non-remove_if erasure results are still the best out of all containers. The last thing to note from these results is that colony insertion is noticably slower under linux than under windows. I'm not sure why this should be, on the same compiler, same processor etc. So there's something to look into there.

For anyone wanting to run these benchmarks themselves, here is the code:
google_benchmark_colony_deque_vector.zip.

UPDATE 19-06-2018: I tried doing a simple google benchmark with memory allocation/deallocation only. This was consistent with the findings for the allocation/deallocation of deque - about 37ns for Xubuntu (GCC) and about 176ns for Windows 7 (mingw-GCC - ~220ns under MSVC2017). The results were almost exactly the same on the Ivy Bridge/Win10 setup. In addition I also tested memset, which was 20-30% slower on windows, depending on the CPU/platform. The .cpp for this test has been included in the zip above.

20-04-2018 - Colony 4.5 vs colony 4.0 benchmark results

Performance increase/decrease for raw performance tests on average across all numbers of elements, erasing 25% of all elements during erasure:
Operation Char Int Double Small Struct Large Struct
Insert +2% +3% +5% +4% 0%
Erase +0% +26% +13% +10% +5%
Iterate -43% -1% 0% 0% 0%

The negative result for char iteration is due to the fact that the char type has to be overaligned to the same size as the skipfield type (unsigned short) in order for the free list mechanism to work in v4.5 (v4.0 used a pointer stack). This can be mitigated by using unsigned char for the skipfield type in colony's template arguments (only really useful for small collections < 1024 elements).

Erasure performance for small structs, across all numbers of elements, erasing 75% of all elements:

+35% average, +58% max (low number of elements), +20% min (high number of elements)

General use-case benchmarking, small struct, on average across all numbers of elements and rates of modification:

Low-modification: +1%,
High-modification: +2.5%

Referencer use-case benchmarking, small struct, on average across all numbers of elements and levels of modification:

+2%, increased lead vs packed-array types

02-04-2018 - Colony 4.5 released - major update.

So here's the thing. colony 4.5 is a lot faster than it's predecessor, due to ditching a stack for erased locations and employing per-memory-block free-lists (plus a global intrusive list of blocks-with-erasures) instead. It also uses less memory, and enables erase to be more exception-safe due to a lack of allocation. Also it supports overaligned types! Here's how that all came about:

Someone at the container-focused C++ committee meeting in Jacksonville said colony shouldn't use a stack for erasures because that could create an allocation exception upon expanding the stack. Now that's partially true, erase can throw exceptions anyway because (a) a destructor might throw an exception - for example, if an object has a file open and can't close it on destruction for some reason - and (b) an invalid/uninitialized iterator could be supplied to the function. Regardless, it's a good point, and I started looking into how to make that not happen. I came back to an old idea I'd had for using per-memory-block free lists instead of a stack for recording erased/reusable locations for future insertions.

The main reason I didn't go with that in the first instance was because I didn't really understand how explicit casting works in C++; originally I thought that I would need to use unions to make a free list work in this case, which don't support basic object-oriented functionality like constructors etc. When in reality, if you have a pointer to something in C++, you can reinterpret_cast anything to it, provided you don't overflow bit boundaries and mess something up. I don't particularly see why C++ doesn't allow one to do this with just regular objects, but I'm sure there's some eldritch reasoning.

The second reason I thought it wouldn't work was because free lists typically utilize pointers, and while fast, pointers are large(ish) - so using colony with scalar types would mean overaligning them to 64-bit most of the time - not cool man!!! Because of my experiences writing plf::list, I was more experienced with thinking through this approach with pointers in mind. But, I realised eventually that since the free list is per-block I could use indexes relative to the start of that memory block, instead of pointers, and that these indexes only had to be of the same type as the skipfield (unsigned short by default) - so this would work fine for everything except char - where the type would have to be overaligned to unsigned short. Not a huge loss since I don't see people using char with colony much, if at all... (and people can always change the skipfield type to unsigned char for small colelctions and get better performance anyway)

So what I wound up with was (a) per-block free-lists of indexes via reinterpret_cast'ing of the 'next' free list index into the current index's erased element's memory space, and (b) per-block free list heads (also an index) and (c) a global intrusive singly-linked list of all memory blocks which have erasures (with a global list head). Pretty cool. No additional memory allocations necessary to record the erasures. No wasted memory (asides from the extra per-block head, and 'next' pointer for the global groups-with-erasures list). Better performance due to a lack of allocations upon erasure. Lastly, no having to process the stack to remove erased locations when a memory block is freed to the OS. When you get rid of a memory block now, it's free list of course goes with it- so all you have to do is remove it from the global intrusive list of memory blocks with erasures (pretty quick).

The nice thing about this exercise is how easy it was - I didn't expect the upgrade to be anything but a bundle of heavily-bugged freak-outs, but the initial work was done in about 3 hours, while the rest of it - fixing up capacity(), setting up overaligning for char under C++11 etc, finding the edge-cases, optimizing, took about a week or so. The side-benefit of setting up for overaligning when the skipfield type is larger than type T ended up being support for over-aligned types, which is also cool. I don't forsee a hell of a lot of use of SIMD with colony (for the reasons stated here) but at least it's more possible now.

By the way, rules for colony behaviour when the skipfield type is larger than the element type are as follows:

Also, colony is now exactly three years old since first release! ... Happy Birthday?!

3-12-2017 - 'auto': Dangerous Gimmick, or merely the innocuous result of too many programmers huffing the 'terse code' paint?

For 30 years, comp sci teachers have been trying to educate programmers out of using meaninglessly terse variable, class and object names. But now, the battle appears to be being lost, as too many programmers are failing to recognise that typenames are not an exception.

For the sake of argument, let's say I'm stupid. I like to write code like "int i = 10;". 'i' has an inherent meaning. Sure, I can look at the code and infer the meaning, but that increases cognitive load - or maybe I'm the kind of programmer that takes a perverse pride in my ability to rapidly infer meaning out of meaningless code. If the latter, I'm even dumber than I thought because I failed to realise that those brain cycles could be better spent on other things.

Let's separate this out into 4 concepts: "terseness", "aesthetics", "readability" and "meaningfulness". Compare and contrast the following two paragraphs:

"Colony is a container with fast insertion and erasure times, which doesn't invalidate pointers to non-erased elements."

versus

"Colony is a container with fast insertion and erasure times, which doesn't invalidate pointers to non-erased elements. This makes it beneficial for programs where there is substantial interlinking between different sets of elements and where elements are being erased and inserted continuously in real time."

The first paragraph is more terse. Aesthetically, there is no substantial difference between the two, unless you aesthetically prefer shorter writings - this is because aesthetics are largely subjective and meaningless, typically based more on familiarity than logic. The readability of both paragraphs is the same - it's just that the second paragraph makes you read more. But the meaningfulness of the second paragraph is greater. The second sentence in the paragraph *can* be inferred from the first, but doing so (a) increases the cognitive load of the reader and (b) increases the chance they might create an incorrect inference in their own minds, purely through the flexibility of the human brain and thinking.

So we can see that readability and aesthetics do not necessarily have anything to do with terseness or meaningfulness, but terseness and meaningfulness can be opposed. And while there are certainly ways of writing which are both more terse and more meaningful at the same time, often there's a tradeoff. Many programmers appear to side on the 'fewer keystrokes' way of thinking, but this is a mistake if you understand that code is always read more than it is written, even if it's a program that only you yourself will ever read. Going over and over your code is made more tedious and more time-consuming if you waste cycles decrypting the semantics of it.

For this reason, teachers of comp sci will tell you to give your variables and classnames real names, ones which indicate their purpose. Even if "i" is a counter in a loop, just calling it "counter" negates other possible meanings of the variable, and makes the code more meaningful as a result. So why, now, are programmers in multiple languages opting for code similar to "auto i = 0;"?

Let's, again, break this down into multiple categories. The two I'm familiar with are "type cascading" and "terseness". We already understand how terseness is problematic, but to go into more depth, look at the following two sections of C++ code:

for (auto i = some_stuff.begin(); i != some_stuff.end(); ++i)
{
   // do stuff with 'i'
}

vs

for (vector<int>::iterator current_stuff = some_stuff.begin(); current_stuff != some_stuff.end(); ++current_stuff)
{
   // do stuff with 'current_stuff'
}

So, we're forcing the reader to infer three things in the first section: the type of 'some_stuff', the type of 'i' (iterator or pointer to an element, , presumably) and the meaning of 'i' in the enclosed code. Now, if the loop is reasonably short and uncomplicated, perhaps the omission of a meaningful name for 'i' is not that important - it's right there, after all. If the loop contains a lot of code however, 'i' makes it less meaningful, particularly if there are other meaningless variable names like 'j'. But the worse of the two is the type inference. Not only does the reader have to infer the type of 'i' by going back through the code and finding out what 'some_stuff' refers to, but they must now also infer whether 'i' is an iterator or a pointer.

So while we save keystrokes with the first section, and from a certain point of view it could be considered more aesthetically " pleasing", it is also much more meaningless, and creates more cognitive load, rather than less. This is bad code: not because it's terse, but because you waste more human energy and brain cycles in the code's lifetime.

Type-cascading is another anti-pattern. If you're using 'auto' across multiple functions which call each other in a heirarchy, where the initial type is specified at point 'A', all that means is that at point 'B' you have to navigate back through multiple function calls to find the actual type - wasting time and human energy. It's something that can be done more meaningfully using typedef's, but overall in terms of code meaningfulness should be avoided, where possible. Templates are an example of useful type cascading, but it can be taken to extremes, and when it does, the signal-to-noise ratio becomes strong.

Now, I'm not saying 'auto' is incredibly worthless or that there couldn't be some scenario where it's even necessary - but the way it's used in general in C++ is wrong - it increases terseness at the cost of meaningfulness, and sometimes it doesn't even increase terseness (see 'int', 'char')! It comes across to me as a gimmick, something to make the language more attractive to programmers who're used to the same feature being present in another language. But it's a mistake to think that in all those years promoting meaningful naming in code, that we should leave type definitions out of that equation.

Addendum: I wrote a related article called "Terseness: how little is too much?" on Arne Mertz's blog, which explores the concept of meaningful naming in general.

Contact: footer
plf:: library and this page Copyright (c) 2017, Matthew Bentley