PLF C++ Library - Blog

13-10-2018 - Making in a film in C++ then upscaling it with weird nerd sh*t

~11 years ago a made a film called o2 in C++ - rather, I generated the visuals in C++ by using SDL and just breaking things. Mostly I used sin & cos & rgb values (which is immediately apparent when you watch it). Sometimes I ended up using the values left over in memory from previous programs to accidentally make random stuff (like in the opening sequence) - I'm not actually sure if you can still do this on modern OS's (some of my experiences suggest not).

Anyway, this footage was cut together and I used it to make a 15-minute music video comprising five of the songs from this electronic album. Unfortunately the video seems to be a litmus test for how awful your video codec is. Divx (remember that?) couldn't compress it at all, Xvid did the best job in 2007, but it took several years before h264 became widespread enough (and the encoders for it good enough) so I could actually make a decent non-lossless compress of the film.

Unfortunately youtube dials down the bitrate hard enough that even with a great encode, the video still turned it into an unviewable sea of shifting mpeg blocks. Even when upscaled to 720p with lanczos, the bitrate was not sufficient to avoid the clustertruck of minecraftification. But lately I've been wondering if it was ever going to be possible to make this viewable online. Luckily, I discovered Waifu2x. Waifu2x is an attempt to use a neural net to upscale anime videos & pictures using sample sets to tell the upscaler what the result should look like at a higher resolution. It is extremely precise and crazy. Plus, Super-Nerdcore.

So, I thought I'd give that a shot - it needs an nvidia GPU and CUDA and A BUNCH OF CUDA STUFF but it all works well. The process was surprisingly easy, once I found the right tools. Anyway, I upscaled this film from it's original SD resolution to 4k - and it worked. If anything, it made the upscaled film look better. So, I finally uploaded it to youtube, it converted to 1080p (the film's still in 4:3 aspect ratio, and youtube don't like that for 4k), and it's still not great - though sufficiently more viewable due to the higher bitrate afforded by 1080p, there's still quite a lot of minecraftification happening. So instead, here's a 3GB download of the film at 4k: enjoy.

For reference, the lossless-compressed (UTvideo) version of the film is 106GB.

05-10-2018 - Colony 5 - the new mechanisms

As mentioned on the colony page and in countless as-yet unaccepted standards proposals, any colony implementation is structured around 3 aspects:

In colony v4.5 the third of these was changed from a stack of memory locations to a series of per-memory-block free lists. In colony v5.0 both the second and third change; the third changes from per-memory-block singly-linked free lists of erased element locations, to per-memory-block doubly-linked free lists of skipblocks. A skipblock, in colony terminology, refers to any sequence of contiguous erased elements - including sequences of only one erased element. By recording skipblock locations instead of singular erased element locations, we improve performance by decreasing the number of skipfield nodes which need to be altered upon erasure or insertion. But, this also enables a further improvement.

By only storing skipblocks we are able to change the second aspect above (the skipfield) to the Bentley pattern, instead of using the Advanced Jump-counting skipfield pattern. Without going into it in great detail, the Bentley pattern is a refinement of the simple variant of the jump-counting skipfield pattern. It does not require the middle nodes of a skipblock to be updated upon changes to the skipblock - however, only the beginning and end nodes of the skipblock may be changed. But since we're only keeping records of skipblocks instead of individual erased nodes, this means we can choose which erased element within the skipblock to reuse, and only change a beginning or end node.

In addition, reusing the beginning or end node of a skipblock (current implementation reuses the beginning because of a benchmarked performance advantage in terms of multiple insertions and thus reusing multiple sequential nodes from left to right instead of from right to left) is advantageous for iteration performance; reusing a middle node means splitting a skipblock in two, increasing the overall number of skips during iteration.

Without writing a paper here (I am writing a paper - just not right here), the simple upshot of this is that all skipfield updates for single insertions and erasures are now O(1) amortized, and skipfield updates for multple insertions/erasures are O(1) per-memory-block affected - as opposed to previous versions where all these were undefined in terms of time complexity. Insertion, erasure and iteration become faster in the context of multiple erasures and insertions over time; as the number of skipblocks - and the subsequent number of jumps required during iteration - is reduced substantially.

A lingering question might be, why the change from a singly-linked per-memory-block free list to doubly-linked? Well, turns out if you're storing/reusing singular erased element locations, singly-linked is fine - you make the newly-erased element the new head of the free list and point it's 'next' index to the previous head - regardless of scenario. However if you're storing whole skipblocks, when you reach the scenario where you're erasing an element that is directly between two skipblocks, you have to join the skipblocks and remove the record of one of the skipblocks from the free list - which requires either:

  1. Traversing the whole singly-linked free list to find the skipblock previous to the removed skipblock, and changing it's 'next' index to point to the skipblock after the removed skipblock, or
  2. Changing the singly-linked free list to a doubly-linked free list and updating 'next' and 'previous' indexes respectively for the skipblocks before and after the removed skipblock in the free list sequence.

Across multiple benchmarks the second solution works out to be faster as the first requires jumping all over the memory block in the case of many non-contiguous erasures and subsequently many skipblocks. When the elements in question are large, this can have huge cache effects, since the free list index info is stored within the erased element's memory space, not it's skipfield node. The performance cost of having to update and maintain both a previous and next index is minimal by contrast.

A side-effect of this is that elements are now over-aligned to (sizeof(skipfield type) * 2) (or the type alignment if that is larger). So, assuming the default skipfield type of unsigned short, a colony storing char or short has that type overaligned to 32bits (assuming unsigned short is 16 bits on that platform), in order to store both a next and previous index when an element gets erased and becomes part of a free list. The performance cost of this is minimal, the storage cost is larger (though not as large as it was before v4.5 when colony was still using a pointer stack), but, colony was never designed for small types - if you store chars or shorts in a colony you are already wasting space due to the skipfield node being larger/as large as the element itself. So not a huge loss from my perspective.

So yeah - big update, big change, for the better. Hope everyone gets some use out of it!

02-08-2018 - In the name of Elegance

I did another guest post on Arne's blog about the dangers of letting area-specific 'elegant' solutions dominate a language.

10-07-2018 - The results of Assumptions

The next time someone tells you that std::fill_n or std::fill is as fast as memset, get them to benchmark it.

My results:

Xubuntu 18, Core2Duo E8500 CPU, GCC 7.3

Results in release mode (-O2 -march=native): ============================================= 2018-07-10 21:28:11 Running ./google_benchmark_test Run on (2 X 3800.15 MHz CPU s) CPU Caches: L1 Data 32K (x2) L1 Instruction 32K (x2) L2 Unified 6144K (x1) ----------------------------------------------------- Benchmark Time CPU Iterations ----------------------------------------------------- memory_filln 16488 ns 16477 ns 42460 memory_fill 16493 ns 16493 ns 42440 memory_memset 8414 ns 8408 ns 83022 Results in debug mode (-O0): ============================= ----------------------------------------------------- Benchmark Time CPU Iterations ----------------------------------------------------- memory_filln 87209 ns 87139 ns 8029 memory_fill 94593 ns 94533 ns 7411 memory_memset 8441 ns 8434 ns 82833 Results in -O3 (clang at -O2 is much the same): ================================================ ----------------------------------------------------- Benchmark Time CPU Iterations ----------------------------------------------------- memory_filln 8437 ns 8437 ns 82799 memory_fill 8437 ns 8437 ns 82756 memory_memset 8436 ns 8436 ns 82754
The code (using google benchmark):
#include <cstring>
#include <algorithm>
#include <benchmark/benchmark.h>

double total = 0;

static void memory_memset(benchmark::State& state)
	int ints[50000];

	for (auto _ : state)
		std::memset(ints, 0, sizeof(int) * 50000);

	for (int counter = 0; counter != 50000; ++counter)
		total += ints[counter];

static void memory_filln(benchmark::State& state)
	int ints[50000];
	for (auto _ : state)
		std::fill_n(ints, 50000, 0);

	for (int counter = 0; counter != 50000; ++counter)
		total += ints[counter];

static void memory_fill(benchmark::State& state)
	int ints[50000];
	for (auto _ : state)
		std::fill(std::begin(ints), std::end(ints), 0);

	for (int counter = 0; counter != 50000; ++counter)
		total += ints[counter];

// Register the function as a benchmark

int main (int argc, char ** argv)
    benchmark::Initialize (&argc, argv);
    benchmark::RunSpecifiedBenchmarks ();
	printf("Total = %f\n", total);
    return 0;

Note: The for-loop counting the array contents is necessary for the benchmark loops not to get optimized out by clang (which detects that the arrays are unused and removes them).

10-06-2018 - Google benchmark and some strange results

So I took part in an exercise to try and understand google benchmark, as a small part of the larger task of understanding the google toolchains en generale. Google documentation is... lets say, "sparse", sometimes inaccurate and typically lacking in reasonable explanations for newbs, so it took quite a bit of searching to understand what the different macros were doing. Final result: figured out they were running each process a certain number of times to get an estimate as to how many iterations would be necessary to achieve statistically-meaningful results, then running the benchmark post-warmup. Plan to submit a merge request to the google benchmark docs, at some point when I have time.

At any rate, I was surprised to find the insertion (push_back) results did not match my own benchmarks in windows (I was in this case running goog-bench on xubuntu 18 on a core2). Deque, bless it's little libstdc++ heart, was outperforming it by a factor of 2 - which was unusual. For 100000 insertions, libstdc++'s deque should've been performing approximately 781 allocations per-run, as it allocates memory blocks in 512-byte chunks; whereas colony, with it's memory block growth factor of 2, should've performed approximately 12 allocations total. Memory allocation being an expensive operation in general, under my windows benchmarks colony had largely outperformed deque in terms of insertion. So the only conclusion I could come to...

... was that windows memory allocation sucks. I fired up Win7, checked my libstdc++/gcc version numbers (gcc 7.3 x64 on both linux and windows/mingw), downloaded python, cmake, google benchmark & test, and recompiled the test under nuwen mingw. Sure enough, same GCC, same optimization flags (-o2 -march=native), same compilation under Codelite, but now colony outperformed deque by a factor of 2. I re-ran tests both on linux and windows, just to make sure there weren't any fluke values, but it was consistent. BTW, all speedstep/power-saving/services/AV/networking/UAC/etcetera/etcetera/etcetera are disabled on both setups. I also tested a box with an Ivy bridge CPU and Windows 10, and the results were exactly the same. So there you have it; memory allocation under windows is breathtakingly slow. But that wasn't the only surprise. Take a look at the comparitive results below:

Windows results:
-------------------------------------------------------------------------- Benchmark Time CPU Iterations -------------------------------------------------------------------------- benchmark_colony_creation 1 ns 1 ns 897430145 benchmark_deque_creation 128 ns 128 ns 4985723 benchmark_vector_creation 1 ns 1 ns 897430145 benchmark_colony_insertion 281589 ns 275333 ns 2493 benchmark_deque_insertion 546001 ns 546003 ns 1000 benchmark_vector_insertion 736900 ns 736903 ns 1122 benchmark_colony_erasure 713045 ns 713048 ns 897 benchmark_deque_erasure 183300495 ns 183301175 ns 4 benchmark_deque_erasure_remove_if 336472 ns 328826 ns 2040 benchmark_vector_erasure 312000513 ns 312002000 ns 2 benchmark_vector_erasure_remove_if 260000 ns 260002 ns 2640 benchmark_colony_iterate_and_sum 121685 ns 121686 ns 6410 benchmark_deque_iterate_and_sum 95949 ns 95949 ns 7479 benchmark_vector_iterate_and_sum 69534 ns 69535 ns 8974
Linux results:
-------------------------------------------------------------------------- Benchmark Time CPU Iterations -------------------------------------------------------------------------- benchmark_colony_creation 1 ns 1 ns 885853600 benchmark_deque_creation 42 ns 42 ns 16841287 benchmark_vector_creation 1 ns 1 ns 886048673 benchmark_colony_insertion 427091 ns 427091 ns 1639 benchmark_deque_insertion 253210 ns 253210 ns 2764 benchmark_vector_insertion 700716 ns 700709 ns 997 benchmark_colony_erasure 743468 ns 743458 ns 939 benchmark_deque_erasure 94806854 ns 94806240 ns 7 benchmark_deque_erasure_remove_if 297223 ns 297222 ns 2355 benchmark_vector_erasure 96288165 ns 96288299 ns 7 benchmark_vector_erasure_remove_if 169655 ns 169655 ns 4100 benchmark_colony_iterate_and_sum 120794 ns 120793 ns 5801 benchmark_deque_iterate_and_sum 95539 ns 95539 ns 7300 benchmark_vector_iterate_and_sum 69133 ns 69133 ns 10100

Note: Both remove_if and non-remove_if erase benchmarks randomly remove 1 in every 8 elements from the total in the container.
You can see non-remove_if erasure (erasing randomly during iteration) in both vector and deque is 2-3x as slow under windows compared to linux - this indicates that, not only are memory Allocations slower, but also memory Copies are slower. However this particular result did not show up on the Win10/Ivy Bridge setup (where the non-remove_if erase results were comparitively equal between linux and windows), indicating that either Win10 is better at memory copies or windows is better at handling memory copies on Ivy Bridge CPUs than Core2's. The "creation" benchmarks (instantiate an instance of a template, check it's size, then destroy it) show the initial allocation/deallocation of deque (unlike vector and colony, libstdc++'s deque implementation allocates it's first memory block upon instantiation instead of upon first insertion) was roughly 4x slower in windows! Lastly, vector erasure with remove_if is almost twice as fast under linux as opposed to windows.

So how did windows memory system get so inefficient? Or perhaps a better question, how did linux's memory management get so efficient? Perhaps that's the right question. Millions of contributors worldwide with an emphasis on performance and stability not "new features" or shiny interfaces, probably. One might ask, why not test this under MSVC? Well, I already did that with my regular benchmarks, and they showed much the same results as mingw, only slower, so that's probably not necessary. Given that colony was initially designed for game engines and games almost universally are designed for windows on the desktop/laptop, this is not a great loss for colony, at least in it's original use-case. And of course it's non-remove_if erasure results are still the best out of all containers. The last thing to note from these results is that colony insertion is noticably slower under linux than under windows. I'm not sure why this should be, on the same compiler, same processor etc. So there's something to look into there.

For anyone wanting to run these benchmarks themselves, here is the code:

UPDATE 19-06-2018: I tried doing a simple google benchmark with memory allocation/deallocation only. This was consistent with the findings for the allocation/deallocation of deque - about 37ns for Xubuntu (GCC) and about 176ns for Windows 7 (mingw-GCC - ~220ns under MSVC2017). The results were almost exactly the same on the Ivy Bridge/Win10 setup. In addition I also tested memset, which was 20-30% slower on windows, depending on the CPU/platform. The .cpp for this test has been included in the zip above.

20-04-2018 - Colony 4.5 vs colony 4.0 benchmark results

Performance increase/decrease for raw performance tests on average across all numbers of elements, erasing 25% of all elements during erasure:
Operation Char Int Double Small Struct Large Struct
Insert +2% +3% +5% +4% 0%
Erase +0% +26% +13% +10% +5%
Iterate -43% -1% 0% 0% 0%

The negative result for char iteration is due to the fact that the char type has to be overaligned to the same size as the skipfield type (unsigned short) in order for the free list mechanism to work in v4.5 (v4.0 used a pointer stack). This can be mitigated by using unsigned char for the skipfield type in colony's template arguments (only really useful for small collections < 1024 elements).

Erasure performance for small structs, across all numbers of elements, erasing 75% of all elements:

+35% average, +58% max (low number of elements), +20% min (high number of elements)

General use-case benchmarking, small struct, on average across all numbers of elements and rates of modification:

Low-modification: +1%,
High-modification: +2.5%

Referencer use-case benchmarking, small struct, on average across all numbers of elements and levels of modification:

+2%, increased lead vs packed-array types

02-04-2018 - Colony 4.5 released - major update.

So here's the thing. colony 4.5 is a lot faster than it's predecessor, due to ditching a stack for erased locations and employing per-memory-block free-lists (plus a global intrusive list of blocks-with-erasures) instead. It also uses less memory, and enables erase to be more exception-safe due to a lack of allocation. Also it supports overaligned types! Here's how that all came about:

Someone at the container-focused C++ committee meeting in Jacksonville said colony shouldn't use a stack for erasures because that could create an allocation exception upon expanding the stack. Now that's partially true, erase can throw exceptions anyway because (a) a destructor might throw an exception - for example, if an object has a file open and can't close it on destruction for some reason - and (b) an invalid/uninitialized iterator could be supplied to the function. Regardless, it's a good point, and I started looking into how to make that not happen. I came back to an old idea I'd had for using per-memory-block free lists instead of a stack for recording erased/reusable locations for future insertions.

The main reason I didn't go with that in the first instance was because I didn't really understand how explicit casting works in C++; originally I thought that I would need to use unions to make a free list work in this case, which don't support basic object-oriented functionality like constructors etc. When in reality, if you have a pointer to something in C++, you can reinterpret_cast anything to it, provided you don't overflow bit boundaries and mess something up. I don't particularly see why C++ doesn't allow one to do this with just regular objects, but I'm sure there's some eldritch reasoning.

The second reason I thought it wouldn't work was because free lists typically utilize pointers, and while fast, pointers are large(ish) - so using colony with scalar types would mean overaligning them to 64-bit most of the time - not cool man!!! Because of my experiences writing plf::list, I was more experienced with thinking through this approach with pointers in mind. But, I realised eventually that since the free list is per-block I could use indexes relative to the start of that memory block, instead of pointers, and that these indexes only had to be of the same type as the skipfield (unsigned short by default) - so this would work fine for everything except char - where the type would have to be overaligned to unsigned short. Not a huge loss since I don't see people using char with colony much, if at all... (and people can always change the skipfield type to unsigned char for small colelctions and get better performance anyway)

So what I wound up with was (a) per-block free-lists of indexes via reinterpret_cast'ing of the 'next' free list index into the current index's erased element's memory space, and (b) per-block free list heads (also an index) and (c) a global intrusive singly-linked list of all memory blocks which have erasures (with a global list head). Pretty cool. No additional memory allocations necessary to record the erasures. No wasted memory (asides from the extra per-block head, and 'next' pointer for the global groups-with-erasures list). Better performance due to a lack of allocations upon erasure. Lastly, no having to process the stack to remove erased locations when a memory block is freed to the OS. When you get rid of a memory block now, it's free list of course goes with it- so all you have to do is remove it from the global intrusive list of memory blocks with erasures (pretty quick).

The nice thing about this exercise is how easy it was - I didn't expect the upgrade to be anything but a bundle of heavily-bugged freak-outs, but the initial work was done in about 3 hours, while the rest of it - fixing up capacity(), setting up overaligning for char under C++11 etc, finding the edge-cases, optimizing, took about a week or so. The side-benefit of setting up for overaligning when the skipfield type is larger than type T ended up being support for over-aligned types, which is also cool. I don't forsee a hell of a lot of use of SIMD with colony (for the reasons stated here) but at least it's more possible now.

By the way, rules for colony behaviour when the skipfield type is larger than the element type are as follows:

Also, colony is now exactly three years old since first release! ... Happy Birthday?!

3-12-2017 - 'auto': Dangerous Gimmick, or merely the innocuous result of too many programmers huffing the 'terse code' paint?

For 30 years, comp sci teachers have been trying to educate programmers out of using meaninglessly terse variable, class and object names. But now, the battle appears to be being lost, as too many programmers are failing to recognise that typenames are not an exception.

For the sake of argument, let's say I'm stupid. I like to write code like "int i = 10;". 'i' has an inherent meaning. Sure, I can look at the code and infer the meaning, but that increases cognitive load - or maybe I'm the kind of programmer that takes a perverse pride in my ability to rapidly infer meaning out of meaningless code. If the latter, I'm even dumber than I thought because I failed to realise that those brain cycles could be better spent on other things.

Let's separate this out into 4 concepts: "terseness", "aesthetics", "readability" and "meaningfulness". Compare and contrast the following two paragraphs:

"Colony is a container with fast insertion and erasure times, which doesn't invalidate pointers to non-erased elements."


"Colony is a container with fast insertion and erasure times, which doesn't invalidate pointers to non-erased elements. This makes it beneficial for programs where there is substantial interlinking between different sets of elements and where elements are being erased and inserted continuously in real time."

The first paragraph is more terse. Aesthetically, there is no substantial difference between the two, unless you aesthetically prefer shorter writings - this is because aesthetics are largely subjective and meaningless, typically based more on familiarity than logic. The readability of both paragraphs is the same - it's just that the second paragraph makes you read more. But the meaningfulness of the second paragraph is greater. The second sentence in the paragraph *can* be inferred from the first, but doing so (a) increases the cognitive load of the reader and (b) increases the chance they might create an incorrect inference in their own minds, purely through the flexibility of the human brain and thinking.

So we can see that readability and aesthetics do not necessarily have anything to do with terseness or meaningfulness, but terseness and meaningfulness can be opposed. And while there are certainly ways of writing which are both more terse and more meaningful at the same time, often there's a tradeoff. Many programmers appear to side on the 'fewer keystrokes' way of thinking, but this is a mistake if you understand that code is always read more than it is written, even if it's a program that only you yourself will ever read. Going over and over your code is made more tedious and more time-consuming if you waste cycles decrypting the semantics of it.

For this reason, teachers of comp sci will tell you to give your variables and classnames real names, ones which indicate their purpose. Even if "i" is a counter in a loop, just calling it "counter" negates other possible meanings of the variable, and makes the code more meaningful as a result. So why, now, are programmers in multiple languages opting for code similar to "auto i = 0;"?

Let's, again, break this down into multiple categories. The two I'm familiar with are "type cascading" and "terseness". We already understand how terseness is problematic, but to go into more depth, look at the following two sections of C++ code:

for (auto i = some_stuff.begin(); i != some_stuff.end(); ++i)
   // do stuff with 'i'


for (vector<int>::iterator current_stuff = some_stuff.begin(); current_stuff != some_stuff.end(); ++current_stuff)
   // do stuff with 'current_stuff'

So, we're forcing the reader to infer three things in the first section: the type of 'some_stuff', the type of 'i' (iterator or pointer to an element, , presumably) and the meaning of 'i' in the enclosed code. Now, if the loop is reasonably short and uncomplicated, perhaps the omission of a meaningful name for 'i' is not that important - it's right there, after all. If the loop contains a lot of code however, 'i' makes it less meaningful, particularly if there are other meaningless variable names like 'j'. But the worse of the two is the type inference. Not only does the reader have to infer the type of 'i' by going back through the code and finding out what 'some_stuff' refers to, but they must now also infer whether 'i' is an iterator or a pointer.

So while we save keystrokes with the first section, and from a certain point of view it could be considered more aesthetically " pleasing", it is also much more meaningless, and creates more cognitive load, rather than less. This is bad code: not because it's terse, but because you waste more human energy and brain cycles in the code's lifetime.

Type-cascading is another anti-pattern. If you're using 'auto' across multiple functions which call each other in a heirarchy, where the initial type is specified at point 'A', all that means is that at point 'B' you have to navigate back through multiple function calls to find the actual type - wasting time and human energy. It's something that can be done more meaningfully using typedef's, but overall in terms of code meaningfulness should be avoided, where possible. Templates are an example of useful type cascading, but it can be taken to extremes, and when it does, the signal-to-noise ratio becomes strong.

Now, I'm not saying 'auto' is incredibly worthless or that there couldn't be some scenario where it's even necessary - but the way it's used in general in C++ is wrong - it increases terseness at the cost of meaningfulness, and sometimes it doesn't even increase terseness (see 'int', 'char')! It comes across to me as a gimmick, something to make the language more attractive to programmers who're used to the same feature being present in another language. But it's a mistake to think that in all those years promoting meaningful naming in code, that we should leave type definitions out of that equation.

Addendum: I wrote a related article called "Terseness: how little is too much?" on Arne Mertz's blog, which explores the concept of meaningful naming in general.

Contact: footer
plf:: library and this page Copyright (c) 2017, Matthew Bentley