PLF C++ Library - plf::reorderase

Intro
Motivation
License
Download
Function syntax
Benchmarks
Version History
Contact

Introduction

Reorderase (portmanteau of 'reorder erase') is an evolution of the classic "swap-and-pop" idiom commonly used to speed up singular erasures on deques and vectors. In essence that technique involves swapping the element you want to erase with the back element of the container, then using pop_back() to erase the element. Since popping the back element involves no element relocation, whereas erasing from a non-back location in a vector/deque requires moving every single element after the erased element backwards by 1, this has a strong performance advantage, particularly when size() is large and/or the element type is large/non-trivially-copyable. The only caveat is, you have to be okay with the order of elements in the container changing when you erase.

We can optimize this technique further by removing the unnecessary swap and simply copying or moving the back element to the location of the element we wish to erase, then popping. This reduces a 5-item operation (allocate swap buffer, copy to swap, copy back to location, copy swap buffer to back, pop) down to a 2-item operation (move/copy + pop) and removes the allocation/deallocation required for the swap. Reorderase (single erasure) does this and also accomodates trivial types, non-movable types and types which might throw when copying (the latter requires a temporary buffer in case the copy fails).

Reorderase goes further: the same technique used for single erasures can be extended to ranged erasures (ie. erase(first, last)) with great results - again, provided that one does not care about element reordering. And it can also be used to construct alternatives to std::erase_if, std::erase and std::partition which have strong performance advantages with larger element types.

In addition a few new algorithms become apparent: unordered erase_if-style erasure for only a sub-range of a container's elements. This basically takes elements from the back of the container to fill in the holes. A slightly different method of partitioning to the 3 major vendor implementations is introduced (plf::partition), which performs better on some CPU architectures and worse on others (older architectures tend to benefit more). As well as it's destructive equivalent (plf::destructive_partition), which leaves moved-from elements at the end of the range instead of swapping.

A performance summary when using std::vector follows, using averages from benchmarks across four different element types (4-byte, 8-byte, 40-byte and 490-byte) and numbers of elements randing from 10 to ~920000/870000 (depending on test). The numbers of elements are increased by a factor of 1.1 for each measurement, hence there are more measurements of lower numbers of elements and the results are thererby skewed toward low numbers of elements. Reorderase's performance advantages increase with larger numbers of elements, hence the averages below can be seen as very conservative. To clarify, 100% faster == twice as fast. 200% faster == three times as fast:

Single erasures: 1010566% faster than .erase(position) on average (ie. 10157 x original speed), with a minimum of 10% faster for a vector of 10 ints, and a maximum of 25974010% faster for 870000 490-byte structs.
Ranged erasures: 12570% faster than .erase(first, last) on average, with a minimum of 20% faster for a vector of 10 ints, and a maximum of 77679% for 920000 490-byte structs.
std::erase_if/std::erase-style erasures: This typically only outperforms the std:: equivalents on types larger than 25 bytes. 38% faster than std::erase_if(container, predicate)/std::erase(container, value) on average, with a minimum of 10% slower for 50 doubles, and a maximum of 180% for 920000 490-byte structs. The results for std::partition are the same (they use the same algorithm).

See the individual benchmarks for more specifics for different types and numbers of elements.

Motivation

Although the swap-and-pop idiom is relatively well-known in computing circles, a lot of people do not know about it. In addition, many who do know the technique use the original swap technique rather than the optimized move-and-replace one. This implementation is the first I've seen for ranged erasures, or erase_if alternatives (have subsequently seen Brent Freidman's implementation from the 2015 unstable_remove paper attempt, which is similar in terms of logic but not as optimized). Non-back vector erasure and non-front/back deque erasure have appalling performance, but many programmers in my experience do not appear to realise the extent to which this hampers them.

Benchmarks

License

plf::reorderase is under a permissive zlib license. This means: Software is used on 'as-is' basis. It can be modified and used in commercial software. Authors are not liable for any damages arising from its use. The distribution of a modified version of the software is subject to the following restrictions:

The authorship of the original software must not be misrepresented,
Altered source versions must not be misrepresented as being the original software, and
The license notice must not be removed from source distributions.

Download

Download here (4kb zip file) or view the repository

The reorderase function library is a simple .h header file, to be included with a #include command.
In addition if you are interested in benchmarking you can also download the plf benchmark suite.

Function Descriptions and syntax

template <class container_type, class iterator_type = typename container_type::iterator> constexpr iterator_type reorderase (container_type &container, iterator_type position)
Erase single element. Use wherever you would use a container's .erase(position) member function normally.
Example: std::vector<int> i(20, 10); plf::reorderase(i, i.begin() + 10);
template <class container_type, class iterator_type = typename container_type::iterator> constexpr iterator_type reorderase (container_type &container, iterator_type first, iterator_type last)
Erase a range of elements. Use wherever you would use a container's .erase(first, last) member function normally. Example:
std::vector<int> i(20, 10); plf::reorderase(i, i.begin() + 5, i.end() - 4);
template <class container_type, class predicate_function> constexpr void reorderase_all_if (container_type &container, predicate_function predicate)
Erase all elements in a container which match the predicate. Use wherever you would use std::erase_if normally. Example using a lambda:
std::vector<int> i({1, 2, 3, 4, 4, 5, 4, 6, 7, 8, 9}); plf::reorderase_all_if(i, [](typename container_type::value_type x){return x < 5;});
template <class container_type, class value_type> constexpr void reorderase_all (container_type &container, value_type value)
Erase all elements in a container which are equal to value. Use wherever you would use std::erase normally. value must be able to be static_cast'd to container_type::value_type. Example:
std::vector<int> i({1, 2, 3, 4, 4, 5, 4, 6, 7, 8, 9}); plf::reorderase_all(i, 4);
template <class container_type, class iterator_type, class predicate_function> constexpr void reorderase_all_if (container_type &container, iterator_type first, iterator_type last, predicate_function predicate)
Erase all elements in a sub-range of a container which match the predicate. Equivalent to std::erase_if, if there were an equivalent of std::erase_if for sub-ranges of containers. Example using a lambda:
std::vector<int> i({1, 2, 3, 4, 4, 5, 4, 6, 7, 8, 9}); std::vector<int>::iterator it1 = i.begin(), it2 = it1 + 5; plf::reorderase_all_if(i, it1, it2, [](typename container_type::value_type x){return x < 5;});
template <class container_type, class iterator_type, class value_type> constexpr void reorderase_all (container_type &container, iterator_type first, iterator_type last, value_type value)
Erase all elements in a sub-range of a container which are equal to value. Will use appropriate elements from the back of the container to fill the holes. Equivalent to std::erase, if there were an equivalent of std::erase for sub-ranges of containers. value must be able to be static_cast'd to container_type::value_type. Example:
std::vector<int> i({1, 2, 3, 4, 4, 5, 4, 6, 7, 8, 9}); std::vector<int>::iterator it1 = i.begin(), it2 = it1 + 5; plf::reorderase_all(i, it1, it2, 4);
template <class iterator_type, class predicate_function> constexpr iterator_type partition (iterator_type first, iterator_type last, predicate_function predicate)
My personal algorithm for partitioning. Example using a lambda:
std::vector<int> i({1, 2, 3, 4, 4, 5, 4, 6, 7, 8, 9}); plf::partition(i.begin() + 2, i.begin() + 5, [](typename container_type::value_type x){return x < 5;});
template <class iterator_type, class predicate_function> constexpr iterator_type destructive_partition (iterator_type first, iterator_type last, predicate_function predicate)
Destructively partition all elements which match the predicate in a range ie. move elements from the back of the range (which don't match the predicate) on top of elements matching the predicate, leaving a trailing range of moved-from elements at the back of the range. Use wherever you would use std::partition but only if you don't want to preserve the elements which would be, in std::partition, moved to the back of the range. Returns an iterator pointing to the first moved-from element in the trailing range. Example using a lambda:
std::vector<int> i({1, 2, 3, 4, 4, 5, 4, 6, 7, 8, 9}); plf::destructive_partition(i.begin() + 2, i.begin() + 5, [](typename container_type::value_type x){return x < 5;});

const_iterators are not supported for the obvious reason that the content of the position and first iterators will get written to.

You might ask, why return an iterator? With regular vector/deque erase member functions, the location of the next element post-erasure is always the same place as the first element you've erased, ie. for singular erasures, position is the next element's location post-erasure, as every element gets shifted back by one. Same for range erasure - first is always the returned location. Therefore returning an iterator from erase is a waste of CPU cycles. Well, unfortunately the optimisations for this algorithm for deque (erasing from the front where appropriate instead of moving elements) mean that the return value isn't necessarily first or position - it could be position + 1, or last.

Okay, last question: these algorithms are only for random_access containers, is there any point to making them compatible with bidirectional containers like colony, std::list, etc? No. Those containers do not have performance issues with erasure like deque/vector/static_vector do. It's only deques and vectors that have the massive amount of element relocation caused by erasures.

Benchmarks

Benchmark results for reorderase v1.00 under GCC 11.2 x64 on an Intel i7 9750 are below. The benchmark test code can be downloaded above. Benchmark methodology is the same as those on the main benchmarks page. The measurements are done on a logarithmic scale, hence the averages presented in the text are skewed toward lower numbers of elements, since there are more samples of those. Higher numbers of elements consistently show the strongest performance improvements from reorderase, so this skews against reorderase.

Singular erasures comparison

To start we'll analyse singular element erasures. Here we erase 25% of all elements in a container while iterating.

Click images or hover over to see results at linear scale instead

Singular erasures of ints results in a skewed average of reorderase being 104014% faster than vector.erase, with a maximum of 1165378% faster for 870000 elements. First measurement being a factor of ten faster than vector.erase is at 3000 elements. Minimum speed increase is 10% at 10 elements.

Singular erasures of doubles result in a skewed average of reorderase being 197548% faster than vector.erase, with a maximum of 2098764% faster for 920000 elements. First measurement being a factor of ten faster than vector.erase is at 2000 elements. Minimum speed increase is 15% at 10 elements.

Singular erasures of small structs result in a skewed average of reorderase being 1255606% faster than vector.erase, with a maximum of 12990540% faster for 920000 elements. First measurement being a factor of ten faster than vector.erase is at 300 elements. Minimum speed increase is 52% at 10 elements.

Singular erasures of doubles result in a skewed average of reorderase being 2485098% faster than vector.erase, with a maximum of 25974010% faster for 920000 elements. First measurement being being a factor of ten faster than vector.erase is at 55 elements. Minimum speed increase is 135% at 10 elements.

The skewed average across all types and numbers of elements is 2375914% for reorderase.

Range erasures comparison

Here we erase 25% of all elements via erasing random ranges of elements, with a maximum random range of the square root of the number of elements.

Click images or hover over to see results at linear scale instead

Range erasures of ints result in a skewed average of reorderase being 10826% faster than vector.erase, with a maximum of 92351% faster for 920000 elements. First measurement being being a factor of ten faster than vector.erase is at 13957 elements. Minimum speed increase is 20% at 10 elements.

Singular erasures of doubles result in a skewed average of reorderase being 15327% faster than vector.erase, with a maximum of 90208% faster for 920000 elements. First measurement being being a factor of ten faster than vector.erase is at 6500 elements. Minimum speed increase is 49% at 10 elements.

Singular erasures of small structs result in a skewed average of reorderase being 15315% faster than vector.erase, with a maximum of 63566% faster for 920000 elements. First measurement being being a factor of ten faster than vector.erase is at 1050 elements. Minimum speed increase is 98% at 10 elements.

Singular erasures of doubles result in a skewed average of reorderase being 12128% faster than vector.erase, with a maximum of 74438% faster for 920000 elements. First measurement being being a factor of ten faster than vector.erase is at 450 elements. Minimum speed increase is 117% at 10 elements.

The skewed average across all types and numbers of elements is 13624% faster for reorderase.

std::erase_if comparison

Here we erase 25% of all elements via std::erase_if, the reorderase equivalent being reorderase_all_if.

Click images or hover over to see results at linear scale instead

erase_if of ints results in a skewed average of reorderase_all_if being 7% faster than std::erase_if, with a lot of variability. Numbers of elements under 19 perform marginally (up to 2%) better with std::erase_if, probably due to the low amount of element movement and greater simplicity of the erase_if algorithm.

erase_if of doubles results in a skewed average of reorderase_all_if being 7% slower than std::erase_if, with a lot of variability. Uncertain why doubles perform worse than int's here, but the finding is consistent across multiple machines. Not worth it here.

erase_if of small structs results in a skewed average of reorderase_all_if being 22% faster than std::erase_if, with a minimum of 7% and a maximum of 52%.

erase_if of large structs results in a skewed average of reorderase_all_if being 154% faster than std::erase_if, with a minimum of 50% and a maximum of 282%.

The skewed average across all types and numbers of elements is 43% faster for reorderase, though this is heavily weighted by the large struct results.

Version history

2024-04-10: v1.1.2 - Change to vcpkg-compatible versioning. plf::make_move_iterator changed to be static. plf::equal_to uses references now and is entirely noexcept.
2023-10-11: v1.11 - Introduced sub-range based overloads for reorderase_all_if/reorderase_all (currently bidirectional and above containers only). Re-wrote all reorderase_all/reorderase_all_if overloads to return count of elements erased to match return type on std::erase_if/std::erase.
2023-09-04: v1.10 - Implemented my std::partition equivalent + plf::destructive_partition, and implemented reorderase_all_if/reorderase_all in terms of the shared sub-function.
2023-08-10: v1.07 - Update to address shared tools.
2023-07-30: v1.06 - Further optimisation to range reorderase with deque. memmove changed to memcpy as areas don't overlap.
2023-07-23: v1.05 - Correction to type trait semantics.
2023-07-09: v1.04 - Correction to deque support. If container::value_type isn't nothrow_move_assignable or nothrow_copy_assignable, reorderase(first, last) now uses container's own allocator for the temp buffer instead of default std::allocator.
2023-07-04: v1.03 - Minor corrections and tidy-ups. Remove unnecessary conversions in range. Disable annoying MSVC warning.
2023-07-04: v1.02 - Return values now included for reorderase (not for reorderase_all or reorderase_all_if) since optimisations for deque mean that elements are simply erased if they're at the front of the container, instead of moving elements - therefore the user can't rely on using position or first as the next iterator position. No performance change above 0.1% for vector benchmarks, surprisingly.
2023-07-03: v1.01 - Fix for using reorderase(first, last) with std::deque. Generic fix for random_access non-contiguous containers (C++20 and above only). Performance improvements when erasing from back of containers (and from front of containers for deque) - 2% increase in overall benchmark performance.
2023-07-01: v1.00 - First release.