plf::list is a drop-in higher-performance replacement for std::list, with (on average) *:
Like std::list, plf::list maintains stable pointer/iterator links to list elements regardless of erasure, insertion, merging, splicing, reversing and sorting. It's functionality replicates std::list with two minor exceptions: the removal of single-element and ranged splices when the source is not *this (splicing whole lists from source into *this is still available) and inclusion of several useful functions: reserve(), shrink_to_fit(), trim(), memory(), and overloads for single-element and ranged splice without a source list argument (source must always be *this). Non-member function overloads for std::erase, std::erase_if and std::swap are also implemented.
It achieves it's speed via the following:
The first increases insertion and erasure speed since individual per-element
allocations and deallocations are no longer necessary, and also increases the
speed of iteration due to increased element contiguousness in memory.
The second re-uses erased element memory locations, but in a way which promotes
element contiguousness (positioning the inserted element as close to the
insertion position as possible in memory). Several other methods were trialed,
but this had the best balance of reinsertion speed and subsequent iteration
speed.
The third is a simple method which allows any linked list to use
quicksort-style internal sorting rather than the slower mergesort-based
internal sorting typically utilized, but without invalidating
pointers/iterators to elements and without the disadvantage of slower sort
speed for larger-than-scalar types typically evidenced with a quicksort
method.
* These results are not achievable with std::list + an allocator for the reasons described here. Results are from benchmarks performed on a haswell-based CPU under GCC 8.1. Average percentages are obtained by testing with numbers of elements ranging from 10 to 1000000, and for the use-case, insertion, erasure and iteration benchmarks, from five different type sizes from char to a very large struct.
Linked lists have many useful characteristics: they splice together without losing pointer validity to contained elements, they sort quickly for large types, and their erasure speed is very good. Unfortunately the base implementation of most linked lists disallows decent performance for insertion - specifically, most linked lists allocate per-node, which in the modern age is particularly slow. The C++ std::list spec disallows clustered allocations by allowing partial list splicing - ie. if a block of nodes is allocated instead of one, the block cannot be split without invalidating pointers to nodes within the block. This inability to allocate multiple nodes at once has a negative effect on insertion, iteration and erasure due to cache effects and the increased number of allocation/deallocation calls.
plf::list is a re-think from the ground-up of how to structure a linked list around modern hardware. This means grouped allocations, re-use of memory and sort modification. The result is not something which can compete with vector for iteration performance, but which easily beats it for non-back insertion and erasure speeds, and increases it's competitiveness with aspects like sorting. It is designed as a drop-in replacement for std::list, as well as a demonstration of how lists can perform outside of the C++ specification's requirement of allowing partial-list splicing.
plf::list is designed for better push_back() performance than push_front() performance, in keeping with the most typical container usage patterns.
As detailed earlier, extensive testing showed block sizes of 2048 elements to be optimal across types for balancing the reinsertion mechanism - what this means is that, for any given block, the free list may contain element locations in any area of that block, and, the location popped off the top of the list to be used for reinsertion, may not be the location closest to the insertion position (out of all locations in the free list). But if we decrease the block size, we increase the probability of any free list location being closer to the insertion position. However making the block size too low results in lowered iteration speeds, as fewer elements are contiguous.
Other methods were tested for the reinsertion mechanism, with the dual aim of quick reinsertion and finding a reinsertion location as close to the insertion position as possible. In no particular order they are:
As mentioned, the latter is the chosen method - however there were several sub-methods of this which did not get selected. The first was to process the selected free list to find the location within that free list closest to the insertion position. This did not result in a net gain when benchmarked with subsequent iterations balanced against the additional reinsertion cost. The second sub-method was searching from the last memory block in the list to the first sequentially instead of radiating outwards from the insertion position's block - again this was slower, due to the likelyhood of finding a location further away in memory than with the radiant pattern.
The reason why the radiant pattern works best is because in terms of iteration, if you find a location in as close a memory block as possible, you increase the overall statistical likelihood that the linked list sequence will also follow the sequence of the memory blocks in memory (provided the blocks are allocated close to each other, this will affect cache locality).
Normally it is impossible for a container with a bidirectional iterator (such as a list) to utilize std::sort, as it requires a random-access iterator. This method allows any container to use std::sort (or any sort technique that requires random-access iterators). It has a memory cost of N pointers over a normal list sort - one pointer for each element in the container. Each element has it's node's memory address inserted into an array - the order of the insertion is not important. Hence, while with a traditional linked list the list must be iterated over to add the addresses to the array, with plf::list the nodes of all memory blocks can be processed sequentially in order, speeding up the process of filling the array.
Once this array of N pointers is filled with the addresses of all active nodes in the list, we then use a templated dereferencing functor in order to sort these pointers based on the value of the elements within the nodes that they point to:
template <class comparison_function>
struct sort_dereferencer
{
comparison_function stored_instance;
sort_dereferencer(const comparison_function &function_instance):
stored_instance(function_instance)
{}
sort_dereferencer()
{}
bool operator() (const node_pointer_type first, const node_pointer_type second)
const
{
return stored_instance(first->element, second->element);
}
};
This function object takes whatever sort function is supplied to the sort operation, and uses it to sort the pointers instead of the elements. Since an array is a linear structure and can hence utilize a random-access iterator, we can now sort the array of pointers using std::sort (or any technique), the function object above and whichever sort function is supplied.
Once the pointers are sorted according to the sort function and the elements themselves, the final process is simple;
The resultant sort process is roughly 72% faster than std::list's sort, when averaged across all datatypes and numbers of elements. It is not, unlike mergesort, stable (in this context, stable means two contiguous objects with the same sort value will be in the same order once sorted), because std::sort is not guaranteed to be stable, but std::stable_sort can be used instead if this is important.
The internal structure of plf::list is as follows: a custom vector holds the individual memory blocks that the elements are stored within, plus each memory block's metadata (this combined with the memory block is termed a 'group'). Part of the metadata is each block's 'free list' of erased memory locations. When a memory block becomes empty of elements it is either freed to the OS or moved to the back of the vector for subsequent re-use in future. The implementation defines the protocol for deciding whether or not to free the block, but in my implementation the decision is based on whether or not the memory block is of maximum size, and also whether it is close to the back of the vector.
When an insertion occurs, if there are no erased locations the new element is pushed to the next available element location in the last memory block and then linked into insertion position. Otherwise the free list of the memory block which the insertion position is within is checked - if empty, we check each memory block to the left and right in sequence until a non-empty free list is found. The selected non-empty free list has it's top element popped off and re-used.
To avoid double-destruction of elements when clearing or destroying a plf::list, element nodes which are added to their memory block's free list have their 'next' pointer set to NULL - since free lists only utilize the previous pointer, and non-freelist node pointers cannot be == NULL, this has no effect on operation. When a plf::list is cleared or destroyed, instead of iterating over the list normally, we can more quickly iterate directly over the memory blocks themselves, destroying any element of any node where the next pointer is != NULL. The reason this is faster is because in this method, each subsequent node is contiguous in memory while iterating, compared to a regular list traversal which can jump all over the place.
plf::list is under a permissive zlib license. This means: Software is used on 'as-is' basis. It can be modified and used in commercial software. Authors are not liable for any damages arising from its use. The distribution of a modified version of the software is subject to the following restrictions:
Download here or view
the repository
The list library is a simple .h header file, to be included with a #include
command.
In addition if you are interested in benchmarking you can also download the plf benchmark suite.
plf::list meets the technical specifications and has the same functions and iterator functions as std::list, with the exception of the removal of range or single-element splices between plf::list's. Only full list (ie. all element) splices between plf::list's are allowed. Range and single-element splices within a plf::list (ie. source == *this) are allowed and are detailed below. In addition, plf::list does not guarantee that contiguous and equal elements will be in the same order after sorting as std::list does (ie. the sort style is not a 'stable'-type). This is due to the way that plf::list::sort() gathers element pointers in the 'gather' phase of the algorithm (see above).
To specify an alternative sort technique, #define PLF_SORT_FUNCTION as the name of the sort function you wish to use.
The sort function in question must take the same parameters as std::sort. An example would be #define PLF_SORT_FUNCTION std::stable_sort
. This macro must be defined before plf_list.h is #include'd.
Additional functions specific to plf::list follow:
void trim_capacity() noexcept
Removes any trailing groups within the container substructure, which may
be present in the scenario of many erasures. Unlike a regular
shrink_to_fit, does not shrink capacity to the size needed to contain the
currently-stored elements, simply frees up unused groups. Runtime cost is
significantly lower than shrink_to_fit, but may not free as much memory.
Example: i_list.trim_capacity();
void shrink_to_fit()
Reduces container capacity to the amount necessary to store all
currently stored elements, consolidates elements and removes any erased
locations. Invalidates all pointers, iterators and references to elements
within the container.
Example: i_list.shrink_to_fit();
void reserve(size_type reserve_amount)
Preallocates memory space sufficient to store the number of elements
indicated by reserve_amount
. If possible the algorithm will
re-use existing empty memory blocks, so long as they are large enough. This
function is necessary due to the growth factor of plf::list and the affect
of many needless smaller memory block allocations on performance.
Example: i_list.reserve(15);
size_type capacity() const noexcept
Returns total number of elements currently able to be stored in
container without expansion.
Example: std::cout << i_list.capacity()
<< std::endl;
void splice(const_iterator position, const_iterator location) noexcept
Equivalent to std::list's single-element splice operation if the source is *this. Reorders the list so that the element at 'location' in the list is
before 'position'. O(1) time complexity operation with no copying or moving
(only previous and next pointers are altered. Good for shifting locations
of large elements, non-trivially-copyable elements, and
non-copyable/movable elements. If position and location are the same, no
operation will occur. If location source is not *this behaviour is undefined.
Example: plf::list<int> list1 = {1, 2, 3,
4, 5};
plf::list<int>::iterator it = --(list1.end());
// Shift 5 in front of 1:
list1.splice(list1.begin(), it);
void splice(const_iterator position, const_iterator first, const_iterator
last) noexcept
Equivalent to std::list's range splice operation if the source of the range is *this. Same as single splice but with a range of first->(last - 1)
ie. last points to the element one-past the back element of the range, not the back element itself. Also O(1). If first is later
than last in the list's iterative sequence, behaviour is undefined. If
position is within the range of first->last, behaviour is
undefined. If range source is not *this, behaviour is undefined.
Example: plf::list<int> list1 = {1, 2, 3,
4, 5};
plf::list<int>::iterator it1 = list1.begin();
plf::list<int>::iterator it2 = std::advance(list1.begin(), 2);
plf::list<int>::iterator it3 = list1.end();
// Shift 1, 2 to after 5:
list1.splice(it3, it1, it2);
template <class predicate_function>
iterator unordered_find_single(predicate_function predicate) const noexcept
iterator unordered_find_single(value_type element_to_match) const noexcept
Searches across the entire list in an unordered fashion (iterating
linearly in memory over the node blocks rather than following the linked
list sequence - much faster than std::find()) and returns an iterator to
the first node which matches the supplied element. Will return end() if the
element isn't found.
Example: plf::list<int> list1 = {1, 2, 3,
4, 5};
plf::list<int>::iterator it1 = list1.unordered_find_single(3);
std::cout *it1;
template <class predicate_function>
plf::list<plf::list<value_type>::iterator>
unordered_find_multiple(predicate_function predicate, size_type
number_of_elements_to_find) const noexcept
plf::list<plf::list<value_type>::iterator>
unordered_find_multiple(value_type element_to_match, size_type
number_of_elements_to_find) const noexcept
Searches across the entire list in an unordered fashion for the supplied
element, collecting as many iterators to matching elements as the user
specifies, returns a plf::list of iterators pointing to all nodes which
match the element when either (a) the number of elements is reached or (b)
the end of the list is reached. Much faster than std::find().
Example: plf::list<int> list1 = {1, 2, 3,
4, 5, 5, 4, 3, 2, 1, 3, 5, 6, 1, 3};
plf::list<plf::list<int>::iterator> nodes_found =
list1.unordered_find_multiple(1, 2);
std::cout *(nodes_found.begin());
template <class predicate_function>
plf::list<plf::list<value_type>::iterator>
unordered_find_all(predicate_function predicate) const noexcept
plf::list<plf::list<value_type>::iterator>
unordered_find_all(value_type element_to_match) const noexpect
Searches across the entire list in an unordered fashion for the supplied
element and returns a plf::list of iterators pointing to all nodes which
match the element. Much faster than std::find().
Example: plf::list<int> list1 = {1, 2, 3,
4, 5, 5, 4, 3, 2, 1};
plf::list<plf::list<int>::iterator> nodes_found =
list1.unordered_find_all(3);
std::cout *(nodes_found.begin());
Benchmark results for plf::list v1.85 under GCC 9.2 x64 on an Intel Xeon
E3-1241 (Haswell) are available here.
Some results for an early version of plf::list under GCC 5.1 x64 on an Intel
E8500 (Core2) are here.
There are two parts to the answer to that question, but both say 'no'.
Part 1 involves the standard operations: insert, erase, iteration. In order for a std::list + allocator to maintain the 16% iteration performance increase that plf::list has (which is the result of more contiguous storage), after multiple non-back/front erasures and insertions, the actual memory location of an inserted element must be as close as possible to the element pointed to by 'it' in 'list.insert(it, value);'. This means the allocator has to be able to take memory hints for allocating new elements, but also the linked list itself must supply these hints. Given that no current std::list implementations (that I am aware of) do this, this will not occur. However even if it did occur, most allocators use global free-lists or other less-efficient techniques for determining the closest insertion point for a newly-allocated element, so you would not gain the same post-erase insert performance unless you used the same (or better) techniques as plf::list for managing memory and figuring out optimal insertion locations.
Part 2 involves the less-used operations: reverse, clear, find_if, find, sort, and list destruction. These all utilize plf::list's contiguous node chunks to linearly iterate across nodes regardless of order for part or all of their function - as opposed to following the linked list's iterative sequence. What this means is the cache-friendliness and prefetch-friendliness of these operations is greater than a std::list + allocator can be, because a std::list cannot iterate linearly across it's 'owned' memory space in the allocator. In addition, even if you somehow managed to get a std::list implementation talking to an allocator well enough that all of this occurred, you would lose cycles due to the interaction. eg. having to use an iterator supplied by the allocator, along with whatever overhead was incurred by the allocator ensuring that unused memory spaces are skipped (jump-counting skipfield, for example).
Only one, and that is that partial (range) splicing between std::lists using the same allocator can occur without creating problems. There are no performance advantages, that I am aware of.
Here is a (non-concurrent implementation of) std::vector's thread-safe matrix to see which basic operations can occur at the same time:
std::vector | Insertion | Erasure | Iteration | Read |
Insertion | No | No | No | No |
Erasure | No | No | No | No |
Iteration | No | No | Yes | Yes |
Read | No | No | Yes | Yes |
Here is plf::list's matrix:
plf::list | Insertion | Erasure | Iteration | Read |
Insertion | No | No | Mostly** | Yes |
Erasure | No | No | Mostly*** | Mostly* |
Iteration | Mostly** | Mostly*** | Yes | Yes |
Read | Yes | Mostly* | Yes | Yes |
* Erasures will not invalidate iterators
unless the iterator points to the erased element.
** Iteration will not be affected by insertion unless the element being
iterated to/from is the element which is being inserted in front of.
*** Iteration will not be affected by erasure unless the element being
iterated to/from is the element being erased.
Contact:
plf:: library and this page Copyright (c) 2024, Matthew Bentley