As part of the making a new geometry node Points to Curves for blender.
# Abstract
Array of random values should be split in sequential arrays to iterate, and passing to any other algorithms. Result array index defined by value of element in source array. This is a grouping of elements. Each group is a set of same values from different positions in the source array. Instead of element value, or its pointer, used index/position in source array.
To make it simpler, random values in the source array are deduplicated and compressed. So, it in (0, group amount] range are. Without gaps. It could be done via hash maps or any other ways. This is not the main problem for this paper.
The size of each group is unknown at the beginning. Source array values distributed randomly, and it positions is not depend on itself.
To make it simpler, the amount of groups should be found at the start by some kind of other algorithm too.
Topology-related applications in geometry:
1. Access to any element of topology A, connected to any B. A and B can have multiple unique relations. Group sizes can be defined or limited by some topological rule.
2. Combining user data into new sets. Groups sizes cannot be determined in constant time.
In this paper are presented a set of algorithms tested on conditions with different numbers and sizes of groups on a large amount of data.
# Dataset
## Node tree designed to test all algorithms on different sizes of randomly generated groups

Everything is in repeat zone to iterate test 50 time. Each node within the group uses a previously created geometry. Allocations do not use the same memory after the previous one, and nodes only work sequentially.
## Number of groups
The groups is a source array filled with random numbers in a range from 1 to:
| № | Number of groups + 1 | Average group size |
| - | - | - |
| 1 | 1 | 5.000.000 |
| 2 | 5 | 2.000.000 |
| 3 | 10 | 1.000.000 |
| 4 | 50 | 200.000 |
| 5 | 100 | 100.000 |
| 6 | 500 | 20.000 |
| 7 | 1.000 | 10.000 |
| 8 | 5.000 | 2.000 |
| 9 | 10.000 | 1.000 |
| 10 | 50.000 | 200 |
| 11 | 100.000 | 100 |
| 12 | 500.000 | 20 |
| 13 | 1.000.000 | 10 |
| 14 | 5.000.000 | 2 |
| 15 | 10.000.000 | 1 |
The node itself additionally deduplicate randomly-generated groups, thus, if there were gaps, the indices will be shifted.
## Points amount
**10.000.000**.
## Benchmark CPU
Benchmark was on Ryzen 7 1700, 16 threads.
# Google sheets diagram of timings
[Group algorithms benchmark](https://docs.google.com/spreadsheets/d/1YEX6KRk4mTFfJZmpmEyqD7f3ztlynO0dzQhMk8OqpUQ/edit?usp=sharing)
# Task
The simplest way to solve this problem by using default containers.
## Just appending to vectors without reserving
```cpp
static Array<Vector<int>> groups_in_vectors(const int group_total,
const Span<int> group_indices)
{
Array<Vector<int>> all(group_total);
for (const int i : group_indices.index_range()) {
const int group = group_indices[i];
all[group].append(i);
}
return all;
}
```
Function `groups_in_vectors` is builder of array with vectors. Each vector is a group, contain indices in source array. Group values, taken as index, used to access to vector to append.

1. Few amount of large groups have a good algorithm speed, 0-part of x-axis look interesting.
2. 10.000.000 part of diagram can be misleading. Vectors can have an embedded buffer. This is not bad for groups with sizes that fit into this buffer. But plot interpolation and measurement inaccuracy makes this effect very wrong. In fact, most of the graphics still have to grow upwards.
Negative sides of approach:
1. May cause increase memory fragmentation and reduces overall performance for a lot of small group.
2. A lot of allocation and deallocation of memory in an algorithm body.
4. Can-not be a parallel.
5. Speed is depended on amount of group. The fewer groups, the better the cache to reallocate-and-copy vector.
5. Bad cpu cache factor to access to elements from random group in result.
# Splitting on an offsets accumulating and indices writing stages
Each group size had to deduces to have indices of each group, separated to separate sequential memory span.
Having size of all groups means actually offsets in result array can be known too. This is accumulating of offsets. Offset is a start of current span of memory. Via index math, the next start of span can be sampled too. Distance between both start is a size of current span. Both value contained in sequential array, sampling of both have a good cache.
## Offsets computing
```cpp
void count_indices(const Span<int> indices, MutableSpan<int> counts)
{
threading::parallel_for(indices.index_range(), 4096, [&](const IndexRange range) {
for (const int i : indices.slice(range)) {
atomic_add_and_fetch_int32(&counts[i], 1);
}
});
}
```
Calculation of size of each group is a `count_indices` function. Integer math can be atomic between parallel loops.
```cpp
OffsetIndices<int> accumulate_counts_to_offsets(MutableSpan<int> counts_to_offsets,
const int start_offset)
{
int offset = start_offset;
for (const int i : counts_to_offsets.index_range().drop_back(1)) {
const int count = counts_to_offsets[i];
counts_to_offsets[i] = offset;
offset += count;
}
counts_to_offsets.last() = offset;
return OffsetIndices<int>(counts_to_offsets);
}
```
Original source array size can't be used for offsets array. To store extra one last element as cap to find size of last real element, array should be allocated for source array size + 1. In `accumulate_counts_to_offsets` function, last extra element processed in separate way.
```cpp
void build_reverse_offsets(const Span<int> indices,
MutableSpan<int> offsets)
{
count_indices(indices, offsets);
accumulate_counts_to_offsets(offsets);
}
```
Final offsets array produced by `build_reverse_offsets`.

Benchmark result of `build_reverse_offsets` test. The diagram had a two progression:
1. `count_indices` depend on sizes of groups. If multiple thread are atomic write in the same memory, this going to be slow. If groups are not too big, though randomly distributed in memory, this still bad, but for cache. This cause the badly results in 0-part of x-axis of diagram.
2. `accumulate_counts_to_offsets` depend on amount of group. This is small-factor linear dependency. The badly results of 10.000.000-part of x-axis is caused by this reason.
## Counting of elements of source array for groups sizes
The main target to do optimizations: `count_indices` function.
### Groups value sorting
The main problem that seems to be the most important: **A lot of write to random memory**.
The problem causes slow performance due to both cache and multithreaded locks.
Solution: record not randomly, but in groups. The ideal and most general solution is sorting.
Sorting is a way to order items based on a rule. That is, it can be used to order groups.
```cpp
static void count_indices_sorted(const Span<int> indices,
MutableSpan<int> counts)
{
Array<int> indices_to(indices.size());
std::iota(indices_to.begin(), indices_to.end(), 0);
parallel_sort(indices_to.begin(), indices_to.end(), [&](const int a, const int b) {
return indices[a] < indices[b];
});
for (const int index : indices_to) {
counts[indices[index]]++;
}
}
```
The code of `count_indices_sorted` function does not use multithreading outside of sorting. This is a test of the performance achieved by sorting.
Sorting can be expensive, depending on the size and order of the data.

The best result of sorting is comparable to the worst result of random writing. It makes no sense to think about dependency for such a simple example.
```cpp
static void count_indices_sorted_parallel(const Span<int> indices,
MutableSpan<int> counts)
{
Array<int> indices_to(indices.size());
std::iota(indices_to.begin(), indices_to.end(), 0);
parallel_sort(indices_to.begin(), indices_to.end(), [&](const int a, const int b) {
return indices[a] < indices[b];
});
threading::parallel_for(indices_to.index_range(), 4096, [&](const IndexRange range) {
for (const int index : indices_to.as_span().slice(range)) {
atomic_add_and_fetch_int32(&counts[indices[index]], 1);
}
});
}
```
The only difference between the `count_indices_sorted_parallel` function and `count_indices_sorted` is that multithreaded writing is used.

1. Sorting had the smallest contribution to time costs. But having a multithreaded write changes that.
2. The sort makes the dependence on amount of group extremely weak. There are fluctuations, but they are very small (compared to the 10x different in the previous charts).
The result is still worse than the original. It is more stable, but worse.
```cpp
static void count_indices_parallel_segments(const Span<int> indices,
MutableSpan<int> counts)
{
Array<int> indices_to(indices.size());
std::iota(indices_to.begin(), indices_to.end(), 0);
threading::parallel_for(indices_to.index_range(), 1024, [&](const IndexRange range) {
MutableSpan<int> r_indices = indices_to.as_mutable_span().slice(range);
parallel_sort(r_indices.begin(), r_indices.end(), [&](const int a, const int b) {
return indices[a] < indices[b];
});
int start = 0;
while (true) {
const Span<int> local = r_indices.drop_front(start);
if (local.is_empty()) {
break;
}
const int group = indices[local.first()];
const int *first_other = std::find_if(
local.begin(), local.end(), [&](const int index) {
return indices[index] != group; });
const int current_size = int(std::distance(local.begin(), first_other));
start += current_size;
atomic_add_and_fetch_int32(&counts[group], current_size);
}
});
}
```
1. Now, sorting is also part of multi-threading. This is good, since sorting can rarely be linear with size. This means that the smaller the parallel parts of the sort, the greater the total speed.
2. Multithreaded writing is now done once for each group. This is possible because group sort ensures that all groups are sorted arrays.
Now, too many groups is the main cost. This is because each thread is required to do many write operations.

### Results of offsets accumulating

Top results for:
1. `count_indices_parallel_segments`.
2. `count_indices`.
3. `count_indices_sorted_parallel`.
4. `count_indices_sorted`.
Both `count_indices_parallel_segments` and `count_indices` have the best results, but different group-amount-dependency.
Also, `count_indices_parallel_segments` function is more stable. `count_indices` has a tenfold sharp drop.
If group sizes are statically known, it is better to choose one of the functions explicitly.
If the groups are user dependent, it is the best to use only `count_indices_parallel_segments`.
*Performance may vary on different hardware. Dependency can be overridden. But the error can be still too big.*
The best functions can be combined if the size of the groups can be known in advance:
```cpp
static void count_indices_best(const Span<int> indices,
const int average_group_size,
MutableSpan<int> counts)
{
if (average_group_size > 10000) {
count_indices_parallel_segments(indices, counts);
} else {
count_indices(indices, counts);
}
}
```

## Offsets-base indices writing
Offsets is existed. Start point in result array can be sampled from offsets by using group value from source array.
### Writing indices and incrementing offsets at the same time in simple loop
Supported `counts` array is used as iterators array. Iterators are just ints. Each one element of the array is an iterator for a group with index of iterator in the array. Iterator used to write each value in it unique position in group span.
```cpp
static Array<int> accumulate_and_write(const OffsetIndices<int> offsets,
const Span<int> group_indices)
{
Array<int> counts(offsets.size(), 0);
Array<int> indices(group_indices.size());
for (const int64_t index : group_indices.index_range()) {
const int group = group_indices[index];
indices[offsets[group][counts[group]]] = int(index);
counts[group]++;
}
return indices;
}
```

1. Cache is good for few groups. But this means an expensive dependency on groups amount.
## Parallel indices writing
```cpp
static void make_stabil_order_in_groups(const OffsetIndices<int> offsets,
MutableSpan<int> indices)
{
threading::parallel_for(offsets.index_range(), 256, [&](const IndexRange range) {
for (const int64_t index : range) {
MutableSpan<int> group = indices.slice(offsets[index]);
parallel_sort(group.begin(), group.end());
}
});
}
```
For multithreading algorithms used `make_stabil_order_in_groups` function with sorting of indices to make it stable.
This is going to be used in all next algorithm.
### Simple parallel writing with array of iterators
```cpp
static Array<int> reverse_indices_in_small_groups(const OffsetIndices<int> offsets,
const Span<int> group_indices)
{
Array<int> counts(offsets.size(), -1);
Array<int> results(group_indices.size());
threading::parallel_for(group_indices.index_range(), 1024, [&](const IndexRange range) {
for (const int64_t i : range) {
const int group_index = group_indices[i];
const int index_in_group = atomic_add_and_fetch_int32(&counts[group_index], 1);
results[offsets[group_index][index_in_group]] = int(i);
}
});
make_stabil_order_in_groups(offsets, results);
return results;
}
```
`reverse_indices_in_small_groups` function is the same as `accumulate_and_write`, but now parallel loops write indices.

1. Multiple threads write the same value many times atomically. This cause a huge performance down on a few groups, 0-part of the diagram.
### Simple parallel writing with array of iterators + rewriting of original values in main loop for better cache
```cpp
static Array<int> reverse_indices_in_groups(const OffsetIndices<int> offsets,
MutableSpan<int> group_indices)
{
Array<int> counts(offsets.size(), -1);
threading::parallel_for(group_indices.index_range(), 1024, [&](const IndexRange range) {
for (int &group_index : group_indices.slice(range)) {
const int index_in_group = atomic_add_and_fetch_int32(&counts[group_index], 1);
group_index = offsets[group_index][index_in_group];
}
});
Array<int> results(group_indices.size());
threading::parallel_for(results.index_range(), 2048, [&](const IndexRange range) {
for (const int index : range) {
results[group_indices[index]] = index;
}
});
make_stabil_order_in_groups(offsets, results);
return results;
```
`reverse_indices_in_groups` is the same as `reverse_indices_in_small_groups`. The main changes are writing of index in result array in main loop. This can get better cache. All indices still have to be rewritten to the final array, though a second parallel loop would be much faster.

1. The result was better than `reverse_indices_in_small_groups`, but not by much.
2. Mutable source values isn't a common case. Can't be used in general solution.
### Simple parallel writing with array of iterators + rewriting of original values in main loop for better cache + thread-local copy of source values
```cpp
static Array<int> reverse_indices_in_groups_copy(const OffsetIndices<int> offsets,
const Span<int> group_indices_src)
{
Array<int> group_indices(group_indices_src.size());
Array<int> counts(offsets.size(), -1);
threading::parallel_for(group_indices.index_range(), 1024, [&](const IndexRange range) {
MutableSpan<int> local_group_indices = group_indices.as_mutable_span().slice(range);
const Span<int> local_group_indices_src = group_indices_src.slice(range);
std::copy(local_group_indices_src.begin(), local_group_indices_src.end(), local_group_indices.begin());
for (int &group_index : local_group_indices) {
const int index_in_group = atomic_add_and_fetch_int32(&counts[group_index], 1);
group_index = offsets[group_index][index_in_group];
}
});
Array<int> results(group_indices.size());
threading::parallel_for(results.index_range(), 2048, [&](const IndexRange range) {
for (const int index : range) {
results[group_indices[index]] = index;
}
});
make_stabil_order_in_groups(offsets, results);
return results;
}
```
The copy of `reverse_indices_in_groups`, but each thread have a local temporal span for copying source values.

1. Copying of values is costless, this cause better cache line for next part of code and allow avoiding dirtying source data.
2. Better to use this as general solution.
### Simple iota-sorting of groups
```cpp
static Array<int> gather_reverse(const OffsetIndices<int> offsets,
const Span<int> group_indices)
{
Array<int> results(group_indices.size());
std::iota(results.begin(), results.end(), 0);
const auto comparator = [&](const int a, const int b) { return group_indices[a] < group_indices[b]; };
parallel_sort(results.begin(), results.end(), comparator);
make_stabil_order_in_groups(offsets, results);
return results;
}
```
Group sorting! Again!
This is the easiest option, since index reordering is sorting. Faster sorting algorithms can be considered, or implemented that already knows about offsets. *But it seems too complicated now.*

1. Average result, fluctuates slightly. But it does not have at least one minimum in the overall standings.
### Parralel segment processing: iota-sorting and writing with local array of iterators
```cpp
static Array<int> reverse_indices_in_groups_simple_sort(const OffsetIndices<int> offsets,
const Span<int> group_indices)
{
Array<int> indices(group_indices.size());
std::iota(indices.begin(), indices.end(), 0);
Array<int> results(group_indices.size());
Array<int> counts(offsets.size(), 0);
threading::parallel_for(group_indices.index_range(), 1024, [&](const IndexRange range) {
MutableSpan<int> r_indices = indices.as_mutable_span().slice(range);
parallel_sort(r_indices.begin(), r_indices.end(), [&](const int a, const int b) {
return group_indices[a] < group_indices[b];
});
Array<int> thread_counts(offsets.size(), 0);
for (const int i : r_indices) {
thread_counts[group_indices[i]]++;
}
for (const int i : thread_counts.index_range()) {
const int size = thread_counts[i];
thread_counts[i] = atomic_add_and_fetch_int32(&counts[i], size) - size;
}
for (const int i : r_indices) {
const int group_index = group_indices[i];
const int index = thread_counts[group_index];
thread_counts[group_index]++;
results[offsets[group_index][index]] = i;
}
});
make_stabil_order_in_groups(offsets, results);
return results;
}
```
Combining sorting and iterative arrays.

1. Looks good in large groups, but the result is so bad at the end that it requires a logarithmic scale....
If there are many groups, then the thread-local iterator array becomes too large. This is the main problem with the high cost for cache and redundant actions.
But the direction is right! Everything should be fine, if avoid thread-local array.
### Parallel segment processing: iota-sorting and writing without local array of iterators
```cpp
static Array<int> reverse_indices_in_groups_complex_sort(const OffsetIndices<int> offsets,
const Span<int> group_indices)
{
Array<int> indices(group_indices.size());
std::iota(indices.begin(), indices.end(), 0);
Array<int> results(group_indices.size());
Array<int> counts(offsets.size(), 0);
threading::parallel_for(group_indices.index_range(), 1024, [&](const IndexRange range) {
MutableSpan<int> r_indices = indices.as_mutable_span().slice(range);
parallel_sort(r_indices.begin(), r_indices.end(), [&](const int a, const int b) {
return group_indices[a] < group_indices[b];
});
int start = 0;
while (true) {
const Span<int> local = r_indices.drop_front(start);
if (local.is_empty()) {
break;
}
const int group = group_indices[local.first()];
const int *first_other = std::find_if(local.begin(), local.end(), [&](const int index) {
return group_indices[index] != group;
});
const int current_size = int(std::distance(local.begin(), first_other));
start += current_size;
const int current_start = atomic_add_and_fetch_int32(&counts[group], current_size) -
current_size;
const IndexRange finall = offsets[group].slice(current_start, current_size);
MutableSpan<int> dst = results.as_mutable_span().slice(finall);
const Span<int> src = local.take_front(current_size);
std::copy(src.begin(), src.end(), dst.begin());
}
return;
});
make_stabil_order_in_groups(offsets, results);
return results;
}
```
Read and write operations have different cost.
Operations with random memory are worse than operations with sequence span.
Operations in a large memory range fall out of the cache easily.
For this reason, it is better to read the memory many times in order and then write it randomly. This is a way to bypass the use of a thread-local array.

1. The best stable result. Ideal as a general solution.
### Group indices writing results

1. `reverse_indices_in_groups_complex_sort` stable and good result.
2. `accumulate_and_write` the best result for few big groups.
3. `reverse_indices_in_groups_copy` the best result for a lot of small group.
4. `reverse_indices_in_groups` the best result for a lot of small group, but can't be used in common case.
5. `reverse_indices_in_small_groups` not the best, but good for a lot of small group.
6. `gather_reverse` sotr....
7. `reverse_indices_in_groups_simple_sort` seems as good for few big group, but growing exponentially.
`reverse_indices_in_groups` function required mutable source values as input. Function *could be* faster than `reverse_indices_in_groups_complex_sort`, but can not be used in general case.
*Performance may vary on different hardware. Dependency can be overridden. But the error can be too big still.*
The best functions can be combined if the size of the groups can be known in advance:
```cpp
static Array<int> write_indices_best(const OffsetIndices<int> offsets,
const int average_group_size,
const Span<int> group_indices)
{
if (average_group_size > 1000) {
return reverse_indices_in_groups_complex_sort(indices, counts);
} else {
return reverse_indices_in_groups_copy(indices, counts);
}
}
```

## Reusing of support data
Both functions `reverse_indices_in_groups_complex_sort` and `count_indices_parallel_segments` have the same support date and have the best speed in the same range. Functions can be joined.
```cpp
static void group_build(const Span<int> group_indices,
MutableSpan<int> counts_to_offsets,
MutableSpan<int> results)
{
Array<int> indices(group_indices.size());
constexpr int segment_size = 1024;
const bool last_small_segmet = bool(group_indices.size() % segment_size);
const int total_segments = group_indices.size() / segment_size + int(last_small_segmet);
Array<MutableSpan<int>> thread_local_indices(total_segments);
threading::parallel_for_each(IndexRange(total_segments), [&](const int segment_index) {
MutableSpan<int> &local_indices = thread_local_indices[segment_index];
local_indices = indices.as_mutable_span().slice_safe(segment_index * segment_size,
segment_size);
std::iota(local_indices.begin(), local_indices.end(), segment_index * segment_size);
parallel_sort(local_indices.begin(), local_indices.end(), [&](const int a, const int b) {
return group_indices[a] < group_indices[b];
});
int start = 0;
while (true) {
const Span<int> local = local_indices.drop_front(start);
if (local.is_empty()) {
break;
}
const int group = group_indices[local.first()];
const int *first_other = std::find_if(local.begin(), local.end(), [&](const int index) {
return group_indices[index] != group;
});
const int current_size = int(std::distance(local.begin(), first_other));
start += current_size;
atomic_add_and_fetch_int32(&counts_to_offsets[group], current_size);
}
});
OffsetIndices<int> offsets = offset_indices::accumulate_counts_to_offsets(counts_to_offsets);
Array<int> counts(offsets.size(), 0);
threading::parallel_for_each(IndexRange(total_segments), [&](const int segment_index) {
int start = 0;
while (true) {
const Span<int> local = thread_local_indices[segment_index].drop_front(start);
if (local.is_empty()) {
break;
}
const int group = group_indices[local.first()];
const int *first_other = std::find_if(local.begin(), local.end(), [&](const int index) {
return group_indices[index] != group;
});
const int current_size = int(std::distance(local.begin(), first_other));
start += current_size;
const int current_start = atomic_add_and_fetch_int32(&counts[group], current_size) -
current_size;
const IndexRange finall = offsets[group].slice(current_start, current_size);
MutableSpan<int> dst = results.slice(finall);
const Span<int> src = local.take_front(current_size);
std::copy(src.begin(), src.end(), dst.begin());
}
});
make_stabil_order_in_groups(offsets, results);
}
```
*"Separate an offsets accumulate and indices writing" is a result of `reverse_indices_in_groups_complex_sort` and `count_indices_parallel_segments` function separately.*

# Final function
Different functions prevail in different ranges on x-axis. If use only the best functions for different group size, then the results will be as follows:

The `count_indices_best` function now is useless. Range, detected in, is overlapped by `write_indices_best`. But `write_indices_best` function is useless too. New range (0-10.000 groups) for `group_build` require joining both functions in single one.
The previously shown functions differ in signature because the result array is provided by the caller.
```cpp
static void groups_indices_to_groups(const Span<int> group_indices,
MutableSpan<int> counts_to_offsets,
MutableSpan<int> result_indices)
{
/* The magic number of group sizes is derived in a practical way.
* To get around the problems of accuracy and error in this kind of
* group size eval, there is a epsilon 100 in down direction. */
if (result_indices.size() / counts_to_offsets.size() > 900) {
group_build(group_indices, counts_to_offsets, result_indices);
} else {
build_reverse_offsets(group_indices, counts_to_offsets); // Default one.
const OffsetIndices<int> offsets(counts_to_offsets);
reverse_indices_in_groups_copy(offsets, group_indices, result_indices);
}
}
```
Benchmark results for `groups_indices_to_groups`:
