# Slightly faster offline range queries
Warning: the following algorithm is of questionable utility. However, I find it quite aesthetic.
Suppose we are given an array $A$ of $N$ elements and need process $Q$ queries "given $l,r$, what is $f(A_l, A_{l+1},\dots,A_r)$?". We assume that the problem is offline, i.e., we don't need to answer each query before we're given the next one. We assume that $f$ has an identity element and is associative (all operations I've had to use have an identity). Note that if $f$ is invertible, a normal fenwick trees suffice, and we don't need this dance.
This task can be classically solved using segment trees. However, let's try to get a tiny bit more performance by using fenwick trees instead.
Let $I$ be the identity for $f$ (i.e., any element such that $f(a,I)=a$). We start by creating a fenwick tree $T$ filled with $I$. We also group all queries by their $r$ value. To answer all queries with a specific $r$, we set $T_i = A_i$ for all $i \leq r$. Now, to answer a particular query $[l,r]$, we can simply query $[l, N-1]$. Because all elements in the fenwick tree in $[0,r]$ have been added, we will include everything in $[l,r]$. We will also include the suffix $[r+1,N-1]$, but they won't affect the answer, as they are identity elements. To do this efficiently, we will sort all queries by their $r$ values. To query suffixes of the fenwick tree, we will index it in reverse (i.e., $i$ becomes $N-1-i$).
When is this useful? Reasonable candidates are 2 largest unique numbers in an interval (see https://open.kattis.com/problems/aneasyarray). Sparse table doesn't work here, since it will double count. Another candidate is matrix multiplication.
Here is how to solve range minimum queries using this trick (we will be using [this problem](https://judge.progolymp.se/contests/rangequeries/problems/rmq)). The implementation is decently optimized, but can optimized further (down to 0.03s). It runs in about 0.11s.
```C++
#include <bits/stdc++.h>
using namespace std;
typedef long long ll;
typedef vector<int> vi;
typedef vector<vi> vvi;
typedef pair<int, int> p2;
#define rep(i, high) for (int i = 0; i < (high); i++)
#define repp(i, low, high) for (int i = (low); i < (high); i++)
#define repe(i, container) for (auto& i : container)
#define sz(container) ((int)container.size())
#define all(x) begin(x),end(x)
// assumes queries to be [l,r]
template<typename T>
vector<T> OfflineRangeQueries(const vector<T>& arr, const vector<pair<int, int>>& queries, T identity, function<T(T,T)> f)
{
int n = arr.size();
int q = queries.size();
vector<array<int, 3>> queries_sorted(q);
for (int i = 0; i < q; i++) queries_sorted[i] = { queries[i].first, queries[i].second, i };
sort(queries_sorted.begin(), queries_sorted.end(), [](array<int, 3>& a, array<int, 3>& b)
{
return a[1] < b[1];
});
vector<T> fenwick(n + 1, identity);
vector<T> ans(q);
int qp = 0;
for (int i = 0; i < n; i++)
{
for (int j = n - i; j < n + 1; j += (j & -j)) fenwick[j] = f(fenwick[j], arr[i]);
while (qp < q && queries_sorted[qp][1] == i)
{
T res = identity;
for (int j = n - queries_sorted[qp][0]; j > 0; j -= (j & -j)) res = f(res, fenwick[j]);
ans[queries_sorted[qp][2]] = res;
qp++;
}
}
return ans;
}
int main()
{
cin.tie(0)->sync_with_stdio(0);
int n, q;
cin >> n >> q;
vi nums(n);
repe(v, nums) cin >> v;
vector<p2> queries(q);
rep(i, q) cin >> queries[i].first >> queries[i].second;
vi ans = OfflineRangeQueries<int>(nums, queries, int(1e9), [](int a, int b)
{
return min(a, b);
});
repe(v, ans) cout << v << "\n";
return 0;
}
```
# Advanced: extra performance
## Not using lambda
Using lambda functions result in slightly worse codegen than passing a struct with operator(), as is done for priority queues with custom comparators. If the compiler was smarter, it should be able to recognize that it's known at compile-time, but i digress.
With lambda: https://godbolt.org/z/cvxocYTWc
With struct and operator(): https://godbolt.org/z/jvG4ex65P
Even with O3, the problem persists. So by sacrificing ergonomics, we can get more speed.
[Gist](https://gist.github.com/Matistjati/3623e8dc77dc5a03515580b47604a177) (~0.10s):
## Counting sort
We can also gain more performance by doing an optimized counting sort. It's terribly slow to do it with `vector<vector<int>>`, so instead we will use a flat array.
[Gist](https://gist.github.com/Matistjati/4ba1cba30f3dc4d5984ac73320d1235f) (~0.08s)
## Fast IO + Pragma
Let's mechanically do the standard stuff. Add `pragma GCC optimize("O3")` and `#pragma GCC target("avx2,bmi")`. avx2 helps with fenwick trees, bmi has blsr, which allows us to do `x-=x&-x` in one instruction. Then, add fast input using [blazingio](https://github.com/purplesyringa/blazingio/blob/master/blazingio.min.hpp). avx2 probably also helps blazingio.
The biggest speedup here comes from blazingio, as we were previously IO-bound.
[Gist](https://gist.github.com/Matistjati/d5cb83ddbdc010f08d7bf29f9ff5f370) (~0.03s)
## Comparisons
Tests were ran on yosupo [static RMQ](https://judge.yosupo.jp/problem/staticrmq).
- Offline fenwick: [49ms](https://judge.yosupo.jp/submission/297636)
- Arpa's trick: [53ms](https://judge.yosupo.jp/submission/297616)
- Kactl's RMQ: [54ms](https://judge.yosupo.jp/submission/297619)
- Kactl's segtree with O(N) build: [79ms](https://judge.yosupo.jp/submission/297625)
- Kactl's Segtree: [95ms](https://judge.yosupo.jp/submission/297623)
## SIMD
I spent around 6 hours trying to use SIMD gather to optimize the fenwick tree query. The conclusion is that the compiler generates pretty much optimal code from the normal fenwick tree implementation. Do not try to beat the compiler in this instance.
Let $f$ be our query function. Let's assume it's simple enough to have a SIMD operation that allows us to "merge" two vectors quickly. The scalar code is as follows:
```C++
int j; // our query index
T res = identity;
for (; j > 0; j -= (j & -j))
res = f(res, fenwick[j]);
```
Let $g(j, x)=$ `j-= j&-j` x times.
If we can compute $g$ fast, SIMD:able pseudocode could look like the following:
```C++
int j; // our query index
T acc[4] = {identity,identity,identity,identity};
int indices[4] = {f(j,0),f(j,1),f(j,2),f(j,3)};
while (indices[0])
{
rep(k,4) acc[k] = f(acc[k],tree[indices[k]]):
rep(k,4) indices[k] = g(indices[k],4):
}
T res = f(f(acc[0],acc[1]), f(acc[2],acc[3]));
```
We maintain correctness by having tree[0] contain an identity element. So in this loop we're doing 3 things:
1. Loading values from the fenwick tree at 4 different indices
2. Merging it with acc
3. Updating indices
Step 2 is super fast and great. Step 1 will work well enough with a gather. The most annoying part is step 3. One option is to advance all indices forward 4 times. Doing this with a lane width of 8 instead would mean 8 iterations, etc. 8 lanes gives [51ms](https://judge.yosupo.jp/submission/297645), 4 lanes gives [50ms](https://judge.yosupo.jp/submission/297648) and 2 lanes gives [53ms](https://judge.yosupo.jp/submission/297649). We need explicit SIMD, as the compiler is unlikely to autovectorize this (for good reasons, it seems).
Another option would be to precompute $g$ and store it in a flat array. This is a terrible idea. If our lane width is 4, we will use $5\cdot 10^5 \cdot 5 \cdot 4$ B = 10MB of memory, pushing us firmly out of L3 on most online judge hardware. The fact that it's fine performance-wise to update the indices using SIMD and arithmetic indicates that we are pretty much memory bound. This is not surprising, as the fenwick tree already needs $4 \cdot 5 \cdot 10^5$=2MB of memory. Thus, adding more memory load will be detrimental.
In conclusion, we are likely memory bound. This analysis probably carries over to other usages of fenwick trees as well, given that when they are intended to be used, authors usually like to place large constraints like $N \geq 5 \cdot 10^5$. To increase performance for large $N$, we should consider structures with other memory layouts; in particular, one with better better cache locality such as [wide segment trees](https://en.algorithmica.org/hpc/data-structures/segment-trees/) seem more fruitful.