Mildly Inconvenient Array Problem

# Mildly Inconvenient Array Problem ## Subtask 1: $O(NQ)$ In this subtask, handling each query in $O(N)$ suffices. To handle `+` updates, simply iterate over the array and add $w$. To handle `?` queries, we can use [Kadane's algorithm](https://en.wikipedia.org/wiki/Maximum_subarray_problem#No_empty_subarrays_admitted). We iterate over the interval, keeping a running sum, the sum of $[l,i]$. If this ever becomes negative, then it's better to start a new interval, so we set the running sum to $0$. Take the max over all such runs. ```python l,r=map(int,input().split()) l-=1 r-=1 ans = 0 s = 0 for i in range(l,r+1): s += arr[i] ans = max(ans, s) s = max(s, 0) ``` Why does this work? One way to think of it is that we compute the max subarray for each possible right endpoint of the interval. Each iteration checks one endpoint. In this view, it can be thought of as a DP: $$DP[i]=\text{max subarray with right endpoint i}$$ And then we get two transitions: extend the best subarray ending at $i-1$, or start a new one. We can ignore negative numbers, because the answer is never negative (it is at least 0, as the problem allows the empty subarray). ```python n, q = map(int, input().split()) arr = list(map(int, input().split())) for _ in range(q): parts = map(int, input().split()) c = parts[0] if c == '+': _, l, r, w = parts for i in range(l, r + 1): arr[i] += w else: _, l, r = parts ans = 0 pref = 0 for i in range(l, r + 1): pref += arr[i] ans = max(ans, pref) pref = max(pref, 0) print(ans) ``` ## Subtask 2: point updates Most problems that can be solved using divide and conquer can be made dynamic using [segment trees](https://cp-algorithms.com/data_structures/segment_tree.html). Let's see how. Suppose we have a function $f(l,r)$ that computes the max subarray in the interval $[l,r]$. To divide, we split the problem the problem into parts $[0, mid]$ and $(mid, r]$. Then, we need to compute the best subarray entirely in the part $[0,mid]$, the best subarray in $(mid, r]$, and the best subarray that crosses mid. ![image](https://hackmd.io/_uploads/BJ4LY-hV-l.png) To compute the left and right halves, we can simply call $f(l,mid)$ and $f(mid+1,r)$. To compute the best crossing value, we compute the max suffix value in $[l,mid]$ and the largest prefix in $(mid, r]$. The best interval crossing mid is then the sum of these two values. So our recursive function needs to return: - Largest subarray in $[l,r]$ - Largest possible suffix in $[l,r]$ - Largest possible prefix in $[l,r]$ To compute the largest suffix and prefix of the children, we also need the sum of the interval. Then, the function looks as follows (slightly made-up syntax): ```python arr=... class Ret: max_subarray: int max_prefix: int max_suffix: int sum: int Ret(x): sum=x x=max(0,x) max_subarray=x max_prefix=x max_suffix=x def merge(a, b): ret = Ret(0) ret.max_subarray = max(a.max_subarray, b.max_subarray, a.max_suffix + b.max_prefix) ret.max_prefix = max(a.max_prefix, a.sum + b.max_prefix) ret.max_suffix = max(b.max_suffix, b.sum + a.max_suffix) ret.sum = a.sum + b.sum return ret def f(l,r) if l==r: return Ret(arr[l]) mid=(l+r)//2 return merge(f(l,mid),f(mid+1,r)) ``` If we store the values of $f$ in a segment tree, we can do point updates and range queries in $O(\log(N))$. Unfortunately, this doesn't easily extend to range updates, because `max_subarray`, `max_prefix` and `max_suffix` can change unpredictably from range range updates. ## All queries and updates cover the whole array In this subtask, all updates affect every interval. For an interval of length $k$ with sum $m$, if we add $x$, then its new value will be $kx+m$. Multiple updates stack nicely: adding $x$ and then $y$ is the same as if we had just added $x+y$. So if we could somehow extract all $O(N^2)$ intervals as linear functions $kx+m$, and then query them efficiently, we could solve the problem. First, we can realize that $1 \leq k \leq N$, and for each $k$, we can greedily select the largest $m$. This reduces the final set of candidates to size $O(N)$. If we could compute these, we could then build a convex hull trick data structure, which allows us to find the line with largest value at coordinate $x$ in $O(\log(N))$ time. You can read more about this data structure [here](https://hackmd.io/I4227EHWRuO16T9odRgZWw). Now, how do we compute these lines efficiently? I have described that [here](https://hackmd.io/DpZ_bPNWSd6nwov6Ym1n2w#Problem-max-subarray-with-updates). Make sure to read about Minkowski sum right above. So the solution now is to: - Do a divide and conquer, using minkowski sums to build $O(N\log(N))$ candidate lines. - Filter these and build a convex hull trick. - Answer each query in $O(\log(N))$ by querying the convex hull trick. ## All updates cover the whole array We can realize that in the previous solution, the divide and conquer builds a segment tree over the array. If we just make sure to save the partial results, we can do the same as the point update/range query variant: for each node, build a convex hull trick over prefix max, suffix max and best line in the interval. To answer queries, take the $O(\log(N))$ nodes that cover the interval, query each of the 3 containers to obtain `prefmax`, `sufmax` and `max_subarray`, and then merge all nodes. $O(N\log(N)+Q\log(N)^2)$ in total. ## Full solution at different speeds Let's consider adding lazy propagation to the segment tree. For nodes entirely within or outside the interval, it's easy. But if the update partially intersects the node, as far as I'm aware, we have no choice but to rebuild the node's structures, which takes $O(N)$ if we have to rebuild a big node. To remedy this, let's instead divide the array into blocks of size $B$ (sqrt decomposition). Each block stores linecontainers that can answer `max_prefix`, `max_suffix`, `max_subarray`. Now, we handle the cases as follows: - Block completely inside update: keep a per-block lazy counter, similar to the solutions to subtask 3 and 4 - Block partially inside update: rebuild the block completely - Block completely inside query: query for `max_prefix`, `max_suffix`, `max_subarray` using the block's lazy. - Block partially inside query: do it naively in $O(B)$ Now, how long does this take per query? 1. $O(\frac{N}{B})$: we do $O(1)$ work per block, and there are up to $\frac{N}{B}$ blocks. 2. $O(B\log(B))$: there can be at most two such blocks (at update endpoints), and rebuilding $K$ items takes $K\log(K)$. 3. $O(\frac{N}{B}\log(B))$: we make 3 binary searches per block. 4. $O(B)$: we iterate over size $B$. Selecting the optimal $B$, we get something like $O(Q\sqrt{N}\log(N)$. Depending on your implementation, this might be enough for subtask 5. If you just want subtask 5, rebuilding in $O(B^2)$ with blocksize around 150 is probably going going to be faster, especially if you write it SIMD-friendly. From here, there is no single insight that will get your solution to run fast enough. I'm going to detail how I got my solution to runs in ~2 seconds on Kattis. The way to get a full solution (or subtask 5) is all about constant-factor optimization. First, memory usage. No matter what we do, we're going to have to perform a LOT of operations. On my own computer (slightly faster than Kattis), my solution executes around $1.5 \cdot 10^{10}$ operations in $1.5$ seconds, and makes around $4\cdot 10^9$ memory reads/writes. Each read and write to main memory (RAM, the thing you likely have 8 or 16 GB of) takes around $100$ ns (extremely rough and pessimistic estimate). With this estimate, the program would run for $6$ minutes, just waiting for memory reads/writes. The way to get around this is by ensuring that all our data structures fit in cache. Cache is a set of smaller, super-faster memory. You can read about it [here](https://en.algorithmica.org/hpc/external-memory/hierarchy/). Now, how big is Kattis' cache? We can find out the size of the L3 cache by submitting the bash program ```bash cat /proc/cpuinfo | grep "cache" ``` To e.g. [Treirad](https://open.kattis.com/problems/treirad). Looking at debugging hints, we see ``` cache size : 8192 KB cache_alignment : 64 cache size : 8192 KB cache_alignment : 64 cache size : 8192 KB cache_alignment : 64 cache size : 8192 KB cache_alignment : 64 ``` So we have around 8MB of memory to work with. Now, why care about this? As described previously, we will store: - Original array: $4 \cdot 10^5*8$ bytes - Prefix max: $4 \cdot 10^5*8*2$ bytes over all blocks (N lines `pair<ll,ll>`) - Suffix max: $4 \cdot 10^5*8*2$ bytes over all blocks - Subarray max: $4 \cdot 10^5*8*2$ bytes over all blocks This gets us around $22$ MB, so we have already exceeded our quota. Even if we can optimize it slightly, we will need more data structures later. So this approach will not work. Let's use the fact that all queries are given to us offline. Conceptually, we are currently answering queries as follows: ![image](https://hackmd.io/_uploads/BJK5UMhNbx.png) Per query we go left-right over all blocks. Thus, we need a data structure per block. Instead, we can answer each block independently: ![image](https://hackmd.io/_uploads/HkE6UGnNWg.png) this way, we only need $O(B)$ memory per block, which is nothing. Additionally, even if just storing the input array is 3.2 MB, we will only ever use a slice of size $B$ of it at any given moment, so that's free. Instead, we will need to store all queries in memory, along with a list of size $O(Q)$ for all answers. Since $Q$ is smaller, this already helps. Even better, we will only ever make rather sequential and local accesses to these, which helps memory usage. By being mindful of these, my solution misses L1 around $10^9$ times, ~2%, and misses L3 around and goes to main memory around $3 \cdot 10^5$ times, ~0%. For some tips on profiling, you can read [my presentation about constant-factor optimization](https://typst.app/project/rx5zlAIz1sBvxDXehaWpOB). ## Block fully contained in update Now to describe my optimizations of the basic sqrt template described above. Block completely contained in updates are $O(1)$ per block: increment the block's lazy. This is already super fast and not the bottleneck. ## Block partially contained in query For blocks partially contained in queries, simply solve it in $O(B)$. This is already super fast and not the bottleneck. ## Block fully contained in query Now, to answer queries completely contained. In each block, we will store the blocks as three `vector<Line>`, one for each of pref, suf and subarray. Each `Line` is a `pair<short,ll>`. Notably, we do not store `xRight` (x of intersection with next line in the hull of lines), as many implementations do. We will actually delay answering the query until the block experiences a partial update, forcing a rebuild. To do this, we store waiting queries, storing the lazy at the time of the query and the index of it. Then, when we rebuild (or when we are done with a block), we answer all waiting queries. Answering in this instance is updating the partial information (remember that we're answering queries in a weird order). If there are few queries (say, less than 100), we will simply ternary search each hull. Since we are ternary searching over a discrete function, a binary-search style works faster: ```c++ bool better(const Line& a, const Line& b, ll x) { return a.eval(x) >= b.eval(x); } ll eval(const vector<Line>& sp, ll x) { int l = 0, r = sp.size() - 1; while (l < r) { int m = (l + r) / 2; if (better(sp[m], sp[m + 1], x)) r = m; else l = m + 1; } return sp[l].eval(x); } ``` If there are many queries, we have a situation with a small array we're searching over (at most $B$, which is around 500), and lots of queries. In this case, we can afford to compute `xRight` arrays and binary search over these, giving a normal C++ `lower_bound` for each one. We could of course sort all queries and do a 2-pointer. However, even with a good radix sort, this was consistently slower (which makes sense, since we're trading $O(Q\log(B))$ for $O(Q\log(Q)+B)$, and $B$ is smaller). ## Block partially contained in update This is by far the slowest part of the code: rebuilding the hulls. We are primarily limited by how much memory we can read/write per second, so let's try to minimize this: as mentioned earlier store `Line`s as `pair<short,ll>`. Now for the big optimization: instead of rebuilding everything from scratch, store a segment tree over the block. By proper use of lazy propagation and only rebuilding segment tree nodes that partially intersect the update interval, we can handle updates in $O(B)$ instead of $O(B\log(B))$. The initial construction of course costs $O(B\log(B))$ per block, but this is basically free. Per node, we store the convex hull of the prefix sum $[l,r]$ and suffix sum $[l,r]$. To rebuild a node, we first need to push the lazy to the children. This is pretty easy: just add $k\cdot \texttt{lazy}$ to each line. Next, we can build the prefix by `memcpy`:ing the left and right child's prefixes, and then adding to the right child's half. The `memcpy` is important for performance. After this, take the convex hull of it. Similarly for suffix. Store the `max_subarray` lines as a dense vector for performance: a `vector<vector<ll>> all_lines`, where `all_lines[x][k]=max m`. Minkowski sum the left child's suffix and right child's prefix to get lines crossing mid. Finally, merge the children's `all_lines` with a simple, clean loop. Make sure to never sort. To do minkowski sum without `xRight`, use something like this ```c++ bool intersect_less(const Line& a1, const Line& a2, const Line& b1, const Line& b2) { big_signed numA = (big_signed)a2.m - (big_signed)a1.m; big_signed denA = (big_signed)a1.k - (big_signed)a2.k; big_signed numB = (big_signed)b2.m - (big_signed)b1.m; big_signed denB = (big_signed)b1.k - (big_signed)b2.k; return numA * denB < numB * denA; } ``` To take convex hulls quickly, use something like this ```c++ inline bool intersection_geq(const Line& A, const Line& B, const Line& C) { auto left = (big_signed)(B.m - A.m) * (A.k - C.k); auto right = (big_signed)(C.m - A.m) * (A.k - B.k); return left >= right; } void hullify_vector(vector<Line>& lines) { int hsz = 0; for (auto& l : lines) { while (hsz >= 2 && intersection_geq(lines[hsz - 2], lines[hsz - 1], l)) { --hsz; } lines[hsz++] = l; } lines.resize(hsz); } ``` ## Various small optimizations/tips - Bit-pack your queries so everything fits in cache ```c++ struct Query { bool update : 1; unsigned int l : 19; unsigned int r : 19; int w : 25; }; ``` - Don't be stupid like me: you don't need 128-bit ints to compare lines, as using long long will never overflow (the slope is small). - Dynamically choosing blocksize based on the ratio of queries and updates might help. To do this, create a large set of tests with different characteristics, then run your solution on each test with different blocksizes, and try to figure out the best one. My solution does not do this, so it isn't strictly needed. - Make sure to write SIMD-friendly loops where possible. Possible places: merging answers (prefmax,sufmax,subarray,sum), merging `all_lines`, `memcpy` (?), `put_tag` (?), partially contained queries. To do this, add ```c++ #pragma GCC optimize("O3") #include <bits/allocator.h> #pragma GCC target("avx2") ``` To the top of your code. - NEVER allocate new memory. Allocate some memory up-front, and reuse it. If you just need some temporary memory, `static` is your friend. Spam `.clear()`. - For reference, $B=512$ is optimal for my solution, but $B=1024$ is pretty comparable. - I tried optimizing binary search by precomputing a tighter bound based on the largest set bit. This helps a ton in the average case, but not the worst case. - For profiling, `__attribute__((noinline))` might be helpful, so your functions actually show up. Of course, this should only be done for functions that do a lot of work, where the function call is negligible. - My solution runs in ~2.5 seconds on Kattis, so there's a good margin.