# [Tall Monument](https://po2punkt0.kattis.com/problems/tallmonument)
You are given two positive integers $N$ and $T$ ($N \cdot T \leq 3 \cdot 10^5$), followed by $N$ pairs of numbers $(s_i,a_i)$ ($1 \leq s_i \leq T, -10^9 \leq a_i \leq 10^9$). Your goal is to choose a subset $S$ of these pairs such that $$\sum_{(s_i,a_i)\in S} s_i = T$$ and $$\sum_{(s_i,a_i)\in S} \text{sign}(a_i) \cdot 2^{|a_i|}$$ is maximized.
## Analysis
This is a special case of knapsack where all of our weights are small and the profits are huge powers of two. Naivly implementing knapsack using bigints would run in something like $\mathcal{O}(N \cdot \text{goalWeight} \cdot \text{maxProfit})$ which would be WAYYYYY too slow for this problem so we have to somehow exploit the fact that all profits are powers of two.
## All $a_i$ are small
When $|a_i| \leq 100$ we can do a normal knapsack dp with int128s.
```cpp=
for (int i = 0; i < n; i++) {
for (int j = 0; j <= t; j++)
dp[i + 1][j] = dp[i][j];
for (int j = si[i]; j <= t; j++) {
dp[i + 1][j] = max(dp[i + 1][j], dp[i][j - si[i]] + ai[i]);
}
}
```
## Splitting the problem into two parts
Let's divide the pairs into two groups, one where all $a_i$ are non-negative and the other containing all negative ones. For the non-negative group we perform the above dp and calculate for each height the maximum profit. For the negative group we flip all the signs and do the same as before but trying to minimize the profit, then we can combine these two groups to find the final answer.
```cpp=
using i128=__int128_t;
pair<i128, i128> best{-INF, INF};
for (int i = 0; i <= t; i++) {
i128 pos = dp_pos[sz(si_pos)][i];
i128 neg = dp_neg[sz(si_neg)][t - i];
if (pos + best.second > neg + best.first)
best = {pos, neg};
}
```
## Sqrt decomp
We will use a similar approach to the subtask, but instead of tracking the profit as a single integer, we will represent it as a sparse vector containing only the bits that are turned on.
How long can these vectors get? They are strictly bounded by $N$, since there are only $N$ different bits we can possibly turn on. Interestingly, if we look closer at the DP transitions, we notice that we only have $T$ layers, and each layer adds at most one bit. Therefore, we are bounded above by $\min(N,T)$. Given in the problem constrainst ($N \cdot T \leq 3\cdot 10^5$), we know that $\min(N,T) \leq \sqrt{3*10^5}$.
With careful implementation, we can perform all relevant operations on these sparse numbers in $\mathcal{O}(\min(N,T))$ thereby solving the problem in $\mathcal{O}(NT \cdot \min(N,T)) = \mathcal{O}((NT)^{1.5})$ which is sufficient for AC(100).
:::spoiler Example implementation
```cpp=
#include <bits/stdc++.h>
using namespace std;
using ll = long long;
const ll mod = 1e9 + 7;
ll binpow(ll b, ll e) {
b %= mod;
ll res = 1;
while (e) {
if (e & 1)
res = (res * b) % mod;
b = (b * b) % mod;
e >>= 1;
}
return res;
}
#define sz(c) ((int)c.size())
using num_t = vector<pair<int, uint32_t>>;
int main() {
cin.tie(0)->sync_with_stdio(0);
int n, t;
cin >> n >> t;
vector<pair<int, int>> sa_pos, sa_neg;
for (int i = 0; i < n; i++) {
int s, a;
cin >> s >> a;
if (a >= 0)
sa_pos.emplace_back(s, a);
else
sa_neg.emplace_back(s, -a);
}
auto add_num = [](const num_t& l, const num_t& r) {
num_t res;
int lp = 0;
int rp = 0;
int cexp = 0;
uint64_t cval = 0;
while (lp < sz(l) || rp < sz(r)) {
int lexp = lp < sz(l) ? l[lp].first: 1e9;
int rexp = rp < sz(r) ? r[rp].first: 1e9;
int exp = min(lexp, rexp);
if (cval && cexp < exp) {
res.emplace_back(cexp, cval);
cval = 0;
}
uint64_t val = cval;
if (lp < sz(l) && l[lp].first == exp) val += l[lp++].second;
if (rp < sz(r) && r[rp].first == exp) val += r[rp++].second;
cval = val >> 32;
cexp = exp + 1;
if ((uint32_t)val) res.emplace_back(exp, (uint32_t)val);
}
if (cval) res.emplace_back(cexp, cval);
return res;
};
auto cmp_num = [](num_t l, num_t r) {
int lp = sz(l) - 1;
int rp = sz(r) - 1;
while (lp >= 0 && rp >= 0) {
if (l[lp].first != r[rp].first)
return l[lp].first < r[rp].first;
if (l[lp].second != r[rp].second)
return l[lp].second < r[rp].second;
lp--, rp--;
}
return rp >= 0;
};
auto calc = [t, add_num](vector<num_t>& dp, vector<pair<int, int>> sa, auto cmp) {
dp.resize(t + 1);
for (int i = 0; i < sz(sa); i++) {
for (int j = t; j >= sa[i].first; j--) {
if (j - sa[i].first && dp[j - sa[i].first].empty())
continue;
num_t nxt = add_num(dp[j - sa[i].first], {{sa[i].second / 32, ((uint32_t)1) << (sa[i].second % 32)}});
if (dp[j].empty() || cmp(dp[j], nxt))
dp[j] = nxt;
}
}
};
vector<num_t> nums_pos, nums_neg;
calc(nums_pos, sa_pos, cmp_num);
auto cmp_num2 = bind(cmp_num, placeholders::_2, placeholders::_1);
calc(nums_neg, sa_neg, cmp_num2);
bool valid = false;
array<num_t, 2> best;
for (int i = 0; i <= t; i++) {
auto pos = nums_pos[i];
auto neg = nums_neg[t - i];
if (i && pos.empty())
continue;
if (t - i && neg.empty())
continue;
if (!valid) {
best[0] = pos;
best[1] = neg;
valid = true;
continue;
}
auto tmp1 = add_num(pos, best[1]);
auto tmp2 = add_num(neg, best[0]);
if (cmp_num(tmp2, tmp1)) {
best[0] = pos;
best[1] = neg;
}
}
ll ans = 0;
for (auto [i, j] : best[0]) {
ans += binpow(2, i * 32) * j;
ans %= mod;
}
for (auto [i, j] : best[1]) {
ans -= binpow(2, i * 32) * j;
ans %= mod;
}
if (ans < 0)
ans += mod;
cout << ans << '\n';
}
```
:::
## How to mog everyone by being over 17x faster :moyai: :goat:
Using square-roots is kinda cringe, and luckily for us it turns out that we are able to exchange them for a log-factor.
There are multiple ways to achieve this, but here is what I consider the most straightforward method. We want to perform the same DP, but each operation has to be significantly more efficient. Let's sort the pairs such that smaller $a_i$ values are processed first. This gives rise to a crucial property: at any moment, the highest set bit (hsb) is at most $\mathcal{O}(\log(\min(N,T)))$ greater than the current $a_i$. We can make this be $\mathcal{O}(1)$ nodes by using a base larger than $2^{\log(\min(N,T))}$ instead of base $2$ in our tree.
This means that when creating a new number from an old one, only the final suffix of our bit representation needs to change. We will exploit this by utilizing *persistent sparse numbers*. Instead of copying the entire vector to create a new number, we represent the numbers as nodes in a tree (similar to a linked list). Creating a new number simply requires branching of a path off the tree.
We can now quickly add new bits to the numbers, but how do we quickly compare two numbers? Well by binary-lifting and string-hashing of course :upside_down_face: In this tree each node stores the exponent that it represents, the using binary-lifting and string-hashing we can quickly compute how far up the two numbers are equal.
Putting all of this together creates a solution running in $\mathcal{O}(NT \cdot \log(\min(N,T)))$. For anyone that wants to implement this I will warn you that there are a lot of edgecases that I haven't covered here, good luck!
:::info
It is also possible to solve the problem using persistent lazy sparse segment trees without needing the insight of dividing into groups depending in the sign of $a_i$, exactly how is left as an exercise to the reader. [This](https://codeforces.com/blog/entry/115626) might also be relevant - although I haven't looked too closely at it :D
:::