# [Identifying Cells (Hard)](https://po2punkt0.kattis.com/problems/cellidentifikationsvar)
# Subtask 0: $K \leq 21$
This is the original version of Identifying Cells. One solution is to use [SOS DP](https://codeforces.com/blog/entry/45223) to count the number of submasks for every possible query.
There is a simpler solution: for every mask, compute the largest and smallest submask:
```c++
for(int mask = (1 << k)-1; mask >= 0; mask--){
rep(i, 0, k){
MIN[mask] = min(MIN[mask], MIN[(mask|(1 << i))]);
MAX[mask] = max(MAX[mask], MAX[(mask|(1 << i))]);
}
}
```
To check if there is more than one match, we can simply check if `MIN` is not the same as `MAX`. If there is exactly one match, it is the min (or max; they are the same in this case).
# Subtask 1: $K \leq 28$
We can use ideas from [this](https://codeforces.com/blog/entry/103799) codeforces blog to do SOS DP faster than $O(2^{28})$. This is pretty painful and as far as I'm aware, it cannot be optimized beyond this subtask.
# Faster solutions
The naive $O(NQ)$ solution looks something like the following:
```c++
for (int i = 0; i < n; i++)
{
if ((types[i] & query) == query)
{
match = i;
n_matches++;
}
}
```
The real complexity is something like $O(\frac{NQK}{w})$, where $w$ is the word size of the computer. In this case, $w=64$. However, if we have very large bitsets, it is more like $256$ because of SIMD. This specific loop can also be SIMD:d for increased performance, but as far as I'm aware, this cannot even solve $K=21$.
For this to get full points, it would have to process $NKQ=2.4\cdot 10^{12}$ bits, or 60 GB/s (given the 5-second time limit). This does not seem completely impossible if we answer queries offline with loop blocking. Let's look for an easier solution instead.
For each bit, let's precompute a bitset of which patterns from the biology book have that bit enabled. To answer a given query, we can bitwise and (`&`) together the bitsets for every bit in the query that is set to 1 (if the query bit is 0, everything is valid, so we ignore it). To find the answer, we check whether the count is 0,1, or greater. If it's 1, we call `_Find_first`. This runs in $O(NK+\frac{Q(NK)}{w})$. Not an obvious improvement, but it's likely to run faster than the previous solution, as the actual operations we're doing are simpler (we're just ANDing together bitsets).
The real improvement comes from this allowing us to precompute. We're going to use a precomputation inspired by the method of four russians. Split $K$ into let's say 6 parts, 10 bits each. For each of the $2^{10}$ bit patterns, precompute the result of ANDing together all the bitsets that correspond to the 1 bits in that pattern. This allows us to speed up queries by a factor of 6, while keeping the precomputation time reasonable.
Doing the precomputation naively is fast enough: let $p$ be the number of bits in each chunk. Then, $O(\frac{p2^pN}{w})$ suffices. It can be sped up to $O(\frac{2^pN}{w})$ using DP, but this only results in a speedup of about $10$% (if we increase $p$, then the memory usage will also increase).
Don't forget to enable SIMD: otherwise, the problem is impossible as far as I'm aware. On Kattis, my solution runs in ~3s, with a 5s time limit. I believe it should be possible to improve it further, by processing queries offline and trading $O(\text{precomp})$ memory with $O(Q)$ memory, keeping everything in cache. This is what Erik Adebahr did to get ~1.6s runtime.