advanced sequence tests

# advanced sequence challenges In what follows we shall explore certain classes of challenges for sequence prediction models. These challenges are motivated by the following observation made by a number of research groups: > Word-error-rate / next token prediction tests/challenges are intrinsically naive: on the one hand text-search algorithms would master them easily, and on the other predicting next word/token is no good measure of "intelligence" of a model. The associative-recall tests, concerned mainly with bigrams (pairs of words/tokens) are better - e.g. allow to assess if a model can learn transitive relations (as in MQAR). In essence, the challenges involve analyzing the content of presented contexts for occurrence of groups of _important_ information and _relations_ between them. Further observations, in our opinion, put these challenges into a proper formal form, devoid of antropomorphic motivations, and lead to better understanding of AI models. In the paper: https://magic.dev/blog/100m-token-context-windows following observations about the construction of challenges were made: - context should be full of _heystack_, meaning a large number of tokens irrelevant to the final answer (complicating the "what is relevant" problem), - the tokens to look for/pay attention to should be pretty generic (no special tokens, which would make heystack easily discardable just by single token classification), - ignoring the arrow ("→") notation of the authors, the tests boil down to the following questions/challenges: **Challenge type 1 (list-contain challenge)**: Does the long context contain a list of certain subsequences in it? **Challenge type 2 (set-contain challenge)**: Does the long context contain a set of certain subsequences in it? ## exploring the challenges ### examples of challenges **list-contain example** This challenge deals with the existence of a predefined list of subsequences, e.g. `['AA', 'BB', 'CC']` (using Python notation) in a context; they should appear in order, but can be separated by an arbitrary amout of "hey" tokens, meaning tokens irrelevant to the answer. In order to avoid the trap of signaling-out relevant parts of the context just by the type of tokens used we shall use a very conservative, 4-letter, alphabet (we will use `ABCD`, but DNA nucleotides, `ACTG` could be used instead). If the sequences `AA`, `BB`, `CC` appear in the context as non-overlapping subsequences, in the given order, we classify such contexts as `Yea` contexts; e.g. the following ones are `Yea` contexts: ``` ctx1='ABCABCAABCBCBCBBCDCDCCABC' ctx2='ABCABCAABBBCBCBCBBCDCDCCABC' ``` It might not be easy for the reader to find the required subsequences; let us help with this task by using dots, `.`, for what (ultimately) constitutes the "haystack": ``` ctx1='......AA......BB....CC...' ctx2='......AABB............CC...' ``` Clearly finding such lists of subsequences is not trivial (but doable by classical string algorithms). In addition, due to very limited alphabet the relevant tokens might appear many times all over the context; they might also appear "accidentally", meaning there is a non-zero probability of a random context being actually a `Yea` context. etc. This will be addressed during precise formulation of the challenges which follows. **set-contain version** For such challenges, the _set_ of sequences, `{'AA', 'BB', 'CC'}` (again in Python notation) should be present in a given context. In other words -- these subsequences must appear in the context for a `Yea` classification, but their order can be arbitrary. The following sequences (with haystack masked again) lead to `Yea` answers: ``` ctx1='....BB..AA.........CC...' ctx2='......CCAABB.......CC...' ``` (in `ctx2` we present the subsequence `CC` twice; either one can be marked as hay, and the context will still be a `Yea` for the set-contain challenge). On the other hand, for both of the presented challenges (list-contain and set-contain) contexts such as presented below are categorized as `Nay`: ``` ctx1='ABCABCABCABCABCABC' ctx2='AAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBCACAAAAA' ``` ### generalizations A number of important challenges can be reformulated into the form of list-contain or set-contain challenges. Clearly associative recall tasks are just special simple cases of problems of the above type. A good challenge arises to find out which of the multitude of AI models can solve them at all, and if so, with what parameters and with how many resources for the training. Note, that that due to the very limited alphabet the contexts are expected to be long (starting with few hundred tokens, and ending with tens of thousands of tokens). Further possible applications of the above challenges may involve: - putting restrictions on the distances between tokens of the list (list-contain challenge), - posing other types of challenges, e.g. testing inference of the structure of transitive/reflexive relations (as in MQAR), - predicting numerical values (e.g. for binding affinities etc), via introduction of a larger number of answer classes. Clearly, having an AI algorithm capable of learning and solving such challenges would be of advantage, and -- in such a case -- observing the process of the adaptation of the model to the challenge (during training) will shed light onto the capabilities and emergent internal structures of the model. ## vocabulary v.s. marker sequences In order to be directly applicable to existing AI models we introduce the following simple structure of "markers" which still keeps the problem in the given category (of vocabulary-starved challenges). Since no _special tokens_ should be introduce, token sequences called _marker sequences_ will be employed. These are few-token sequences that have a function for the problem, but are composed of the same vocabulary as everything else (including haystack). The _marker sequences_ will be special in the sense, that finding them in a random context will be practically impossible (in contrast to finding subsequences from the challenges, e.g. a subsequence `AA` in a DNA context, which is very common). ### marker sequences For now we introduce the following marker sequences: - "(" marker, - ")" marker, - "Q" marker, - markers for the allowed answer classes (can be `Yea`, `Nay` or as many as the number of categories the problem will require). (All of these marker sequences are just sequences of tokens from the alphabet, of length between 10 and 30, which is enough to make random collisions impossible.) The problem is now formulated as follows: > In a vocabulary-starved language (such as DNA) a context is given as a sequence of tokens. This context may or may not contain any marker sequences. For challenge of list-contain-type - find all elements hidden between bracket tokens, and check if they contain the given list of subsequences of tokens. If so, continue after the `Q`-marker sequence with the `Yea` marker sequence; else, continue with `Nay` marker sequence. In other words, without introducing any new _tokens_ we present a context to the model, and _if_ this context contained bracket markers, e.g. ``` test_ctx='....(sx)....(sy)....(sz)...Q' ``` then subsequence `sx` should be equal to `AA`, `sy` to `BB` and `sz` to `CC`, if the answer should be "Yea" for predicting list-contain for `['AA', 'BB', 'CC']` (as in examples above); in such a case, upon seeing the marker sequence `Q`, the model should predict a certain number of tokens which should reproduce the unique marker sequence for `Yes` answer. Recalling the example of `ctx1='......AA......BB....CC...'` and setting `(='DAD'`, `)='ADA'`, `Q='BAB'` models will be presented with `ctx1=...DADAAADA...DADBBADA...DADCCADA........BAB'` and should continue with, e.g. `CCCCC` if `Yea='CCCCC'`, while for contexts not fulfilling the task of the challenge, the model should continue with `DDDDD` if `Nay='DDDDD'`. Notes: 1. the first thing the model must learn, is to reporduce one of the two (in `Yea`/`Nay` problems) marker sequences for answer marker sequences upon seeing the marker sequence `Q`; this already assumes perferc word error rate for the continuation after `Q` sequence, 2. the model then must understand that either: - there are special sequences: the bracket markers; failing to do so will completely confuse the model, as small interior sequences such as `AA` are expected to appear randomly in any context, - learn the whole sequences such as `(AA)` as contained in the investigated list; i.e. treat the problem as list-contain problem with `['(AA)', '(BB)', '(CC)']` (inserting long sequences for `(` and `)` where appropriate); while this is possible, markers for `(` and `)` are usually long to ensure uniqueness in random contexts, and the model might find it hard to focus on the little parts between the brackets, essentially making such approach equivalent to the one above -- i.e. understanding that brackets are present. 3. learning the bracket structure essentially gets rid of the "haystack"; now the model must learn the task of checking if the found sequence of subsequences is equal to (in the list-contain challenge) the required sequence, or has exactly the same set of subsequences as the required set (set-contain challenge). As a side-note - one may test if a model has any chance of solving such challenges by first introducing an extra token `.` into the vocabular for the haystack, and prestent the problem in this way (with haystack masked-out; a sine qua non condition for having any chance to solve the full challenge) ## Preliminary implementation At present we are using: ``` # answer marker: A = 'AAAAAA' + 'CCCCCC' + 'TTTTTT' + 'GGGGGG' # begin/end bracket tokens B = 'CCCCCC' + 'TTTTTT' + 'CCCCCC' E = 'GGGGGG' + 'TTTTTT' + 'GGGGGG' ``` these sequences do not appear at all in the DNA sequence "NC_060925.1 Homo sapiens isolate CHM13 chromosome 1, alternate assembly T2T-CHM13v2.0" which we work with. For the encoded information an isomer class was proposed, ``` class IsoCategory(BaseModel): name: str base_forms: list[tuple[str, ...]] permuted_allowed: bool result_marker: str ``` with implementations of the type: ``` yea_protein = IsoCategory( name='yea', base_forms=[('AA', 'CC', 'TT')], permuted_allowed=False, result_marker='AAAAAA' + 'AAAAAA' + 'TTTTTT' + 'TTTTTT' + 'GGGGGG' ) nay_protein = IsoCategory( name='nay', base_forms=[('AA', 'CC', 'GG'), ('AA', 'TT', 'GG'), ('CC', 'TT', 'GG')], permuted_allowed=True, result_marker='GGGGGG' + 'GGGGGG' + 'CCCCCC' + 'CCCCCC' + 'TTTTTT' ) ``` The parameter `permutation_allowed` effectively allowing a change between list-contain and set-contain presentations of the problem. Contexts of length `>=200` were randomly generated, and `yea-` or `nay-` subsequences were imprinted in random locations of these contexts. ### results Here we report on the results for the list-contain challenge having 3 subsequences, prepared as described above. #### FNets FNets with various number of regions were trained on the challenge. Results depend strongly on the number of regions. For >=19 regions FNets discovered that only `Yea` or `Nay` answers were allowed (and almost no other answers were produced). Starting with around 28 regions FNets found a way to solve the problem, and choose a proper answer, most of the time. ![image](https://hackmd.io/_uploads/SJvXczYmyl.png) The results presented were obtained by 1 CPU core running the training for 10 seconds (10 epochs), although similar results are obtained already with as few as 5 epochs, which we have illustrated in the figure below. Indeed, it is the structure of the FNet, and not the number of epochs of training which determines FNet's performance on these challenges. ![image](https://hackmd.io/_uploads/rkB1cGKQ1g.png) #### Hyena models We attacked the problem with some of the Hyena models (mainly the small ones, such as `LongSafari/hyenadna-small-32k-seqlen-hf`, or `LongSafari/hyenadna-medium-160k-seqlen-hf`, as contexts were <1000 tokens long). The structure of the model(s) were taken, but any pre-training was wiped out before starting with the challenge training.   ![image](https://hackmd.io/_uploads/rkvHkv5X1e.png) Figure: statistics of results from 150 training runs of LongSafari/hyenadna-small-32k-seqlen-hf model (not pretrained). As always with such models, it takes very many epochs to get reasonable results. The model was run on a A100 GPU for over an hour (in total). Starting with around 50 epochs the model discovered that only 2 answers were allowed (but even here -- the generated answers rarely reproduced the required sequences (`Yea`/`Nay`result markers) exactly; there were shifts by few tokens, or some tokens misplaced; a criterion was implemented to accept an answer when >=80% of tokens agreed with one of the result markers). However, it should be noted that Hyena models _never_ actually solved the problem; the results stagnated around 48% correct predictions, and never progressed further, even after thousands of epochs. #### Transformer models For this part we have used InstaDeepAI Nucleotide Transformer models, specifically the smallest, `InstaDeepAI/nucleotide-transformer-v2-50m-multi-species` version (as context lengths were limited in this study). Again, only the structure of the model was taken (and not the weights corresponding to some DNA-specific pretrainings). The model behaved qualitatively similarly to the Hyena models, with the exception that NT models were quite capable of learning exact form of the `Yea/Nay` marker sequences. As is visible from the plot below, these models do not seem to be capable of solving the masked hash problem of type I (list-contain) beyond the 50% probability of correct, basically meaning that the origin of the correct assignment of the answer remained unresolved by the model.  ![image](https://hackmd.io/_uploads/B1IfxP9X1e.png) Figure: statistics of results from 150 training runs of InstaDeepAI/nucleotide-transformer-v2-50m-multi-species model (not pretrained).