# Computational Subway Surfers: a CSS Virtual Machine I made [a reverse engineering challenge](https://github.com/UMD-CSEC/UMDCTF-Public-Challenges/tree/main/UMDCTF2025/rev/css) for UMDCTF 2025 called `computational-subway-surfers`, which went unsolved during the competition. I then put a \$50 bounty on the challenge, and after a couple implementation fixes had to be made, the bounty was claimed by Zel on May 3rd (~7.5 days after the competition began). In this document I will explain some of the internals of the virtual machine, how the program works, and the expected approach to solving. ## The virtual machine The challenge download is named `css.html`, and it implements a virtual machine using only HTML elements and CSS animations. The idea for the challenge was inspired by [mooninaut's CSS turing machine](https://github.com/Mooninaut/css-is-turing-complete), which is implemented [here](https://mooninaut.github.io/css-is-turing-complete/turing-1.html). While mooninaut claims that their implementation is the first functioning Turing machine written in pure CSS, I'm not completely convinced by their implementation. The biggest problem the tape is moved left and right by separate finite animations, which puts a limit to how much the tape can move left and right. Notably, if the Turing machine moves the tape left and right infinitely, eventually the tape will stop moving. So, I decided I would make my own implementation to fix this problem. After some tinkering around, I had a solution that involved keeping a "left" and "right" button on each tape cell, which could theoretically be plugged into mooninaut's solution to create a working Turing machine. But I also wanted to turn this into a UMDCTF challenge, and I had already written [a Turing machine challenge](https://github.com/UMD-CSEC/UMDCTF-Public-Challenges/tree/main/UMDCTF2024/rev/undecidable) last year and found the process incredibly painful and limiting. So, I decided to work at a higher level and implement a virtual machine in CSS. ### Program loop Mooninaut's approach to creating a CSS Turing machine is to have `:hover` animations trigger different parts of the machine. To eliminate the need for the user to move the cursor around, the different parts of the machine are driven by infinitely repeating animations that move them under the cursor at the right times. In our VM, we will have just one global animation that moves relevant areas under the cursor. ### Keeping state There is one mechanism that allows us to keep state in CSS, which will be used everywhere in this VM. This is best exemplified when looking at the tape cell, an element that stores one bit of information. Mooninaut's tape cell has five components: two blocks for writing 0 and 1 ("write-0" and "write-1"), two blocks for reading 0 and 1, and one block ("bit") that covers one of the reading blocks. These blocks are placed using the CSS grid layout. By default, the "bit" block is in column 2, but it also has a paused animation that can move it to column 1. Abridged styles are below: ```css @keyframes bit { from { grid-column: 2; } to { grid-column: 1; } } .bit { grid-column: 2; grid-row: 2; animation: bit 0.01s forwards steps(1) paused; } ``` A hover over "write-1" sets the `animation-play-state` property to `running`, which moves "bit" to column 1. Importantly, the "bit" block stays in column 1 even after the cursor leaves the "write-1" block. Hovering over "write-0" sets the `animation` property to `none`, which resets the animation state and moves "bit" back to column 2. ![ezgif-1de2991fd29bf4](https://hackmd.io/_uploads/SJ-x8bbxeg.gif) For my VM, I removed the reading blocks since they don't do anything, and I also put everything on the same row since these will be stacked vertically to form a tape of bits. ![ezgif-150c28125619ae](https://hackmd.io/_uploads/SkPYrMZgxx.gif) To "read" a bit from a cell, we place the cursor on the two places where the "bit" block could be. One of these places is covered by the "bit" block, but the other can go through to trigger a `:hover` animation on an element below. ### Performing computation Since this is a VM, we'd like to be able to do many different kinds of operations on data (arithmetic, bitwise operations, comparisons, etc.). However, having a single global animation to drive our program means our cursor essentially moves on a fixed path. To deal with this, all operations are performed using lookup tables. Selecting the right operation is just a matter of bringing the correct lookup table to the top, and then the same sequence of actions can be used for every operation. How many bits should the VM process at a time? We could do just one bit per operand, and our operations would be logic gates. But things would be a lot easier (and faster) if we can operate on multiple bits at a time. With $n$ bits, each operand takes on $2^n$ values, and our lookup table has to be size $4^n$ (assuming two operands). Picking $n=4$ seemed like a decent balance between convenience and lookup table size ($256$ in this case). From now on, we'll refer to a group of 4 cells as a "byte". We'll have quite a few registers in our VM, implemented as one-byte tapes. To perform a lookup on two operands, we test each bit of the first operand. For each bit that is set, we shift the lookup table up by its corresponding value. So, the first bit moves the table up by 1 unit, the second by 2, the third by 4, and the fourth by 8. We do the same with the second operand, but we move the table left instead. These shifts are achieved by hiding/unhiding elements of the corresponding width and height. Now, if the cursor moves to where top left corner of the table normally is, we hit the entry at the row equal to the first operand, and the column equal to the second operand. Each lookup table entry will move some bits around in a "result" area. These bits will cover the "write" blocks in the destination operand so that only the correct result can be written to the destination. Finally, we need some buttons to reset all the shifting we're doing. The entire process looks like this: 1. Reset positions of everything 2. Move the correct lookup table to the top (using an animation to change `z-index`) 3. Move the first operand into place (this can be done simultaneously with the previous step) 4. Read bits of first operand, shifting the lookup table up accordingly 5. Reset the tape positions 6. Move the second operand into place 7. Read bits of second operand, shifting the lookup table left accordingly 8. Reset the tape positions 9. Go to lookup table 10. Move destination operand into place 11. Write the result into the destination operand Now we can do arithmetic. Here, we're adding 9 + 10 (and getting 3, as expected): ![ezgif-2aa364dc732ae2](https://hackmd.io/_uploads/HkQZcI-lle.gif) To make our life a little easier when programming, we also designate a special `flags` register that can be set by the lookup table to hold extra information, like the carry bit from addition. ### Control flow After an instruction completes, we want to move to the next instruction in the list, or potentially jump to somewhere completely different. To achieve this, we use the same trick we did for the lookup tables: the list instructions will be prefixed by shift elements whose heights are powers of two (times the unit size). After each instruction, we add a "jump" block that hides/unhides the desired shift elements to set the offset for the next instruction. This block also resets the global animation so there's no risk of hitting a block from another instruction after jumping. Conditional jump instructions will have a second "jump" block that points to a destination address. This sits under the first "jump" block, and its `z-index` property can be modified by the lookup table to move in front of the first "jump" block. ### Offsets into memory The same trick of adding shift elements also applies to tapes. So, in addition to moving a tape to a specific location, we can also specify the offset into the tape that we want to load. There is also an additional problem of loading a tape at a non-constant offset (i.e., one specified by another register). To accomplish this, we can add an instruction for every relevant tape that takes in up to two registers as inputs and loads the offset specified by the operands. For tapes that are more than 256 bytes, more instructions can be added to load larger offsets. To prevent offsets from resetting every instruction, these instructions cover the reset buttons with special reset buttons that don't do that. Then we make another instruction that uncovers the normal reset buttons. Here's a demo of how that looks. First, we load 5 into one of the registers, and use that register as an index into the read-only memory, pulling out the value 7. This is then copied into another register. ![ezgif-7b4aeed7c994f9](https://hackmd.io/_uploads/SkMlhTXgxe.gif) That's pretty much everything there is in terms of VM internals. ## The program Programming the virtual machine is just a matter of creating some operations and writing assembly with these instructions. You can see all 1,000 or so lines of assembly that I wrote for this program [here](https://github.com/UMD-CSEC/UMDCTF-Public-Challenges/blob/main/UMDCTF2025/rev/css/code.asm). ### Taking input Input is specified by a `<select>` tag for each character: ![image](https://hackmd.io/_uploads/ByNBjRmexx.png) I originally wanted to have input come from an `<input>` of type `text`, but CSS can't actually read a `value` attribute that's not explicitly specified. The best it can do is read the `:valid` pseudo-class, which gives one bit of information based on whether the input satisfies the regex in the `pattern` attribute. After some poking around, the best option for taking input seemed to be from `<select>`, and then I could read the `:checked` pseudo-class for each `<option>`. To transfer information to the program, I used had a CSS variable for each bit I wanted to set, and stored a linear combination of the bits in the read-only memory. The end result looks something like this: ```css! :root{ --b0-0: 0; --b0-1: 0; --b0-2: 0; --b0-3: 0; --b0-4: 0; --b0-5: 0; --b0-6: 0; --b1-0: 0; --b1-1: 0; --b1-2: 0; --b1-3: 0; --b1-4: 0; --b1-5: 0; --b1-6: 0; --b2-0: 0; --b2-1: 0; --b2-2: 0; --b2-3: 0; --b2-4: 0; --b2-5: 0; --b2-6: 0; --b3-0: 0; --b3-1: 0; --b3-2: 0; --b3-3: 0; --b3-4: 0; --b3-5: 0; --b3-6: 0; --b4-0: 0; --b4-1: 0; --b4-2: 0; --b4-3: 0; --b4-4: 0; --b4-5: 0; --b4-6: 0; --b5-0: 0; --b5-1: 0; --b5-2: 0; --b5-3: 0; --b5-4: 0; --b5-5: 0; --b5-6: 0; --b6-0: 0; --b6-1: 0; --b6-2: 0; --b6-3: 0; --b6-4: 0; --b6-5: 0; --b6-6: 0; --b7-0: 0; --b7-1: 0; --b7-2: 0; --b7-3: 0; --b7-4: 0; --b7-5: 0; --b7-6: 0;} #rom > .t-cell:nth-child(5) > .covr { translate: calc(mod(var(--b0-0) + var(--b0-1) + var(--b0-3) + var(--b0-4) + var(--b0-5) + var(--b1-1) + var(--b1-3) + var(--b1-5) + var(--b1-6) + var(--b2-2) + var(--b2-4) + var(--b2-5) + var(--b3-1) + var(--b3-2) + var(--b3-4) + var(--b3-5) + var(--b4-3) + var(--b4-4) + var(--b4-5) + var(--b5-2) + var(--b5-3) + var(--b5-4) + var(--b5-6) + var(--b6-1) + var(--b6-3) + var(--b7-1) + var(--b7-2) + var(--b7-3) + var(--b7-5) + var(--b7-6), 2) * var(--nsize)); } ``` The flag format is encoded separately from the flag body; in particular, the encodings of the characters in the flag body are randomized so that any specific bit is unpredictable. ### Flag format check The first check of the program is for the flag format. This one's a simple check against constants, where there's a specific instruction that takes in a byte and an index and checks if the byte at that index is the expected one (otherwise it jumps to a fail label). Code below: ```arm ffmtcheck: rld1 ~~, r2, r3 ; load rom[r2] (r3 doesn't matter because it's 0) cff fail, rom, r2 ; check if rom[r2] matches expected output at r2 ublk ; unblock shift-resets add1 r2, r2, ~~ ; r2 += 1 jle ffmtcheck, r2, rom+14 ; if r2 <= 14, jump to ffmtcheck ... fail: halt fail ``` ### Main loop The main loop encodes the inputted flag using a [spinal code](https://dl.acm.org/doi/pdf/10.1145/2342356.2342363), which is an (incredibly cursed) error-correcting code. The way it works is that the message is split into blocks, which are fed into non-invertible hash functions one block at a time to form a sort of "spine" of values. Each value on the spine is fed into a PRNG (which can also be non-invertible), and the encoded message is the outputs of the PRNG. It's as insane as it sounds. ![image](https://hackmd.io/_uploads/ryn9lY4xlx.png) In this program, we take one output from each PRNG and compare it with the expected value, and add the Hamming distance between the outputs to an accumulator. At the end of the main loop, there's a check that the accumulator is less than 2 (that is, there is at most a 1 bit error). The main operation that this program does is elliptic curve point multiplication. I chose a Montgomery curve since point addition and doubling in $(X,Z)$ coordinates can be done efficiently (in 3 and 2 multiplications respectively). It also lends itself well to using the Montgomery ladder for point multiplication. The specific curve used was $y^2 = x^3 + 62406 x^2 + x$ over $\mathbb{F}_{2^{127}-1}$. The modulus was chosen to be a Mersenne prime for easy modular arithmetic, and the coefficient of 62406 was the smallest such that the order and twist order were 4 times a prime (this is not really that important). For notational purposes, let $X(n,P)$ be the $x$-coordinate of $n \cdot P$. Define $P$ to be a point on the curve with $x$-coordinate $3$, and let $Q$ be a point described in the next section. The hashing operation of the spinal code transforms the incoming spine value based on the message block to get some value $k$, and then returns $X(k,P)$. The PRNG simply takes in $s$ and returns $X(s,Q)$. The astute reader may recognize that this is very similar to an infamous PRNG known as Dual_EC_DRBG, and this is very much intentional. In fact, the final check of the program is to find the "backdoor" in this instantiation of Dual_EC_DRBG. ### Final check The points $P$ and $Q$ in the last section are constructed so that $P = q \cdot Q$ for some $q$. This $q$ is defined to be a 57-bit number, taking one bit from every fourth byte in the encoded flag. This is akin to a sort of puncturing of the spinal code (see section 5 of the spinal code paper). At the end of the spinal code, we end up with some spine value $s_n$. The program computes $R = s_n \cdot P$ and $S = s_n \cdot Q$, and verifies that $q \cdot S = R$ (in the $x$-coordinate), which is equivalent to checking that $q \cdot Q = P$. ## The solve path The spinal code can be decoded with a BFS, where on each level we bruteforce the next 8 bits of the message. We prune any branches that exceed a total Hamming distance of 1. To reduce memory cost, candidates can be pruned early by looking ahead by one step. However, the number of possible decodings with at most 1 bit error grows linearly with depth, and so decoding time is quadratic. To resolve this, it is advantageous to resolve the final check first by solving the discrete log between $P$ and $Q$. Since $q$ is 57 bits, [Pollard's kangaroo](https://en.wikipedia.org/wiki/Pollard%27s_kangaroo_algorithm) can solve the discrete log in approximately $2^{28.5}$ steps. This can be reduced to $2^{27.5}$ steps by using Pohlig-Hellman to reduce the problem to a 2-bit and a 55-bit discrete log. My implementation runs in 14 minutes (if lucky; Pollard's kangaroo is probabilistic, and each failure is +25 minutes), and one could also parallelize this to get further speedups. Solving the discrete log resolves 1 bit of ambiguity for every 16 bits, which lets us prune the BFS to run much more efficiently. At the end, we are left with 3 candidates, only one of which decodes to valid characters. ![image](https://hackmd.io/_uploads/Sy9UG4Sxgx.png) The full solve script for this challenge can be found [here](https://github.com/UMD-CSEC/UMDCTF-Public-Challenges/blob/main/UMDCTF2025/rev/css/solve.sage). ## The future of CSS A CSS virtual machine is a bit of a one-time gimmick, so I probably won't feature this in another challenge. I do have future plans for this concept though; maybe a C to CSS compiler could be feasible...