Try   HackMD

Basic Bit Manipulation

Transacting on Ethereum is extremely expensive, so sometimes, it makes sense to aim for more granular control over your code. One way to achieve this is to utilize bit operations! They can also be quite useful outside of optimization. I’ve posted a few bit twiddling techniques to my Twitter @fiveoutofnine:

These operations/tricks are pretty different from other aspects of coding, so it may be fairly esoteric to those without experience. Although I tried to give an application example with each Tweet, it’s still hard to see where/how to apply them yourself. A good example of a project with more extensive applications/context is my on-chain chess engine, but the contracts are probably a bit hard to follow for beginners.

In this post, I attempt to explain some of these tricks and explain my thought process when applying them with a much simpler game: tic-tac-toe.

Basic Bit Operations

First, here are the basic bit operations that are used in this tic-tac-toe example: >>, <<, &, |, and ^. Feel free to skip this section if you are already familiar with them!

>> (SHR)

Digits are moved, or shifted, to the right. Source.

Example: 1010 >> 1 = 101 (in decimal: 10 >> 1 = 5)

Note: shifting an integer m right by n bits results in m2n.

<< (SHL)

Digits are moved, or shifted, to the left. Source.

Example: 1010 << 2 = 101000 (in decimal: 10 << 2 = 40).

Note: shifting an integer m left by n bits results in m2n.

& (AND)

A bitwise AND is a binary operation that takes two equal-length binary representations and performs the logical AND operation on each pair of the corresponding bits. Thus, if both bits in the compared position are 1, the bit in the resulting binary representation is 1 (1 × 1 = 1); otherwise, the result is 0 (1 × 0 = 0 and 0 × 0 = 0). Source.

Example:

  0101 (in decimal: 5)
& 0011 (in decimal: 3)
= 0001 (in decimal: 1).

Note: since & only “accepts” the bit if both numbers’ bits are 1, you can construct a bitmask to select which bits of a number you want to read. For example, for some number m, if we want to read every other bit, we can compute m & 1010...1010.

| (OR)

A bitwise OR is a binary operation that takes two bit patterns of equal length and performs the logical inclusive OR operation on each pair of corresponding bits. The result in each position is 0 if both bits are 0, while otherwise the result is 1. Source.

Example:

  0101 (in decimal: 5)
| 0011 (in decimal: 3)
= 0111 (in decimal: 7)

Note: since | “accepts” the bit as long as either numbers’ bits are 1, you can write a number’s bits into another number, as long as the first number’s bits in the same position are all 0.

^ (XOR)

A bitwise XOR is a binary operation that takes two bit patterns of equal length and performs the logical exclusive OR operation on each pair of corresponding bits. The result in each position is 1 if only one of the bits is 1, but will be 0 if both are 0 or both are 1. In this we perform the comparison of two bits, being 1 if the two bits are different, and 0 if they are the same. Source.

Example:

  0101 (in decimal: 5)
^ 0011 (in decimal: 3)
= 0110 (in decimal: 6)

Note: ^ can be helpful in conjunction with & and | when overwriting a number’s bits with another set of bits. For example, suppose we want to overwrite 10010’s last 3 bits (i.e. 010) with 101:

  • First, we construct the bitmask to select the bits we want (in this case, all bits except for the last 3) with ^: 11111 ^ 00111 = 11000.
  • Next, we “delete” (replace with 0) 10010’s last 3 bits with &: 10010 & (11000) = 10000.
  • Finally, we write in 101, the desired bits, with |: 10000 | 00101 = 10101.
  • Altogether, we have (10010 & (1111 ^ 111)) | 101 = 101010.

Note: numbers in this section with only 0s and 1s are in binary representation unless otherwise specified.

Try working through the examples and notes to gain some foundational intuition!

Overview of Tic-tac-toe

Before writing the contract, let’s lay out the basic things required for a game of tic-tac-toe. It is a 2 player game, and each player only has 1 type of move they can play (i.e. placing their respective symbols into an open square). When a move is inputted by a player, we have to validate a few things:

  • The player is 1 of the 2 players in the game.
  • It is the player’s turn to play a move.

Next, we have to check that the move is legal.

Finally, we must check if the game has ended (i.e. either player has formed 3 consecutive dots).

Writing this out as pseudocode in a function called playMove, we have:

function playMove(position):
    // Player validation
    if ("player is not one of the 2 in the game")
        revert Error();
    if ("it is not the player's turn")
        revert Error();

    // Move validation
    if ("the position has already been played in the board")
        revert Error();

    applyMove(position)

    // Game ending scenario
    if ("either player has formed 3 consecutive dots")
        ("set the winner and end the game")

Board/Game State Representation

Now that we have an idea of all the information to store, let’s come up with a representation in code. Keep in mind that the design should allow for efficient computation: the code will read the data and compute with it, so operations such as accessing, decoding, and storing should be cheap. Once we achieve both, we can look towards using as few bits as possible because storage costs are (generally) the most expensive on the EVM.

Base Representation

A naive approach might look something like:

struct Game {
    uint8[9] board; // 0 = empty, 1 = player 0, 2 = player 1
    bool turn; // false = player 0, true = player 1
    bool hasGameEnded;
    bool playerZeroWon; // only set if `hasGameEnded` is true
    bool playerOneWon; // only set if `hasGameEnded` is true
    address playerZero;
    address playerOne;
}

If I was satisfied with this, I would make sure struct Game was tightly packed, then proceed.

However, uint8[9] board uses a lot of unnecessary reads/writes, and we don’t need 8 bits to store data for a position on the board; since there are only 3 valid states, 2 bits (22=4) are enough. Next, bool turn can easily be 1 bit: 0 if false, 1 if true. Lastly, because bool hasGameEnded, bool playerZeroWon, and bool playerOneWon work in conjunction to express the state of the game (only 4 possibilities), we can combine them into 2 bits:

  • the game is ongoing (00),
  • player zero has won (10),
  • player one has won (11),
  • and game is a draw (01).

Altogether, we use just 21 bits (down from 89+8+8+8+8=104 when using the struct above) to represent everything:

[000000000000000000][00][0]
    • First 2 * 9 = 18 bits denote the board
    • Next 2 bits denote whether game is ongoing
    • Last bit denotes whose turn it is to play

Bitpacking Games Into uint256s Instead of Using Structs

If we were to use a struct, we’d have to use uint24, so, instead, let’s store the players’ addresses separately in 2 mappings and store the game data in a third mapping:

mapping(uint256 => address) public playerZeros;
mapping(uint256 => address) public playerOnes;
mapping(uint256 => uint256) public games;

This way, we can store the games’ data slightly cheaper by bitpacking consecutive games’ data. The idea is to reduce as many zero slot writes as possible because writing to zero slots costs 20k gas, whereas writing to nonzero slots only costs 5k gas. We achieve this by bitpacking as many games’ data into a uint256 (in this case, 25621=12, so we can fit up to 12 games in 1 uint256). This way, only every 12nth game will cost 20k gas to store, and every other game will only cost 5k gas. Retrieval and writing remains efficient:

// Each `uint256` value has 12 games bitpacked into it.
mapping(uint256 => uint256) public gameStates;

// Each game is 21 bits, so the desired mask is
// `0b111111111111111111111 = 0x1FFFFF` (21 `1`s).
uint256 internal constant GAME_MASK = 0x1FFFFF;

function retrieveGame(uint256 _gameId) external view returns (uint256) {
    // Divide by 12 and take mod 12 because 12 games are bit packed into
    // each `uint256`.
    uint256 key = _gameId / 12;
    uint256 index = _gameId % 12;

    // Bit shift by `21 * index` to get to the desired game.
    return (gameStates[key] >> (21 * index)) & GAME_MASK;
}

function writeGame(uint256 _gameId, uint256 _gameState) external {
    uint256 key = _gameId / 12;
    uint256 index = _gameId % 12;

    // Read the entire `uint256` into memory first because it's cheaper to make all
    // modifications in memory before writing the final result to storage.
    uint256 gameUint256 = gameStates[key];

    // This process overwrites whatever exists in the desired 21 bit slot with
    // `_gameState` (see `XOR` in "Basic Bit Operations").
    gameUint256 &= type(uint256).max ^ (GAME_MASK << index);
    gameUint256 |= _gameState << (index * 21);

    // Write to storage.
    gameStates[key] = gameUint256;
}

Final Representation

Now, we have 3 mappings to keep track of all games’ data and players. With this set-up, storing each new game to storage costs roughly

20000+20000+20000+5000121346154

gas on average (2 zero slot writes for each of the players’ addresses, and 6154 gas for the game’s state).

However, since addresses are only 160 bits, we can remove a write to storage by bitpacking the game’s state with one of the players’ addresses. Let’s arbitrarily pick player zero’s address for this. Combined, we have:

// The first 21 bits are games' data, and the remaining 160 bits are player 0
// addresses.
mapping(uint256 => uint256) public playerZerosAndGames;
mapping(uint256 => address) public playerOnes;

Again, retrieval and writing is efficient:

uint256 internal currentGameId;

function createNewGame(address _playerZero, address _playerOne) {
    uint256 gameId = currentGameId++;
    // The first 21 bits are the game. We don't need to set anything for this
    // because the all bits are 0 for the starting position.
    playerZerosAndGames[gameId] = uint256(uint160(_playerZero));
    playerOnes[gameId] = _playerOne;
}

function retrieveGame(uint256 _gameId) external view returns (uint256) {
    return playerZerosAndGames[_gameId] >> 160;
}

function writeGame(uint256 _gameId, uint256 _gameState) external {
    uint160 playerZero = uint160(playerZerosAndGames[_gameId]);
    // Bit shift `_gameState` left by 160 to make space for player 0's address.
    playerZerosAndGames[_gameId] = (_gameState << 160) | playerZero;
}

Note: in this particular case, since we need a minimum of 2 storage slots per new game anyway (2 player addresses already exceed 1 32-byte word), cutting down on the number of bits used to represent the game is not necessary—all else equal, using 96 bits (256160=96) is just as efficient. However, I thought I’d describe the process here anyway, for when it might be applicable.

Fine-tuning the Representation

For tic-tac-toe, the representation above is good enough. But if it were a different application, there might be extra considerations to take. These are highly dependent on the situation, but here are some common ones:

Frequently Accessed Data

The ordering of the first/last n bits might help save on computation: they only require one of >> or & to access. Thus, if some information needs to be accessed more frequently than the others, it’d probably be wise to place it at the start/end.

Structuring Data To Remove Expensive Computation (Use The Value Itself!)

An expensive computation on the EVM is the switch statement. Keeping this in mind, it may be wise to structure your representation to remove them. For example, suppose a 3-bit portion maps to 8 different outputs:

if      (portion == 0) return  3;
else if (portion == 1) return  7;
else if (portion == 2) return 12;
else if (portion == 3) return  6;
else if (portion == 4) return 13;
else if (portion == 5) return  9;
else if (portion == 6) return  3;
else                   return 15;

Instead, you could use portion as a value itself! The following code block is equivalent to the series of if/else statements above:

//   value << wordSize * index
// ----------------------------
//   (  15 <<        4 *     7)
// | (   3 <<        4 *     6)
// | (   9 <<        4 *     5)
// | (  13 <<        4 *     4)
// | (   6 <<        4 *     3)
// | (  12 <<        4 *     2)
// | (   7 <<        4 *     1)
// | (   3 <<        4 *     0)
// = 0xF39D6C73.
//
// 15 is just 0b1111 (i.e. masks 4 bits).
// `portion << 2` is the same thing as `portion * 4`.
// We multiply by 2 because each word is 4 bits long.
return (0xF39D6C73 >> (portion << 2)) & 15;

See: How to turn if blocks into 3 bit operations for small gas savings.

On a related note, you can often structure your representation and assign values in a way that allows for efficient group conditional checks (i.e. form it in a way that makes it easy to categorize). Some simple examples:

  • if the value is even, perform x, otherwise perform y;
  • if the value is greater than m, perform x, otherwise perform y;
  • if the value is a multiple of 8, perform x, otherwise perform y;
  • if the value’s 2nd to 5th bits are 010, perform x, otherwise perform y.

You can also combine one or more of these to form composite checks. For example, if the value is even and greater than m, perform x, otherwise perform y.

See {Chess-generateMoves} in Chess.sol.

Pedantic Bit Packing: Is 21 Bits Good Enough?

You may have realized that, for the tic-tac-toe example, we use 2 bits (22=4 total possibilities) per position, when there are only 3 possible states (empty, occupied by player 0, or occupied by player 1). Why not save 3 bits (2222222223 is 15 bits long) by using a base 3 representation? We can then use just 18 bits (15+3=18) instead of 21—isn’t this better?

Not really. Besides the point that we bitpack the game data with player 0’s address, consider the following: 18 bits allows us to store 14 (25618=14) games per uint256, which means it costs 20000+500013146071 gas on average to store a game’s data. Comparatively, 21 bits allows us to store 12 games per uint256, or 20000+50001112=6520 gas on average. Although 449 gas is “saved” on the storage side, decoding and encoding base 3 is considerably more expensive with many division and mod operations (vs almost entirely using bit operations)[1].

Valid reasons (maybe) to be that pedantic about the number of bits is that (1) it results in major gas savings or (2) computation complexity for encoding/decoding is not an issue (e.g. it is only encoded/decoded in a view function). On-chain art is a category of projects/communities that are open to excessive efforts to compress storage, as well as a realm where it [generally] makes sense to do so.

[1] We also end up using the extra slot to make a slight optimization in checking whether a player won. See subsection "Checking If a Player Won" in section "Check Game Ending State".

User Validation

Recall that our player 0 is stored in mapping(uint256 => uint256) public playerZerosAndGames, where the first 160 bits represent player 0, and player 1 is stored in mapping(uint256 => address) public playerOnes. Thus, to check that msg.sender is either player 0 or player 1, we have:

address playerZero = address(uint160(playerZerosAndGames[_gameId]));
address playerOne = playerOnes[_gameId];

require(msg.sender == playerZero || msg.sender == playerOne);

Next, to check whether it's msg.sender's turn to play, recall that the last bit of the board denotes whose turn it is to play: 0 if it's player 0's turn, and 1 if it's player 1's turn. In terms of bit operations: if board & 1 == 0, it is player 0's turn, and if board & 1 == 1, it is player 1's turn.

uint256 playerZeroAndGame = playerZerosAndGames[_gameId];
// Store it in memory to compute upon, then push it to storage later all at once.
// We access the game by bitshifting `playerZeroAndGame` to the right by 160.
uint256 inMemGame = playerZeroAndGame >> 160;
address playerZero = address(uint160(playerZeroAndGame));

// `inMemGame & 1` retrieves whose turn it is to play.
require(
    // `msg.sender` is player 0, and it is their turn.
    (msg.sender == playerZero && inMemGame & 1 == 0)
        // `msg.sender` is player 1, and it is their turn.
        || (msg.sender == playerOnes[_boardId] && inMemGame & 1 == 1)
);

Move Validation

After we have verified that msg.sender is the correct address to play, we must verify whether the move they inputted is valid. There are 2 necessary checks for this:

  1. the game is not over,
  2. and the position they inputted must be empty on the board.

Recall that if the 3rd to 2nd last bits are 00, the game is still ongoing. Therefore, checking whether the game is not over is as simple as accessing those 2 bits, and checking if they equal 0:

// 3 is just 0b11 (i.e. masks 2 bits)
require((inMemGame >> 1) & 3 == 0);

Next, let _index denote the user's input, where they correspond to the following positions of the board:

8 7 6
5 4 3
2 1 0

Then, accessing the position corresponding to _index is just a matter of shifting to the correct position and, once again, masking 2 bits with 3 (0b11). This value must equal 0 (i.e. the square is empty) for the move to be valid:

// First, bit shift right by 3 to get the relevant bits (i.e. just the bits
// denoting the board). Then, we bit shift to the correct position by shifting
// right by `_index << 1` bits (note: each position takes up 2 bits).
require(((inMemGame >> 3) >> (_index << 1)) & 3 == 0);

Check Game Ending State

Intuition

There are 4 possible ways for a player to win:

  1. form a horizontal line of 3,
  2. form a vertical line of 3,
  3. form a diagonal line of 3 from the bottom right corner to the top left corner,
  4. and form a diagonal line of 3 from the bottom left corner to the top right corner.

If either player has satisfied one of those, the game is over. How do we determine that?

Well, for each winning pattern, we can select the bits that correspond to the squares in that pattern. Then, if these bits all represent squares played entirely by player 0 or entirely by player 1, the game is over.

Winning Pattern Bitmasks

Recall from earlier that we can select bits with the & operator and a mask. Let's denote

  1. HORIZONTAL for rows of 3,
  2. VERTICAL for columns of 3,
  3. BR_TO_TL_DIAGONAL for the diagonal from the bottom right corner of the board to the top left corner,
  4. and BL_TO_TR_DIAGONAL for the diagonal from the bottom left corner of the board to the top right corner.

The following code block shows how they're computed (remember each square is represented by 2 bits):

//     1. 00 00 00
//        00 00 00
//        11 11 11
//     => 0b111111 = 0x3F
uint256 internal constant HORIZONTAL = 0x3F;
//     2. 00 00 11
//        00 00 11
//        00 00 11
//     => 0b11000011000011 = 0x30C3
uint256 internal constant VERTICAL = 0x30C3;
//     3. 11 00 00
//        00 11 00
//        00 00 11
//     => 0b110000001100000011 = 0x30303
uint256 internal constant BR_TO_TL_DIAGONAL = 0x30303;
//     4. 00 00 11
//        00 11 00
//        11 -- --
//     => 0b1100110011 = 0x333
uint256 internal constant BL_TO_TR_DIAGONAL = 0x333;

Applying The Bitmasks

The position we apply the mask on is also important. For example, if we apply the horizontal bit mask at the square of index 0, we (correctly) select the following squares marked by x:

• • •
• • •
x x x

Now, suppose we apply it at the square of index 1:

• • •
• • x
x x •

This is incorrect. The correct positions to apply them for each mask are shown below:

--------------------------------------------------
| Bitmask           | Applied At | Selected Bits |
|===================|============|===============|
|                   | • • •      | • • •         |
|                   | • • •      | • • •         |
|                   | • • x      | x x x         |
|                   |------------|---------------|
|                   | • • •      | • • •         |
|  HORIZONTAL       | • • x      | x x x         |
|                   | • • •      | • • •         |
|                   |------------|---------------|
|                   | • • x      | x x x         |
|                   | • • •      | • • •         |
|                   | • • •      | • • •         |
|===================|============|===============|
|                   | • • •      | • • x         |
|                   | • • •      | • • x         |
|                   | • • x      | • • x         |
|                   |------------|---------------|
|                   | • • •      | • x •         |
| VERTICAL          | • • •      | • x •         |
|                   | • x •      | • x •         |
|                   |------------|---------------|
|                   | • • •      | x • •         |
|                   | • • •      | x • •         |
|                   | x • •      | x • •         |
|===================|============|===============|
|                   | • • •      | x • •         |
| BR_TO_TL_DIAGONAL | • • •      | • x •         |
|                   | • • x      | • • x         |
|===================|============|===============|
|                   | • • •      | • • x         |
| BL_TO_TR_DIAGONAL | • • •      | • x •         |
|                   | x • •      | x • •         |
--------------------------------------------------

Try working through them on paper!

Checking If a Player Won

For each selected square, the corresponding pair of bits in the mask are 0b11, or 3 in decimal representation. All other bits are 0. Then, if the selected portion denotes squares entirely played by either player, it should be a scalar multiple of the mask that was used:

  • If 0b01 (1 in decimal) represents a square played by player 0, and all the selected squares were played by player 0, the selected portion should equal MASK3.
  • Similarly, if 0b10 (2 in decimal) represents a square played by player 1, and all the selected squares were played by player 1, the selected portion should equal 2MASK3.

This is fairly efficient to check, but we can do slightly better. If we use 0b00 for empty squares, 0b01 for player 0, and 0b10 for player 1, 0b11 represents nothing. What happens if we use 0b11 instead of 0b10 for player 1's moves? The selected portion should equal MASK, instead of 2MASK3! This is more efficient to check, so let's go with this.

Implementation

x denotes the position the code is at, and - denotes bits that have been bitshifted off (and thus no longer relevant).

// @return 0 if ongoing, 1 if player 0 has won, and 2 if player 1 has won.
function checkGameState(uint256 _board) internal pure returns (uint256) {
    // Last 4 bits denote information about whether the game is over and whose turn it is to
    // play, so we ignore them.
    _board >>= 4;
    // • • •
    // • • •
    // • • x

    uint256 gameState = checkGameStateHelper(_board, HORIZONTAL);
    if (gameState != 0) return gameState;

    gameState = checkGameStateHelper(_board, VERTICAL);
    if (gameState != 0) return gameState;

    gameState = checkGameStateHelper(_board, BR_TO_TL_DIAGONAL);
    if (gameState != 0) return gameState;

    _board >>= 2;
    // • • •
    // • • •
    // • x -
    gameState = checkGameStateHelper(_board, VERTICAL);
    if (gameState != 0) return gameState;

    _board >>= 2;
    // • • •
    // • • •
    // x - -
    gameState = checkGameStateHelper(_board, VERTICAL);
    if (gameState != 0) return gameState;
    gameState = checkGameStateHelper(_board, BL_TO_TR_DIAGONAL);
    if (gameState != 0) return gameState;

    _board >>= 2;
    // • • •
    // • • x
    // - - -
    gameState = checkGameStateHelper(_board, HORIZONTAL);
    if (gameState != 0) return gameState;

    _board >>= 6;
    // • • x
    // - - -
    // - - -
    gameState = checkGameStateHelper(_board, HORIZONTAL);
    if (gameState != 0) return gameState;

    return 0;
}

function checkGameStateHelper(uint256 _board, uint256 _mask)
    internal pure
    returns (uint256)
{
    return (_board & _mask) == _mask
        ? 2 // All squares are player 1's.
        : (_board & _mask) == (_mask / 3)
            ? 1 // All square are player 0's.
            : 0; // No win situation.
}

Check Game Ending State w/ While Loop (bonus)

This section is just for fun (+has some cool bit twiddling patterns).

Intuition

The code above looks kinda ugly/redundant. Let's use a while loop instead, so we don't have checkGameStateHelper in so many lines. We can do this by bitpacking the win-pattern-checking masks consecutively into a uint256 masks (in the order they are used), as well as bitpacking the number of bits the board is shifted by after each check into a uint256 boardShifts.

We make the checks in the following order:

Bitmask Right Shift to Board After Check
HORIZONTAL 0
VERTICAL 0
BR_TO_TL_DIAGONAL 2
VERTICAL 2
VERTICAL 0
BL_TO_TR_DIAGONAL 2
HORIZONTAL 6
HORIZONTAL 0
Derivation
There are 8 possible game ending scenarios for each player. Let * denote the bit
we analyze from, and ^ denote the line we analyze:
In board position
                                         • • •
                                         • • •
                                         • • *,
    1. • • •
       • • •    (HORIZONTAL)
       ^ ^ ^

    2. • • ^
       • • ^    (VERTICAL)
       • • ^

    3. ^ • •
       • ^ •    (BR_TO_TL_DIAGONAL)
       • • ^
In board position
                                         • • •
                                         • • •
                                         • * •,
                      (bitshift right by 2 from previous position)
    4. • ^ •
       • ^ •    (VERTICAL)
       • ^ •
In board position
                                         • • •
                                         • • •
                                         * • •,
                      (bitshift right by 2 from previous position)
    5. ^ • •
       ^ • •    (VERTICAL)
       ^ • •

    6. • • ^
       • ^ •    (BL_TO_TR_DIAGONAL)
       ^ • •
In board position
                                         • • •
                                         • • *
                                         • • •,
                      (bitshift right by 2 from previous position)
    7. • • •
       ^ ^ ^    (HORIZONTAL)
       • • •
In board position
                                         • • *
                                         • • •
                                         • • •,
                      (bitshift right by 6 from previous position)
    8. ^ ^ ^
       • • •    (HORIZONTAL)
       • • •

The checks we make in order, summarized into a table:
         | Bitmask                     | Right Shift to The Board After Check |
         |-----------------------------|--------------------------------------|
         | HORIZONTAL                  | 0                                    |
         | VERTICAL                    | 0                                    |
         | BR_TO_TL_DIAGONAL           | 2                                    |
         | VERTICAL                    | 2                                    |
         | VERTICAL                    | 0                                    |
         | BL_TO_TR_DIAGONAL           | 2                                    |
         | HORIZONTAL                  | 6                                    |
         | HORIZONTAL                  | 0                                    |

Computing masks and boardShifts

Since the largest mask is 18 bits, and 918<256, we can safely reserve 18 bits for each "word" within uint256 masks:
HORIZONTAL|(VERTICAL<< 181)|(BR_TO_TL_DIAGONAL<< 182)|(VERTICAL<< 183)|(VERTICAL<< 184)|(BL_TO_TR_DIAGONAL<< 185)|(HORIZONTAL<< 186)|(HORIZONTAL<< 187)=0xFC003F00CCC186106187030306184003F

Next, since the largest shift is 6, we can safely use 3 bits ((13=8)<6) for each "word" within uint256 boardShifts:
0|(0 << 31)|(2 << 32)|(2 << 33)|(0 << 34)|(2 << 35)|(6 << 36)=0x190480

Implementation

function checkGameStateWithWhile(uint256 _board) internal pure returns (uint256) {
    // Last 4 bits denote information about whether the game is over and whose turn
    // it is toplay, so we ignore them.
    _board >>= 4;

    uint256 masks = 0xFC003F00CCC186106187030306184003F;

    // Since the largest shift is 6, we can safely reserve 3 bits (1 << 3 = 8 < 6) for each
    // shift:
    //                                      0
    //                                      | (0 << 3 * 1)
    //                                      | (2 << 3 * 2)
    //                                      | (2 << 3 * 3)
    //                                      | (0 << 3 * 4)
    //                                      | (2 << 3 * 5)
    //                                      | (6 << 3 * 6)
    //                                      = 0x190480
    uint256 boardShifts = 0x190480;

    while (masks != 0) {
        // 0x3FFFF = 0b11...11 (18 1's)
        uint256 mask = masks & 0x3FFFF;
        if (_board & mask == mask) return 2;
        else if (_board & mask == (mask / 3)) return 1;

        _board >>= (boardShifts & 7);
        masks >>= 18;
        boardShifts >>= 3;
    }

    return 0;
}

Conclusion

So many people see low-level solidity optimizations from people like @transmissions11
and think that it's the key to optimizing their project

but what they don't realize is that the real savings come from efficient and well-designed mechanisms and architectures

Source: Tweet by 0xSubway

It's easy to get caught up in tactical tricks to make near-obsolete optimizations to your smart contracts. In fact, most of the techniques I shared on Twitter were never even intended to be "optimization tips," (bit twiddling can be quite useful outside of optimization) but I sort of ended up falling down that hole.

Of course, it's always fun to discover and share tricks like this! But many of them (especially bit twiddling) have extremely niche applications, so ask yourself whether it really makes sense to obfuscate your code for small optimizations before applying them.

Extra Reading/Resources

If you want to see some more advanced stuff, check out:

Acknowledgements

Thanks to people that helped proofread and give suggestions :)

If you found this helpful, consider following me on Twitter @fiveoutofnine.