# Atari style fast cpu-only chunky
Notation:
> Let letters be pixels and digits bits of a pixel. Dash represents unused bit set to 0. For instance 4-color pixel may be represented as byte `------b1b0`.
### 3-color sprites
Each byte stores four 2-bit pixels:
```
a1a0b1b0c1c0d1d0 e1e0f1f0g1g0h1h0
i1i0j1j0k1k0l1l0 m1m0n1n0o1o0p1p0
...
```
Let's have precomputed arrays for every possible byte value (256 words each, for each xyuv):
```
[0] x1x0 y1y0 u1u0 v1v0 => x1y1u1v1----x0y0u0y0----
[1] x1x0 y1y0 u1u0 v1v0 => ----x1y1u1v1----x0y0u0y0
```
Finally, chunky to planar algorithm:
```asm68k
# [a0] chunky buffer
# [a1] sprite words
# [a2] precomputed array
# begin *block*
clr.w d0 # {4} | these instructions
move.b (a0)+,d0 # {8} | may be part of
add.w d0,d0 # {4} | effect renderer
move.w 0(a2,d0.w),d1 # {14} pixel a1a0b1b0c1c0d1d0
clr.w d0 # {4}
move.b (a0)+,d0 # {8}
add.w d0,d0 # {4}
or.w 256(a2,d0.w),d1 # {14} pixel a1a0b1b0c1c0d1d0
# end *block*
# d1 = a1b1c1d1e1f1g1h1 a0b0c0d0e0f0g0h0
movep.w d1,row+0(a1) # {16}
# repeat *block*
# d1 = i1j1k1l1m1n1o1p1 i0j0k0l0m0n0o0p0
movep.w d1,row+1(a1) # {16}
```
Average cycles per pixel is: `(16+14)/4 + 2 = 9.5` :+1:
#### What about blitter c2p?
##### Input
```
a1a0b1b0c1c0d1d0 e1e0f1f0g1g0h1h0
i1i0j1j0k1k0l1l0 m1m0n1n0o1o0p1p0
A1A0B1B0C1C0D1D0 E1E0F1F0G1G0H1H0
I1I0J1J0K1K0L1L0 M1M0N1N0O1O0P1P0
```
-----
* Swap 8x2
```
a1a0b1b0e1e0f1f0 A1A0B1B0E1E0F1F0
c1c0d1d0g1g0h1h0 C1C0D1D0G1G0H1H0
i1i0j1j0m1m0n1n0 I1I0J1J0M1M0N1N0
k1k0k1l0o1o0p1p0 K1K0L1L0N1N0O1O0
```
* Swap 2x1
```
a1a0 b1b0 -> a1a0 c1c0
c1c0 d1d0 -> b1b0 d1d0
```
```
a1a0c1c0e1e0g1g0 A1A0C1C0E1E0G1G0
b1b0d1d0f1f0h1h0 B1B0D1D0F1F0H1H0
i1i0k1k0m1m0o1o0 I1I0K1K0M1M0O1O0
j1j0k1l0n1n0p1p0 J1J0L1L0N1N0P1P0
```
* Swap 1x1
```
a1 a0 -> a0 b0
b1 b0 -> a1 b1
```
##### Output
```
a0b0c0d0e0f0g0h0 i0j0k0l0m0n0o0p0 <- bitplane #0
A0B0C0D0E0F0G0H0 I0J0K0L0M0N0O0P0
a1b1c1d1e1f1g1h1 i1j1k1l1m1n1o1p1 <- bitplane #1
A1B1C1D1E1F1G1H1 I1J1K1L1M1N1O1P1
```
### 15-color sprites
Each byte stores two 4-bit pixels:
```
a3a2a1a0 b3b2b1b0 c3c2c1c0 d3d2d1d0
e3e2e1e0 f3f2f1f0 g3g2g1g0 h3h2h1h0
i3i2i1i0 j3j2j1j0 k3k2k1k0 l3l2l1l0
m3m2m1m0 n3n2n1n0 o3o2o1o0 p3p2p1p0
...
```
Let's have precomputed arrays for every possible byte value (256 long each, for each xy):
```
[0] x3x2x1x0 y3y2y1y0 => x3y3------x2y2------x1y1------x0y0------
[1] x3x2x1x0 y3y2y1y0 => --x3y3------x2y2------x1y1------x0y0----
[2] x3x2x1x0 y3y2y1y0 => ----x3y3------x2y2------x1y1------x0y0--
[3] x3x2x1x0 y3y2y1y0 => ------x3y3------x2y2------x1y1------x0y0
```
Finally, chunky to planar algorithm:
```asm68k
# [a0] chunky buffer
# [a1] sprite words
# [a2] precomputed array
# begin *block*
clr.w d0 # {4} | these instructions
move.b (a0)+,d0 # {8} | may be part of
lsl.w #2,d0 # {8} | effect renderer
move.l 0(a2,d0.w),d1 # {18} pixel a3a2a1a0b3b2b1b0
clr.w d0 # {4}
move.b (a0)+,d0 # {8}
lsl.w #2,d0 # {8}
or.l 1024(a2,d0.w),d1 # {18} pixel c3c2c1c0d3d2d1d0
clr.w d0 # {4}
move.b (a0)+,d0 # {8}
lsl.w #2,d0 # {8}
or.l 2048(a2,d0.w),d1 # {18} pixel e3e2e1e0f3f2f1f0
clr.w d0 # {4}
move.b (a0)+,d0 # {8}
lsl.w #2,d0 # {8}
or.l 3072(a2,d0.w),d1 # {18} pixel e3e2e1e0f3f2f1f0
# end *block*
# d1 = a3b3c3d3e3f3g3h3 a2b2c2d2e2f2g2h2 \
# a1b1c1d1e1f1g1h1 a0b0c0d0e0f0g0h0
# Is it correct below ?
movep.l d1,row+0(a1) # {24}
# repeat *block*
# d1 = i1j1k1l1m1n1o1p1 i0j0k0l0m0n0o0p0
movep.l d1,row+1(a1) # {24}
```
Average cycles per pixel is: `(4+8+8+18)/4 + 2.5 = 12` :+1: