Broadcasting in binary ops

# Broadcasting in binary ops Tensor broadcasting is a powerful feature in NumPy and other array computing frameworks, allowing operations on tensors of different but compatible shapes. In simple terms, broadcasting takes a smaller tensor and "stretches" it across a larger tensor, without creating redundant copies of the data. Below are some examples of broadcasting operations using NumPy: ![image](https://hackmd.io/_uploads/BJADaxJekx.png) ## Compatible Shapes Two tensors are compatible for broadcasting if, for each corresponding dimension, the dimension sizes are equal or one of them is $1$. Formally, given two tensors $A[a_1, a_2, \dots, a_n]$ and $B[b_1, b_2, \dots, b_n]$ , their shapes are compatible if $\forall i \in \{1,2,\dots,n\}$, either $a_i = b_i$ or $a_i = 1$ or $b_i = 1$. The output shape $s$ is then computed as the maximum size along each dimension, $s_i = \max(a_i, b_i)$. ## Broadcasting definition For a tensor $A[a_1, a_2, \dots, a_n]$ and a tensor shape $s = [m_1,\dots, m_n]$, with $a_k \in \{1,m_k\}$, the operation $C = \text{broadcast}(A, s)$ replicates the content of $A$ along the $k$ dimension $m_k$ times for every $a_k = 1$. The output tensor $C$ is of shape $s$. ## Implementing broadcasting To implement broadcasting without physically copying tensor data for every broadcast dimension, we need to carefully iterate over the tensor elements multiple times in way that mirrors the single iteration pass over a materialized broadcast tensor. For `tt-metal` this means iterating over tiles instead of individual elements, which makes things slightly more complicated. For broadcasting over any dimension other than the last two, we can simplify things by iterating over that dimension multiple times. For instance, let's consider a 4D tensor and the very first dimension: ![image](https://hackmd.io/_uploads/BkdYpxJxJe.png) - Here the first dimension is $a_1 = 3$. If we had $a_1 = 1$ and needed to broadcast it $m_1$ times, we'd simply iterate over all the tiles of the tensor in the outermost loop $m_1$ times instead of just $1$. This effectively replicates the entire 3D "slice" without any physical copying. - Similarly, the next dimension here is $a_2 = 2$. If we had $a_2 = 1$ and needed to broadcast it $m_2$ times, we'd iterate over every tile inside every colored 3D slice of the entire tensor $m_2$ times, before moving on to the next index of the outermost dimension. ## Challenges with Innermost Dimensions Broadcasting over the two innermost dimensions poses additional challenges due to tiling. 1. **Broadcasting Last Dimension:** When $a_4 = 1$, it implies that each tile in the input tensor contains a single column of actual values, with the rest as padding. To broadcast along $m_4$, we replicate the column value multiple times to fill the tile width $W_t$. We need to handle full tiles as well as partial ones, iterating as follows: - Push the full tile $q$ times, where $q = \lfloor m_4 / W_t\rfloor$. - Follow up with a single partial tile if the remainder ($r$) is not zero: $r = m_4 - W_t\,q$. ![tile01](https://hackmd.io/_uploads/S1NcqMygJx.png) 2. **Broadcasting Penultimate Dimension:** Broadcasting the penultimate dimension ($a_3 = 1$) also requires careful consideration. First, let's consider the case where the last dimension, $a_4$, is not broadcast. This means that we have $a_4 / W_t$ tiles that correspond to each fixed pair of higher dimension indexes $(a_1, a_2)$. And each tile only has a single row of data, rest being the padding. - If $m_3 \le H_t$, where $H_t$ is the tile height, then we need to replicate the first row of each tile $m_3$ times, while the total tile count remains the same. - If $m_3 > H_t$, then the total tile count increases by the factor of $\lceil m_3 / H_t\rceil$. That is, we first have to iterate over every tile and fill it entirely with the single data row replicated $H_t$ times. We will perform $q = \lfloor m_3 / H_t\rfloor$ such iterations. After that we need to run the iteration one more time, filling the tiles partially with $r = m_3 - H_t\,q$ rows of data. - Since the first $q$ iterations push the same sequence of full tiles, an obvious optimization would be to construct the sequence only once, store it in an intermediate CB, and reuse it on every iteration. ![tile02](https://hackmd.io/_uploads/BkEaagJlye.png) 3. **Broadcasting both innermost dimensions:** The final case we'll consider is when we need to broadcast both $a_3 = 1$ and $a_4 = 1$ dimensions. This means that each tile of the input tensor contains a single data element and everything else is padding. In this situation we have to combine both approaches covered earlier. We need to repeat the single element both along rows and columns, forming full tiles when possible and handling partial rows and columns otherwise. - First, construct the full tile by replicating the single value in the data element across all $H_t$ rows and $W_t$ columns. - We compute $q_r = \lfloor m_3 / H_t \rfloor$ and $q_c = \lfloor m_4 / W_t \rfloor$ to indicate how many full tiles we'll need to iterate over, as well as $r_r = m_3 - H_t\,q_r$ and $r_c = m_4 - W_t\,q$ to populate partial tiles. - We put 4 tiles in an intermediate CB: a full tile $T_F$, a tile $T_H$ with $H_t$ partial rows of length $r_c$, a tile $T_W$ with $W_t$ partial columns or length $r_r$, and a tile $T_0$ with $r_r$ partial rows of length $r_c$. - We loop $q_r$ times and in each iteration we push the $T_F$ tile $q_c$ times followed by a single $T_H$ tile. - Then we push the $T_W$ tile $q_c$ times, followed by a single $T_0$ tile. ![tile03](https://hackmd.io/_uploads/BkMomm1e1g.png)