[Design Practice Sharing] Part III – Handling Alignment and Cascade in Stream Converters

# [Design Practice Sharing] Part III – Handling Alignment and Cascade in Stream Converters In hardware data movers, alignment isn’t always perfect — especially when handling AXI-MM transfers that start at unaligned addresses or contain only partial data. This article shares how I tackled unaligned and narrow transfers while building a **Sparse to Continuous Stream Converter IP**, designed for PCIe DMA scenarios like Xilinx CPM QDMA. --- ## 🚧 The Challenge AXI-MM interfaces allow transactions to begin at arbitrary byte offsets and send partial beats. But AXI-Stream expects **fully aligned, gap-free, in-order data**. So what happens when: - Data starts at byte 1 instead of byte 0? - You only receive 3 valid bytes instead of 4? - The last transfer has just 2 bytes? You need **alignment handling** and **data cascade logic**. --- ## 🧠 Design Breakdown To solve this, I implemented: ### 🔹 Partial Buffer Stores partial input data that cannot yet form a complete AXIS beat. ### 🔹 Shift & Merge Logic Shifts incoming valid bytes into alignment with the output LSB (Byte0). Merges with buffered data from previous beats. ### 🔹 Valid Byte Counter Tracks accumulated valid bytes to determine when a full beat is ready. --- ## 🔄 Behavior Summary 1. **First beat**: If start offset ≠ 0, shift data to align with LSB. 2. **Intermediate beats**: Merge new data with buffered data from the previous beat. 3. **When ≥ N bytes collected**: Output a full AXIS beat. 4. **Final beat**: If < N bytes remain, output partial data and set `axis_invalid_cnt`. ## 🧩 Data Merge Implementation Strategy Two candidate architectures are proposed for merging and aligning sparse data into a continuous stream: ### Option 1: Dual Register Merge Strategy - Utilizes two N-bit registers (e.g., 64-bit): Reg A and Reg B. - Reg A holds the first beat, right-shifted by the invalid byte count. - Reg B holds the next beat, left-shifted by the number of valid bytes in Reg A. - The output is generated by merging Reg A and Reg B (`Reg A | Reg B`). - Remaining bytes from Reg B can be retained for the next merge cycle. **Pros:** - Simpler control for small widths. - Per-beat management is intuitive. **Cons:** - Requires dual-register coordination. - Shift logic must be aware of valid byte counts per beat. ### Option 2: Wide Shift Register Strategy - Uses a single 2×N-bit register (e.g., 128-bit for 64-bit data width). - Each new input beat is placed in the upper half. - Previous data is shifted to the lower half automatically. - A global right shift is applied based on the first beat's invalid byte count. - The lower half then represents the final merged aligned output. **Pros:** - Compact implementation with centralized shift logic. - Easier to pipeline and scale. **Cons:** - Consumes more registers. - May stress timing paths at higher data widths. ## 🖼️ Example Diagram ### Transaction from Sparse to Continuous ![image](https://hackmd.io/_uploads/H1-kcFaggx.png) ### Dual Register Merge Strategy ![image](https://hackmd.io/_uploads/B1QR9l0gee.png) ### Wide Shift Register Strategy ![image](https://hackmd.io/_uploads/rkuOdp6lex.png) --- ## ✅ Key Takeaways - **Alignment always starts from Byte0** on AXIS output. - **Order must be preserved** across beat boundaries. - **Cascading** allows data to be carried forward until complete. This mechanism ensures smooth AXI-Stream flow even when source data is sparse, misaligned, or narrow. --- #AXIStream #IPDesign #DataMover #FPGA #StreamingArchitecture #QDMA #UnalignedTransfer #SystemDesign #RTL