Response - HackMD

# Response Dear Editor in Chief, dear Associate Editor, dear reviewers, Thank you for your considerate feedback and comment on our original submission. Hereunder follows the response we would like to address to each individual point raised by both reviewers. ## Reviewer 1 - "The hybrid bonding assumed by the authors rather points to oxide bonding, hence, the very high vertical interconnect density. There are processes, however, that this density may not be supported. Therefore, it would make sense to check how the results would change for fabrication processes based on microbumps on the order of tens of μm (i.e., 55 μm and below towards 10 μm)." -> Indeed there are many ways to implement a 3D stack. We do explore F2F with high interconnect density, but other options as you say are possible. Not only with F2F and bigger µbumps, but also F2B with TSVs that will introduce area overhead for TSV insertion and KOZ (Keep out zone integration where no active devices could be placed) and will further limit interconnect density. We believe that future 3D will definitively require more than two layers and variable 3D interconnect density must be taken into account. But here we focus on 2 dies stack simply because we want to understand the benefits of fine-grain system partitioning. An open question is whether we could we extend the proposed methods for more than 2 layers and somehow control the number of interconnect between the dies. Moreover, the 3D pitch considered in this work is around 1.75µm, which considering the area footprint of the 3D stack, we have close to 35k 3D bumps available across the F2F boundary. Looking at the amount of 3D nets produced by each partitioning, we already have 13 to 72% of the total 3D structures used even before power routing. This means that a larger 3D pitch, even as little as 10µm, would render the whole 3D stack infeasible. Of course this is system architecture and partitioning scenario dependant. Other system architectures may have obviously very different pitch requirements. By the way automated partiitoning at one point in time could and should be 3D pitch driven. We added a sentence clarifying this point in the first part of Section IV. - "Results with some other circuits other that LDPC, which is wire-dominated, would offer more insight as to whether 3D is useful or not. Table I shows for L4S that gains when migrating to 3D may not be so great (i.e., total cap. does not really decrease)." -> One of the aims of these experiments was to highlight the difference between highly and lowly interconnected designs. The LDPCs used here were engineered for that purpose, each representing one end of the spectrum. L4S has a lot of local connections by design, but few global interconnections. On the other end of the spectrum, we have L4F with a dense global interconnect. Indeed L4S show little improvements by going 3D, but L4F definitely does, not only on the total capacitance but especially on the design utilization. And this translates directtly into area and cost gains. 2D itnegration for heavily interconnected design requires desing utilisation to go down to 40%, while 3D can bring desing utilisiation to level of poorly interconencted design. This is also an indicator in chich direction system architecture developements could go further (e.g. MemPool from Prof Luca Benini and ETHZ.) The literature tends to agree that 3D is useful if the technology enables it. Through these experiments, we want to show that the type of design matters as well as the partitioning and clustering decisions. We appended Section V. with a clarification going that way and clarified our purpose in choosing those two design at the beginning of Section V. - "I fully agree with this statement: " Working directly with timings would be an interesting extension, but the interaction required between the various tools is outside the scope of the present work." I understand the difficulty of this task but it would be nice to have such results." -> The major difficulty as of today in performing such exercises is automated generation of timing budgets for each die that will be processed individually. Even with most recent tools at our disposition as of 2023, this is not doable. This can be achieved for simpler design partitioning such as memory on logic but not logic on logic as large as L4S and L4F presented in this work. The related sentence in the manuscript has been updated to clarify this point. - "References directly relating to this work, such as "T. Yan, Q. Dong, Y. Takashima, and Y. Kajitani, “How Does Partitioning Matter for 3D Floorplanning?” Proceedings of the ACM International Great Lakes Symposium on VLSI, pp. 73-76, April/May 2006" should be included in the list of refs. There aren't only two groups making research on 3D around the globe." -> This is indeed a fair point and the state-of-the-art as been expanded to reflect this remark. - "It is stated: "They were both implemented using an advanced PDK node and totaled around one million standard cells" Which PDK? This should be stated clearly along with some info relating to the constraints used to produce the initial 2D synthesized netlist and placement later on." -> **DM** -> Minor edits have been inserted within the text. ## Reviewer 2 - "This submission is suggested going to TCAD or TVLSI, because most of content is related to physical implementation level, not system level. And, there are many existing papers published in the physical design related conference proceedings and journals." -> In our experiments, we used a design with a saturated global interconnect representing an actual limitation for many systems designers. Nowadays, one would avoid an architecture such as the one engineered through L4F (highly interconnected blocks) and for a good reason: the design utilization (DU) suffers greatly of the metal layers congestion, driving away most of the interest by the need to decrease the DU and thus waste precious silicon real estate. We try to show that a 3D stacking scheme makes such a design entirely viable by bringing the DU to the level of a more classical implementation represented by L4S (which in turn benefices less from a 3D implementation). Furthermore, 3D makes sense for a system consideration, especially embedded, as it participates in wire-length reduction and thus decreases power dissipation in the metal layers. A good example of such a highly interconnected design is Mempool [1] that poses challenges for a 2D implementation, somewhat relaxed by 3D alternatives [2]. And this goes much beyond performance and power savings. Assuming 3D technology future system architects could benefit from that increased interconnect offer to re-design system architecture (more distributed communication (more parallel communication channels) with wider data interfaces (e.g. AXI width can go up to 1024 bits). We appended Section V. with a clarification going that way and clarified our purpose in choosing those two design at the beginning of Section V. - "The evaluated methodology requires 2D P&R before partitioning. However, the solution quality of this kind of partitioning might be limited by the initial 2D placed gate-level netlist. In fact, the partitioning does not necessarily rely on a 2D placed netlist. Some evidences indicate that partitioning for a pure 3D methodology is desired." -> This remark is entirely on point! However the extension of the placement to 3rd dimension is far from being simple. Commercial tools as of 2023 extend existing 2D PnR for 3D [3] and do not provide automated partitioning decision yet. Our goal in this work is to understand if fine grain logic-on-logic can be interesting and what are the trade-offs. The state-of-the-art has been updated to include a recent work with a native 3D placement scheme, which is indeed a step on the still-long road toward 'true' 3D-ICs. - "In introduction, "The placement of the 3D structures is however fully automated and _optimal_ with respect to the first die in the flow, minimizing the distance between each 3D structure with its connected gate." Please be careful for the claim of "optimal"." -> This is a fair remark. The sentence has been updated to: "3D structures are allocated to be as close as possible to the 2D pin to which it will have to connect, but lacking the information pertaining to what and where it will be connected in the subsequent die." - "There are keeping new papers published. Comparison with state-of-the-art works is required. For example, Pruek Vanna-Iampikul, Chengjia Shao, Yi-Chen Lu, Sai Pentapati, and Sung Kyu Lim. 2021. Snap-3D: A Constrained Placement-Driven Physical Design Methodology for Face-to-Face-Bonded 3D ICs. In Proceedings of the 2021 International Symposium on Physical Design (ISPD '21). Association for Computing Machinery, New York, NY, USA, 39–46." -> This is absolutely right and the state-of-the-art as been expanded to included more recent works. - "In experiments, two testcases were used. Two testcases might be too few for the evaluation. Especially, their results showed that none of the evaluated three clustering methods can be a general solution for 3D IC designs." -> Reviewer 1 shows a similar concern and we address this point in a dedicated response above. The two designs engineered for these experiments represent both ends of the interconnect density spectrum: L4S is highly local with less global connections, whilst L4F has a very dense global interconnect. Using those two, we show that 3D has limited impact on local interconnect, but a significant impact on global interconnect which is hosting the longer wires. This is especially noticeable on the design utilization which doubles for L4F but barely changes for L4S. Concerning the clustering methods, we did not aim at highlighting a general solution that should be a "go-to" method for designing 3D-ICs. We want to show that while the literature put the pre-partitioning clustering problem aside, it can have an non-negligible impact on the partitioning quality and on 3D designs enablement. [1] MemPool: A Shared-L1 Memory Many-Core Cluster with a Low-Latency Interconnect, Cavalcante et al., https://doi.org/10.23919/DATE51398.2021.9474087 [2] MemPool-3D: Boosting Performance and Efficiency of Shared-L1 Memory Many-Core Clusters with 3D Integration, Cavalcante et al., https://doi.org/10.23919/DATE54114.2022.9774726 [3] Integrity 3D-IC Platform, Cadence, https://www.cadence.com/en_US/home/tools/digital-design-and-signoff/soc-implementation-and-floorplanning/integrity-3dic-platform.html

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.