Reading List of Carloni's works

# Reading List of Carloni's works ###### Tags: `Thoughts` This review lists the publications by Luca Carloni that look interesting to me. Among 149 papers since 1996, 32 papers have been picked here divided into four areas: latency insensitive design, automatic design exploration, memory optimisation and others. In case you have limited time to read it through, I give a related factor (RF) out of 10 for each paper, indicating how related that paper is to my current research area: * RF between 7 and 10: Very closed to what we are working (worth careful reading). * RF between 3 and 6: Unclear how related it is, but we should be aware of this work. * RF between 0 and 2: Interesting work but unlikely to be related to my current research. In addition, I will give some comments on these papers detailed below. ### Latency Insensitive Design --- #### 1.(RF = 7) G. Zacharopoulos, L. Ferretti, G. Ansaloni, G. Di Guglielmo, L. Carloni and Laura Pozzi, [Compiler-Assisted Selection of Hardware Acceleration Candidates from Application Source Code(not available yet)](), The Proceedings of the International Conference on Computer Design (ICCD), 2019. The paper is not available yet, but I find it interesting as their prior work recognises a subgraph and customises the architecture in hardware. This may be helpful when we automate code partitioning between SS and DS from the input code. <details> <summary>Abstract</summary> *Related work to it(?) TCAD2018: Georgios Zacharopoulos, Lorenzo Ferretti, Emanuele Giaquinta, Giovanni Ansaloni, Laura Pozzi. RegionSeeker: Automatically Identifying and Selecting Accelerators from Application Source Code, in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. 2018. [Link to paper](https://ieeexplore.ieee.org/document/8323207).* *Abstract(TCAD2018): Embedded systems present stringent and often conflicting requirements. On the one side, the need for high performance within a tight energy budget favors inflexible Application Specific Integrated Circuit (ASIC) implementations; on the other side, a short time-to-market demands programmability. Hybrid architectures such as special-purpose customized processors represent an attractive solution, as they are programmable by software, but use dedicated hardware to accelerate parts of the computation. In such a scenario, the capability of automatically identifying the computation parts to be realized in hardware is highly desirable, in order to reduce design time and effort. **This paper aims at advancing the state-of-the-art in this field. We recognize that subgraphs of control flow graphs having a single input control point and a single output control point, that we call regions, are good targets for the synthesis of application specific hardware accelerators. We therefore provide a method to identify them and an LLVM-based toolchain (named RegionSeeker) that, analyzing a software application, automatically selects its most profitable regions given an area constraint.** Experimental evidence shows that the accelerators identified by RegionSeeker provide a speedup of up to $4.6\boldsymbol {\times }$ and, on average, approximately 30% higher speedup is achieved compared to state-of-the-art identification techniques.* </details> #### 2.(RF = 9) L. P. Carloni, [From Latency-Insensitive Design to Communication-Based System-Level Design](https://ieeexplore.ieee.org/document/7299248/?arnumber=7299248), The Proceedings of the IEEE, Vol. 103, No. 11, November 2015. This is the 'annoying' paper we found at FPGA2020 rebuttal, which leads me to these papers. It introduces the work of combing SS and DS hardware. <details> <summary>Abstract</summary> *Abstract: By the end of the 20th century, the continuous progress of the semiconductor industry brought a major transformation in the design of integrated circuits: as the speed of global wires could not keep up with the speed of ever-smaller transistors, the digital chip became a distributed system. This fact broke the synchronous paradigm assumption, i.e., the foundation of those computer-aided design (CAD) flows which had made possible three decades of unique technology progress: from chips with thousands of transistors to systems on chips (SoCs) with over a billion transistors. Latency-insensitive design (LID) is a correct-by-construction design methodology that was originally developed to address this challenge while preserving as much as possible the synchronous assumption. A broad new approach that transforms the fundamentals of how complex digital systems are assembled, LID introduces the protocols and shells paradigm, which offers several main benefits: modularity (by reconciling the synchronous paradigm with the dominant impact of global interconnect delays that characterizes nanometer technologies), scalability (by making key properties of the design be correct by construction through interface synthesis), flexibility (by simplifying the design and validation of a system through the separation of communication from computation), and efficiency (by enabling the reuse of predesigned components, thus reducing the overall design time). This paper overviews the principles and practice of LID, offers a retrospective on related research over the past decade, and looks ahead in proposing the protocols and shells paradigm as the foundation to bridge the gap between system-level and logic/physical design, a requisite to cope with the complexity of engineering future SoC platforms.* </details> #### 3.(RF = 8) R. Collins and L.P. Carloni, [Flexible Filters in Stream Programs](https://dl.acm.org/citation.cfm?id=2539041), ACM Transactions on Embedded Computing Systems, Vol. 13, No. 3, December 2013. This is interesting work. They introduce a flexible filter to allocate the workload both statically and dynamically in a streaming application. <details> <summary>Abstract</summary> *The stream-processing model is a natural fit for multicore systems because it exposes the inherent locality and concurrency of a program and highlights its separable tasks for efficient parallel implementations. **We present flexible filters, a load-balancing optimization technique for stream programs. Flexible filters utilize the programmability of the cores in order to improve the data-processing throughput of individual bottleneck tasks by “borrowing” resources from neighbors in the stream. Our technique is distributed and scalable because all runtime load-balancing decisions are based on point-to-point handshake signals exchanged between neighboring cores. Load balancing with flexible filters increases the system-level processing throughput of stream applications, particularly those with large dynamic variations in the computational load of their tasks. We empirically evaluate flexible filters in a homogeneous multicore environment over a suite of five real-word stream programs.*** </details> #### 4.(RF = 8) L.P. Carloni and A.L. Sangiovanni-Vincentelli, [A Framework for Modeling the Distributed Deployment of Synchronous Designs](http://www.cs.columbia.edu/~luca/research/dsddaFMSD06.pdf), in Formal Methods in Systems Design - An International Journal, © Springer-Verlag, Vol. 28, No. 2, March 2006 This paper is the offically published version of paper 5 (paper 5 was published in a workshop). I think this paper is one of the most important paper for me to read. It formally models the LID circuit (to be summarised after reading it). <details> <summary>Abstract</summary> *Synchronous specifications are appealing in the design of large scale hardware and software systems because of their properties that facilitate verification and synthesis. When the target architecture is a distributed system, implementing a synchronous specification as a synchronous design may be inefficient in terms of both size (memory for software implementations or area for hardware implementations) and performance. A more elaborate implementation style where the basic synchronous paradigm is adapted to distributed architectures by introducing elements of asynchrony is, hence, highly desirable. **Building on the tagged-signal model, we present a modeling for the distributed deployment of synchronous design. We offer a comparative exposition of various design approaches (synchronous, asynchronous, GALS, latency-insensitive, and synchronous programming) and we provide some insight on the role of signal absence in modeling synchronization in distributed concurrent systems. Finally, we compare two distinct methodologies, desynchronization and latency-insensitive design, and we elaborate on possible options to combine their results.*** </details> #### ~~5.(RF = 9) L.P. Carloni and A.L. Sangiovanni-Vincentelli, [A Formal Modeling Framework for Deploying Synchronous Designs on Distributed Architectures](http://www.cs.columbia.edu/~luca/research/dsddaFMGALS03.pdf), First International Workshop on Formal Methods for Globally Asynchronous Locally Synchronous Architectures (FMGALS), 2003.~~ See the newer version in 4. <details> <summary>Abstract</summary> *Synchronous specifications are appealing in the design of large scale hardware and software systems because of their properties that facilitate verification and synthesis. When the target architecture is a distributed system, implementing a synchronous specification as a synchronous design may be inefficient in terms of both size (memory for software implementations or area for hardware implementations) and performance. A more elaborate implementation style where the basic synchronous paradigm is adapted to distributed architectures by introducing elements of asynchrony is, hence, highly desirable. This approach has to conjugate the desire of maintaining the theoretical properties of synchronous designs with the efficiency of implementations where the constraints imposed by synchrony are relaxed. Two interesting avenues have been recently pursued to achieve this goal:* * 1. Latency insensitive protocols[9,10] motivated by hardware implementations, where long paths between the design components may introduce delays that force the overall clock of the system to run too slow in order to maintain synchronous behavior. This approach introduces additional elements in the design to allow the implementation to maintain the throughput that could have been achieved with communication delays of the same order of the clock of the subsystems at the price of additional latency.* * 2. Desynchronization [3,4,20] motivated by software implementations, where processes that compose the large scale system are locally implemented synchronously while their communication is implemented in an asynchronous style. This approach allows also to run each of the process at its own “speed”.* ***By using the Lee and Sangiovanni-Vincentelli (LSV) tagged-signal model [19] as a common framework, we offer a comparative exposition of these approaches and we show their precise relationship. In doing so, we also provide some insight on the role of signal absence in synchronous, asynchronous, and globally-asynchronous locally-synchronous (GALS) design styles.*** </details> #### 6.(RF = 7) L.P. Carloni and A.L. Sangiovanni-Vincentelli, [Combining Retiming and Recycling to Optimize the Performance of Synchronous Circuits](http://www.cs.columbia.edu/~luca/research/rscSBCCI03.pdf), Proceedings of the 16th Symposium on Integrated Circuits and System Design (SBCCI), 2003. I was kind of disappointed by the failure of C-slowing retiming last month. And this paper makes me excited again! It shows how to do retiming and recycling to improve the performance of LID. <details> <summary>Abstract</summary> *Recycling was recently proposed as a system-level design technique to facilitate the building of complex System-on-Chips (SOC) by assembling pre-designed components. Recycling allows us to model the communication patterns among the components, analyze the impact of interconnect latency on the overall data processing throughput, and manage computation/communication tradeoffs to optimize the performance of the system. **In this paper, we present recycling as a circuit-level design technique for optimizing the performance of sequential circuits beyond what can be achieved by retiming. We also provide a theoretical framework to guide the simultaneous application of the two techniques. Our model identifies the conditions under which an optimally-retimed synchronous circuit can be further sped-up and determines the amount of the resulting performance gain.*** </details> #### 7.(RF = 8) L.P. Carloni, K.L. McMillan, A. Saldanha, and A.L. Sangiovanni-Vincentelli, [A Methodology for Correct-by-Construction Latency-Insensitive Design](http://www.cs.columbia.edu/~luca/research/lidBestOfICCAD.pdf), Reprinted (first published in 1999) as selected paper in A. Kuehlmann (Ed.), "The Best of ICCAD - 20 Years of Excellence in Computer-Aided Design", Kluwer Academic Publishers, 2003. <details> <summary>Abstract</summary> * In Deep Sub-Micron (DSM) designs, performance will depend critically on the latency of long wires. We propose a new synthesis methodology for synchronous systems that makes the design functionally insensitive to the latency of long wires. **Given a synchronous specification of a design, we generate a functionally equivalent synchronous implementation that can tolerate arbitrary communication latency between latches. By using latches we can break a long wire in short segments which can be traversed while meeting a single clock cycle constraint. The overall goal is to obtain a design that is robust with respect to delays of long wires, in a shorter time by reducing the multiple iterations between logical and physical design, and with performance that is optimized with respect to the speed of the single components of the design. In this paper we describe the details of the proposed methodology as well as report on the latency insensitive design of PDLX, an out-of-order microprocessor with speculative-execution.*** </details> #### 8.(RF = 8) L.P. Carloni, K.L. McMillan, and A.L. Sangiovanni-Vincentelli, [Theory of Latency-Insensitive Design](http://www.cs.columbia.edu/~luca/research/lipTransactions.pdf), IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. Vol. 20, No. 9, September 2001. <details> <summary>Abstract</summary> *The theory of latency-insensitive design is presented as the foundation of a new correct-by-construction methodology to design complex systems by assembling intellectual property components. Latency-insensitive designs are synchronous distributed systems and are realized by composing functional modules that exchange data on communication channels according to an appropriate protocol. The protocol works on the assumption that the modules are stallable, a weak condition to ask them to obey. The goal of the protocol is to guarantee that latency-insensitive designs composed of functionally correct modules behave correctly independently of the channel latencies. This allows us to increase the robustness of a design implementation because any delay variations of a channel can be “recovered” by changing the channel latency while the overall system functionality remains unaffected. As a consequence, an important application of the proposed theory is represented by the latency-insensitive methodology to design large digital integrated circuits by using deep submicrometer technologies.* </details> This paper formally describes the LID model and proves its properties. I think it can be a fundamental work of our analysis, at least to borrow some existing terms from them. #### 9.(RF = 8) L.P. Carloni and A.L. Sangiovanni-Vincentelli, [Performance Analysis and Optimization of Latency-Insensitive Systems](http://www.cs.columbia.edu/~luca/research/rrrDAC00.pdf), The Proceedings of the Design Automation Conference (DAC), 2000. <details> <summary>Abstract</summary> *Latency insensitive design has been recently proposed in literature as a way to design complex digital systems, whose functional behavior is robust with respect to arbitrary variations in interconnect latency. However, this approach does not guarantee the same robustness for the performance of the design, which indeed can experience big losses. **This paper presents a simple, yet rigorous, method to (1) model the key properties of a latency insensitive system, (2) analyze the impact of interconnect latency on the overall throughput, and (3) optimize the performance of the final implementation.*** </details> This paper is interesting. It formally models the LID graph and use recycling to insert buffers between the connections to perform retiming. I think this may be useful when we apply multi-clock domain stuffs onto different components. We can insert buffers into the hand-shaking connections to increases clock frequency if that is the bottleneck. #### ~~10.(RF = 8) L.P. Carloni, K.L. McMillan, A. Saldanha, and A.L. Sangiovanni-Vincentelli, [A Methodology for Correct-by-Construction Latency-Insensitive Design](http://www.cs.columbia.edu/~luca/research/lidICCAD99.pdf), The Proceedings of the International Conference on Computer-Aided Design (ICCAD), 1999.~~ See the newer version in 7. <details> <summary>Abstract</summary> *In Deep Sub-Micron (DSM) designs, performance will depend critically on the latency of long wires. We propose a new synthesis methodology for synchronous systems that makes the design functionally insensitive to the latency of long wires. Given a synchronous specification of a design, we generate a functionally equivalent synchronous implementation that can tolerate arbitrary communication latency between latches. By using latches we can break a long wire in short segments which can be traversed while meeting a single clock cycle constraint. The overall goal is to obtain a design that is robust with respect to delays of long wires, in a shorter time by reducing the multiple iterations between logical and physical design, and with performance that is optimized with respect to the speed of the single components of the design. **In this paper we describe the details of the proposed methodology as well as report on the latency insensitive design of PDLX, an out-of-order microprocessor with speculative-execution.*** </details> #### 11.(RF = 6) L.P. Carloni, K.L. McMillan, and A.L. Sangiovanni-Vincentelli, [Latency-Insensitive Protocols](http://www.cs.columbia.edu/~luca/research/lipCAV99.pdf), In N. Halbwachs and D. Peled, editors, Proc. of the 11th Intl. Conf. on Computer-Aided Verification (CAV), LNCS 1633, © Springer-Verlag, 1999. <details> <summary>Abstract</summary> *The theory of latency insensitive design is presented as the foundation of a new correct by construction methodology to design very large digital systems by assembling blocks of Intellectual Properties. Latency insensitive designs are synchronous distributed systems and are realized by assembling functional modules exchanging data on communication channels according to an appropriate protocol. The goal of the protocol is to guarantee that latency insensitive designs composed of functionally correct modules, behave correctly independently of the wire delays. A latency-insensitive protocol is presented that makes use of relay stations buffering signals propagating along long wires. To guarantee correct behavior of the overall system, modules must satisfy weak conditions. The weakness of the conditions makes our method widely applicable.* </details> ### Automatic Design Exploration --- #### 12.(RF = 8) L. Piccolboni, P. Mantovani, G. Di Guglielmo, and L. P. Carloni, [COSMOS: Coordination of High-Level Synthesis and Memory Optimization for Hardware Accelerators](http://www.cs.columbia.edu/~luca/research/piccolboni_CODES17.pdf), Presented at the International Conference on Hardware/Software Codesign & System Synthesis (CODES+ISSS), 2017. ACM Transactions on Embedded Computing Systems, Vol. 16, No. 5s, September 2017. *Abstract: Hardware accelerators are key to the efficiency and performance of system-on-chip (SoC) architectures. With high-level synthesis (HLS), designers can easily obtain several performance-cost trade-off implementations for each component of a complex hardware accelerator. **However, navigating this design space in search of the Pareto-optimal implementations at the system level is a hard optimization task. We present COSMOS, an automatic methodology for the design-space exploration (DSE) of complex accelerators, that coordinates both HLS and memory optimization tools in a compositional way.** First, thanks to the co-design of datapath and memory, COSMOS produces a large set of Pareto-optimal implementations for each component of the accelerator. Then, COSMOS leverages compositional design techniques to quickly converge to the desired trade-off point between cost and performance at the system level. When applied to the system-level design (SLD) of an accelerator for wide-area motion imagery (WAMI), COSMOS explores the design space as completely as an exhaustive search, but it reduces the number of invocations to the HLS tool by up to 14.6×.* #### 13.(RF = 7) L. Piccolboni, P. Mantovani, G. Di Guglielmo, and L. P. Carloni, [Broadening the Exploration of the Accelerator Design Space in Embedded Scalable Platforms](http://www.cs.columbia.edu/~luca/research/piccolboni_HPEC17.pdf), Proceedings of the Twenty-First Annual Conference on High Performance Extreme Computing (HPEC), 2017. *Abstract: Accelerators are specialized hardware designs that generally guarantee two to three orders of magnitude higher energy efficiency than general-purpose processor cores for their target computational kernels. To cope with the complexity of integrating many accelerators into heterogeneous systems, we have proposed Embedded Scalable Platforms (ESP) that combines a flexible architecture with a companion system-level design (SLD) methodology. In ESP, we leverage high-level synthesis (HLS) to expedite the design of accelerators, improve the process of design-space exploration (DSE), and promote the reuse of accelerators across different target systems-on-chip (SoCs). HLS tools offer a powerful set of parameters, known as knobs, to optimize the architecture of an accelerator and evaluate different trade-offs in terms of performance and costs. However, exploring a large region of the design space and identifying a rich set of Pareto-optimal implementations are still complex tasks. The standard knobs, in fact, operate only on loops and functions present in the high-level specifications, but they cannot work on other key aspects of SLD such as I/O bandwidth, on-chip memory organization, and trade-offs between the size of the local memory and the granularity at which data is transferred and processed by the accelerators. To address these limitations, we augmented the set of HLS knobs for ESP with three additional knobs, named eXtended Knobs (XKnobs). **We used the XKnobs for exploring two selected kernels of the wide-area motion imagery (WAMI) application.** Experimental results show that the DSE is broadened by up to 8.5x for the performance figure (latency) and 3.5x for the implementation costs (area) compared to use only the standard knobs.* #### 14.(RF = 6) M. M. Ziegler, H.-Y. Liu, and L. P. Carloni, [Scalable Auto-Tuning of Synthesis Parameters for Optimizing High-Performance Processors](http://www.cs.columbia.edu/~luca/research/ziegler_ISLPED16.pdf), The Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED), 2016. *Abstract: Modern logic and physical synthesis tools provide numerous options and parameters that can drastically impact design quality; however, the large number of options leads to a complex design space difficult for human designers to navigate. By employing intelligent search strategies and parallel computing we can tackle this parameter tuning problem, thus automating one of the key design tasks conventionally performed by a human designer. In this paper we present a novel learning-based algorithm for synthesis parameter optimization. This new algorithm has been integrated into our existing autonomous parameter-tuning system, which was used to design multiple 22nm industrial chips and is currently being used for 14nm chips. These techniques show, on average, over 40% reduction in total negative slack and over 10% power reduction across hundreds of 14nm industrial processor macros while reducing overall human design effort. We also present a new higher-level system that manages parameter tuning of multiple designs in a scalable way. This new system addresses the needs of large design teams by prioritizing the tuning effort to maximize returns given the available compute resources.* #### 15.(RF = 7) Y. Jung, J. Koo, K. Stratos, and L. P. Carloni, [A Probabilistic Ranking Model for Audio Stream Retrieval](https://dl.acm.org/citation.cfm?id=2927013), International Workshop on Multimedia Analysis and Retrieval for Multimodal Interaction (MARMI), 2016. *Abstract: In Audio Stream Retrieval (ASR) systems, clients periodically query an audio database with an audio segment taken from the input audio stream to keep track of the flow of the stream in the original content sources or to compare two differently edited streams. **We recently developed a series of ASR applications such as broadcast monitoring systems, automatic caption fetching systems, and automatic media edit tracking systems. Based on this experience, we propose a probabilistic ranking model designed for ASR systems. In order to train and test the model, we create a new set of audio streams and make it publicly available.** Our experiments with these new streams confirm that the proposed ranking model works effectively with the retrieved results and reduces the errors when used in various ASR applications.* #### 16.(RF = 5) M. M. Ziegler, H.-Y. Liu, G. Gristede, B. Owens, R. Nigaglioni, and L. P. Carloni, [A Synthesis-Parameter Tuning System for Autonomous Design-Space Exploration](http://www.cs.columbia.edu/~luca/research/ziegler_DATE16.pdf), The Proceedings of the Conference on Design, Automation and Test in Europe (DATE), 2016. *Abstract: Advanced logic and physical synthesis tools provide a vast number of tunable parameters that can significantly impact physical design quality, but the complexity of the parameter design space requires intelligent search algorithms. To fully utilize the optimization potential of these tools, we propose SynTunSys, a system that adds a new level of abstraction between designers and design tools for managing the design space exploration process. SynTunSys takes control of the synthesis-parameter tuning process, i.e., job submission, results analysis, and next-step decision making, by automating a key portion of a human designer’s decision process. We present the overall organization of SynTunSys, describe its main components, and provide results from employing it for the design of an industrial chip, the IBM z13 22nm high-performance server chip. During this major design, SynTunSys provided significant savings in human design effort and achieved a quality of results beyond what human designers alone could achieve, yielding on average a 36% improvement in total negative slack and a 7% power reduction.* #### 17.(RF = 8) R. K. Brayton, L. P. Carloni, A.L. Sangiovanni-Vincentelli, and T. Villa, [Design Automation of Electronic Systems: Past Accomplishments and Challenges Ahead [Scanning the Issue]](https://ieeexplore.ieee.org/document/7302635?arnumber=7302635), The Proceedings of the IEEE, Vol. 103, No. 11, November 2015. *Abstract: The articles in this special issue provides an overview of and a perspective on the evolution of electronic design automation (EDA), and offers a perspective on some of the principal avenues of future development.* #### 18.(RF = 5) Y. Jung, M. Petracca, and L. P. Carloni, [Cloud-Aided Design for Distributed Embedded Systems](http://www.cs.columbia.edu/~luca/research/jung_IEEEDT.pdf), IEEE Design & Test, Vol. 31, No. 4, July/August 2014. *Abstract: This paper presents how to use cloud computing for designing distributed embedded systems. The cloud is used as a simulation platform. The simulation environment targets the design and testing of distributed embedded systems executing applications that can access cloud services. A networked VP can run on a cloud through the infrastructure as a service (IaaS) model.* #### 19.(RF = 6) G. Di Guglielmo, C. Pilato, and L. P. Carloni, [A Design Methodology for Compositional High-Level Synthesis of Communication-Centric SoCs](https://ieeexplore.ieee.org/document/6881455), The Proceedings of the Design Automation Conference (DAC), 2014. *Abstract: Systems-on-chip are increasingly designed at the system level by combining synthesizable IP components that operate concurrently while interacting through communication channels. CAD-tool vendors support this System-Level Design approach with high-level synthesis tools and libraries of interface primitives implementing the communication protocols. These interfaces absorb timing differences in the hardware-component implementations, thus enabling compositional design. However, they introduce also new challenges in terms of functional correctness and performance optimization. We propose a methodology that combines performance analysis and optimization algorithms to automatically address the issues that SoC designers may accidentally introduce when assembling components that are specified at the system level.* #### 20.(RF = 7) H.-Y. Liu and L.P. Carloni, [On Learning-Based Methods for Design-Space Exploration with High-Level Synthesis](https://ieeexplore.ieee.org/document/6560643), The Proceedings of the Design Automation Conference (DAC), 2013. *Abstract: This paper makes several contributions to address the challenge of supervising HLS tools for design space exploration (DSE). We present a study on the application of learning-based methods for the DSE problem, and propose a learning model for HLS that is superior to the best models described in the literature. In order to speedup the convergence of the DSE process, we leverage transductive experimental design, a technique that we introduce for the first time to the CAD community. Finally, we consider a practical variant of the DSE problem, and present a solution based on randomized selection with strong theory guarantee.* #### 21.(RF = 6) H.-Y. Liu, I. Diakonikolas, M. Petracca, and L.P. Carloni, [Supervised Design Space Exploration by Compositional Approximation of Pareto Sets](http://www.cs.columbia.edu/~luca/research/capsDAC11.pdf), The Proceedings of the Design Automation Conference (DAC), 2011. *Abstract: Technology scaling allows the integration of billions of transistors on the same die but CAD tools struggle in keeping up with the increasing design complexity. Design productivity for multi-core SoCs increasingly depends on creating and maintaining reusable components and hierarchically combining them to form larger composite cores. Characterizing such composite cores with respect to their power/performance tradeoffs is critical for design reuse across various products and relies heavily on synthesis tools. We present CAPS, an online adaptive algorithm that efficiently explores the design space of any given core and returns an accurate characterization of its implementation tradeoffs in terms of an approximate Pareto set. It does so by supervising the order of the time-consuming logic-synthesis runs on the core’s components. Our algorithm can provably achieve the desired precision on the approximation in the shortest possible time, without having any a-priori information on any component. We also show that, in practice, CAPS works even better than what is guaranteed by the theory.* ### Memory Optimisation --- #### 22.(RF = 6) D. Giri, P. Mantovani and L. P. Carloni, [Runtime Reconfigurable Memory Hierarchy in Embedded Scalable Platforms](https://sld.cs.columbia.edu/pubs/giri_aspdac19.pdf), (Invited Paper). Proceedings of the Asia and South Pacific Design Automation Conference (ASPDAC), 2019. This work applies both dynamic and static techniques on the memory architecture. It may be useful if I go to off-chip memory. *Abstract: In heterogeneous systems-on-chip, the optimal choice of the cache-coherence model for a loosely-coupled accelerator may vary at each invocation, depending on workload and system status. **We propose a runtime adaptive algorithm to manage the coherence of accelerators. The algorithm’s choices are based on the combination of static and dynamic features of the active accelerators and their workloads.** We evaluate the algorithm by leveraging our FPGA-based platform for rapid SoC prototyping. Experimental results, obtained through the deployment of a multi-core and multi-accelerator system that runs Linux SMP, show the benefits of our approach in terms of execution time and memory accesses.* #### 23.(RF = 5) E. G. Cota, P. Mantovani, G. Di Guglielmo, and L. P. Carloni, [An Analysis of Accelerator Coupling in Heterogeneous Architectures](http://www.cs.columbia.edu/~luca/research/cota_DAC15.pdf), The Proceedings of the Design Automation Conference (DAC), 2015. *Abstract: Existing research on accelerators has emphasized the performance and energy efficiency improvements they can provide, devoting little attention to practical issues such as accelerator invocation and interaction with other on-chip components (e.g. cores, caches). In this paper we present a quantitative study that considers these aspects by implementing seven high-throughput accelerators following three design models: tight coupling behind a CPU, loose out-of-core coupling with Direct Memory Access (DMA) to the LLC, and loose out-of-core coupling with DMA to DRAM. A salient conclusion of our study is that working sets of non-trivial size are best served by loosely-coupled accelerators that integrate private memory blocks tailored to their needs.* #### 24.(RF = 6) C. Pilato, P. Mantovani, G. Di Guglielmo, and L. P. Carloni, [System-Level Memory Optimization for High-Level Synthesis of Component-Based SoCs](http://www.cs.columbia.edu/~luca/research/pilato_CODESISSS2014.pdf), The Proceedings of the International Conference on Hardware/Software Codesign & System Synthesis (CODES+ISSS), 2014. *Abstract: The design of specialized accelerators is essential to the success of many modern Systems-on-Chip. Electronic system-level design methodologies and high-level synthesis tools are critical for the efficient design and optimization of an accelerator. Still, these methodologies and tools offer only limited support for the optimization of the memory structures, which are often responsible for most of the area occupied by an accelerator. To address these limitations, we present a novel methodology to automatically derive the memory subsystems of SoC accelerators. Our approach enables compositional design-space exploration and promotes design reuse of the accelerator specifications. We illustrate its effectiveness by presenting experimental results on the design of two accelerators for a high-performance embedded application.* #### 25.(RF = 8) H.-Y. Liu, M. Petracca, and L.P. Carloni, [Compositional System-Level Design Exploration with Planning of High-Level Synthesis](http://www.cs.columbia.edu/~luca/research/liu_DATE12.pdf), The Proceedings of the Conference on Design, Automation and Test in Europe (DATE), 2012. (Best Paper Award). *Abstract: The growing complexity of System-on-Chip (SoC) design calls for an increased usage of transaction-level modeling (TLM), high-level synthesis tools, and reuse of pre-designed components. In the framework of a compositional methodology for efficient SoC design exploration we present three main contributions: a concise library format for characterization and reuse of components specified in high-level languages like SystemC; an algorithm to prune alternative implementations of a component given the context of a specific SoC design; and an algorithm that explores compositionally the design space of the SoC and produces a detailed plan to run high-level synthesis on its components for the final implementation. The two algorithms are computationally efficient and enable an effective parallelization of the synthesis runs. Through a case study, we show how our methodology returns the essential properties of the design space at the system level by combining the information from the library of components and by identifying automatically those having the most critical impact on the overall design.* ### Others --- #### 26.(RF = 5) L. P. Carloni, A. B. Kahng., S. Muddu. A. Pinto, K. Samadi, and P. Sharma, [Accurate Predictive Interconnect Modeling for System-Level Design](http://www.cs.columbia.edu/~luca/research/nocs_TransVLSI10.pdf), IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 18, No. 4, April 2010. *Abstract: We propose new accurate predictive models for the delay, power, and area of buffered interconnects to enable a more effective system-level design exploration with existing and future nanometer technology processes. We show that our models are significantly more accurate than previous models—essentially matching sign-off analyses. We integrate our models in the COSI-OCC communication synthesis infrastructure and show how they impact the feasibility and optimality of the network-on-chip architectures that are synthesized by this tool.* #### 27.(RF = 6)L.P. Carloni, R. Passerone, A. Pinto and A.L. Sangiovanni-Vincentelli, [Languages and Tools for Hybrid Systems Design](http://www.cs.columbia.edu/~luca/research/hybridFnT.pdf), in Foundations and Trends. in Electronic Design Automation, Vol. 1, No. 1/2, Jul 2006. *Abstract: The explosive growth of embedded electronics is bringing information and control systems of increasing complexity to every aspects of our lives. The most challenging designs are safety-critical systems, such as transportation systems (e.g., airplanes, cars, and trains), industrial plants and health care monitoring. The difficulties reside in accommodating constraints both on functionality and implementation. The correct behavior must be guaranteed under diverse states of the environment and potential failures; implementation has to meet cost, size, and power consumption requirements. The design is therefore subject to extensive mathematical analysis and simulation. However, traditional models of information systems do not interface well to the continuous evolving nature of the environment in which these devices operate. Thus, in practice, different mathematical representations have to be mixed to analyze the overall behavior of the system. Hybrid systems are a particular class of mixed models that focus on the combination of discrete and continuous subsystems. There is a wealth of tools and languages that have been proposed over the years to handle hybrid systems. However, each tool makes different assumptions on the environment, resulting in somewhat different notions of hybrid system. This makes it difficult to share information among tools. Thus, the community cannot maximally leverage the substantial amount of work that has been directed to this important topic. In this paper, we review and compare hybrid system tools by highlighting their differences in terms of their underlying semantics, expressive power and mathematical mechanisms. We conclude our review with a comparative summary, which suggests the need for a unifying approach to hybrid systems design. As a step in this direction, we make the case for a semantic-aware interchange format, which would enable the use of joint techniques, make a formal comparison between different approaches possible, and facilitate exporting and importing design representations. * #### 28.(RF = 3) L. Piccolboni, G. Di Guglielmo, and L. P. Carloni, [KAIROS: Incremental Verification in High-Level Synthesis through Latency-Insensitive Design](https://sld.cs.columbia.edu/pubs/piccolboni_fmcad19.pdf), Formal Methods in Computer-Aided Design (FMCAD), 2019. This is more related to the heavy weight verification. Maybe John is interested in this work. *Abstract: High-level synthesis (HLS) improves design productivity by replacing cycle-accurate specifications with untimed or transaction-based specifications. Obtaining high-quality RTL implementations requires significant manual effort from designers, who must manipulate the code and evaluate different HLS-knob settings. These modifications can introduce bugs in the RTL implementations. We present KAIROS, a methodology for incremental formal verification in HLS. KAIROS verifies the equivalence of the RTL implementations the designer subsequently derives from the same specification by applying code manipulations and knobs.* #### 29.(RF = 1) L. Piccolboni, G. Di Guglielmo, and L. P. Carloni, [Securing Accelerators with Dynamic Information Flow Tracking](https://arxiv.org/pdf/1903.06801.pdf), Hardware Demo presented at the IEEE International Symposium on Hardware Oriented Security and Trust (HOST), 2019. Security implementation in HLS. *Abstract: Systems-on-chip (SoCs) are becoming heterogeneous: they combine general-purpose processor cores with application-specific hardware components, also known as accelerators, to improve performance and energy efficiency. The advantages of heterogeneity, however, come at a price of threatening security. The architectural dissimilarities of processors and accelerators require revisiting the current security techniques. With this hardware demo, we show how accelerators can break dynamic information flow tracking (DIFT), a well-known security technique that protects systems against software-based attacks. We also describe how the security guarantees of DIFT can be re-established with a hardware solution that has low performance and area penalties.* #### 30.(RF = 3) D. Jahier Pagliari, M. R. Casu, and L. P. Carloni, [Acceleration of Microwave Imaging Algorithms for Breast Cancer Detection via High-Level Synthesis](https://ieeexplore.ieee.org/document/7357152), The Proceedings of the International Conference on Computer Design (ICCD), 2015. Potential benchmarks. *Abstract: We present the system-level design of two accelerators for two microwave imaging algorithms for breast cancer detection. The accelerators were designed in SystemC and optimized via High-Level Synthesis (HLS). The two algorithms stress the capabilities of commercial HLS tools in different ways: the first is communication-bound and requires careful pipelining of communication and computation; the second is computation-bound and requires the implementation of mathematical functions that are not properly supported by HLS tools. Still, in the span of four months we were able to design and validate about one hundred alternative implementations, targeting a Zynq SoC platform. Furthermore, we were pleased to obtain results that are superior to a previous RTL implementation, which confirms the remarkable progress of HLS tools.* #### 31.(RF = 2) K. Bhardwaj, P. Mantovani, L. P. Carloni and S. M. Nowick, [Towards a Complete Methodology for Synthesizing Bundled-Data Asynchronous Circuits on FPGAs](https://ieeexplore.ieee.org/abstract/document/8824912), The Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED), 2019. *Abstract: Asynchronous circuits are gaining momentum as a promising low-power alternative to the conventional synchronous design approaches. In particular, single-rail bundled-data design style has seen significant interest both for designing GALS systems and in the emerging area of neuromorphic computing. However, there has been only limited research on implementing these asynchronous circuits on commercial FPGAs, which can be challenging due to the use of relative timing constraints in these designs for correct operation. This paper proposes a systematic CAD methodology to synthesize efficiently bundled-data asynchronous circuits on commercial FPGAs, achieving a two-fold goal for the target implementation: robustness and high performance. The methodology is targeted to the existing Xilinx Vivado tool set. As a case study, two asynchronous NoC switches are prototyped on Xilinx Virtex 7 in 28 nm: one supporting unicast, and the other also handling multicast. The former shows significant energy and idle power improvements, with some performance benefits, over a high-performance synchronous FPGA-based switch. The asynchronous multicast router also shows promising energy and performance results. Although a NoC case study is used, the proposed approach is general and can be used for other bundled-data asynchronous circuits.* #### 32.(RF = 1) N. Bombieri, H.-Y. Liu, F. Fummi, and L.P. Carloni, [A Method to Abstract RTL IP Blocks into C++ Code and Enable High-Level Synthesis](http://www.cs.columbia.edu/~luca/research/bombieri_DAC13.pdf), The Proceedings of the Design Automation Conference (DAC), 2013. *Abstract: We present a method to automatically generate a synthesizable C++ specification from the given RTL design of an IP block, by abstracting away most of its micro-architectural characteristics while preserving its functionality. The goal is twofold: recover the IP block specification for system-level design, and enable the derivation of more optimized implementations through high-level synthesis.The C++ specification can be generated with different interfaces thus allowing the IP model to be reused across different system platforms. Experimental results show that the proposed approach not only enhances the reusability of the recovered IP block but also unveils a richer design space to explore.*