# COWOMO'18 Abstracts
### Florian Arrestier, IETR - INSA Rennes
#### Papify: automated instrumentation for rapid prototyping and design space exploration
When doing design space exploration, it is necessary to have some metrics to determine the value of a given solution. To gather those metrics, there two possible paths. A first approach is to build model(s) of the application and the system being designed and predict desired values. A secondary solution is to perform real mesurements on the application and the system.
In this demonstration, we are adressing the latter approach and present PAPIFY, an automated instrumentation tool for dataflow based applications. PAPIFY is integrated in the rapid prototyping tool PREESM. PREESM is a framework that uses an high level description of the target architecture (S-LAM model), a dataflow description of the application (PiSDF MoC) and a scenario defining constraints linking the both. PREESM schedules, maps and generates compilable code within seconds of the application for the target achitecture based on the constraints defined in the scenario. Using PAPIFY, the user can define a list of events to be monitored for specific actors using a simple graphical interface. The events monitoring calls are automatically inserted in the generated code which provide to users the possibility to have real measure values within seconds.
### Daniel Madronal, UPM
#### Many-core energy-based LSLA model: development, microbenchmarking and future applications
In this work, we propose to generate power and energy consumption models for the manycore platform MPPA-256 Bostan by Kalray. This platform gathers 2 Input/Output (I/O) subsystems together with 16 clusters with 16 processors within each of them. To carry out the communication among resources, each I/O has 4 DMAs and each cluster has 1 DMA. This DMAs are connected to a Network on Chip (NoC). Likewise, to simplify the models, we are working on generating a Linear System-Level Architecture (LSLA) Model of Architecture (MoA). To do so, we are considering the clusters as the processing elements, while a combination of several communication nodes will represent the NoC and the DMAs connecting all the resources. In order to perform this modelling, we are focused on the use of microbenchmarks to stress each resource independently.
Once this analysis is finished, we will use the mathematical model of the platform to automatically map and schedule applications on the target platform. To do so, we will use PREESM framework, which will help us to speed up the process due to the current status of the tool. Finally, we want to go one step further and define a new concept: entropy of a dataflow graph. Similarly to the relationship between parallelism and execution time, entropy would help developers to intuitively know the less energy consuming network when several of them are compared. Preliminary studies have suggested that, given a specific workload, the energy consumed during the processing part of the system remains almost stable when the parallelism increases. Likewise, energy consumption variations are highly related to the data transmission strategy. Thus, the dataflow graph entropy would be strongly associated to the data communication distribution.
### Johan Lilius, Abo Akademi
#### Task-based run-times for dataflow networks
Task based run-times have been proposed as way to achieve good automatic load-balancing on homogenous multi-cores. Originally implemented in the Cilk language, such features have been incorporated into mainstream products like Intel Cilk++, and openMP. In recent work [SIPS2017], we have shown how to translate dataflow graphs to be executed on a task-based run-time (WOOL). The advantage of this approach is that it gives a naturaly self-timed execution of the dataflow-graph without the need for extensive offline scheduling. The self-timed execution also takes away the need for synchronisations when actors in the dataflow network can have very different execution-times. Furthermore the approach independent of the number of cores, and thus allows the scaling of the application.
In this talk we will review the basic ideas of the translation, discuss its restrictions, and some extensions towards heterogenous architectures.
### Raquel Lazcano
#### Towards combining polyhedral optimizations and dataflow
### Deepayan Bhowmik
#### Area-Energy Aware Dataflow Optimisation of Visual Tracking Systems
This work presents an orderly dataflow-optimisation approach suitable for area-energy aware computer vision applications on FPGAs. Vision systems are increasingly being deployed in power constrained scenarios, where the dataflow model of computation has become popular for describing complex algorithms. Dataflow model allows processing datapaths comprised of several independent and well defined computations. However, compilers are often unsuccessful in identifying domain-specific optimisation opportunities resulting in wasted resources and power consumption. We present a methodology for the optimisation of dataflow networks, according to patterns often found in computer vision systems, focusing on identifying optimisations which are not discovered automatically by an optimising compiler. Code transformation using profiling and refactoring provides opportunities to optimise the design, targeting FPGA implementations and focusing on area and power abatement. Our refactoring methodology, applying transformations to a complex algorithm for visual tracking resulted in significant reduction in power consumption and resource usage.
### Bruno Bodin
#### Navigating the Real-time 3D Scene Understanding Landscape
The visual understanding of 3D environments in real-time and at low tional challenge. This is central to applications such as industrial robotics and autonomous vehicles. In this presentation we will discuss results of a major research effort to assemble the algorithms, architectures, tools, and systems software needed to enable delivery of SLAM (Simultaneous Localisation and Mapping), by supporting applications specialists in selecting and configuring the appropriate algorithm and the appropriate hardware to meet their performance, accuracy, and energy consumption goals.
### Eduardo Juarez & Ruben Salvador
#### UPM-CITSEM-GDEM Research Activities Overview
This presentation provides a general overview of the current active research lines at the UPM-CITSEM-GDEM research group:
Although dataflow specifications expose intrinsic parallelism, the paradigm itself does not exploit the potential parallelism available inside actors. The combination of polyhedral transformations and dataflow can be a step forward towards larger degrees of parallelization. The idea of this research line is to include polyhedral transformations within the Preesm and Spider workflows both at design and run-time.
Mathematical energy consumption Models of Architecture (MoA) to automatically map and schedule applications on a target platform are needed for fast Design Space Exploration (DSE). Power and energy consumption models that belong to the class of Linear System-Level Architecture (LSLA) MoAs, are being explored in this research line for the manycore MPPA-256 Bostan device. A first version of the automatic C code instrumentation using PAPI within the PREESM framework, using the in-house developed PAPIFY tool, is presented. The intention is to include this monitoring tool within the MPPA to enable real-time power and energy consumption estimation to drive system self-adaptation.
Architecture and performance modeling try to assist designers during the DSE phase to obtain early design estimations on different runtime system features. Current proposals suffer from one main issue: the model strongly depends on the applications used as training inputs, losing estimation accuracy for applications not appearing in the training set. This work investigates methods for application-agnostic training of performance/efficiency MoAs. For that, it will study the use as training inputs of standard computation and communication pattern operators found to be the constituent parts of an algorithm decomposition. It is expected that for different applications, a compositionality feature arises from the combination of the (estimations of the) different pattern operators that build that particular new application. Target platforms are embedded GPUs and FPGA architectures defined using HLS.
Cyber-Physical Systems are entities expected to operate autonomously in high uncertainty environments, for which, neither traditional system design techniques, nor hardware-only dynamic reconfiguration approaches, suffice anymore. Holistic, seamless system reconfiguration is needed to be supported out-of-box. However, either standard or custom design approaches for heterogeneous embedded systems, usually based on traditional imperative languages, do not provide neither sane ways to express dynamism to cope with uncertainty nor enough analizability of the application at design time to optimize final system costs (memory, area, etc.). This work, by using a well defined Model of Computation able to capture reconfigurability features of the application, investigates runtime efficient HW/SW reconfiguration to seamlessly move actor execution across the different computational resources of the target platform, a SoC-FPGA. Three main elements are to be studied in this work: (i) lightweight intermediate graph representations and (ii) just-in-time compilation and (iii) hardware composition techniques. A suitable reconfiguration granularity that leverages on the IR will be investigated, so it can be used by a runtime manager to provide system adaptation at (coarse/fine-grained) hardware and software levels.
### Alain Girault
#### A multi-criteria static scheduling heuristic to optimize the execution time, the failure rate, the power consumption, and the temperature in multicores.
We address the problem of computing a static schedule of a DAG of tasks onto an multicore architecture, with the goal of optimizing four criteria: execution time, reliability, maximum power consumption, and peak temperature. We propose a ready list scheduling heuristic: it builds a static schedule of the given DAG of tasks onto the given multicore such that its reliability, power consumption, and temperature remain below three given thresholds, and such that its total execution time is as low as possible. We replicate actively the tasks to increase the reliability, we use Dynamic Voltage and Frequency Scaling to decrease the power consumption, and we insert cooling times to control the peak temperature. We advocate that, when one wants to optimize multiple criteria, it makes more sense to build a set of solutions, each one corresponding to a different tradeoff between those criteria, rather than to build a single solution. This is all the more true when the criteria are antagonistic, which is the case here: for instance, improving the reliability requires to add some redundancy in the schedule (in our case spatial redundancy), which penalizes the execution time. For this reason, we build a Pareto front in the 4D space (exec. time, reliability, power, temp.), by varying the three thresholds on the reliability, power, and temperature. Comparisons show that the schedules produced by our heuristic are on average only 10% worse than the optimal schedules (computed by an ILP program), and 35% better than the ones generated by the PowerPerf-PET heuristic from the literature.
### Michael Masin
#### Cross-layer Model-based Multi-objective HW/SW Co-design Using Mathematical Programming (MILP) in the PREESM=>AOW=>PREESM Toolchain
In this presentation we show how large scale applications modelled as SDF graphs with hundreds to thousands of actors can be maped to heterogenous HW/SW architecture platforms and scheduled to optimize multple objectives and constraints using a MILP solver.
A proof of concept integrated IBM's AOW tool, performing the optimization, in INSA's PREESM HW/SW co-design flow.
### Hugo Daniel Meyer
#### Modeling and Simulation of Complex Industrial Systems
Nowadays, most industrial systems include Distributed Computing Systems (DCS) with many application processes that compete for the available resources in order to carry out their tasks. Their sub-systems are normally built separately and during integration the main objective is to guarantee that functional requirements are met. However, due to the complexity involved in systems integration, one of the big challenges in complex distributed systems is to make an efficient use of the available resources or reduce the cost paid to obtain the expected result. When designing and developing such complex systems, pure analytical or experimental techniques are not sufficient or very costly. For these reasons, system architecture modelling and simulation techniques are preferred, since they allow to easily perform design space exploration, what-if analysis and evaluation of Key Performance Indicators (KPI). This work focuses on addressing what we consider the three critical challenges in complex DCS: (a) System monitoring; (b) Modelling and simulation; and (c) Actuation and tuning. We will discuss the main challenges of modelling software processes and resource utilization in photolithography machines produced by ASML and how such complex systems can be mimicked with high-level, trace-based simulation with two different tools (SESAME and
OMNeT++). The main findings so far will be presented as well as the initial experimental results where we measure the accuracy of the proposed models considering different metrics.
### Jani Boutellier
#### Combining dataflow with functional programming
For decades, the dataflow abstraction has been used for describing the functionality of computation dominated algorithms that are common, e.g., in signal processing. Lately, the research focus of dataflow based programming has been moving towards coarse-grain dataflow, where programs are described with a relatively low number of dataflow nodes, and the detailed node functionality is written in a procedural language, such as C/C++. This approach works well when the dataflow program is compiled for a target device that has conventional general purpose processor (GPP) cores, but turns out to be problematic when the target device has programmable data parallel accelerator cores, such as graphics processing units (GPUs). In general, GPUs cannot be programmed with the same language as GPPs, and hence dataflow programs that target heterogeneous target devices, may require two descriptions for each dataflow node: one for GPPs and one for GPUs.
This talk presents preliminary results on combining coarse-grain dataflow programming with the functional programming language Halide. Halide is a relatively new language designed for image processing. We present an approach where the application is divided into dataflow actors on a high level, and the detailed functionality of actors is expressed in Halide. As Halide offers code generation for GPPs, GPUs, and DSPs (Digital Signal Processors), a single language is sufficient for targeting all core types of a heterogeneous target device. Experimental results are presented for three applications.
### Renaud De Landtsheer
#### Placer: a Design-time Model-based Tool for Mapping Task-based SW onto Heterogeneous HW
Placer is a model-based tool that optimizes the mapping of task-based software onto heterogeneous hardware . It takes several models as input, notably, a model of the software that identifies tasks, their data dependencies and the various task implementations available for executing on different hardware, and a model of the hardware that describe various properties related to processing and I/O transfer capabilities. The outcome from running Placer proposes a mapping of the software tasks onto the hardware. The mapping includes an assignment of tasks to processing elements (CPU Core, FPGA, etc.) and a global schedule for initiating tasks and transmissions. Placer is also able to handle transmission delay and routing. Tasks can have several implementations targeting different computing models, each suitable for a particular hardware processing (FPGA, CPU for instance). Alternatively, even when targeting particular hardware processing, different task implementations may also be provided to offer different trade-offs between speed, memory usage etc. Subsequently, Placer also selects the most appropriate implementation for these tasks. Contrary to other optimisation tool that optimise placement and schedule as separate processes, Placer optimizes all these aspects of the mapping in a single optimization process, so it can reason about global optimization for the schedule, placement, and implementation selection aspects at once. Placer is available open source .
Placer relies on the OscaR.cp  constraint programming engine to find high-quality solution and to cope with this very complex multi-aspect optimization problem.
The talk will focus on the modelling language of the Placer tool and will illustrate how it performs on some simple yet representative examples such as  but on a heterogeneous hardware platform.
Placer is developed through the TANGO H2020 (grant RIA 687584) http://tango-project.eu/
 Placer development repository https://github.com/TANGO-Project/placer
 OscaR Team, OscaR: Scala in OR, 2012, https://bitbucket.org/oscarlib/oscar
 TANGO blog post, “TANGO Development-time tooling: An initial usage of Placer”,
### Yehya Nasser
#### Power Modeling on FPGA: A Neural Model for RT-Level Power Estimation
Today reducing power consumption is a major concern especially when it concerns small embedded devices. Power optimization is required all along the design flow but particularly in the first steps where it has the strongest impact. In this work, we propose new power models based on neural networks that predict the power consumed by digital operators implemented on Field Programmable Gate Arrays (FPGAs). These operators are interconnected and the statistical information of data patterns are propagated among them. The obtained results make an overall power estimation of a specific design possible. A comparison is performed to evaluate the accuracy of our power models against the estimations provided by the Xilinx Power Analyzer (XPA) tool. Our approach is verified at system-level where different processing systems are implemented. A mean absolute percentage error which is less than 8\% is shown versus the Xilinx classic flow dedicated to power estimation.
### Simei Yang
##### Evaluation and Design of a Runtime Manager for Ultra-Low Power Cluster-Based Multicores
Energy efficiency is a challenge in nowadays multi-core embedded systems. Due to dynamic variations of applications execution, run-time management is needed to appropriately set the platform configurations. The objective of this work is to efficiently combine run-time task mapping and Dynamic Frequency Scaling (DFS) to achieve better energy efficiency under timing constraint. These techniques are applied in a hybrid approach, which is comprised of design-time and run-time stages. Based on the design-time prepared mappings for each application, two different run-time mapping algorithms are considered. The proposed algorithms aim to reduce execution time of active applications, for supporting lower frequency configuration. We evaluate our approach with different streaming multimedia applications on homogeneous multi-core platform. When compared to the related work with our experimented situation, the proposed approach decreases the execution time by 6.5% before DFS. After DFS, the dynamic energy consumption is reduced by 18.7% with lower frequency configuration.
### Steven Derrien
#### Using Polyhedral Techniques to Tighten WCET Estimates of Optimized Code
The ARGO H2020 European project aims at developing a Worst-Case Execution Time (WCET)-aware parallelizing compilation toolchain. This toolchain operates on Scilab and XCoS inputs, and targets ScratchPad memory (SPM)-based multi-cores. Data-layout and loop transformations play a key role in this flow as they improve SPM efficiency and reduce the number of accesses to shared main memory. In this paper1, we study how these transformations impact WCET estimates of sequential codes. We demonstrate that they can bring significant improvements of WCET estimates (up to 2.7 χ) provided that the WCET analysis process is guided with automatically generated flow annotations obtained using polyhedral counting techniques.
### Maxime Pelcat
#### Design Productivity of CAD Tools
The complexity of hardware systems is currently growing faster than the productivity of system designers and programmers. This phenomenon is called Design Productivity Gap and results in inflating design costs.
In this presentation, the notion of Design Productivity is discussed, as well as a metric to assess the Design Productivity of a High-Level Synthesis (HLS) method versus a manual hardware description. The Design Productivity metric evaluates the trade-off between design efficiency and implementation quality. The method is generic enough to be used for comparing several HLS methods of different natures, opening opportunities for further progress in Design Productivity. This metric is a first step towards comparison of CAD tools for system design.
### François Verdier
#### PowClkARCH SystemC-TLM/C++ library
I will present what we have done in LEAT about the library PowClkARCH
that can be used in design flows at the very begining of the design flow
of SoCs. This library is in SystemC-TLM and in C++ and allows to extract
from a TLM version of a simulation model all the power and energy
features and it allows to get an efficient energy estimate of new SoC.
### Hai Nam Tran
#### Toward Efficient Many-core Scheduling of Partial Expansion Graphs
Transformation of synchronous data flow graphs (SDF) into equivalent homogeneous SDF representations has been extensively applied as a pre-processing stage when mapping signal processing algorithms onto parallel platforms. While this transformation helps fully expose task and data parallelism, it also presents several limitations such as an exponential increase in the number of actors and excessive communication overhead. Partial expansion graphs were introduced to address these limitations for multi-core platforms. However, existing solutions are not well-suited to achieve efficient scheduling on many-core architectures. In this presentation, we develop a new approach that employs cyclo-static data flow techniques to provide a simple but efficient method of coordinating the data production and consumption in the expanded graphs. We demonstrate the advantage of our approach through experiments on real application models.
### Shuvra S. Bhattacharyya
#### Software Synthesis from Dataflow Schedule Graphs
In dataflow-based design processes for signal and information processing, scheduling is a critical task that affects practical measures of performance, including latency, throughput, energy consumption, and memory requirements. Dataflow schedule graphs (DSGs) provide a formal abstraction for representing schedules in dataflow-based design processes. The DSG abstraction allows designers to model a schedule as a separate dataflow graph, thereby providing a formal, abstract (platform- and language-independent) representation for the schedule. In this presentation, we review the DSG representation, and introduce a design methodology that is based on explicit specifications of application graphs and schedules as cooperating dataflow models. We also develop new software synthesis techniques for automatically deriving efficient implementations from these coupled application and schedule models. We demonstrate the proposed methodology and software synthesis techniques through various experiments, including a demonstration involving real-time detection of people and vehicles using acoustic and seismic sensors.
### Guillaume Delbergue
#### Open-source virtual platforms, the key to build a new collaborative era in electronics
Methodologies based on the SystemC/TLM standard provides early in the development a virtual version of the (future) system, commonly called virtual platform. Their efficiency to secure hardware and software design as early as possible is widely accepted. However these benefits are insufficient to make the technology widespread.
Today, the requirements have evolved. SystemC/TLM simulation standard no longer meets the expectations. This technology is not easily accessible for stand-alone software developer, or for small teams building new devices. In an era where open source and community development take a central place, collaborative tools around virtual platforms are missing.
We aim to present what we believe is the future of embedded system development, what are the collaborative properties of virtual platform, which technological issues need to be addressed in priority, and how open source solutions unlock the potential to build large collaborative projects with no initial fee.
### Sebastien Le Nours
#### A Hybrid Simulation Approach for Fast and Accurate Timing Analysis of Multi-Processor Platforms Considering Communication Resources Conflicts
In the early design phase of embedded systems, discrete-event simulation is extensively used to analyse time properties of hardware-software architectures. Improvement of simulation efficiency has become imperative for tackling the ever increasing complexity of multi-processor execution platforms. The fundamental limitation of current discrete-event simulators lies in the time-consuming context switching required in simulation of concurrent processes. In this talk, we present a new simulation approach that reduces the number of events managed by a simulator while preserving timing accuracy of hardware-software architecture models. The proposed simulation approach abstracts the simulated processes by an equivalent executable model which computes the synchronization instants with no involvement of the simulation kernel. Compared to traditional lock-step simulation approaches, the proposed approach enables significant simulation speed-up with no loss of timing accuracy. In this talk, application of the simulation approach and the achieved improvement of simulation efficiency will be presented considering various case-studies.
### Claudio Rubattu
#### H2020 Cerbero - MDC Tutorial