2022 Conference on Advanced Topics and Auto Tuning in High-Performance Scientific Computing

# 2022 Conference on Advanced Topics and Auto Tuning in High-Performance Scientific Computing <center>March 29-30, at National Cheng Kung University</center> ### Program | Time (GMT+8) | March 29 (Tuesday) | March 30 (Wednesday) | | -------- | -------- | -------- | | 09:00 - 09:10 | Opening | Opening | | 09:10 - 09:35 | [Kengo Nakajima](#Kengo-Nakajima) | [Masatoshi Kawai](#Masatoshi-Kawai) | | 09:35 - 10:00 | [Li-Da Tong](#Li-Da-Tong) | [Yueh-Nan Chen](#Yueh-Nan-Chen) | | 10:00 - 10:10 | break | break | | 10:10 - 10:35 | [Hiroyuki Takizawa](#Hiroyuki-Takizawa) | [Tetsuya Hoshino](#Tetsuya-Hoshino) | | 10:35 - 11:00 | [Yu-Ting Wu](#Yu-Ting-Wu) | [Torbjörn Nordling](#Torbjörn-Nordling) | | 11:00 - 13:00 | Lunch Break | Lunch Break | | 13:10 - 13:35 | [Takahiro Katagiri](#Takahiro-Katagiri) | [Satoshi Ohshima](#Satoshi-Ohshima) | | 13:35 - 14:00 | [Wei-Fan Hu](#Wei-Fan-Hu) | [Cheng-Fang Su](#Cheng-Fang-Su) | | 14:00 - 14:10 | break | break | | 14:10 - 14:35 | [Takeshi Fukaya](#Takeshi-Fukaya) | [Takeshi Iwashita](#Takeshi-Iwashita) | | 14:35 - 15:00 | [Te-Sheng Lin](#Te-Sheng-Lin) | [Yu-Hsun Lee](#Yu-Hsun-Lee) | | 15:00 - 15:20 | Tea Break | Tea Break | | 15:20 - 15:55 | [Toshiyuki Imamura](#Toshiyuki-Imamura) | [Takeshi Terao](#Takeshi-Terao) | | 15:55 - 16:20 | [Mei-Heng Yueh](#Mei-Heng-Yueh) | [Masahiro Hamano](#Masahiro-Hamano) | | 16:20 - 16:30 | Break | Break | | 16:30 - 16:45 | [Makoto Morishita](#Makoto-Morishita) | [Sameer Deshmukh](#Sameer-Deshmukh) | | 16:45 - 17:00 | [Yen-Chen Chen](#Yen-Chen-Chen) | [Muhammad Ridwan Apriansyah](#Muhammad-Ridwan-Apriansyah) | | 17:00 - 17:15 | [Chung-Yu Shih](#Chung-Yu-Shih) | [Thomas Spendlhofer](#Thomas-Spendlhofer) | | 17:15 - 17:30 | [Chang-Wen Liang](#Chang-Wen-Liang) | [Wei-Ann Lin](#Wei-Ann-Lin) | | 18:00 - 20:00 | Banquet | Closing | --- ### Kengo Nakajima nakajima@cc.u-tokyo.ac.jp The University of Tokyo/RIKEN R-CCS Title: *“Communication-Computation Overlapping for Preconditioned Iterative Solvers”* Abstract: Preconditioned parallel solvers based on the Krylov iterative method are widely used in scientific and engineering applications. Communication overhead is a critical issue when executing these solvers on large-scale massively parallel supercomputers. The author has been developing methods for improvement of parallel performance of such solvers by communication-computation overlapping. In the present work, investigations for reordering methods for communication-computation overlapping in ICCG solvers of a parallel finite element application (GeoFEM/Cube) are presented. Evaluations on the Oakforest-PACS with Intel Xeon Phi processors and Wisteria/BDEC-01 (Odyssey) with A64FX are presented. [Back to program](#Program) --- ### Li-Da Tong ldtong@math.nsysu.edu.tw Time: 09:35 - 10:00 Department of Applied Mathematics & Institute of Precision Medicine, National Sun Yat-sen University Title: *“Practical Applications of Artificial Intelligence”* Abstract: Regarding the applications of AI practice, we start from chemical engineering process optimization. Based on the operational data accumulated by the dehydrogenation reactor system in the past provided by some companies, and the neural networks in artificial intelligence, modeling and analysis are carried out to improve the conversion rate and selection rate of the dehydrogenation reactor system. By our self-created machine learning, the prediction of production capacity can reach more than 95%. Then we give short introduction to other applications. [Back to program](#Program) --- ### Hiroyuki Takizawa takizawa@tohoku.ac.jp Tohoku University Title: *“A Data-driven Approach for Making Better Use of Compilers”* Abstract: Today, one HPC platform could have multiple compilers, each of which provides a lot of option flags. Since those compilers have different optimization capabilities, it could be challenging to select an appropriate build configuration such as the best available compiler and its option flags for each application code. In this work, therefore, we apply machine learning to properly selecting a build configuration from performance counter values. Our evaluation results demonstrate that our neural network model with careful feature selection can outperform other typical prediction models. As a result of properly selecting a build configuration, the proposed approach can significantly reduce the execution time in comparison with simply using the default compiler and its optimization level, such as gcc with the O2 flag. [Back to program](#Program) --- ### Yu-Ting Wu bulawu@gmail.com National Cheng Kung University Title: *“Parallel Computing of Large Eddy Simulation of Large Wind Farms”* Abstract: A pseudo-spectrum-based large-eddy simulation (LES) model, coupled with a dynamic actuator-disk model, is used to investigate the turbine power production and the turbine wake distribution in elongated wind farms where the streamwise turbine spacing of 7, 9, 12, 15, and 18 rotor diameters are considered. Two incoming flow conditions, three wind turbine arrangements, as well as the five turbine spacings are involved in this study, which leads to a total of 30 LES wind farm scenarios. The two incoming flow conditions have the same mean velocity of 9 m s-1 but different turbulence intensity levels (i.e., 7% and 11%) at the hub height level. The considered turbine arrangements are the perfectly-aligned, laterally-staggered, and vertically-staggered layouts. The simulated results show the turbine power production has a significant improvement by increasing the streamwise turbine spacing. With increasing the streamwise turbine spacing from 7 to 18 rotor diameters, the overall averaged power outputs are raised by about 27% in the staggered wind farms and about 38% in the aligned wind farms. The wind farm scenarios with the turbine spacing of 12d or greater in a large wind farm can lead to an increasing trend in the power production from the downstream turbines in the high-turbulence inflow condition, or also avoids the degradation of the power output on the turbines with the low-turbulence inflow condition [Back to program](#Program) --- ### Takahiro Katagiri katagiri@cc.nagoya-u.ac.jp Nagoya University Title: *“Adaptation of XAI to Auto-tuning of Numerical Libraries”* Abstract: Recently, Explainable AI (XAI) is known for one of very important issuers in computer systems since result from AI is not based on explainable output. The author thinks XAI is not only crucial issues for AI field but also crucial issue for numerical computations field. On the other hand, Auto-Tuning (AT) facility have been developed for two decades. In this presentation, we focus on the XAI for AT to numerical libraries. Target numerical library is a high accurate matrix multiplication library. AI is adapted to an AT function for selecting the best implementation between 11 implementations on the library. This is including the selection of the best implementation on CPU and GPU. In our adaptation, LIME and SHAP are used for adapting XAI to the AT function on the numerical library. We utilize a GPU based-supercomputer, named the Supercomputer “Flow” Type II Subsystem in Information Technology Center, Nagoya University, to obtain training data for target AI model by a random-forest regressor. According to evaluation result, the XAI tools can explain their behaviour of the best (predicted) parameters well from the learned model. [Back to program](#Program) --- ### Wei-Fan Hu wfhu@math.ncu.edu.tw National Central University Title: *“Machine Learning Approximation for Solving Sharp Interface Problems”* Abstract: In this talk, a new Discontinuity Capturing Shallow Neural Network (DCSNN) for approximating $d$-dimensional piecewise continuous functions and for solving sharp interface problems is developed. There are three novel features in the present network; namely, (i) jump discontinuity is captured sharply, (ii) it is completely shallow consisting of only one hidden layer, (iii) it is completely mesh-free for solving partial differential equations (PDEs). We first continuously extend the d-dimensional piecewise continuous function in $(d+1)$-dimensional space by augmenting one coordinate variable to label the pieces of discontinuous function, and then construct a shallow neural network to express this new augmented function. Since only one hidden layer is employed, the number of training parameters (weights and biases) scales linearly with the dimension and the neurons used in the hidden layer. For solving elliptic interface equations, the network is trained by minimizing the mean squared error loss that consists of the residual of governing equation, boundary condition, and the interface jump conditions. We compare the results obtained by the traditional grid-based immersed interface method (IIM) which is designed particularly for elliptic interface problems. The present results show better accuracy than the ones obtained by IIM. We conclude by solving a six-dimensional problem to show the capability of the present network for high-dimensional applications. [Back to program](#Program) --- ### Takeshi Fukaya fukaya@iic.hokudai.ac.jp Hokkaido University Title: *“Performance Evaluation of Various Algorithms for Computing Tall-skinny QR Factorization”* Abstract: Tall-skinny QR factorization, the QR factorization of a tall and skinny matrix, has many applications such as solution of least square problems, preprocessing for SVD, and vector orthogonalization. For computing tall-skinny QR factorization, various numerical algorithms have been proposed, for example, Householder QR, TSQR, Cholesky QR and its variants, and they have different characteristics. In this talk, we report the results of the performance evaluation for these algorithms conducted on recent distributed parallel computer systems. We present the strong scalability of each algorithm and analyze it based on the breakdown of the execution time including the time for the communication. [Back to program](#Program) --- ### Te-Sheng Lin tslin@math.nctu.edu.tw National Yang Ming Chiao Tung University Title: *“A Shallow Ritz Method for Elliptic Problems with Singular Sources”* Abstract: A shallow Ritz-type neural network for solving elliptic equations with delta function singular sources on an interface is developed. We first introduce the energy functional of the problem and then transform the contribution of singular sources to a regular surface integral along the interface. The original problem is then reformulated as a minimization problem. We propose a shallow Ritz-type neural network with one hidden layer to approximate the global minimizer of the energy functional. As a result, the network is trained by minimizing the loss function that is a discrete version of the energy. In addition, we include the level set function of the interface as a feature input and find that it significantly improves the training efficiency and accuracy. We perform a series of numerical tests to show the accuracy of the present network and its capability for problems in irregular domains and higher dimensions. [Back to program](#Program) --- ### Toshiyuki Imamura imamura.toshiyuki@riken.jp RIKEN R-CCS, co-authors: Takeshi Terao (RIKEN R-CCS), Takuya Ina (JAEA), Yusuke Hirota (University of Fukui), Yuki Uchino (Shibaura Institute of Technology), and Katsuhisa Ozaki (Shibaura Institute of Technology) Title: *“Roadmap and the Performance Benchmark of EigenExa on Fugaku”* Abstract: Eigensolver is one of the most strongly demanded scientific tools in modern simulations. RIKEN has been offering high-performance and reliable eigensolvers since 2012 and 2020 for the K computer and the supercomputer Fugaku, respectively. ‘EigenExa’ refers to the heart of our eigensolver project and the leading software for large-scale and highly parallel computations. We released version 2.11 in 2021 as a standard eigensolver library on Fugaku, which employs the banded-matrix algorithm. Also, the traditional one-stage scheme is available as most of the state-of-the-art eigensolvers such as ScaLAPACK and ELPA. We will present the current status of EigenExa, demonstrate the preliminary performance benchmark on Fugaku up to 16384 nodes, and exhibit the roadmap to emerging hardware systems in perspectives of new algorithms, accuracy, and reproducibility. [Back to program](#Program) --- ### Mei-Heng Yueh yue@ntnu.edu.tw National Taiwan Normal University Title: *“Optimal Mass Transportation Map with Applications”* Abstract: The optimal mass transportation (OMT) aims to find a measure-preserving map that minimizes a given cost function. The OMT map has been widely applied to various tasks in computer vision and deep learning. With the help of our developed efficient energy minimization algorithm for measure-preserving mapping, the computation of the OMT map can be carried out effectively. In this talk, I will introduce the projected gradient method for computing the OMT map and demonstrate its application in manifold registration and image processing. [Back to program](#Program) --- ### Makoto Morishita morishita@hpc.itc.nagoya-u.ac.jp Nagoya University Title: *“Adaptation of Automatic Tuning to a CMOS Annealing Machine”* Abstract: Annealing computers are expected to show superiority over conventional computers in several combinatorial optimization problems. In this study, we optimize performance parameters such as hyperparameters and annealing parameters by software auto-tuning when solving optimization problems on a CMOS annealing machine, which is one of “quantum-inspired” annealing computers. Then, we show the usefulness of AT technology in the CMOS annealing machine. [Back to program](#Program) --- ### Yen-Chen Chen chen-yenchen842@g.ecc.u-tokyo.ac.jp The University of Tokyo Title: *“A Parallel-in-time Method for Compressible Fluid Simulation”* Abstract: Conventional High-Performance Computing (HPC) methods for time-dependent problems are reaching an acceleration limit in the spatial domain. As we move toward the exaflops supercomputing era, numerical PDE solvers parallelization reaches its saturation point in the spatial dimension and is restricted by the time domain instead. This restriction leads to the development of parallel-in-time (PinT) methods. Well-known PinT methods such as parareal and multigrid reduction in time (MGRIT) have provided reasonable acceleration to partial differential equations (PDE) implicit schemes. However, very few PinT methods have been tested with explicit schemes to our best knowledge. Moreover, those very few pieces of research on PinT methods with explicit schemes show poor acceleration compared to spatial parallelization. This research introduces a parallel-in-time method optimized to work with explicit schemes. The proposed method constructs a multiple coarsening layer structure and solves the parareal algorithm through coarse to fine layers. The proposed method is optimized based on the number of available cores to improve the efficiency of parallel-in-time solvers with a limited number of processors. This research introduces a numerical experiment of a two-dimensional compressible viscous flow around a circular cylinder, using explicit time-marching schemes as relaxation methods. The research result shows that the proposed parallel-in-time method could improve the computation efficiency of explicit solvers compared to pure spatial parallelization. [Back to program](#Program) --- ### Chung-Yu Shih davidavidshih11@gmail.com National Central University Advisor: Feng-Nan Hwang hwangf@math.ncu.edu.tw Title: *“A Hybrid Data Assimilation Method Based on Extended Kalman Filter and Long Short-term Memory for Traffic Flow Prediction Problems”* Abstract: In recent years, the development of smart cities has driven the demand for traffic flow forecasting. Two traffic flow prediction techniques are currently available: a mathematical model based on traffic flow theory, producing stable and accurate long-term results under ideal conditions. The other is a machine learning method based on training data, which provides information beyond theory. In this talk, we propose a hybrid data assimilation framework to take advantage of the strengths of both classes of techniques. The key ingredients of the proposed method include an extended Kalman Filter (EKF) to obtain an improved prediction using the numerical simulation by eliminating the negative impact of an inaccurate initial condition setting due to observation error and the long short-term memory (LSTM) method, a machine learning method. As the numerical simulator, a kernel component of the predictive tool, we use an explicit Godunov’s method to discretize the Lighthill-Whitham-Richards model, where the MacNicholas formulation is used as the fundamental relation between the velocity and density. EKF assimilates the background value obtained by numerical simulation with LSTM predictions as pseudo observations to obtain a more accurate initial condition for the next prediction. Based on the real historical highway traffic data, our experiment results show that the LSTM-EKF method can successfully filter out part of the noise from the observation error and performs better than traditional EKF and LSTM methods. [Back to program](#Program) --- ### Chang-Wen Liang treewithout@gmail.com National Central University Advisor: Feng-Nan Hwang hwangf@math.ncu.edu.tw Title: *“Nonlinear Elimination Preconditioned Inexact Newton Algorithms for a Full Space-time Formulation of Hyperbolic Equations”* Abstract: As the computing power of the latest parallel computer systems increases dramatically, the fully coupled space-time solution algorithms for the time-dependent PDEs obtained their popularity recently for temporal domain parallelism. This space-time algorithm requires to solve the resulting large, space, nonlinear systems in an all-at-once manner. A robust and efficient nonlinear solver plays an essential role as a critical kernel of the whole solution algorithm. In this talk, we study some nonlinear preconditioned Newton algorithms for the space-time formulation of the hyperbolic equation with shock presented. In that case, the history of the nonlinear residual norm for the classical inexact Newton method with backtracking (INB) suffers from a long stagnation due to strong local nonlinearity. Nonlinear preconditioning such as nonlinear elimination has been shown as a practical technique to improve the robustness of INB for many different types of PDEs but does not work well for hyperbolic PDEs. We propose a new variant of nonlinear elimination preconditioners designed for hyperbolic PDEs by taking their characteristics into account to overcome the difficulties. We performed a comparative performance study of two nonlinear preconditioned iterative algorithms, where nonlinear elimination techniques as either right or left nonlinear preconditioning, in conjunction with inexact Newton algorithms, namely INB-ANE and NEPIN, respectively. Taking the Riemann problems for Burgers' equation and two-phase flow equation as examples, we numerically show that NEPIN outperformed INB-ANE in the sense that the number of inexact Newton iterations required to converge is almost independent of both of the time-step and the mesh sizes. The success secrets of NEPIN are that it can more quickly identify the correct shock location and introduce less the interface pollution after the subspace correction before the global update, compared to INB-ANE. [Back to program](#Program) --- ### Masatoshi Kawai kawai@cc.u-tokyo.ac.jp The University of Tokyo, co-author: Akihiro Ida (Japan Agency for Marine-Earth Science and Technology) Title: *“Numerical Evaluation of Dynamic Core Binding Library with H-matrix Application”* Abstract: As represented by the H matrix, there are many applications in which it is difficult to balance loads among a unit of parallelization, and complex implementations are often required for realizing equal load balancing. This research will introduce a library that equalizes the load for each core in a node for applications where MPI and OpenMP hybrid parallelization is performed and it is not easy to keep evenly load balance. In this library, by changing the number of cores assigned to each MPI process, each core's calculation can be made evenly even if the load between MPI processes is not evenly. [Back to program](#Program) --- ### Yueh-Nan Chen yuehnan@mail.ncku.edu.tw National Cheng Kung University Title: *“Performance of Quantum State Transfer on IBM-Q, Ion-Q, and QuTech Devices”* Abstract: Quantum state transfer (QST) provides a method to send arbitrary quantum states from one system to another. Such a concept is crucial for transmitting quantum information into the quantum memory, quantum processor, and quantum network. In this talk, I will first introduce the concept of EPR steering. I will then describe the temporal analogue of EPR steering, i.e. temporal quantum steering. For practical applications, I will show that the temporal steerability is preserved when the perfect QST process is successful. Otherwise, it decreases under imperfect QST processes. We then apply the temporal steerability measurement technique to benchmark quantum devices including the IBM quantum experience, QuTech quantum inspire, and Ion-Q system under QST tasks. The experimental results show that the temporal steerability decreases as the circuit depth increases. Moreover, we show that the no-signaling in time condition could be violated because of the intrinsic non-Markovian effect of the devices. [Back to program](#Program) --- ### Tetsuya Hoshino hoshino@cc.u-tokyo.ac.jp The University of Tokyo co-authors: Akihiro Ida (Japan Agency for Marine-Earth Science and Technology), Toshihiro Hanawa (The University of Tokyo) Title: *“Optimizations of H-matrix Vector Multiplication for A64FX”* Abstract: The supercomputer Wisteria/BDEC-01, which started its operation at the Information Technology Center of the University of Tokyo in May 2021, is equipped with the A64FX processor as well as the supercomputer Fugaku. The A64FX is the first processor that implements Scalable Vector Extension (SVE), and since there are few evaluation cases, it is an urgent issue to clarify its performance characteristics. The hierarchical matrices (H-matrices) are one of the matrix approximation methods and are effective for dense matrices that appear as coefficient matrices in the boundary element method. In order to speed up HACApK, which is an H-matrix library, it is necessary to efficiently handle H-matrix with a complex data structure. In this presentation, we focus on H-matrix-vector multiplication and evaluate them on A64FX, Intel Xeon CascadeLake, and AMD EPYC Rome. [Back to program](#Program) --- ### Torbjörn Nordling t@nordlinglab.org National Cheng Kung University Title: *“Feature Selection under Uncertainty--Avoiding the Combinatorial Explosion by Making the Problem Seemingly Harder”* Abstract: Feature selection is a challenging fundamental problem encountered every time one constructs a model of a system. Typically prior knowledge on which features that are essential to measure and include is used. In lack of such knowledge, various feature selection methods based on trying different feature combinations, ranging from exhaustive testing of every possible combination of features to various heuristic search algorithms, are used. In this presentation, I discuss feature selection under uncertainty caused by measurement noise in a linear regression problem. I demonstrate that despite the noise seemingly making the feature selection harder, it can be used to estimate uncertainty sets of each feature and convert the problem into a feature selection under uncertainty problem for which a provably necessary subset of features can be found. To encourage discussion and hopefully provide intuitive understanding this is all presented using one single example motivated by the need to infer gene regulatory networks. [Back to program](#Program) --- ### Satoshi Ohshima ohshima@cc.nagoya-u.ac.jp Nagoya University Title: *“QR Factorization of Block Low-rank Matrices on Multi-Instance GPU”* Abstract: QR factorization is one of the important computation and used in various numerical simulations. Many large-scale simulations using huge-size matrices require both large-size memory and long-time computation. Low-rank approximation methods are expected to decrease both of them. Block low-rank (BLR) matrices is one of the low-rank approximation methods and QR factorization of BLR is already implemented. To reduce execution time of it, we are trying to utilize Multi-Instance GPU (MIG). In this talk, the current status is shown. [Back to program](#Program) --- ### Cheng-Fang Su scf1204@nctu.edu.tw National Yang Ming Chiao Tung University Title: *“Quantum circuit design for computer-assisted Shor’s algorithm”* Abstract: This talk will first briefly introduce Shor’s algorithm for cracking RSA encryption and explain our achievements in circuit design. Our team successfully constructed the universal quantum gate for Shor’s algorithm and derived the cost of this quantum circuit to estimate the complexity. If researchers can improve the hardware in the future, the quantum circuit design proposed in our study result can decompose any composite number. This talk is based on the joint work with Chair Professor Chi-Chuan Hwang from National Cheng Kung University. [Back to program](#Program) --- ### Takeshi Iwashita iwashita@iic.hokudai.ac.jp Hokkaido University Title: *“Subspace Correction Preconditioning and Deflation Method for a Sequence of Linear Systems and Condition Number Estimation by Using Error Vector Sampling”* Abstract: In this talk, we introduce an efficient use of error vector sampling for convergence acceleration of preconditioned CG method and the condition number estimation. First, we focus on solving a series of linear systems with an identical (or similar) coefficient matrix. The linear systems are sequentially processed due to the dependence of the right-hand side vector on the solution vector of the prior linear system. For such a problem, we investigate the subspace correction and deflation methods to accelerate the convergence of the Krylov subspace method. Practically, these acceleration methods work well when the range of the auxiliary matrix contains eigenvectors corresponding to small eigenvalues of the coefficient matrix. We have developed a new auxiliary matrix construction method to identify the approximation of the eigenvectors with small eigenvalues using error vector sampling in the prior solution step. Numerical tests confirm that both subspace correction and deflation methods with the generated auxiliary matrix accelerate the convergence of the iterative solver. Next, we show that the error vector sampling is effective to estimate the condition number of the coefficient matrix. [Back to program](#Program) --- ### Yu-Hsun Lee andylee0914.tw@gmail.com National Cheng Kung University Title: *“Reliability of GPU-accelerated Vortex Filament Evolution under the Biot–Savart Law”* Abstract: In this talk, we investigate the numerical treatment of the Biot–Savart law and provide computation reliability for vortex filament evolution based on Rosenhead's regularized Biot–Savart law for incompressible fluid. We reproduce the numerical reconnection of vortex filaments by GPU computation and establish reliability based on numerical experiments. We measure the rounding error quantitatively by comparing the numerical solutions with quadruple precision and multiple-precision arithmetic. Our proposed GPU computation enables us to investigate the relation of temporal and spatial discretization parameters to mitigate disturbance in numerical solutions in reasonable computational time [Back to program](#Program) --- ### Takeshi Terao takeshi.terao@riken.jp RIKEN co-authors: Toshiyuki Imamura (RIKEN), Katsuhisa Ozaki (Shibaura Institute of Technology) Title: *“Development of a Reliable Eigensolver on Supercomputer”* Abstract: We are developing a reliable eigensolver on a supercomputer. Recent supercomputers can solve problems of, for example, one million dimensions with a reasonable computational cost. However, the errors in the computational results are often opaque. The proposed eigensolver computes the computed eigenvalues and their errors, and the error bounds are computed using a numerical verification method. Several numerical verification methods for eigenvalues have been proposed, but they differ in computational speed and overestimation of errors. In the proposed method, the best verification method is selected according to the tolerance given by the user. The presentation will compare the computational performance with EigenExa on the Fugaku supercomputer. [Back to program](#Program) --- ### Masahiro Hamano hamano@gs.ncku.edu.tw National Cheng Kung University Title: Probabilistic Computation and Stochastic Exponential Comonad Abstract: Recent trends of probabilistic programming for statistical modelling require clear semantic account for computation, separately from statistic algorithms but for explainable outcome. Probabilistic semantics is more widely applied to transition systems with continuous state spaces for concurrent systems such as process calculi. The category-theoretical notion of monad, which is ubiquitous in computer science, needs to be accommodated to the stochasticity as the notion has provided a foundation for modelling computation and programming. An instance of this was yet established by the probabilistic foundation of Giry monad, which yields stochastic relation (a.k.a. Markov transition kernels), analogous to how the category of the relations has been to deterministic discrete systems. This talk presents how the notion of monad arises consistently with a duality in probabilistic computation. The duality, which is the origin of Girard's Linear Logic and later extended over discrete probability by Danos-Ehrhard's probabilistic coherent spaces ('11), is shown generalised into continuous measurable spaces. The generalisation enables us to extract intrinsic limit for the modal operator (the exponential) by the genuine (Lebesgue monotone) convergence. The functorial tensor product by Staton's s-finiteness kernels ('17) (for commutative programming semantics) plays also a crucial role, making information flow pictorial, which guarantees our semantics powerful enough to have feedback as well as iteration. [Back to program](#Program) --- ### Sameer Deshmukh sameer.deshmukh@rio.gsic.titech.ac.jp Tokyo Institute of Technology Contact: Rio Yokota: rioyokota@gsic.titech.ac.jp Title *“Acceleration of $O(N)$ Solvers for Large Dense Matrices”* Abstract: Low rank approximation of the off-diagonal blocks of large dense matrices arising from boundary element methods, electrostatics and weather prediction allows us to reduce the time needed for a direct factorization from $O(N^3)$ to $O(N)$. In this talk, we will see a brief overview of the math behind such methods, followed by our work on accelerating them on highly parallel shared and distributed memory computers. [Back to program](#Program) --- ### Muhammad Ridwan Apriansyah ridwan@rio.gsic.titech.ac.jp Tokyo Institute of Technology Contact: Rio Yokota: rioyokota@gsic.titech.ac.jp Title: *“Parallel QR Factorization of Block Low-rank Matrices”* Abstract: Matrices from many scientific applications have been shown to have rank-deficient blocks on the off-diagonal. In this talk, we present two new algorithms to perform QR factorization of such matrices. Our algorithms exploit the Block Low-Rank (BLR) structure to achieve more than an order of magnitude faster computation time than the state-of-the-art vendor optimized dense QR factorization. Our algorithms are also robust to ill-conditioning and show promising parallel scalability on multicore architecture. We discuss how to adapt the existing blocked and tiled algorithms to exploit BLR structure and present numerical experiment results on a shared memory system. [Back to program](#Program) --- ### Thomas Spendlhofer thomas@rio.gsic.titech.ac.jp Tokyo Institute of Technology Contact: Rio Yokota: rioyokota@gsic.titech.ac.jp Title: *“Iterative Refinement with Hierarchical Low-rank Preconditioners Using Mixed Precision”* Abstract: LU factorization is one of the standard algorithms used to solve linear systems of the form $Ax=b$. However, due to the limits of numerical calculations, it fails to produce an accurate solution if the system $A$ is very ill-conditioned. In such a case, iterative refinement is commonly employed to increase the accuracy of the initial solution. Recently, it has been demonstrated that mixed precision iterative refinement is capable of accurately solving systems (in double precision) up to a condition number of $\kappa_{\infty} \leq 10^{16}$, even if the factorization is computed in single precision. Nonetheless, this algorithms still requires a full factorization of the matrix $A$, resulting in a complexity of $O(n^3)$. By making use of hierarchical low-rank approximation, that initial complexity could be reduced down to $O(n)$, resulting in a decreased runtime. This method along with the results of an initial performance analysis will be introduced in my talk. [Back to program](#Program) --- ### Wei-Ann Lin r28071012@ncku.edu.tw National Cheng Kung University Advisor: Ray-Bin Chen rbchen@ncku.edu.tw Title: *“Tree-based Gaussian Process with Many Qualitative Factors”* Abstract: In computer experiments, Gaussian process models are commonly used for emulation. However, when both qualitative and quantitative factors in the experiments, emulation using Gaussian process models becomes challenging. In particular, when many qualitative factors are in the experiments, existing methods in the literature become cumbersome due to curse of dimensionality. Motivated by the computer simulations for the design of a cooling system, we propose a new tree-based Gaussian process for emulating computer experiments with many qualitative and quantitative factors. The proposed method incorporates tree structures to model the qualitative factors, with Gaussian process models in the leaf nodes for modeling quantitative factors. Numerical simulations as well as a real example for the design of a cooling system show that the proposed enjoys good prediction accuracy while retaining the model interpretation. [Back to program](#Program) --- ### ending