# eScience-conference-2019 # Keynote Randy Olson - USC rolson@usc.edu tying science and filmmaking ABT Narrative ABT - And, But, Therefore key for story telling 3 part structure how this is related to sciencentific research. There is a prolbem - trying to solve this problem replacing `and` with `but` change dynamic of story Rule of Replacing used by South Park creators narrative vs non-narrative 2 audiences - inner group e.g. your lab outer group - everyone else doesn't know your work figureout the etup , problem, solution (what were doing about it) ABT template narrative is needed for everyone else to understand **Book: Houston we have a Narrative** Why science needs a story ABT training conditioning to develop narratives training story circles narrative training 2 step model 1. demo day - introducton to ABT group 2. story circles 10 1hr sessons and - is simplest form of agreement, common ground understanding * Distilling work * consensus building /cultural divides * Jargon - develop a good narrative and Jargon can be important * don't begin with Jargon * Consistency * interview press conference **Book: Narrative is Everything** nocholas christoff NYT article - artcle about narrative **Book: the one thing** --- # Wednesday September 25, 2019 10:30am - 11:00am ### defoe: A Spark-Based Toolbox for Analysing Digital Historical Textual Data This work presents defoe, a new scalable and portable digital eScience toolbox that enables historical research. It allows for running text mining queries across large datasets, such as historical newspapers and books, in parallel via Apache Spark. It handles queries against collections that comprise several XML schemas and physical representations. The proposed tool has been successfully evaluated using five different large-scale historical text datasets and two computing environments, Cray Urika-GX, and Eddie, as well as in desktops. Results shows that defoe allows researchers to query multiple datasets in parallel from a single command-line interface and in a consistent way, without any HPC environment-specific requirement. **Various models developed:** PAPER object model (data - British Library Newspapers) ALTO object model (British Library Books) FMP object model (Find My Past Newspapers) * reconstruction of articles format Digital Collections Variables - Dataset (collection), Period, Structure, XML, schema, Space, model NLP Processing of data **Text ming queries** ALTO model PAPERS model NZPP model FMP model installing defoe conda environment Running defoe submit source code to Spark --- ### Understanding a Rapidly Expanding Refugee Camp Using Convolutional Neural Networks and Satellite Imagery **Susanne Benz (UC San Diego)**, Hogeun Park (UC San Diego), Jiaxin Li (UC San Diego), Daniel Crawl (UC San Diego), Jessica Block (UC San Diego), Mai Nguyen (UC San Diego), and Ilkay Altintas (UC San Diego) In summer 2017, close to one million Rohingya, an ethnic minority group in Myanmar, have fled to Bangladesh due to the persecution of Muslims. This large influx of refugees has resided around existing refugee camps. Because of this dramatic expansion, the newly established Kutupalong-Balukhali expansion site lacked basic infrastructure and public service. While Non-Governmental Organizations (NGOs) such as Refugee Relief and Repatriation Commissioner (RRCC) conducted a series of counting exercises to understand the demographics of refugees, our understanding of camp formation is still limited. Since the household type survey is time-consuming and does not entail geo-information, we propose to use a combination of high-resolution satellite imagery and machine learning (ML) techniques to assess the spatiotemporal dynamics of the refugee camp. Four Very-High Resolution (VHR) images (i.e., World View-2) are analyze to compare the camp pre- and post-influx. Using deep learning and unsupervised learning, we organized the satellite image tiles of a given region into geographically relevant categories. Specifically, we used a pre-trained convolutional neural network (CNN) to extract features from the image tiles, followed by cluster analysis to segment the extracted features into similar groups. Our results show that the size of the built-up area increased significantly from 0.4 km2 in January 2016 and 1.5 km2 in May 2017 to 8.9 km2 in December 2017 and 9.5 km2 in February 2018. Through the benefits of unsupervised machine learning, we further detected the densification of the refugee camp over time and were able to display its heterogeneous structure. The developed method is scalable and applicable to rapidly expanding settlements across various regions. And thus a useful tool to enhance our understanding of the structure of refugee camps, which enables us to allocate resources for humanitarian needs to the most vulnerable populations. From UC San Diego - GPS --- ### ENVRI-FAIR - Interoperable Environmental FAIR Data and Services for Society, Innovation and Research Andreas Petzold (Forschungszentrum Jülich GmbH), Ari Asmi (University of Helsinki), Alex Vermeulen (Lund University), Gelsomina Pappalardo (CNR Institute of Methodologies for Environmental Analysis), Daniele Bailo (Istituto Nazionale di Geofisica e Vulcanologia), Dick Schaap (MARIS B.V.), Helen M. Glaves (British Geological Survey), Ulrich Bundke (Forschungszentrum Jülich GmbH), and **Zhiming Zhao (University of Amsterdam)** ENVRI-FAIR is the connection of the Cluster of European Environmental Research Infrastructures (ENVRI) to the European Open Science Cloud (EOSC). The overarching goal of ENVRI-FAIR is that at the end of the project, all participating RIs have built a set of FAIR data services which enhances the efficiency and productivity of researchers, supports innovation, enables data- and knowledge-based decisions and connects the ENVRI Cluster to the EOSC. This goal is reached by: (1) well defined community policies and standards on all steps of the data life cycle, aligned with the wider European policies, as well as with international developments; (2) each participating RI will have sustainable, transparent and auditable data services, for each step of data life cycle, compliant to the FAIR principles. (3) the focus of the proposed work is put on the implementation of prototypes for testing pre-production services at each RI; the catalogue of prepared services is defined for each RI independently, depending on the maturity of the involved RIs; (4) the complete set of thematic data services and tools provided by the ENVRI cluster is exposed under the EOSC catalogue of services. Looking at assets - Data, model, software how can these be accessible by the community Research Infrastructure (RI) trying to find solutions to help RI ENVRI-FAIR - helps RIs with understanding documentations EU open science cloud requirememnts: modellling, diverse, be a part of the global network **Problem 1** how to get RI to talk develop common vocabularies find common parts from each RI Open distribution data processes - ODP ENVRI reference model was the end result ontology description of RI www.oil-e.net store ontologies in a RDF triplestore created a ENVRI community knowledge base **problem 2:** identify common problems and dev plans data culture and FAIR data diffcult part of FAIRness - how to assess FAIRness? used custom goFAIR template collaborate with goFAIR and other communities created ENVRIFAIR questionnaire and evaluated questionnaire answers **problem 3** how to organized joint activities? workingw ith developer teams on RI and ENVRI created Agile teams to tackle use cases How to find an automated solution? --- ### Data Identification and Process Monitoring for Reproducible Earth Observation Research **Bernhard Gößwein (Vienna University of Technology)**, Tomasz Miksa (Vienna University of Technology & SBA Research), Andreas Rauber (Vienna University of Technology), and Wolfgang Wagner (Vienna University of Technology) Earth observation researchers use specialised computing services for satellite image processing offered by various data backends. The source of data is often the same, for example Sentinel-2 satellites operated by the European Space Agency, but the way how data is pre-processed, corrected, updated, and later analysed may differ among the backends. Backends often lack mechanisms for data versioning, for example, data corrections are not tracked. Furthermore, an evolving software stack used for data processing remains a black box to researchers. Researchers have no means to identify why executions of the same code deliver different results. This hinders reproducibility of earth observation experiments. In this paper, we present how infrastructure of existing earth observation data backends can be modified to support reproducibility. The proposed extensions are based on recommendations of the Research Data Alliance regarding data identification and the VFramework for automated process provenance documentation. We implemented these extensions at the Earth Observation Data Centre, a partner in the openEO consortium. We evaluated the solution on a variety of usage scenarios, providing also performance and storage measures to evaluate the impact of the modifications. The results indicate reproducibility can be supported with minimal performance and storage overhead. earth science observation EO openEO problem what is the backend and what is the input ? cannot be repoduced because of the unknown back end and input goals for reproducibility socument software identification of changing data - provide easy way to cite and re-use data comparable results - see if real scientific phenomena occured Methodology: RDA recommendations on data indet VFramework and context model and Context model OpenEO project - common EO interface enabling interoperability EODC - Versioning used in Github data identification - data PID and PID storage part of H2020 openEO project future work: adaption to future of the openEO API implementation on different backend types e.g. non file based --- ### A Hybrid Algorithm for Mineral Dust Detection Using Satellite Data Jianwu Wang Peichang Shi (University of Maryland), Qianqian Song (University of Maryland), Janita Patwardhan (University of Maryland), Zhibo Zhang (University of Maryland), **Jianwu Wang (University of Maryland)**, and Aryya Gangopadhyay (University of Maryland) Mineral dust, defined as aerosol originating from the soil, can have various harmful effects to the environment and human health. The detection of dust, and particularly incoming dust storms, may help prevent some of these negative impacts. In this paper, using satellite observations from Moderate Resolution Imaging Spectroradiometer (MODIS) and the Cloud-Aerosol Lidar and Infrared Pathfinder Satellite Observation Observation (CALIPSO), we compared several machine learning algorithms to traditional physical models and evaluated their performance regarding mineral dust detection. Based on the comparison results, we proposed a hybrid algorithm to integrate physical model with the data mining model, which achieved the best accuracy result among all the methods. Further, we identified the ranking of different channels of MODIS data based on the importance of the band wavelengths in dust detection. Our model also showed the quantitative relationships between the dust and the different band wavelengths. **what is mineral dust?** - soil particles in the air adversely affects air quality and human health changes temp structure in the atmosphere satellies providing data: MODIS - awua terra satellite CALIPSO - lidar sensor algorithm development process: simple physical algorithms Machine learning methods tried 5 approaches, LR, RF, ANN, SVM, ensemble elarning of multiple stsacking classifiers Logistic regression ML method wroked best Results: variable selecition using Lasso idea, algorithms saved for reproducibility --- ### Workflow Design Analysis for High Resolution Satellite Image Analysis **Ioannis Paraskevakos (Rutgers University)**, Matteo Turilli (Rutgers University), Bento Collares Gonçalves (Stony Brook, NY), Heather Lynch (Stony Brook, NY), and Shantenu Jha (Rutgers University and Brookhaven National Laboratory) Ecological sciences are using imagery from a variety of sources to monitor and survey populations and ecosystems. Very High Resolution (VHR) satellite imagery provide an effective dataset for large scale surveys. Convolutional Neural Networks have successfully been employed to analyze such imagery and detect large animals. As the datasets increase in volume, O(TB), and number of images, O(1k), utilizing High Performance Computing (HPC) resources becomes necessary. In this paper, we investigate a task-parallel data-driven workflows design to support imagery analysis pipelines with heterogeneous tasks on HPC. We analyze the capabilities of each design when processing a dataset of 3,000 VHR satellite images for a total of 4~TB. We experimentally model the execution time of the tasks of the image processing pipeline. We perform experiments to characterize the resource utilization, total time to completion, and overheads of each design. Based on the model, overhead and utilization analysis, we show which design approach to is best suited in scientific pipelines with similar characteristics. Satellite imager analysis application Process World View 3 WWV03 images implemented based on RADICAL- Ensemble Tool Kit EnTK --- ### Connect your Research Data with Collaborators and Beyond **Amit Chourasia, SDSC** Data is an integral part of scientific research. With a rapid growth in data collection and generation capability and an increasingly collaborative nature of research activities, data management and data sharing have become central and key to accomplishing research goals. Researchers today have variety of solutions at their disposal from local storage to Cloud based storage. However, most of these solutions focus on hierarchical file and folder organization. While such an organization is pervasively used and quite useful, it relegates information about the context of the data such as description and collaborative notes about the data to external systems. This spread of information into different silos impedes the flow research activities. In this tutorial, we will introduce and provide hands on experience with the SeedMeLab platform, which provides a web-based data management and data sharing cyberinfrastructure. SeedMeLab enables research groups to manage, share, search, visualize, and present their data in a web-based environment using an access-controlled, branded, and customizable website they own and control. It supports storing and viewing data in a familiar tree hierarchy, but also supports formatted annotations, lightweight visualizations, and threaded comments on any file/folder. The system can be easily extended and customized to support metadata, job parameters, and other domain and project-specific contextual items. The software is open source and available as an extension to the popular Drupal content management system. For more information visit: http://SeedMeLab.org Motivation: Data Curation Data use/re-use Research Teams stumbling blocks: access control , collaboration, storage, transfer, automation problems: information dispersion - scattered data is harder to manage and use weak presentation requires context to give it depth adn meaning dark data - hard to find data, exists but hard to find seedme data hub: discovery, discussion, description, display use cases for seedmelab strengthen your brand with your data use as a collaboration hub use for data management plan Laser Plasman Lab @UCSD examples application hub: CIPRES Gateway compute in a website GenApp Gateway architecture for seedmelab client / server client is web browser REST client - custom web application content managmeent - lang interpreter , data base, website managment use CMS Drupal content management system - one of the major CMS Customized CMS to handle data management account management , site layout, security, content management SeedmeLab specialized modules for Drupal specialisted REST client Open source Seedmelab features: Foldershare Foldershare REST - enables fileshare system every file and folder has descriptions Formatter Suite - tools for formatting content Chart Suite - preview viz of data CILogon Auth 500mb 2gb uploading will break seedmelab managed by SDSC regulatory compliant data under Sherlock Cloud package amit@sdsc.edu ------ **Odd Erik Gundersen** Scientific method in emperical AI research beliefs about AI method , study design, hypothesis, prediction, experiment, results , interpretation reproducibility specturm rd peng science 2011 v. stodden - replication and reproduction definition SN goodman, Fanelli, loannidis definitions of reproducibility ai research produce sam eresults using the same AI method based on the documentation made by the original research team. Documentation includes: method data experiment degree of reproducibility R1 share all R2 data no code R3 written text only Ml data used for training and results share code , hardware info , Reproducibility metrics Machine learning platforms were looked at and scored --- # eScience 2019 - Day 2 - September 26, 2019 ## Keynote **Manish Parashar** "Exploring the Future of Facilities-based, Driven-Driven Science"** Large-scale experimental and observational facilities, individually and collectively, provide new opportunities for data-driven research across a wide range of science and engineering domains. These facilities provide shared-use infrastructure, instrumentation, and data products that are openly accessible to a broad community of researchers and educators. However, as these facilities grow in scale and provide increasing volumes of data and data products, effectively using them has become a significant challenge. In this talk, I will explore new opportunities enabled by these facilities as well as the new challenges presented. I will also explore how the cyberinfrastructure continuum, from the edge to extreme scales systems, can be harnessed to support end-to-end data-driven workflows. Specifically, I will explore approaches for intelligent data delivery, in-transit data processing and edge-core integration. This research is part of the Continuum Computing project at the Rutgers Discovery Informatics Institute. NSF Ocean Observatories Initiative (OOI) ooinet.oceanobservatories.org Facility-based data-drive Science challenges data products are many and diverse Data volumes are large e.g. LIGO LSST facility-based data-driven science is exciting, wide randing and highly impactful open, near real time democratized access to data from geographically distributed sensors, unsturments, experiments use-case: Tsunami early warning goal: combine multiple data sources to cover the whole spectrum of events Data acess challenge ex 1 large volume, high data-rate geographically distributed datasets OOI LIGO SKA LSST large number of geographically distributed users communities with overlapping interests ex. 2 Data transfer performance for large-scale Climate data analysis addressing data access prefetching hybrid prefetching model association based prediction model Data recommendation data objects and accessses are co-related along many dimensions can we leverate domain knowledge, data models ad knowledge graphs to recommend (nd push) data to users? Challenges Data access, discover, integration, leveraging data models adding domain knowledge Data processing using edge & in-transit resources leverage resources at the edge and instransit nodes summary data-driven science and engineering research enabled by large-scale, share use experimental and observational facilities preserts new opportunities for discovery. http://parashar.rutgers.edu parashar@rutgers.edu question about curation, cleaning how to deploy in the pipe-line community based infrastructure - how to implement --- **Yoshio Tanaka National Institute of Advanced Industrial Science and Technology (AIST), Japan** **Future Vision of e-Science Based on the Insights Gained Through Experiences** About 20 years have passed since the term “e-Science” was created. Since then, rapid development of new information technologies has greatly impacted the research methods and lifecycle of the scientific enterprise. State-of-the-art technologies such as big data analytics, IoT, AI, and robotics may solve currently impossible problems, however there still remain fundamental problems that must be solved to make the vision of e-Science a reality. In this talk, the future vision of e-Science will be presented based on the insights gained through experiences on past research such as Grid and Cloud computing which aimed to build on-demand and dynamic virtual infrastructures based on requirements created by application usage. Through this examination, fundamental problems will be identified and mapped to the current e-Science landscape. Understanding these issues will lead to solutions that must be achieved in order to make the vision of e-Science a reality. AIST - AI BRidiging Cloud Infrastucture Asia GEO grid initiative project combining satellite and in-situ data Geo-spatial information processing global monitoring of hot areas change detection and recognition using ML PRAGMA AI Platform Data Exploitation Platform (tentative name) in Japan Provide a repid PoC on-demand platform Infrastructure of the ata exploitation platform Challenges for making the vision of e-Science a reality easy to use interface and workflows scalability / performance application developmetn security Data interoperation FAIR Overall design of platform infrastructure --- **Chaitan Baru Senior Science Advisor, National Science Foundation** **Knowledge as Infrastructure** Knowledge networks that encode curated information about data, tools, processes, and the science itself will be essential cyberinfrastructure for the future. Recognizing the importance of knowledge graphs, the US National Science Foundation has embraced the creation of an Open Knowledge Network through its newly announced Convergence Accelerator pilot. The Convergence Accelerator pilot started with a singular vision: to identify areas of research where investment in convergent approaches -- those bringing together people across disciplines to solve problems – have the potential to yield high-benefit results. The Convergence Accelerator seeks to expand and refine NSF's efforts to support fundamental scientific exploration through partnerships that potentially include stakeholders from industry, government, nonprofits and other sectors. Tracing the evolution of eScience topics from its first year of establishment in 2005, one finds topics such as web services, service-oriented architecture, grid computing, high-performance computing, data management and digital repositories, dataflow, middleware, cloud computing, In future, eScience will be assisted by AI—intelligent agents, smart instruments, smart tutors, machine learning and deep learning for data interpretation, and the like. Knowledge graphs are essential to the success of such AI and are, therefore, central to eScience. This talk will introduce challenges and opportunities of knowledge networks and will provide an overview of the NSF Convergence Accelerator for the Open Knowledge Network. NSF convergence Accelerator new organizational structure to accelerate the trasition fo convergence reserach into practive in areas of national importance designed for larger national scale teams and cohorts OKN open knowledge network - motivaition --- **Roselyne Tchoua University of Chicago** **Active Learning Yields Better Training Data for Scientific Named Entity Recognition** **Roselyne Tchoua (University of Chicago)**, Aswathy Ajith (University of Chicago), Zhi Hong (University of Chicago), Logan Ward (Argonne National Laboratory), Kyle Chard (University of Chicago), Debra Audus (National Institute of Standards and Technology), Shrayesh Patel (University of Chicago), Juan de Pablo (University of Chicago), and Ian Foster (Argonne National Laboratory) Despite significant progress in natural language processing, machine learning models require substantial expert annotated training data to perform well in tasks such as named entity recognition (NER) and entity relations extraction. Furthermore, NER is often more complicated when working with scientific text. For example, in polymer science, chemical structure may be encoded using nonstandard naming conventions, the same concept can be expressed using many different terms (synonymy), and authors may refer to polymers with ad-hoc labels. These challenges, which are not unique to polymer science, make it difficult to generate training data, as specialized skills are needed to label text correctly. We have previously designed polyNER, a semi-automated system for efficient identification of scientific entities in text. PolyNER applies word embedding models to generate entity-rich corpora for productive expert labeling, and then uses the resulting labeled data to bootstrap a context-based word vector classifier. PolyNER facilitates a labeling process that is otherwise tedious and expensive. Here, we use active learning to efficiently obtain more annotations from experts and improve performance. PolyNER requires just five hours of expert time to achieve discrimination capacity comparable to that of a state-of-the-art chemical natural language processing toolkit, highlighting the potential for human-computer partnership in domain-specific scientific NER. PolyNER system - streamline machine learning to fine materials scientific facts from text polymer extraction is manual process - extracting facts about polymers Scientific Information Extraction IE **challenges in informatoin extraction:** non standard naming acronyms/sysnonyms families/classes * verbage and vocabularies polysemy synomy words that mean similar things in different contexts and many words that mean the same thing * lack of training data available NER scientific NER * training data is expensive to obtain and prepare PolyNER goal to slash cost of NER training by using bootstrap methods to optimize the effectiveness/impact of expert labeling process for finding polymer in text: remove english words Spacy NLTK use part of speect tagging remote numbers and unwanted characters multi-level expertise candidiate labeling labels provided by untrained crowds and expert polymer scientists --- # Poster session 2:30p - 3:30p ***Presentation*** post posters for shared service --- **Michael Wyatt** **Characterizing In Situ and In Transit Analytics of Molecular Dynamics Simulations for Next-Generation Supercomputers** Michela Taufer (The University of Tennessee), Stephen Thomas (The University of Tennessee), Michael Wyatt (The University of Tennessee), Tu Mai Anh Do (University of Southern California), Loïc Pottier (University of Southern California), Rafael Ferreira da Silva (University of Southern California), Harel Weinstein (Cornell University), Michel A. Cuendet (Cornell University; Lausanne University Hospital), Trilce Estrada (University of New Mexico), and Ewa Deelman (University of Southern California) Molecular Dynamics (MD) simulations executed on state-of-the-art supercomputers are producing data at a rate faster than it can be written out to disk. In situ and in transit analysis of data produced by MD simulations reduce the original volume of information by several orders of magnitude, thereby alleviating the negative impact of I/O bottleneck. This work focuses on characterizing the impact of in situ and in transit analytics on the overall MD workflow performance, and the capability for capturing rapid, rare events in the simulated molecular system. The MD simulation and analysis processes share data via remote direct memory access (RDMA) using Dataspaces. Our metrics of interest are time spent waiting in I/O, or lost frames by the MD simulation, and idle time by the analysis. We measure these metrics for a diverse set of molecular systems, characterize their trends for in situ and in transit configurations, and model which frames are dropped and which ones are analyzed for a real use case. The insights gained from this study are generally applicable for in situ and in transit workflows that require optimization of parameters to minimize loss in workflow performance and analytic accuracy. collective variables data workflow modeling for analytics generator of md frams --- # Afternoon Keynote **Maryann Martone UC San Diego** The launch of several international large brain projects indicates that we are still far from understanding the brain at even a basic level, let alone being able to intervene meaningfully in most degenerative, psychiatric and traumatic brain disorders. Such projects reflect the idea that neuroscience needs to be placed on a more data-rich, computational footing to address the inherent complexity of the nervous system. But should we just be looking towards big science to produce comprehensive and integrated data and tools? What about the thousands of studies conducted by individual investigators and small teams, so called “long tail data"? How does the regular practice of neuroscience need to change to address grand challenges in brain science? Across the breadth of academia, researchers are defining new modes of scholarship designed to take advantage of 21st century technology for linking and distributing information. Principles, best practices and tools for networked scholarship are emerging. Chief among these is the move towards open science, making the products of research as open as possible to ensure their broadest use. Second, increased recognition that research outputs should not only include journal articles and books, but data, tools and workflows. Third, that research outputs should be FAIR: Findable, Accessible, Interoperable and Reusable-the characteristics required for making digital objects maximally useful for both humans and machines. FAIR encompasses the agreement upon and use of community standards for data exchange. Finally, that citation and credit systems be redesigned to reflect the broadening of scientific output. In this presentation, I will discuss the community and technical infrastructure for moving neuroscience towards an open, FAIR and citable science, highlighting our experiences in building and maintaining the Neuroscience Information Framework and other related projects. I will also provide an example of work underway in the spinal cord Injury community to come together around the sharing and integration of long tail data. NIF - Neuroscience information framework initiative of the NIH Blueprint consortium of institutes neuinfo.org FORCE11 Manifesto FAIR, OPEN, RESEARCH OBJECT BASED< CITABLE, ECOSYSTEM FAIR in neuroscience is needed Big national brain projects - researchers working on brain initiatives creating huge pools of data small brain projects longtail science Facilities of research excellence - spinal cord injury FORE-SCI "Syndromic space" FAIR requires both technical and social infrastructure data management in labs is a major problem very few practices in the lab neuroscience data repositories standards in neuroscience Opendata commons for spinal cord injury https://scicrunch.org/odc-sci INCF - standards organization to support global neuroscience barin imaging data structure - stanford BIDS Brain Imaging Data Structure a file org standard a metadata standard Resrouce Identification Initiative FAIR Partnership Researchers, repos and registries, indexers aggregators --- # Friday- Day 3 # Keynote **Dieter Kranzlmüller Ludwig-Maximilians-Universität München** As a leadership-class computing facility, the mission of the Leibniz Supercomputing Centre (LRZ) is to enable significant achievement and advancement in science with powerful computational resources including dedicated hardware and software coupled with computational science expertise for specific research domains. One such focus area is environmental science, which is highly relevant for our daily lives and society. With LRZ’s latest supercomputer SuperMUC-NG and its combination of a powerful general purpose architecture with integrated data science and AI capabilities, users from the environmental sciences increase the size and resolution of their model while reducing the necessary computing time. The major factor of success, however, is the partnership between the domain scientists and the computational specialists at LRZ which includes every step along discovery, from dedicated training through the entire scientific workflow during production runs. This talk introduces the LRZ partnership model and highlights a number of example use cases from environmental computing. Leibniz Supercomputing Center (LRZ) SuperMUC-NG - only procs no GPU The experts are the valuable assets not the hardware case study 1 SeiSol - Numerical Simulation of Seismic Wave Phenomena case study 2 StMGP StMUV Project ePIN electronic pollen infomration network monitoring and prediction of pollen flight in Bavaria using a netowrk of online pollen monitors matle circulatio model - studies the continental drift bringing computer scientists and research scientists together hot water cooling system 45deg c water Dieter Kranzlmuller kranzlmuller@lrz.de --- **Chreston Miller** **Timing is Everything: Identifying Diverse Interaction Dynamics in Scenario and Non-Scenario Meetings** In this paper we explore the use of temporal patterns to define interaction dynamics between different kinds of meetings. Meetings occur on a daily basis and include different behavioral dynamics between participants, such as floor shifts and intense dialog. These dynamics can tell a story of the meeting and provide insight into how participants interact. We focus our investigation on defining diversity metrics to compare the interaction dynamics of scenario and non-scenario meetings. These metrics may be able to provide insight into the similarities and differences between scenario and non-scenario meetings. We observe that certain interaction dynamics can be identified through temporal patterns of speech intervals, i.e., when a participant is talking. We apply the principles of Parallel Episodes in identifying moments of speech overlap, e.g., interaction “bursts”, and introduce Situated Data Mining, an approach for identifying repeated behavior patterns based on situated context. Applying these algorithms provides an overview of certain meeting dynamics and defines metrics for meeting comparison and diversity of interaction. We tested on a subset of the AMI corpus and developed three diversity metrics to describe similarities and differences between meetings. These metrics also present the researcher with an overview of interaction dynamics and presents points-of-interest for analysis. Scenario vs. non-scenario Comparison amongst different small group meeting types: scenario - each participant is asignged a role non-scenario - regularly observation: data comparison differences/similarities of two or more datasets charaterized by the data distribution - need to becareful comparing two data sets speech burst - bursts of intense activity situated data mining (SDM) - repeating data patterns --- **Junan Guo (University of California San Diego), Subhasis Dasgupta (University of California San Diego), and Amarnath Gupta (University of California San Diego)** **Multi-model Investigative Exploration of Social Media Data with BOUTIQUE: A Case Study in Public Health** We present our experience with a data science problem in Public Health, where researchers use social media (Twitter) to determine whether the public shows awareness of HIV prevention measures offered by Public Health campaigns. To help the researcher, we develop a “investigative exploration” system called BOUTIQUE that allows a user to perform a multi-step visualization and exploration of data through a dashboard interface. Unique features of BOUTIQUE includes its ability to handle heterogeneous types of data provided by a polystore, and its ability to use computation as part of the investigative exploration process. In this paper, we present the design of the BOUTIQUE middleware and walk through an investigation process for a real-life problem. looking at HIV social media from twitter questions: are people who are at a high risk for HIV aware of the preventitive measures available to them? how are they becomoing aware of these measures? The AWESOME ----