eScience-conference-2019

Keynote

tying science and filmmaking
ABT Narrative

ABT - And, But, Therefore key for story telling
3 part structure how this is related to sciencentific research.

There is a prolbem - trying to solve this problem

replacing and with but change dynamic of story

Rule of Replacing used by South Park creators

narrative vs non-narrative
2 audiences - inner group e.g. your lab outer group - everyone else doesn't know your work

figureout the etup , problem, solution (what were doing about it)
ABT template narrative is needed for everyone else to understand

Book: Houston we have a Narrative

Why science needs a story

ABT training
conditioning to develop narratives training
story circles narrative training 2 step model

demo day - introducton to ABT group
story circles 10 1hr sessons

and - is simplest form of agreement, common ground understanding

Distilling work
consensus building /cultural divides
Jargon - develop a good narrative and Jargon can be important
don't begin with Jargon
Consistency
interview press conference

Book: Narrative is Everything

nocholas christoff NYT article - artcle about narrative

Book: the one thing

Wednesday September 25, 2019 10:30am - 11:00am

defoe: A Spark-Based Toolbox for Analysing Digital Historical Textual Data

This work presents defoe, a new scalable and portable digital eScience toolbox that enables historical research. It allows for running text mining queries across large datasets, such as historical newspapers and books, in parallel via Apache Spark. It handles queries against collections that comprise several XML schemas and physical representations. The proposed tool has been successfully evaluated using five different large-scale historical text datasets and two computing environments, Cray Urika-GX, and Eddie, as well as in desktops. Results shows that defoe allows researchers to query multiple datasets in parallel from a single command-line interface and in a consistent way, without any HPC environment-specific requirement.

Various models developed:
PAPER object model (data - British Library Newspapers)
ALTO object model (British Library Books)
FMP object model (Find My Past Newspapers)
* reconstruction of articles format

Digital Collections
Variables - Dataset (collection), Period, Structure, XML, schema, Space, model

NLP Processing of data

Text ming queries
ALTO model
PAPERS model
NZPP model
FMP model

installing defoe
conda environment

Running defoe
submit source code to Spark

Understanding a Rapidly Expanding Refugee Camp Using Convolutional Neural Networks and Satellite Imagery

Susanne Benz (UC San Diego), Hogeun Park (UC San Diego), Jiaxin Li (UC San Diego), Daniel Crawl (UC San Diego), Jessica Block (UC San Diego), Mai Nguyen (UC San Diego), and Ilkay Altintas (UC San Diego)

In summer 2017, close to one million Rohingya, an ethnic minority group in Myanmar, have fled to Bangladesh due to the persecution of Muslims. This large influx of refugees has resided around existing refugee camps. Because of this dramatic expansion, the newly established Kutupalong-Balukhali expansion site lacked basic infrastructure and public service. While Non-Governmental Organizations (NGOs) such as Refugee Relief and Repatriation Commissioner (RRCC) conducted a series of counting exercises to understand the demographics of refugees, our understanding of camp formation is still limited. Since the household type survey is time-consuming and does not entail geo-information, we propose to use a combination of high-resolution satellite imagery and machine learning (ML) techniques to assess the spatiotemporal dynamics of the refugee camp. Four Very-High Resolution (VHR) images (i.e., World View-2) are analyze to compare the camp pre- and post-influx. Using deep learning and unsupervised learning, we organized the satellite image tiles of a given region into geographically relevant categories. Specifically, we used a pre-trained convolutional neural network (CNN) to extract features from the image tiles, followed by cluster analysis to segment the extracted features into similar groups. Our results show that the size of the built-up area increased significantly from 0.4 km2 in January 2016 and 1.5 km2 in May 2017 to 8.9 km2 in December 2017 and 9.5 km2 in February 2018. Through the benefits of unsupervised machine learning, we further detected the densification of the refugee camp over time and were able to display its heterogeneous structure. The developed method is scalable and applicable to rapidly expanding settlements across various regions. And thus a useful tool to enhance our understanding of the structure of refugee camps, which enables us to allocate resources for humanitarian needs to the most vulnerable populations.

From UC San Diego - GPS

ENVRI-FAIR - Interoperable Environmental FAIR Data and Services for Society, Innovation and Research

Andreas Petzold (Forschungszentrum Jülich GmbH), Ari Asmi (University of Helsinki), Alex Vermeulen (Lund University), Gelsomina Pappalardo (CNR Institute of Methodologies for Environmental Analysis), Daniele Bailo (Istituto Nazionale di Geofisica e Vulcanologia), Dick Schaap (MARIS B.V.), Helen M. Glaves (British Geological Survey), Ulrich Bundke (Forschungszentrum Jülich GmbH), and Zhiming Zhao (University of Amsterdam)

ENVRI-FAIR is the connection of the Cluster of European Environmental Research Infrastructures (ENVRI) to the European Open Science Cloud (EOSC). The overarching goal of ENVRI-FAIR is that at the end of the project, all participating RIs have built a set of FAIR data services which enhances the efficiency and productivity of researchers, supports innovation, enables data- and knowledge-based decisions and connects the ENVRI Cluster to the EOSC. This goal is reached by: (1) well defined community policies and standards on all steps of the data life cycle, aligned with the wider European policies, as well as with international developments; (2) each participating RI will have sustainable, transparent and auditable data services, for each step of data life cycle, compliant to the FAIR principles. (3) the focus of the proposed work is put on the implementation of prototypes for testing pre-production services at each RI; the catalogue of prepared services is defined for each RI independently, depending on the maturity of the involved RIs; (4) the complete set of thematic data services and tools provided by the ENVRI cluster is exposed under the EOSC catalogue of services.

Looking at assets - Data, model, software
how can these be accessible by the community

Research Infrastructure (RI)
trying to find solutions to help RI

ENVRI-FAIR - helps RIs with understanding documentations

EU open science cloud

requirememnts: modellling, diverse, be a part of the global network

Problem 1
how to get RI to talk
develop common vocabularies
find common parts from each RI
Open distribution data processes - ODP

ENVRI reference model was the end result

ontology description of RI
www.oil-e.net

store ontologies in a RDF triplestore

created a ENVRI community knowledge base

problem 2:
identify common problems and dev plans
data culture and FAIR data

diffcult part of FAIRness - how to assess FAIRness?
used custom goFAIR template
collaborate with goFAIR and other communities

created ENVRIFAIR questionnaire and evaluated questionnaire answers

problem 3
how to organized joint activities?

workingw ith developer teams on RI and ENVRI

created Agile teams to tackle use cases

How to find an automated solution?

Data Identification and Process Monitoring for Reproducible Earth Observation Research

Bernhard Gößwein (Vienna University of Technology), Tomasz Miksa (Vienna University of Technology & SBA Research), Andreas Rauber (Vienna University of Technology), and Wolfgang Wagner (Vienna University of Technology)

Earth observation researchers use specialised computing services for satellite image processing offered by various data backends. The source of data is often the same, for example Sentinel-2 satellites operated by the European Space Agency, but the way how data is pre-processed, corrected, updated, and later analysed may differ among the backends.

Backends often lack mechanisms for data versioning, for example, data corrections are not tracked. Furthermore, an evolving software stack used for data processing remains a black box to researchers. Researchers have no means to identify why executions of the same code deliver different results. This hinders reproducibility of earth observation experiments.

In this paper, we present how infrastructure of existing earth observation data backends can be modified to support reproducibility. The proposed extensions are based on recommendations of the Research Data Alliance regarding data identification and the VFramework for automated process provenance documentation. We implemented these extensions at the Earth Observation Data Centre, a partner in the openEO consortium. We evaluated the solution on a variety of usage scenarios, providing also performance and storage measures to evaluate the impact of the modifications. The results indicate reproducibility can be supported with minimal performance and storage overhead.

earth science observation EO
openEO

problem what is the backend and what is the input ?

cannot be repoduced because of the unknown back end and input

goals for reproducibility
socument software

identification of changing data - provide easy way to cite and re-use data

comparable results - see if real scientific phenomena occured

Methodology:

RDA recommendations on data indet

VFramework and context model and Context model

OpenEO project - common EO interface enabling interoperability

EODC - Versioning used in Github

data identification - data PID and PID storage

part of H2020 openEO project

future work:
adaption to future of the openEO API
implementation on different backend types e.g. non file based

A Hybrid Algorithm for Mineral Dust Detection Using Satellite Data

Jianwu Wang

Peichang Shi (University of Maryland), Qianqian Song (University of Maryland), Janita Patwardhan (University of Maryland), Zhibo Zhang (University of Maryland), Jianwu Wang (University of Maryland), and Aryya Gangopadhyay (University of Maryland)

Mineral dust, defined as aerosol originating from the soil, can have various harmful effects to the environment and human health. The detection of dust, and particularly incoming dust storms, may help prevent some of these negative impacts. In this paper, using satellite observations from Moderate Resolution Imaging Spectroradiometer (MODIS) and the Cloud-Aerosol Lidar and Infrared Pathfinder Satellite Observation Observation (CALIPSO), we compared several machine learning algorithms to traditional physical models and evaluated their performance regarding mineral dust detection. Based on the comparison results, we proposed a hybrid algorithm to integrate physical model with the data mining model, which achieved the best accuracy result among all the methods. Further, we identified the ranking of different channels of MODIS data based on the importance of the band wavelengths in dust detection. Our model also showed the quantitative relationships between the dust and the different band wavelengths.

what is mineral dust? - soil particles in the air
adversely affects air quality and human health
changes temp structure in the atmosphere

satellies providing data:
MODIS - awua terra satellite

CALIPSO - lidar sensor

algorithm development process:
simple physical algorithms
Machine learning methods
tried 5 approaches,
LR, RF, ANN, SVM,
ensemble elarning of multiple stsacking classifiers

Logistic regression ML method wroked best

Results: variable selecition using Lasso

idea,
algorithms saved for reproducibility

Workflow Design Analysis for High Resolution Satellite Image Analysis

Ioannis Paraskevakos (Rutgers University), Matteo Turilli (Rutgers University), Bento Collares Gonçalves (Stony Brook, NY), Heather Lynch (Stony Brook, NY), and Shantenu Jha (Rutgers University and Brookhaven National Laboratory)

Ecological sciences are using imagery from a variety of sources to monitor and survey populations and ecosystems. Very High Resolution (VHR) satellite imagery provide an effective dataset for large scale surveys. Convolutional Neural Networks have successfully been employed to analyze such imagery and detect large animals. As the datasets increase in volume, O(TB), and number of images, O(1k), utilizing High Performance Computing (HPC) resources becomes necessary. In this paper, we investigate a task-parallel data-driven workflows design to support imagery analysis pipelines with heterogeneous tasks on HPC. We analyze the capabilities of each design when processing a dataset of 3,000 VHR satellite images for a total of 4~TB. We experimentally model the execution time of the tasks of the image processing pipeline. We perform experiments to characterize the resource utilization, total time to completion, and overheads of each design. Based on the model, overhead and utilization analysis, we show which design approach to is best suited in scientific pipelines with similar characteristics.

Satellite imager analysis application
Process World View 3 WWV03 images

implemented based on RADICAL- Ensemble Tool Kit EnTK

Connect your Research Data with Collaborators and Beyond

Amit Chourasia, SDSC

Data is an integral part of scientific research. With a rapid growth in data collection and generation capability and an increasingly collaborative nature of research activities, data management and data sharing have become central and key to accomplishing research goals. Researchers today have variety of solutions at their disposal from local storage to Cloud based storage. However, most of these solutions focus on hierarchical file and folder organization. While such an organization is pervasively used and quite useful, it relegates information about the context of the data such as description and collaborative notes about the data to external systems. This spread of information into different silos impedes the flow research activities.

In this tutorial, we will introduce and provide hands on experience with the SeedMeLab platform, which provides a web-based data management and data sharing cyberinfrastructure. SeedMeLab enables research groups to manage, share, search, visualize, and present their data in a web-based environment using an access-controlled, branded, and customizable website they own and control. It supports storing and viewing data in a familiar tree hierarchy, but also supports formatted annotations, lightweight visualizations, and threaded comments on any file/folder. The system can be easily extended and customized to support metadata, job parameters, and other domain and project-specific contextual items. The software is open source and available as an extension to the popular Drupal content management system.

For more information visit: http://SeedMeLab.org

Motivation:

Data Curation
Data use/re-use
Research Teams

stumbling blocks:

access control , collaboration, storage, transfer, automation

problems:
information dispersion - scattered data is harder to manage and use
weak presentation requires context to give it depth adn meaning
dark data - hard to find data, exists but hard to find

seedme data hub:
discovery, discussion, description, display

use cases for seedmelab
strengthen your brand with your data
use as a collaboration hub
use for data management plan
Laser Plasman Lab @UCSD

examples application hub:
CIPRES Gateway compute in a website
GenApp Gateway

architecture for seedmelab
client / server
client is web browser

REST client - custom web application

content managmeent - lang interpreter , data base, website managment
use CMS

Drupal content management system - one of the major CMS

Customized CMS to handle data management
account management , site layout, security, content management

SeedmeLab
specialized modules for Drupal
specialisted REST client

Open source

Seedmelab features:
Foldershare
Foldershare REST - enables fileshare system

every file and folder has descriptions
Formatter Suite - tools for formatting content
Chart Suite - preview viz of data
CILogon Auth

500mb
2gb

uploading will break

seedmelab managed by SDSC

regulatory compliant data under Sherlock Cloud package

amit@sdsc.edu

Odd Erik Gundersen

Scientific method in emperical AI research

beliefs about AI method , study design, hypothesis, prediction, experiment, results , interpretation

reproducibility specturm rd peng science 2011

v. stodden - replication and reproduction definition

SN goodman, Fanelli, loannidis
definitions of reproducibility

ai research produce sam eresults using the same AI method based on the documentation made by the original research team.

Documentation includes:
method
data
experiment

degree of reproducibility
R1 share all
R2 data no code
R3 written text only

Ml data used for training and results

share code , hardware info ,

Reproducibility metrics

Machine learning platforms were looked at and scored

eScience 2019 - Day 2 - September 26, 2019

Keynote

Manish Parashar

"Exploring the Future of Facilities-based, Driven-Driven Science"**

Large-scale experimental and observational facilities, individually and collectively, provide new opportunities for data-driven research across a wide range of science and engineering domains. These facilities provide shared-use infrastructure, instrumentation, and data products that are openly accessible to a broad community of researchers and educators. However, as these facilities grow in scale and provide increasing volumes of data and data products, effectively using them has become a significant challenge. In this talk, I will explore new opportunities enabled by these facilities as well as the new challenges presented. I will also explore how the cyberinfrastructure continuum, from the edge to extreme scales systems, can be harnessed to support end-to-end data-driven workflows. Specifically, I will explore approaches for intelligent data delivery, in-transit data processing and edge-core integration. This research is part of the Continuum Computing project at the Rutgers Discovery Informatics Institute.

NSF Ocean Observatories Initiative (OOI)
ooinet.oceanobservatories.org

Facility-based data-drive Science challenges
data products are many and diverse

Data volumes are large e.g.

LIGO
LSST

facility-based data-driven science is exciting, wide randing and highly impactful

open, near real time democratized access to data from geographically distributed sensors, unsturments, experiments

use-case: Tsunami early warning
goal: combine multiple data sources to cover the whole spectrum of events

Data acess challenge

ex 1
large volume, high data-rate geographically distributed datasets
OOI
LIGO
SKA
LSST

large number of geographically distributed users communities with overlapping interests

ex. 2
Data transfer performance for large-scale Climate data analysis

addressing data access prefetching

hybrid prefetching model
association based prediction model

Data recommendation
data objects and accessses are co-related along many dimensions

can we leverate domain knowledge, data models ad knowledge graphs to recommend (nd push) data to users?

Challenges
Data access, discover, integration,

leveraging data models
adding domain knowledge

Data processing using edge & in-transit resources
leverage resources at the edge and instransit nodes

summary
data-driven science and engineering research enabled by large-scale, share use experimental and observational facilities preserts new opportunities for discovery.

http://parashar.rutgers.edu
parashar@rutgers.edu

question about curation, cleaning
how to deploy in the pipe-line

community based infrastructure - how to implement

Yoshio Tanaka
National Institute of Advanced Industrial Science and Technology (AIST), Japan

Future Vision of e-Science Based on the Insights Gained Through Experiences

About 20 years have passed since the term “e-Science” was created. Since then, rapid development of new information technologies has greatly impacted the research methods and lifecycle of the scientific enterprise. State-of-the-art technologies such as big data analytics, IoT, AI, and robotics may solve currently impossible problems, however there still remain fundamental problems that must be solved to make the vision of e-Science a reality. In this talk, the future vision of e-Science will be presented based on the insights gained through experiences on past research such as Grid and Cloud computing which aimed to build on-demand and dynamic virtual infrastructures based on requirements created by application usage. Through this examination, fundamental problems will be identified and mapped to the current e-Science landscape. Understanding these issues will lead to solutions that must be achieved in order to make the vision of e-Science a reality.

AIST -

AI BRidiging Cloud Infrastucture
Asia GEO grid initiative project

combining satellite and in-situ data

Geo-spatial information processing
global monitoring of hot areas
change detection and recognition using ML

PRAGMA AI Platform

Data Exploitation Platform (tentative name) in Japan
Provide a repid PoC

on-demand platform

Infrastructure of the ata exploitation platform

Challenges for making the vision of e-Science a reality

easy to use interface and workflows
scalability / performance

application developmetn
security
Data interoperation
FAIR

Overall design of platform infrastructure

Chaitan Baru
Senior Science Advisor, National Science Foundation

Knowledge as Infrastructure

Knowledge networks that encode curated information about data, tools, processes, and the science itself will be essential cyberinfrastructure for the future. Recognizing the importance of knowledge graphs, the US National Science Foundation has embraced the creation of an Open Knowledge Network through its newly announced Convergence Accelerator pilot.

The Convergence Accelerator pilot started with a singular vision: to identify areas of research where investment in convergent approaches – those bringing together people across disciplines to solve problems – have the potential to yield high-benefit results. The Convergence Accelerator seeks to expand and refine NSF's efforts to support fundamental scientific exploration through partnerships that potentially include stakeholders from industry, government, nonprofits and other sectors.

Tracing the evolution of eScience topics from its first year of establishment in 2005, one finds topics such as web services, service-oriented architecture, grid computing, high-performance computing, data management and digital repositories, dataflow, middleware, cloud computing, In future, eScience will be assisted by AI—intelligent agents, smart instruments, smart tutors, machine learning and deep learning for data interpretation, and the like. Knowledge graphs are essential to the success of such AI and are, therefore, central to eScience.

This talk will introduce challenges and opportunities of knowledge networks and will provide an overview of the NSF Convergence Accelerator for the Open Knowledge Network.

NSF convergence Accelerator

new organizational structure to accelerate the trasition fo convergence reserach into practive in areas of national importance

designed for larger national scale
teams and cohorts

OKN open knowledge network - motivaition

Roselyne Tchoua
University of Chicago

Active Learning Yields Better Training Data for Scientific Named Entity Recognition

Roselyne Tchoua (University of Chicago), Aswathy Ajith (University of Chicago), Zhi Hong (University of Chicago), Logan Ward (Argonne National Laboratory), Kyle Chard (University of Chicago), Debra Audus (National Institute of Standards and Technology), Shrayesh Patel (University of Chicago), Juan de Pablo (University of Chicago), and Ian Foster (Argonne National Laboratory)

Despite significant progress in natural language processing, machine learning models require substantial expert annotated training data to perform well in tasks such as named entity recognition (NER) and entity relations extraction. Furthermore, NER is often more complicated when working with scientific text. For example, in polymer science, chemical structure may be encoded using nonstandard naming conventions, the same concept can be expressed using many different terms (synonymy), and authors may refer to polymers with ad-hoc labels. These challenges, which are not unique to polymer science, make it difficult to generate training data, as specialized skills are needed to label text correctly. We have previously designed polyNER,
a semi-automated system for efficient identification of scientific entities in text. PolyNER applies word embedding models to generate entity-rich corpora for productive expert labeling, and then uses the resulting labeled data to bootstrap a context-based word vector classifier. PolyNER facilitates a labeling process that
is otherwise tedious and expensive. Here, we use active learning to efficiently obtain more annotations from experts and improve performance. PolyNER requires just five hours of expert time to achieve discrimination capacity comparable to that of a state-of-the-art chemical natural language processing toolkit, highlighting the potential for human-computer partnership in domain-specific scientific NER.

PolyNER system - streamline machine learning to fine materials scientific facts from text

polymer extraction is manual process - extracting facts about polymers

Scientific Information Extraction IE

challenges in informatoin extraction:
non standard naming
acronyms/sysnonyms
families/classes

verbage and vocabularies
polysemy
synomy
words that mean similar things in different contexts and many words that mean the same thing
lack of training data available
NER
scientific NER
training data is expensive to obtain and prepare

PolyNER
goal to slash cost of NER training by using bootstrap methods to optimize the effectiveness/impact of expert labeling

process for finding polymer in text:
remove english words
Spacy NLTK

use part of speect tagging
remote numbers and unwanted characters

multi-level expertise candidiate labeling
labels provided by untrained crowds and expert polymer scientists

Poster session

2:30p - 3:30p

Presentation

post posters for shared service

Michael Wyatt

Characterizing In Situ and In Transit Analytics of Molecular Dynamics Simulations for Next-Generation Supercomputers

Michela Taufer (The University of Tennessee), Stephen Thomas (The University of Tennessee), Michael Wyatt (The University of Tennessee), Tu Mai Anh Do (University of Southern California), Loïc Pottier (University of Southern California), Rafael Ferreira da Silva (University of Southern California), Harel Weinstein (Cornell University), Michel A. Cuendet (Cornell University; Lausanne University Hospital), Trilce Estrada (University of New Mexico), and Ewa Deelman (University of Southern California)

Molecular Dynamics (MD) simulations executed on state-of-the-art supercomputers are producing data at a rate faster than it can be written out to disk. In situ and in transit analysis of data produced by MD simulations reduce the original volume of information by several orders of magnitude, thereby alleviating the negative impact of I/O bottleneck. This work focuses on characterizing the impact of in situ and in transit analytics on the overall MD workflow performance, and the capability for capturing rapid, rare events in the simulated molecular system. The MD simulation and analysis processes share data via remote direct memory access (RDMA) using Dataspaces. Our metrics of interest are time spent waiting in I/O, or lost frames by the MD simulation, and idle time by the analysis. We measure these metrics for a diverse set of molecular systems, characterize their trends for in situ and in transit configurations, and model which frames are dropped and which ones are analyzed for a real use case. The insights gained from this study are generally applicable for in situ and in transit workflows that require optimization of parameters to minimize loss in workflow performance and analytic accuracy.

collective variables

data workflow modeling for analytics

generator of md frams

Afternoon Keynote

Maryann Martone
UC San Diego

The launch of several international large brain projects indicates that we are still far from understanding the brain at even a basic level, let alone being able to intervene meaningfully in most degenerative, psychiatric and traumatic brain disorders. Such projects reflect the idea that neuroscience needs to be placed on a more data-rich, computational footing to address the inherent complexity of the nervous system. But should we just be looking towards big science to produce comprehensive and integrated data and tools? What about the thousands of studies conducted by individual investigators and small teams, so called “long tail data"? How does the regular practice of neuroscience need to change to address grand challenges in brain science?

Across the breadth of academia, researchers are defining new modes of scholarship designed to take advantage of 21st century technology for linking and distributing information. Principles, best practices and tools for networked scholarship are emerging. Chief among these is the move towards open science, making the products of research as open as possible to ensure their broadest use. Second, increased recognition that research outputs should not only include journal articles and books, but data, tools and workflows. Third, that research outputs should be FAIR: Findable, Accessible, Interoperable and Reusable-the characteristics required for making digital objects maximally useful for both humans and machines. FAIR encompasses the agreement upon and use of community standards for data exchange. Finally, that citation and credit systems be redesigned to reflect the broadening of scientific output.

In this presentation, I will discuss the community and technical infrastructure for moving neuroscience towards an open, FAIR and citable science, highlighting our experiences in building and maintaining the Neuroscience Information Framework and other related projects. I will also provide an example of work underway in the spinal cord Injury community to come together around the sharing and integration of long tail data.

NIF - Neuroscience information framework

initiative of the NIH Blueprint consortium of institutes

neuinfo.org

FORCE11 Manifesto

FAIR, OPEN, RESEARCH OBJECT BASED< CITABLE, ECOSYSTEM

FAIR in neuroscience is needed
Big national brain projects - researchers working on brain initiatives creating huge pools of data

small brain projects longtail science

Facilities of research excellence - spinal cord injury
FORE-SCI

"Syndromic space"

FAIR requires both technical and social infrastructure

data management in labs is a major problem
very few practices in the lab

neuroscience data repositories

standards in neuroscience

Opendata commons for spinal cord injury

https://scicrunch.org/odc-sci

INCF - standards organization to support global neuroscience

barin imaging data structure - stanford

BIDS Brain Imaging Data Structure
a file org standard
a metadata standard

Resrouce Identification Initiative

FAIR Partnership
Researchers, repos and registries, indexers aggregators

Friday- Day 3

Keynote

Dieter Kranzlmüller
Ludwig-Maximilians-Universität München

As a leadership-class computing facility, the mission of the Leibniz Supercomputing Centre (LRZ) is to enable significant achievement and advancement in science with powerful computational resources including dedicated hardware and software coupled with computational science expertise for specific research domains. One such focus area is environmental science, which is highly relevant for our daily lives and society. With LRZ’s latest supercomputer SuperMUC-NG and its combination of a powerful general purpose architecture with integrated data science and AI capabilities, users from the environmental sciences increase the size and resolution of their model while reducing the necessary computing time. The major factor of success, however, is the partnership between the domain scientists and the computational specialists at LRZ which includes every step along discovery, from dedicated training through the entire scientific workflow during production runs. This talk introduces the LRZ partnership model and highlights a number of example use cases from environmental computing.

Leibniz Supercomputing Center (LRZ)
SuperMUC-NG - only procs no GPU

The experts are the valuable assets not the hardware

case study 1
SeiSol - Numerical Simulation of Seismic Wave Phenomena

case study 2
StMGP StMUV Project ePIN
electronic pollen infomration network
monitoring and prediction of pollen flight in Bavaria using a netowrk of online pollen monitors

matle circulatio model - studies the continental drift

bringing computer scientists and research scientists together

hot water cooling system 45deg c water

Dieter Kranzlmuller
kranzlmuller@lrz.de

Chreston Miller

Timing is Everything: Identifying Diverse Interaction Dynamics in Scenario and Non-Scenario Meetings

In this paper we explore the use of temporal patterns to define interaction dynamics between different kinds of meetings. Meetings occur on a daily basis and include different behavioral dynamics between participants, such as floor shifts and intense dialog. These dynamics can tell a story of the meeting and provide insight into how participants interact. We focus our investigation on defining diversity metrics to compare the interaction dynamics of scenario and non-scenario meetings. These metrics may be able to provide insight into the similarities and differences between scenario and non-scenario meetings. We observe that certain interaction dynamics can be identified through temporal patterns of speech intervals, i.e., when a participant is talking. We apply the principles of Parallel Episodes in identifying moments of speech overlap, e.g., interaction “bursts”, and introduce Situated Data Mining, an approach for identifying repeated behavior patterns based on situated context. Applying these algorithms provides an overview of certain meeting dynamics and defines metrics for meeting comparison and diversity of interaction. We tested on a subset of the AMI corpus and developed three diversity metrics to describe similarities and differences between meetings. These metrics also present the researcher with an overview of interaction dynamics and presents points-of-interest for analysis.

Scenario vs. non-scenario
Comparison amongst different small group meeting types:
scenario - each participant is asignged a role
non-scenario - regularly

observation: data comparison
differences/similarities of two or more datasets

charaterized by the data distribution - need to becareful comparing two data sets

speech burst - bursts of intense activity

situated data mining (SDM) - repeating data patterns

Junan Guo (University of California San Diego), Subhasis Dasgupta (University of California San Diego), and Amarnath Gupta (University of California San Diego)

Multi-model Investigative Exploration of Social Media Data with BOUTIQUE: A Case Study in Public Health

We present our experience with a data science problem in Public Health, where researchers use social media (Twitter) to determine whether the public shows awareness of HIV prevention measures offered by Public Health campaigns. To help the researcher, we develop a “investigative exploration” system called BOUTIQUE that allows a user to perform a multi-step visualization and exploration of data through a dashboard interface. Unique features of BOUTIQUE includes its ability to handle heterogeneous types of data provided by a polystore, and its ability to use computation as part of the investigative exploration process. In this paper, we present the design of the BOUTIQUE middleware and walk through an investigation process for a real-life problem.

looking at HIV social media from twitter

questions:
are people who are at a high risk for HIV aware of the preventitive measures available to them?

how are they becomoing aware of these measures?

The AWESOME