BACKUP Software vs Data

:::warning # BACKUP Software vs Data ::: This document draft collects notes and references on how/why research data and research software can/should be seen as the same concept and when not. The public has read/write access. # Software and its quantum states - When is software data and when is it not? initial authors/editors * Alexander Struck https://orcid.org/0000-0002-1173-9228 * Jan Philipp Dietrich https://orcid.org/0000-0002-4309-6431 * Stephan Druskat https://orcid.org/0000-0003-4925-7248 contributing authors * first last ORCID ## Introduction In the context of research data management and others, software is often just treated as data. However, software comes with characteristics which are quite distinct and opossing to classical data. What differences there are has been recently discussed in the context of software and data citations (https://doi.org/10.7287/peerj.preprints.2630v1). In this paper Daniel Katz and colleagues come up with five distinctions between software and data: * Software is executable, data is not * Data provides evidence, software provides a tool * Software is a creative work, scientific data are facts or observations * Software suffers from a different type of bit rot than data: It is frequently built to use other software, leading to complex dependencies, and these dependent software packages also frequently change * The lifetime of software is generally not as long as that of data > [name=lx] except for Fortran code in meteorology? COBOL in financing and air traffic control? ... > What are the reasons for different lifetime? Prototyping? NIH-syndrome? short-term project funding? > > [name=lx] More literature on the topic I will look into > https://peerj.com/articles/cs-163/#p-14 ### Software is Data * **Data archives** Same archives are typically used for software as for data * **Software as input** ### Software is not Data * **Data vs Software repositories** Usually different repositories are used due to different assumptions about persistency and memory requirements; also: repositories vs. archives (e.g., GitHub vs. Software Heritage/Zenodo) * **Working modes** Measurement data is gathered and worked with, but usually not worked on. Software source code (as data) is generally worked on. ### Text ??? We argue that software is actionable/executable compared to data which is rather processed. Software acts as input (just like data) when it comes to linters, compilers/interpreters and anything that automatically evaluated software. The software itself may not be changed but it serves as an input for e.g. a hash function; think of digitally signed software. Software is most often constructed by humans but in some cases software may be created by other software; think templating and efforts to construct secure/safe/provable software (CITE). Due to these differences but also the similarities between software and classical data we argue that - similar to quantum physics - the perspective on a piece of software (the "measurement") determines whether it behaves like classical data or not. While the analogy ends here and the measurement - in contrast to quantum physics - does not really alter the characteristics of the objects, it is important to acknowledge that the perspective defines the rules how the software should be handled. We argue that while for some applications software should be treated as data, in other cases it should certainly not! Comparing for instance data repositories and software repositories one will find quite different approaches and assumptions about persistency or memory requirements, on the other hand archiving often is performed with identical tools for both. The latter may not be ideal due to different metadata (or presentation) requirements. To highlight the need (not) to distinguish software and data we will discuss in the following different perspectives on software and try to give some guidance how software in these cases should be treated. TL;DR: Software and Data is (not) the same! (It depends on the perspective.) ### FAIR principles applied to data and software https://www.slideshare.net/danielskatz/fair-is-not-fair-enough-particularly-for-software-citation-availability-or-quality?next_slideshow=1 https://figshare.com/articles/FAIR_enough_Can_we_already_benefit_from_applying_the_FAIR_data_principles_to_software_/7449239 - Es gibt FAIR Principles, die werden teilw. in den Aktivitäten erwähnt, unsere Aktivitäten gehen aber darüber hinaus - Unsere Contribution ist zu FAIRinciples (!FAIR4RS) vs. - Es gibt FAIR Principles, die werden teilw. in den Aktivitäten erwähnt, unsere Aktivitäten gehen aber darüber hinaus - Diese Diskussion ist wichtig auch für das Verständnis von derzeit entwickelten FAIR4RS (vs. den Daten FAIR). ### Definitions Given the ambiguity of the whether software can be treated as data or not it is important to clearly define how the terms "software" and "data" are used in the following. To be able to verbally distinguish between them, "software" will refer in the following to executable code while "data" refers to all kind of data which is not software, making both non-overlapping, complementing categories of data in the broader sense of a collection of bits and bytes. ### Cross-Dependency Data & Software in RSE4NFDI proposal: ''The quality of data is extricable from the quality and documentation of processes linked to it, and theses processes are built on and with software.'' and also ''Data is meaningsless without the software need to read, process and interpret it.'' > [name=lx] Hier gehen Meinungen außeinander, da Daten auf eine Weise dokumentiert sein sollten, die sie unabhaengig von einer Anwendung machen sollte. Z. B. muessen Zitationsdaten Downloads aus Web of Science in allen Programmiersprachen haendelbar sein.) ''However, software is different in nature. What drives the change in data is different from what drives the change in software.'' - [ ] extract more from here: https://github.com/danielskatz/software-vs-data#software-is-executable-data-is-not MORE QUESTIONS (to be answered in a different section): how independent is one thing from the other? Does software only have a purpose if there is data? Could data be handled without software? Does data have a "product" state? Is data "engineered" in the same way software is engineered in software engineering? Is there functionally "correct" data? Does data have behaviour in the way that software does? What is the relation between call graphs in software and different segmentations in data? ### Aspects of Time Software has different temporal cycles than data if the former is maintained. We have interdependencies if software is required to handle certain data. It is unclear how long software and data each on their own are useful. And this may significantly differ between disciplines (think HEP vs art history). At the same time, a certain data set may be useful centuries later to ask different questions, or -- comparable to software -- in disciplines like the history of science both may be input for further investigations how they were handled/created in the past. Over time, the underlying hardware may change and software may cease to be executable. At the same time, storage media may decease and the responsible (data and/or software) curator would need to copy or move them to a new location. Unfortunately, there are also cases where the source code of a piece of software may be lost and just the executable remains for further use. Depending on its importance, reverse engineering with a disassembler (IDA, Ghidra, ...) may be required where software would be data input and output. There are also cases, where descriptive metadata is lost and the remaining data does not allow interpretation (CITE case with satellite data). --- ## Activities around Research Software and Data > Hier keine Unterschiede aufmachen, höchstens "anteasern" In order to cover relevant aspects of both, data and software, we will look at activities related to each. The Social Sciences attempted to get a grasp research activities, e.g. by Latour (1986). More precise scholarly activities were described by Unsworth (2000) for the so-called "Digital Humanities". He listed * Discovering * Annotating * Comparing * Referring * Sampling * Illustrating * Representing as activities to be formally described in their digital information processing environment. Bernardou (2010) review some of the related later work and mention activities like * monitoring * extracting * accessing * networking * verifying * creating and * sharing which already sound a bit more technical. To the best of our knowledge there has no attempt been made to identify and describe activities of the emerging position of the "Research Software Engineer". Some early papers (TODO CITE SOME) discuss the environment, the position and a few activities related to researchers handling software. In our listing of activities we attempt to add to the known activities those related to software and data and show for the most common activities across disciplines, how both, software and data, compare or must be differentiated. ### Funding > Generating **funding** to finance the development and maintenance of software and data and other planned activities. Writing software is an activity most researchers are paid for. Collecting or measuring data involves humans, sensors and other tools which need to be paid for. We consider funding an external activity, initiated by researchers writing grants applications. This may require conceptualization of research objects, for example some funders require a DMP and/or SMP (TODO examples). Funding may also be generated by licensing or support contracts, paid-for workshops and so on. In our experience, this is more relevant for research software than data. ### Creation > **Creation** of software or data Software needs to be created, most often involving humans. Data is created by measurement (and similar activities). RSE will most likely create software for their own use. Contract developers create for external use in research projects. - Conceptualization - Software: e.g., Methodologies: agile / waterfall, etc. - Data: Measurement, elicitation, creation from scratch, collection ### Accessing, Searching, Discovering > **Discovering** relevant software or data for planned research projects/exercises. The F in FAIR stands for findable. Both, software and data, require search strategies tailored towards specific needs and we see an abundance of different platforms and information behavior (or processing). CITE Howison. - Aggregation ### (Re-) Use > **(Re-) Use** of software or data for research purposes. Sustainable software could be (re-) used although we also see the forced reuse of unsustainable software (legacy code). The reuse of data has been encouraged for a few years and is reflected in RDM activities and platforms like re3data.org. We differentiate between usage and reuse as the latter may involve third parties who were not involved in the creation of software or data. Reuse of data or software requires evaluation/verification and probably also the acitivity of comparing as suggested by Unsworth. The scholarly primitives listed above mostly relate to information processing. We include *comparing*, *verifying*, *sampling*, *illustrating* here as general usage. Most interesting differences between software and data are expected here. ### Publishing, Sharing, Licensing and Archiving > **Publishing** software or data to make it accessible to others (including licensing and archiving). These activities are a requirement for finding and reusing software and data. We see data journals and software journals, several types of repositories for either software or data or with mixed content. TODO Examples Sharing data and/or code is required by some journals for paper review, which presents new challenges for the reviewers. Best practices for publishing research objects like software and data include (but are not limited to) licensing information and persistent identification for proper citations. Licenses for software (LINK) differ significantly from recommended licenses for data, e.g. Creative Commons (LINK). Depending on your jurisdiction you may be able to receive a patent for software and/or even for data, e.g. genomes(?). In other jurisdictions you cannot claim any intellectual property on raw data but must produce added value by e.g. aggregation and augmentation before you can put a license on such data. Legal aspects of software are still under debate (CITE Anzt2020) > **[TODO]** include legal aspects ### Referencing and Credit Giving > **Referencing** data or software to make sources transparent and give credit to its authors. We consider data and software citation to still be in its infancy although referencing is one of the building blocks of science. Citing research objects is mostly connected to other paper publications but we see CodeMeta and CFF dedicated to software. TODO Data Citation Standard? [STEPHAN] extract from Katz talk at FORCE2019, see references ### Evaluation > **Evaluating** software or data in terms of usability but also visibility and impact. Researcher activities are monitored externally for evaluation purposes. Publications of data or software and citations to those are used as indicators of activity and attention. Positive evaluation of research projects may lead to more funding. Researchers must also evaluate third party software and data for reuse (after finding something). The different aspects will be discussed from the stakeholders' point of view. READ * https://www.researchgate.net/publication/267369044_Modellierung_und_Ontologien_im_Wissensmanagement_--_Erfahrungen_aus_drei_Projekten_im_Umfeld_von_Europeana_und_des_DFG-Exzellenzclusters_Bild_Wissen_Gestaltung_an_der_Humboldt-Universitat_zu_Berlin * Unsworth 2000 http://people.virginia.edu/~jmu2m//Kings.5-00/primitives.html * Bernadou 2010 A Conceptual Model for Scholarly Research Activity * Latour 1986 Laboratory Life the construction of scientific facts * and others --- ## Stakeholders > Hier werden Unterschiede gemacht und diskutiert Base on existing stakeholder lists (de-RSE Position, Druskat & Katz), and further develop (similar to activities). Based on the lifecycles of software and data, we identify the following stakeholders. ### Researchers ... Can be both creators and end users of software and data at the same time. Depending on factors such as job description, domain, etc., the balance between both roles within one and the same person may differ greatly. RSEs working in an RSE group which supprts research groups and projects are on one side of the spectrum, while domain researchers who write small scrpits every now and then are on the other end. Similarly, research technicians overseeing instrument measurements but do not analyse the data they create themselves are on one end of the spectrum, domain researchers who select subsets of a larger dataset for their research are on the opposite end. In the following, we will try to discuss both aspects - creation and use - largely separately, but due to the fuzzy borders between both roles, the disctinctions may not always be as clean cut as they appear. #### ... as creators > including RSEs - include industry Humans devise apparatus to measure phenomena in nature. Most often sensory data is collected and sometimes the collection is automated in software procedures. Exploration and interpretation of data may require software for calculation, visualization and so on. Input (raw data) is transformed by software (to processed data) or may be created (as simulated data) (CITE Katz2016) At times data and software are interrelated to an extent that one is meaningless without the other, e.g. if neither could be reused in other circumstances without the other. That may be the case when software is not modular or a specific type of software is required to handle certain (formatted) data. ##### Creator One of the most important indicators that research data and software should be treated differently is the licensing. It is explicitly said, that different licenses should be considered: https://creativecommons.org/faq/#can-i-apply-a-creative-commons-license-to-software With data, especially in the humanities, certain privacy aspects play a role and may hinder further usage or pose restrictions. Depending on the global region, software may or may not be assigned a copyright or patent with the handling of research data differing as well. The creator of a new product, say a web-based platform that combines research software and data to showcase their results need to understand different licensing schemes and consequences. #### ... as end users - include industry From the researchers perspective data, either collected/created by the researcher or taken from existing research is usually the basis of research. It is the basis to test a hypothesis or to gain new insights. In contrast software, whether it is created or just applied by the researcher, is usually not in the center of an analysis but serves as a tool to create or process the data to be analyzed. Technically, data is usually considerably bigger in terms of memory requirements (somewhere in the range of MB to TB) compared to software, which is usally in the range of KB to MB in size. After being collected data might be further processed but is usally not altered anymore and just kept as it was received. Software on the other end might be updated quite frequently during the analysis process an might even require updates after the corresponding research has been finalized due to changes in software dependencies. Data in this context is quite stable while software is continuously evolving. This also means that software has to be maintained even after the corresponding research has been finalized to enable others to use it, while data can usually be reused without the burden of continuous maintenance. For citations the higher level of updates for software compared to data also makes the mentioning of the used version for software even more important than it is for data. Archiving and software and data for a publication the requirements are different with data primarily demanding a service which has sufficient resourced, while software primarily demands good version management. The structural differences are also reflected in the different options when it comes to licensing. While pure and broadly used data licenses such as creative commons (cc) theoretically support software as well, mainly licenses spezialized to the specific requirements are used for software such as the GPL license. For funding the differences are not that pronounced, as both, the data and the software, are seen as the means to answer a research question and usually little dedicated funding for any of it. ##### Reuse If a researcher wants to re-use someone's data or software, it should be possible. Software may ask for a specific data format as input and may produce some particular output format. But the data may not have to be the same as in the original research. ##### Discovery Finding software may differ from finding data because they are not necessarily stored in the same (kind of) repository (CITE Struck). The attention research data has aquired during the last years led to a vast landscape of data repositories, each with their own particular ideas on certain aspects. Some of them have been evaluated using concepts like the data seal of approval. A central registry has been established to index most known data repositories. Some listed there may also hold software but the latter aspect is neglected in this registry. A global registry of software resources may ease the finding and evaluation of research software. ##### Evaluation Evaluating third-party data or software involves different evaluation schemes. Is data augmented with enough metadata and has a piece of research software enough documentation of start with? READ chaoss.community/metrics ##### Management Handling software and data differ as it can be observed in the different aspects dealt with in software management plans (SMP) vs data management plans (DMP). More recently output management plan seem to play a role, too. https://wellcome.ac.uk/funding/guidance/how-complete-outputs-management-plan ##### End user Researchers intending to reuse either software or data need to understand different licenses. ### Funders Next to text publications, any data or software created during research activity should be considered a research result. In the context of Open Science, all of it should be made accessible to reviewers, the community and interested parties where possible. Is software more expensive than data? That may depend on discipline and/or the complexity of software to be written and maintained. A seasoned FORTRAN or COBOL developer could probably ask for a different salary then a PHP or Python dev, solely based on the number of available devs. On the other hand, the creation of data during experiments like ATLAS at CERN requires to many financial and human resources that the cost of software development may be dwarfed in this example. A general answer to this question is currently out of sight. - [ ] link some recommendations from DFG etc Reproducibility gains momentum and the FAIR principles find application. More on that later. - include industry ##### Funder Funders may encourage creators to use licenses as open for reuse as possible. (LINK DFG, others) Some funders may evaluate the use of license for future funding? ### Society Society may have some interest in data and/or research software. We currently assume that research journalism mainly focuses on research output such as books and papers. Society engages with research software, e.g., SETI@home or benefits from research software/data, e.g., in terms of weather forecast. Taxpayer money goes into the creation of both, data and software, but it is hard to account for it. ### Research disciplines E.g., Software Engineering uses software as research data. > [name=lx] This needs more elaboration. Which stakeholder cares about this? ### Artifact managers Research Data Managers and Data Curators are job roles that have been established in the recent past. (TODO cite papers investigating this) Other artifacts created are software and its handling is less well established(?). ### Aggregators ### Publishers and publishing platforms ### Platform vendors - Versioning - e.g., CI/automated processing? - What is "at stake?" - commercial vs non-commercial? ### Computing centres and infrastructure providers --- ## Conclusion Consequences of differences for: - RDM vs RSM - Important to analyse the meta layer, i.e., do more of the things we have done in this paper to better understand a complex scenario # References: :book: https://www.zotero.org/groups/2557322/position_on_software_and_data