This document draft collects notes and references on how/why research data and research software can/should be seen as the same concept and when not.
The public has read/write access.
# Software and its quantum states - When is software data and when is it not?
initial authors/editors
* Alexander Struck https://orcid.org/0000-0002-1173-9228
* Jan Philipp Dietrich https://orcid.org/0000-0002-4309-6431
* Stephan Druskat https://orcid.org/0000-0003-4925-7248
contributing authors
* first last ORCID
> Eingruppierung: Working paper in digital scholarship
MATRIX https://docs.google.com/spreadsheets/d/19A58qf2QVDejdRDR8i2xZDQca1rplNQm57sjS4N4pAo/edit#gid=0
ZOTERO
https://www.zotero.org/groups/2557322/position_on_software_and_data/items/2WC2MP5A/item-details
WORKPLAN
https://hackmd.io/Rb_Atn3ARGKTaib9dwDOlA?both
## Abstract
...
## Introduction
In the context of research data management and others, software is often just treated as data. However, software comes with characteristics which are quite distinct and opossing to classical data. What differences there are has been recently discussed in the context of software and data citations (https://doi.org/10.7287/peerj.preprints.2630v1). In this paper Daniel Katz and colleagues come up with five distinctions between software and data:
* Software is executable, data is not
* Data provides evidence, software provides a tool
* Software is a creative work, scientific data are facts or observations
* Software suffers from a different type of bit rot than data: It is frequently built to use other software, leading to complex dependencies, and these dependent software packages also frequently change
* The lifetime of software is generally not as long as that of data
> [name=lx] except for Fortran code in meteorology? COBOL in financing and air traffic control? ...
> What are the reasons for different lifetime? Prototyping? NIH-syndrome? short-term project funding?
>
> [name=lx] More literature on the topic I will look into
> https://peerj.com/articles/cs-163/#p-14
### Software is Data
* **Data archives** Same archives are typically used for software as for data
* **Software as input**
### Software is not Data
* **Data vs Software repositories** Usually different repositories are used due to different assumptions about persistency and memory requirements; also: repositories vs. archives (e.g., GitHub vs. Software Heritage/Zenodo)
* **Working modes** Measurement data is gathered and worked with, but usually not worked on. Software source code (as data) is generally worked on.
### Text ???
We argue that software is actionable/executable compared to data which is rather processed. Software acts as input (just like data) when it comes to linters, compilers/interpreters and anything that automatically evaluated software. The software itself may not be changed but it serves as an input for e.g. a hash function; think of digitally signed software. Software is most often constructed by humans but in some cases software may be created by other software; think templating and efforts to construct secure/safe/provable software (CITE).
Due to these differences but also the similarities between software and classical data we argue that - similar to quantum physics - the perspective on a piece of software (the "measurement") determines whether it behaves like classical data or not. While the analogy ends here and the measurement - in contrast to quantum physics - does not really alter the characteristics of the objects, it is important to acknowledge that the perspective defines the rules how the software should be handled. We argue that while for some applications software should be treated as data, in other cases it should certainly not! Comparing for instance data repositories and software repositories one will find quite different approaches and assumptions about persistency or memory requirements, on the other hand archiving often is performed with identical tools for both. The latter may not be ideal due to different metadata (or presentation) requirements.
To highlight the need (not) to distinguish software and data we will discuss in the following different perspectives on software and try to give some guidance how software in these cases should be treated.
TL;DR: Software and Data is (not) the same! (It depends on the perspective.)
### FAIR principles applied to data and software
https://www.slideshare.net/danielskatz/fair-is-not-fair-enough-particularly-for-software-citation-availability-or-quality?next_slideshow=1
https://figshare.com/articles/FAIR_enough_Can_we_already_benefit_from_applying_the_FAIR_data_principles_to_software_/7449239
- Es gibt FAIR Principles, die werden teilw. in den Aktivitäten erwähnt, unsere Aktivitäten gehen aber darüber hinaus
- Unsere Contribution ist zu FAIRinciples (!FAIR4RS)
vs.
- Es gibt FAIR Principles, die werden teilw. in den Aktivitäten erwähnt, unsere Aktivitäten gehen aber darüber hinaus
- Diese Diskussion ist wichtig auch für das Verständnis von derzeit entwickelten FAIR4RS (vs. den Daten FAIR).
### Definitions
Given the ambiguity of the whether software can be treated as data or not it is important to clearly define how the terms "software" and "data" are used in the following.
To be able to verbally distinguish between them, "software" will refer in the following to executable code while "data" refers to all kind of data which is not software, making both non-overlapping, complementing categories of data in the broader sense of a collection of bits and bytes.
### Cross-Dependency Data & Software
in RSE4NFDI proposal: ''The quality of data is extricable from the quality and documentation of processes linked to it, and theses processes are built on and with software.'' and also ''Data is meaningsless without the software need to read, process and interpret it.''
> [name=lx] Hier gehen Meinungen außeinander, da Daten auf eine Weise dokumentiert sein sollten, die sie unabhaengig von einer Anwendung machen sollte. Z. B. muessen Zitationsdaten Downloads aus Web of Science in allen Programmiersprachen haendelbar sein.)
''However, software is different in nature. What drives the change in data is different from what drives the change in software.''
- [ ] extract more from here:
https://github.com/danielskatz/software-vs-data#software-is-executable-data-is-not
MORE QUESTIONS (to be answered in a different section):
* how independent is one thing from the other? Does software only have a purpose if there is data?
* (lx thinks: Input-Processing-Output as a paradigm requires data to be 'involved'. At the same time, we deploy software if we are too lazy to 'calculate' results in the brain or on paper.)
* Could data be handled without software?
* (lx thinks: If data had the 'state' of 'visualized' then humans may not need software anymore because now they can 'make sense' of data.)
* Does data have a "product" state?
* lx thinks: we may have a 'life cycle' ... which may include collection/cleansing/transformation/publishing/...
* Is data "engineered" in the same way software is engineered in software engineering?
* lx thinks: data should not be engineered but rather 'gathered by sensor' with definition how to represent analog signals (okay this may be engineering)
* Is there functionally "correct" data?
* lx thinks: There is error-correction built into data some processing. An invalid ISBN can be easily 'spotted' by software.
* lx thinks: We have hash functions to indicate un-altered data and we also have 'code signing' mainly for security reasons.
* Does data have behaviour in the way that software does?
* lx asks: Do we want data to 'behave' at all?
* What is the relation between call graphs in software and different segmentations in data?
### Aspects of Time
Software has different temporal cycles than data if the former is maintained. We have interdependencies if software is required to handle certain data. It is unclear how long software and data each on their own are useful. And this may significantly differ between disciplines (think HEP vs art history). At the same time, a certain data set may be useful centuries later to ask different questions, or -- comparable to software -- in disciplines like the history of science both may be input for further investigations how they were handled/created in the past.
Over time, the underlying hardware may change and software may cease to be executable. At the same time, storage media may decease and the responsible (data and/or software) curator would need to copy or move them to a new location. Unfortunately, there are also cases where the source code of a piece of software may be lost and just the executable remains for further use. Depending on its importance, reverse engineering with a disassembler (IDA, Ghidra, ...) may be required where software would be data input and output. There are also cases, where descriptive metadata is lost and the remaining data does not allow interpretation (CITE case with satellite data).
### Aspect of Size
Technically, data is usually considerably bigger in terms of memory requirements (somewhere in the range of MB to TB) compared to software, which is usally in the range of KB to MB in size.
However, calculations may have a significant footprint on memory, before the computed results 'become' data again.
---
## Activities around Research Software and Data (sd)
> Hier keine Unterschiede aufmachen, höchstens "anteasern"
In order to cover relevant aspects of both data and software, we will look at the activities of digital scholarship related to each.
Research activities have been discussed in the social sciences, e.g. by Latour (1986) :warning:.
More precise scholarly activities were described by Unsworth (2000) :warning: for the digital humanities. He listed
* discovering,
* annotating,
* comparing,
* referring,
* sampling,
* illustrating, and
* representing
as activities to be formally described within their respective digital information processing environment. Bernardou et al. (2010) :warning: review some of the related later work and mention activities, such as
* monitoring
* extracting
* accessing
* networking
* verifying
* creating, and
* sharing,
which, compared to Unsworth's terms, more clearly relate to *digital* scholarship.
For software-related activities, Damerow et al. (2020) draft a profile map for the role of the Research Software Engineer (RSE), which includes
* software development,
* user interface \[work\],
* computer science \[activities\],
* domain-specific \[activities\],
* \[activities\] around software,
* \[activities around\] data management systems/information storage systems,
* data\[-related activities\],
* people-related \[activities\],
* research output\[-related\] activities,
* community \[activities\],
* other \[activities\],
* other tasks as assigned.
Additionally, some early papers (:warning: TODO CITE SOME) discuss the environment, the position and a few activities related to researchers handling software.
As none of these lists are comprehensive, or focused, enough to use them to differentiate between their applications for software in comparison to data, we propose a new list of activities in the following.
We argue that they apply across disciplines, and we will compare how they are applied to software and data respectively.
### Funding (:heavy_check_mark: ready for review)
> Generating **funding**, and using it to finance the development and maintenance of software, and the creation and maintenance of data, as well as related activities.
Funding is a relevant activity for both software and data.
Researchers and Research Software Engineers are paid to create software, researchers and research technicians are paid to undertake data elicitation, collection, or measurement.
Furthermore, the collection or measurement of data incurs costs for the acquisition, use and maintenance of sensors and other tools.
Similarly, the use of software may incur licensing or service costs.
While the use of some datasets may also incur costs, e.g., some linguistic corpora can only be accessed via paid-for services, licensing costs are less common for data than for software.
Similarly, in our experience, funding for support contracts or paid-for workshops more commonly has to be acquired for software than for data.
We consider funding an external activity, initiated by researchers writing grants applications.
This may require conceptualization of research objects.
Some funders, for example, require a data management plan and/or a software management plan to be submitted with the grant application (:warning: TODO examples).
### Creation (:heavy_check_mark: ready for review)
> **Creation** of software or data.
*Software creation* is the creation of the original digital artifacts that make up the original software output: source code, documentation, metadata, configurations.
It includes the creation of these artifacts by humans, with the help of hardware such as personal computers and software such as operating systems, runtime environments, development environments, compilers or interpreters, linters and code checkers, test frameworks, documentation frameworks, virtual machines and containers, version control systems, collaboration platforms, and continuous integration and deployment pipelines.
It does not include the packaging into the original software product of non-original source code, i.e. the dependencies of the original software output, which we understand to be an aspect of **(re-)use**.
*Data creation* is the creation of raw digital data.
It includes the recording, elicitation, collection, or measurement of these data with the help of hardware such as recorders, sensors, instruments, and more complex structures such as observatories.
It does not include the processing of the raw digital data into another form or format, such as through transcription, conversion, etc, which we understand to be an aspect of **(re-) use**.
Original software and data are usually created within research projects to facilitate their own research.
The creation of both may be carried out by contractors.
> lx thinks: Best practice guidelines/standards for software creation vs. data creation could make sense here. What about the example of Agile/Waterfall/whatever for software vs. disciplinary guidelines to gather data? (A.von Humboldt collected data and sketched flora/fauna in a notebook which may still be common-place in Anthropology. Climate data gathering is a completely different game.)
:point_down: Eher zu Researchers > Creators
~~The creation of software primarily includes the creation of the original artifacts making up the software output: source code, documentation, metadata, configuration parameters, etc.
In contrast, the creation of software does not include embedding non-original research software libraries as dependencies, which we see as an aspect of **(re-) use**.
It does, however, include use of other existing non-research software, such as operating systems, runtime environments, development environments, version control software, continuous integration and deployment software, compilers and interpreters, language libraries, etc.
The creation activities are usually carried out by humans, but some - such as source code generation, the creation of useful representations of documentation or some aspects of configuration - can be carried out by software as well.
Software can be created following different methods, such as agile or waterfall methodologies.~~
~~Data is also created through different methods such as elicitation, collection, or measurement.
These sub-activities usually involve software, or hardware such as recording devices, sensors, or more complex apparati such as observatories.
Humans feature in the creation process of data as selectors - e.g., by deciding which phenomena should be observed - and in some cases in the creation of necessary artifacts such as documentation and metadata, if this is not carried out by the instruments used to create the primary data.~~
~~The creation of software requires knowledge about all parts involved in the process of software creation: computing hardware, operating systems, runtime environments, compilers or interpreters, programming languages and their syntax, and infrastructure components such as version control systems, development platforms, etc.~~
~~- Conceptualization
- Software: e.g., Methodologies: agile / waterfall, etc.
- Data: Measurement, elicitation, creation from scratch, collection~~
:point_up_2:
### Accessing, Searching, Discovering
> **Discovering** relevant software or data for planned research projects/exercises.
The F in FAIR stands for findable. Both, software and data, require search strategies tailored towards specific needs and we see an abundance of different platforms and information behavior (or processing). CITE Howison.
- Aggregation
### (Re-) Use
> **(Re-) Use** of software or data for research purposes.
Sustainable software could be (re-) used although we also see the forced reuse of unsustainable software (legacy code). The reuse of data has been encouraged for a few years and is reflected in RDM activities and platforms like re3data.org. We differentiate between usage and reuse as the latter may involve third parties who were not involved in the creation of software or data.
Reuse of data or software requires evaluation/verification and probably also the acitivity of comparing as suggested by Unsworth.
The scholarly primitives listed above mostly relate to information processing. We include *comparing*, *verifying*, *sampling*, *illustrating* here as general usage. Most interesting differences between software and data are expected here.
### Publishing, Sharing, Licensing and Archiving
> **Publishing** software or data to make it accessible to others (including licensing and archiving).
These activities are a requirement for finding and reusing software and data. We see data journals and software journals (e.g. JOSS and JORS), several types of repositories for either software (e.g. swMath) or data (e.g. Data) or with mixed content (e.g. Zenodo). Examples include
Sharing data and/or code is required by some journals for paper review, which presents new challenges for the reviewers.
Best practices for publishing research objects like software and data include (but are not limited to) licensing information and persistent identification for proper citations.
Licenses for software (LINK) differ significantly from recommended licenses for data, e.g. Creative Commons (LINK). Depending on your jurisdiction you may be able to receive a patent for software and/or even for data, e.g. genomes(?). In other jurisdictions you cannot claim any intellectual property on raw data but must produce added value by e.g. aggregation and augmentation before you can put a license on such data. Legal aspects of software are still under debate (CITE Anzt2020)
> **[TODO]** include legal aspects
### Referencing and Credit Giving
> **Referencing** data or software to make sources transparent and give credit to its authors.
We consider data and software citation to still be in its infancy although referencing is one of the building blocks of science. Citing research objects is mostly connected to other paper publications but we see CodeMeta and CFF dedicated to software. TODO Data Citation Standard?
[STEPHAN] extract from Katz talk at FORCE2019, see references
### Evaluation
> **Evaluating** software or data in terms of usability but also visibility and impact.
Researcher activities are monitored externally for evaluation purposes. Publications of data or software and citations to those are used as indicators of activity and attention. Positive evaluation of research projects may lead to more funding.
Researchers must also evaluate third party software and data for reuse (after finding something).
The different aspects will be discussed from the stakeholders' point of view.
READ
* https://www.researchgate.net/publication/267369044_Modellierung_und_Ontologien_im_Wissensmanagement_--_Erfahrungen_aus_drei_Projekten_im_Umfeld_von_Europeana_und_des_DFG-Exzellenzclusters_Bild_Wissen_Gestaltung_an_der_Humboldt-Universitat_zu_Berlin
* Latour 1986 Laboratory Life the construction of scientific facts
* and others
---
## Stakeholders
> Hier werden Unterschiede gemacht und diskutiert
Base on existing stakeholder lists (de-RSE Position, Druskat & Katz), and further develop (similar to activities).
Based on the lifecycles of software and data, we identify the following stakeholders.
### Researcher (:heavy_check_mark: ready for review)
Researcher is the most common but also most diverse ~~research related~~ stakeholder which deals with software and data.
From the researchers perspective data, either **created** by the researcher or **re-used** from existing research is usually the basis of research. It is the basis to test a hypothesis ~~or to gain new insights~~. In contrast software, whether it is created or just applied by the researcher, is usually not in the center of an analysis but serves as a tool to create or process the data to be analyzed. Exception of this rule are researchers targeting the software itself as research objective. In that case the software is rather treated as data forming the basis of research rather than as a tool.
The researcher can be both creator and end user of software and data at the same time. Depending on factors such as job description, research project, domain, group structure etc., the balance between both roles may differ greatly between researcher or even over time for the very same researcher.
Research Software Engineers supporting research groups and projects are on one side of the spectrum, while domain researchers who mainly apply existing tools and models are on the other end.
Similarly, research technicians overseeing instrument measurements but do not analyse the data they create themselves are on one end of the spectrum, domain researchers build their research on existing databases are on the opposite end.
Today data **creation** is mostly happening through sensors and/or software, but sometimes also directly collected by the researcher (e.g. surveys). While the process of creating the data can be seen as creative work the data itself usually is not. In contrast, software is usally directly **created** by the researcher and seen as creative work. At times data and software are interrelated to an extent that one is meaningless without the other, e.g. if neither could be reused in other circumstances without the other. That may be the case when software is not modular or a specific type of software is required to handle certain (formatted) data.
When data / software is not created by the researcher appropriate sources have to be first **discovered**. In research this often happens through literature research. Especially, for data a common workflow is to take up data published together with a previous research study and build the new research on top of it. For software it is less common to be published with (or even mentioned in) a scientific publication, requiring separate efforts to discover appropriate tools. However, with increasing aknowledgment of software as being an important part of the research process this might change in the future and the discovery process might become more similar for data and software in the future.
When not going via a literature search, **discovering** may differ for software and data as they are not necessarily stored in the same (kind of) repository (CITE Struck). The attention research data has aquired during the last years led to a vast landscape of data repositories, each with their own particular ideas on certain aspects. Some of them have been evaluated using concepts like the data seal of approval. A central registry has been established to index most known data repositories. Some listed there may also hold software but the latter aspect is neglected in this registry. A global registry of software resources may ease the finding and evaluation of research software.
When discovered **evaluation** is the second important step. Evaluating third-party data or software involves different evaluation schemes. Is data augmented with enough metadata and has a piece of research software enough documentation of start with?
Researchers intending to reuse either software or data also need to evaluate the used licenses and make sure that they are compatible to the intended application.
While proper licensing is in the meantime quite established for software and license **evaluation** has become a rather simple task due to heavy standardization, in particular in the field of open source software, the situation for data sets is often less transparent as data still lacks rather often proper license information and licenses seem to be less standardized.
When **re-using** software or data properly **referencing** it is an important step for the researcher to give credit to the original source and to make the own research transparent and more reproducible. In terms of referencing software and data currently both suffer from similar problems as they tend to be forgotten to be mentioned in publications and usually do not have the same level of prestige as references to scientific paper publications.
> lx: very long sentence.
After being collected data might be further processed but is usally not altered anymore and just kept as it was received. Software on the other end might be updated quite frequently during the analysis process an might even require updates after the corresponding research has been finalized due to changes in software dependencies. Data in this context is quite stable while software is continuously evolving.
This also means that software has to be maintained even after the corresponding research has been finalized to enable others to use it, while data can usually be reused without the burden of continuous maintenance.
For citations the higher level of updates for software compared to data also makes the mentioning of the used version for software even more important than it is for data. Archiving and software and data for a publication the requirements are different with data primarily demanding a service which has sufficient resourced, while software primarily demands good version management.
The structural differences are also reflected in the different options when it comes to **licensing**. While pure and broadly used data licenses such as creative commons (cc) theoretically support software as well it is not recommended to do so (https://creativecommons.org/faq/#can-i-apply-a-creative-commons-license-to-software) and mainly licenses spezialized to the specific requirements are used for software such as the GPL license. In addition, especially in the humanities, privacy aspects play a role and may hinder further usage or pose restrictions on the used data. Depending on the global region, software may or may not be assigned a copyright or patent with the handling of research data differing as well.
The creator of a new product, say a web-based platform that combines research software and data to showcase their results need to understand different licensing schemes and consequences.
For **funding** the differences are not that pronounced, as both, the data and the software, are seen as the means to answer a research question and usually little dedicated funding for is received for any of it. However, requirements on software and data related to a project might differ, also reflected in the need for separate data management plans (DMP) and software management plans (SMP). More recently output management plan seem to play a role, too (https://wellcome.ac.uk/funding/guidance/how-complete-outputs-management-plan)
:::info
TL;DR: For a researcher data is (mostly) the source for gaining new insights while software is (mostly) a research tool. Due to these quite distinct primary purposes also the handling of software and data is in most cases quite different.
:::
### Funders
Next to text publications, any data or software created during research activity should be considered a research result. In the context of Open Science, all of it should be made accessible to reviewers, the community and interested parties where possible.
Is software more expensive than data? That may depend on discipline and/or the complexity of software to be written and maintained. A seasoned FORTRAN or COBOL developer could probably ask for a different salary then a PHP or Python dev, solely based on the number of available devs. On the other hand, the creation of data during experiments like ATLAS at CERN requires to many financial and human resources that the cost of software development may be dwarfed in this example. A general answer to this question is currently out of sight.
- [ ] link some recommendations from DFG etc
Reproducibility gains momentum and the FAIR principles find application. More on that later.
- include industry
##### Funder
Funders may encourage creators to use licenses as open for reuse as possible. (LINK DFG, others) Some funders may evaluate the use of license for future funding?
### Society
Society may have some interest in data and/or research software. We currently assume that research journalism mainly focuses on research output such as books and papers. Society engages with research software, e.g., SETI@home or benefits from research software/data, e.g., in terms of weather forecast. Taxpayer money goes into the creation of both, data and software, but it is hard to account for it.
### Research disciplines
E.g., Software Engineering uses software as research data.
> [name=lx] This needs more elaboration. Which stakeholder cares about this?
### Artifact managers
Research Data Managers and Data Curators are job roles that have been established in the recent past. (TODO cite papers investigating this if there are any)
Software is a different artifact that requires handling in parts comparable to data and in other parts to be distinguished. Other (more analogue) research results like architecture or minerals paper models are sometimes stored in archives and presented in exhibitions (for the public). In some environments software and data are merged in ways that make it hard to distinguish, think of VR (or AR) where software utilizes predefined (or live) data to create (or augment) virtual (or real-world) environments, e.g. surgery. Some research project results in a web-based platform, where both, data and software, need to be entangled (often in VMs or containerized environments) to portray the research question(s) and outcome. We will focus on data and software here.
A recent job offer detailed tasks (activities) of a desired developer employed to shape the infrastructure needed:
- Implementing, developing and maintaining innovative software solutions within the team
- Administrating software applications
- Analyzing requirements and system designs
- Documenting solutions and results
- Contributing, evaluating, and revising existing methodologies and technologies continuously and independently
This is an example where research software is **created for an artifact manager** to handle research data.
This newly found research institute wants to ''establish[...] a central digital infrastructure for acquisition, processing, long-term archiving and distribution of data and research results''. The general term ''research results'' may include software.
Job descriptions for research data managers / coordinators we came across in the recent past often included
- conceptualizing infrastructure
- implementing infrastructure for publishing, finding and long-term archiving
- maintaince and improvement of existing infrastructure
- conceptualizing and hosting workshops for RDM (We also see workshop being held by data and software carpentries.)
- writing reports/papers/
- utilizing APIs and handling of data in several formats and metadata standards
Such infrastructure eases the **reuse** by offering a platform to **publish** and **discover** research data and/or software for that matter.
There are different metadata schemes to be considered by this agent for research data:
- Dublin Core
- METS
- ...todo more...
and research software:
- CFF
- CodeMeta
Both, software and data, require curation. Data at CERN is discarded to keep the remaining data size managable (TODO cite). Data must be described by metadata to allow reuse, for example column headers should be explained to ease interpretation.
Software should be described in different way (see documentation in **Section Creation**). Most of the time, in-line documentation of source code may suffice but artifact managers may document dependencies/requirements for research software in their platform.
**Documenting** research results in such a way may ease the **evaluation** of an organization and lead to new **funding**. Documentation differs for data and software (see section **Creating**) and artifact managers may have different competencies accordingly. At the same time, software and data may not be distinguished by the artifact manager, if an archival system / publishing platform may not be detailed enough in their metadata and only handles BLOBs.
**Publishing** is a prerequisite for referencing both, data and software. Some artifact managers may document **references** because they may influence **evaluation**. Data citation
:::warning
SD: How does curation differ for software vs data? E.g., do many vs few versions make a difference? What about necessary/useful metadata?
:::
Research Data Management is often located at the academic library (https://libraries.mit.edu/data-management/) where Research Software Consultants may be centralized at ''data'' (computing) centers like this group:
https://srcc.stanford.edu/about/people
Artifact managers may provide services like workshops on data handling and maybe even statistical data analysis.
HPC or cloud architects, which can also be seen as (software) artifact manageers provide consulting services.
If legal services are offered by these two groups their content certainly differs as well, as different licenses are appropriate for data vs. software.
While artifact managers may be the agents that deal with research results (on behalf of researchers) the following paragraph describes platforms which may be used.
### Publishers and Aggregators
As mentioned in the previous section, data and software may be **published in the same repository** on the same platform. There are reasons to keep data and software side by side:
- belong together and create one context
- software may depend on certain formats and/or data requires certain software to be handled
- containers allow encapsulating software, all its dependencies and mounting of data, benefitting evaluation, long-term availability, ...
- 'live' visualization of processed data
- ?DO WE HAVE MORE?
There are reasons to **publish data and software in different places**:
- Dedicated extensive metadata may enable ''better'' search functionality which may result in high precision (and appropriate recall).
- Software could be applied to several different datasets (from several disciplines, for example cluster algorithms) and do not depend on a data publication (but maybe certain import formats)
- Data may be interesting for a certain discipline and belongs to a disciplinary data repository
- licensing, patents, and similar restrictions may apply and require (or prohibit) publication in certain (local) platforms
- TODO MORE
This is an effort to list repositories for research software:
https://github.com/NLeSC/awesome-research-software-registries/blob/master/README.md
Similar (listing) efforts exist and have been created for awareness. Some are documented in Struck (2018).
Research data repositories are listed here https://re3data.org/ and indexed with some metadata. Unfortunately, this registry of repositories has some shortcomings. It's only an indexer and it fails at times to index software repositories properly, e.g. https://www.re3data.org/repository/r3d100011713. Usage data is not available for evaluation.
Publishing software (and its dependencies) together with sample data in containerized environments (or VMs) may ease **evaluation** for **re-use** and/or **research performnce/ROI**.
> Do we have better terms for distinguishing evaluation aims?
Artifact managers may run platforms where source code and/or compiled applications (in different versions) is made available. Other interesting attempts include swmath.org, where citations to software and publishing papers are documented.
A real aggreation of data happens in these repositories.
An aggregator of software is a global platform like GitHub, a local GitLab may aggregate the research software of an organization.
As mentioned in section 'artificat mananger' some disciplinary organization may insist on their own platform to publish their ''results'' be it data or software.
### Platform vendors
> lx: what kind of platform are we thinking of there? Software platforms may be 'computing centres' below.
- Versioning
- e.g., CI/automated processing?
- What is "at stake?"
- commercial vs non-commercial?
### Computing centres and infrastructure providers (:heavy_check_mark: ready for review)
No matter whether their services are focussed on **creation** or **use** of data or software, infrastructure providers will mainly have to look into the infrastructure requirements which come with it. The key aspects in this analysis usually differ depending on whether it is software or data.
For a heavily software-based service, typically computing power as well as data exchange rates between compute nodes are key factors to ensure a good overall performance. For heavily data-based tasks storage requirements as well as archiving/backup capabilities are more in the center of attention as well as fast and broad network connections between compute nodes and storage facilities. Correspondingly, **evaluating** a infrastructure provider or planning a new infrastructure will be based on different measures for data-centric applications compared to software-centric applications.
In terms of **referencing** data typically less demanding as it is just important for backtracking the source of a data set. However, in case of software there is the issue of software dependencies. Infrarstructure providers typically need to spent considerable resources on keeping the software stack up-to-date as well as compatible to the software running on it, requiring complex referencing concepts as one software might have different version requirements for the same dependency as another software running on that infrastructure. Hence, **referencing** for data is mainly a transparency issue whereas it is for software crucial for its **use**.
:::info
TL;DR: For infrastructure provider and computing centres data and software are mainly different things as they impose quite different requirements. Data applications mainly require storage and network capabilities whereas software applications mostly benefit from computing power.
:::
---
## Conclusion
While formally software can always be described as data a deeper look into the different activities attached to software and data and the different stakeholders having to deal with it show that in many instances there are way more differences than similarities between both. It is important to acknowlegde their different nature, characteristics and needs when it comes to handling them (e.g. creating separate software and data policies instead of trying to cover both with the same rules).
As with quantum states, while something can be simulatenously software and data it will in most cases only behave as one of the two.
:::warning
Consequences of differences for:
- RDM vs RSM
- Important to analyse the meta layer, i.e., do more of the things we have done in this paper to better understand a complex scenario
:::
# References:
:book: https://www.zotero.org/groups/2557322/position_on_software_and_data
:working group: https://www.rd-alliance.org/groups/fair-4-research-software-fair4rs-wg