owned this note
owned this note
Published
Linked with GitHub
# Hashed Data Elision: Problem Statement and Areas of Work
## Abstract
###### tags: `article / in process`
This document discusses the privacy and human rights benefits of data minimization via the methodology of hashed data elision and how it can help protocols to fulfill the guidelines of [RFC 6973: Privacy Considerations for Internet Protocols](https://datatracker.ietf.org/doc/rfc6973/) and [RFC 8280: Research into Human Rights Protocol Considerations](https://datatracker.ietf.org/doc/rfc8280/). Additional details discuss how the extant Gordian Envelope draft can provide additional benefits in these categories.
## Status of This Memo
[TBW]
## Copyright Notice
[TBW]
# 1. Introduction
Current IETF guidelines for [privacy](https://datatracker.ietf.org/doc/rfc6973/) and [human rights considerations]((https://datatracker.ietf.org/doc/rfc6973/) in internet protocols lack the specificity needed for practical implementation, leading to privacy threats such as correlation, secondary use, and unnecessary disclosure of data.
IETF released guidelines for privacy considerations in 2013 with [RFC 6973](https://datatracker.ietf.org/doc/rfc6973/) and then expanded upon that with human-rights considerations in 2017 with [RFC 8280](https://datatracker.ietf.org/doc/rfc8280/). Both RFCs provide thoughtful ideas for how privacy can be improved in internet protocols, and how that can support human rights on the internet.
However, as generalized guidelines the RFCs don't provide the specifics that might be required to incorporate these guidelines into new protocols. This document suggests more specific areas of work based in part on the Data Minimization suggestions of §6.1 of RFC 6973, and expands them to also support some of the Human Rights Guidelines outlined in §6.2 of RFC 8280.
# 2. Problem Statement
## 2.1. Correlation, Secondary Use, and Disclosure All Threaten Privacy
Often, digital data transmission operates on an all-or-nothing basis: sharing data means full disclosure. This can threaten privacy in multiple ways:
* Correlation can combine data from different sources, unintentionally revealing comprehensive individual data, significantly more than was intended. This is highlighted as a problem in §5.2.1 of RFC 6973.
* Secondary Use permits data acquirers to repurpose it beyond its original intent. This is highlighted as a problem in §5.2.3 of RFC 6973.
* Disclosure of any sort can reveal more data than was required for a use, and that extra data can then create prejudice or otherwise disadvantage the individual whose data has been disclosed. This is highlighted as a problem in §5.2.4 of RFC 6973.
Methodologies for minimizing the amount of data shared at any one time can reduce all of these privacy dangers.
## 2.2. Data Minimization through Anonymity or Pseudonymity is Insufficient
§6.1 of RFC 6973 lists anonymity and pseudonymity as two methodologies for creating data minimization. This means removing uniquely identifying data and/or reducing the amount of personal data that is transmitted.
Though anonymity and pseudonymity are minimal requirements for improving the privacy of digital data, they are insufficient. To best address privacy requires reducing the amount of all data found in any disclosure to the bare minimum required for a specific disclosure.
## 2.3. Simplistic Data Minimization Can Hinder Other Humans Rights Solutions
Simplistic data minimization focuses on cutting out unnecessary content that is not required for a specific task. Doing this is a necessity to improve privacy through data minimization, but again it's not sufficient.
This is because simplistic data minimization excises everything about data, which can cause problems for the Integrity and potentially the Authenticity of the original data set. These are needed per the
Guidelines for Human Rights, as outlined in §6.2.16 and §6.2.17 of RFC 8280.
A better solution for data minimization is required, which does not ignore other Human Rights needs as it improves privacy. Hashed data elision can provide such a solution.
## 2.4. Any Data Can Be Too Much Data
There are many situations where data minimization is important: because a party needs to know data that they do not previously know. However, there are other situations where a party doesn't need to know some freeform data, but instead requires proof that a specific data precept is true. The traditional example is proof whether someone is 21 or older, for buying alcohol in the US.
In these cases, privacy threats can be reduced even more by providing no data, simply the proof that a certain precept is true. This can offer very strong proof against Correlation (§5.2.1 of RFC 6973) and obviously minimizes Disclosure (§5.2.4 of RFC 6973).
Though some systems such as BBS+ Signatures and other Zero Knowledge Proofs system can support superior anti-correlation with "proof of knowledge of the undisclosed signature", a more simple salted hashed data elision often can provide easier solutions for many classes of "inclusion" proofs.
# 3. Areas of Work
## 3.1 Core Areas of Work
This section tries to identify and structure areas of work to address the aforementioned topics by turning the guidelines of RFC 6973 and RFC 8280 into more precise specifications or requirements. It focuses on hashed data elision as a core area of work, but in a section on optional areas of work discusses other advancements that can further support RFC 6973 and especially RFC 8280.
## 3.1.1 Support Data Minimization
As suggested by RFC 6973, Data Minimization is a prime methodology for improving privacy and reducing problems such as Correlation, Secondary Use, and Disclosure.
To fully support Data Minimization, a specification must:
1. Allow for the elision of some content from a larger package of data.
2. Allow for the holder of that data to do that elision, rather than restricting it to only issuers.
## 3.1.2 Incorporate Deterministic Hashing
As noted in §2.3, above, simplistic Data Minimization can cause other human rights problems such as a lack of Authenticity or Integrity checking. This can be resolved in a specification by requiring a fingerprint that can be used to verify elided data. It must:
1. Allow elided data to be verified with a fingerprint.
2. Maintain the validity of authenticity checks such as signatures through that fingerprint.
3. Ensure that the fingerprint is unidirectional, so that the fingerprint can prove the existence of the data, but the data cannot be derived from the fingerprint.
This can typically be done throug a hash function such as SHA-256 or a newer function such as BLAKE3. Combined with the requirements of §3.1.1, above, it would require data to be hashed prior to its elision and for any signature to cover the hashes, not the unhashed data.
## 3.1.3 Enable Inclusion Proofs
Because data does not always need to be shared to provide the verification required by a validator, support of data proofs can provide additional privacy and human rights benefits. To support this, a specification must:
1. Allow for the revelation of specific fingerprints.
2. Support the easy creation of an inclusion proof that demonstrate how specific data can be hashed to create that specific fingerprint.
3. Enable any holder to create that inclusion proof, not just an issuer.
Through this methodology, a holder can create a proof for a specific bit of data, such as their residence in a specific country or state, demonstrate that proof's creation, and show that it matches the hash of elided data. However, the holder does so only if and when they wish: the data is never known unless they do so.
Though other methodologies exist for proving the content of data, such as Zero-Knowledge Proofs and BBS+ Signatures, inclusion proofs based on hashes provide a much easier solution that is pragmatically more likely to be implemented and thus is more accessible and useable today.
## 3.1.4 Facilitate Herd Privacy
Support for inclusion proofs can also allow for the use of herd privacy, where data about a specific user is contained within a much larger hash of data, which can be widely published without danger. This puts all the agency for data revelation in an individual user's hand, and does it without any need to "phone home", meaning that not even the original publisher of the data would know when that data were being checked.
To ensure that inclusion proofs can be extended to herd privacy, a specification must:
1. Use a branching structure for data storage such as a Merkle Tree where hashes can be further hashed together at high levels in a well-known, regularized way.
2. Allow for the publication of top-level or high-level hashes.
3. Enable individual holders to reveal paths that connect their individual data up to the top-level or high-level hash through any number of branches.
4. Build that structure in such a way that a minimum of other hashes are revealed when a user reveals a path to their own data; or else ensure that any other hashes revealed are worthless without knowledge of secret data, such as a salt.
5. Otherwise support the creation of inclusion proofs for proving their low-level individual data.
Ensuring herd privacy in part focuses upon empowering the user, but it also depends on a thoughtful creation of the hash tree structure, such that other information can't be guessed from the revelation of hashes.
## 3.2 Optional Areas of Work
Using hashed data elision as a foundation would improve the privacy of almost any IETF protocol.
The [Gordian Envelope Internet-Draft](https://datatracker.ietf.org/doc/draft-mcnally-envelope/) is one example of a specification that supports hashed data elision. It could be used to enable all of the Core Areas of Work. It also goes further, incorporating additional functionality that can provide better support for RFC 6973 and RFC 8280 through additional features, including the following.
## 3.2.1 Extend Support to Encryption & Compression
A hashed data elision system can be expanded to support both encryption and compression functions, as encrypted and compressed data can also be represented by their hashes without revealing any information about the original data.
Incorporating encryption into a data specification offers the highest level of privacy and of data minimization possible, as data can only be viewed by select individual with the decryption key. This is especially important for Confidentiality, which is referenced in §6.2.15 of RFC 8280.
Hashing encryption primarily improves Authenticity, per §6.2.17 of RFC 8280. As with other sorts of elided data, signatures will remain valid even following compression, provided the signatures are applied to the data hash, not the original data.
## 3.2.2 Address Additional Human Rights Threats
As currently imagined, the Gordian Envelope Internet Draft also offers support for several other Guidelines for Human Rights Considerations that are listed in §6.2 of RFC 8280:
* Privacy (§6.2.2). Besides the obvious privacy benefits of data minimization, Gordian Envelope also improves privacy through optional usage of metadata, which can be used to document the sensitivity of contents, retention limits, etc.
* Accessibility (§6.2.11). Metadata can also be used to ensure accessibility and internationalization of data through inclusions of references with a variety of localizations.
* Censorship Resistance (§6.2.6). Gordian Envelope is built to support SCIDs, or self-certifying identifiers, which can be used to avoid reuse of existing identifiers that might be associated with persons or content.
* Open Standards (§6.2.7). As an Internet Draft, Gordian Envelope represents an open standard. It can support interoperable exchange of data, which is vital for human rights.
* Heterogeneity Support (§6.2.8), Adaptability (§6.2.18). Gordian Envelope builds its data format on triples: assertions of subjects, predicates, and objects. This format can easily be adapted to a wide variety of data formatting styles.
* Localization (§6.2.12), Decentralization (§6.2.13). As an open standard that solely utilizes other open standards such as well-known hashing and encryption algorithms, Gordian Envelope is built to avoid decentralization. This is further supported by Gordian Envelope's Heterogeneity Support, its Adaptability, and its Reliability.
* Reliability (§6.2.14), Integrity (§6.2.16). Gordian Envelope is built on CBOR, which means that data is self-describing. It is also hashed. This improves its Reliability and Integrity, while the self-description also makes data stored in Gordian Envelope more interoperable, and thus less subject to centralization.
## 3.2.3 Keep It Simple
Support for privacy and for human rights has another requirement: it needs to be kept simple so that it finds actual use.
[Gordian Envelope](https://datatracker.ietf.org/doc/draft-mcnally-envelope/) is a fundamentally simple data format that only achieves complexity through iterative structure design.
# 4. Privacy Considerations
As outlined, the concept of hashed data elision and, more specifically the Gordian Envelope specification, provide a wide variety of privacy advancements.
The biggest remaining privacy concern is of accidental correlation that can arise if different parties have different versions of the same data, which has been elided in different ways. This is currently seen as an acceptable side-effect of an elision system that allows for Authenticity and Integrity in the system, and can be offset by careful creation of Envelope structures, such as gathering small groups of data into distinct, elided branches.
However, the question also remains open as to whether there might be more expansive and more automated solutions.
# 5. Security Considerations
The biggest security considerations focus on the strength of hashing algorithms (and encryption algorithms if they're used). Potential threats to hashes and encryption such as quantum computing would also result in threats to any hashed data elision system.
# 6. IANA Considerations
IANA is separately being queried on the allocation of certain CBOR tags for IANA.
# 7. Informative References
[envelope] McNally, W., Allen, C., "The Envelope Structured Data Format", March 2023, <https://datatracker.ietf.org/doc/draft-mcnally-envelope/>.
[RFC6973] Cooper, A., Tschofenig, H., Aboba, B., Peterson, J., Morris., J., Hansen, M., Smith, R., "Privacy Considerations for Internet Protocols", RFC 6973, July 2013, <https://datatracker.ietf.org/doc/rfc6973/>.
[RFC8280] ten Oever, N., Cath, C., "Research into Human Rights Protocol Considerations", October 2017, <https://datatracker.ietf.org/doc/rfc8280/>
[RFC8949] Bormann, C., Hoffman, P. "Concise Binary Object Representation (CBOR)", December 2020, <https://datatracker.ietf.org/doc//rfc8949>
[IF WE WANT TO SUBMIT THIS ANYMORE, WE SHOULD CONSIDER OTHER REFEERENCES TO ADD ON TOPIC]