Enabling Risk Scoring for **Driving Event Machine Learning Labels** through Data Provenance

# Enabling Risk Scoring for **Driving Event Machine Learning Labels** through Data Provenance **Safer & better mobility through verifiable Data Chains.** Presented by Daniel Alvarez-Coello (BMW Group), Juan Caballero (Spherity GmbH), Andrey Orlov (Spherity GmbH), Dr. Carsten Stöcker (Spherity GmbH), Daniel Wilms (BMW Group) March 12, 2020, Munich/Dortmund, Germany Keywords: decentralized identity, vehicle identity, digital twinning, cryptographically secured data chains, verifiable credentials, blockchain, machine learning, agile driving, data chain provenance, audit trails, reputation system, W3C DIDs, W3C VCs, cyber-physical systems, risk scoring ## Abstract Driven by technological innovation and organic ecosystem growth, mobility value chains are significantly changing from monolithic and closed systems to distributed, open ones. Data flows are increasingly defined dynamically and stretch across multiple organizational boundaries and even legal jurisdictions. The trustworthiness and accuracy of output data (such as in-car sensors) generated along distributed digital mobility value chains is of increasing importance for safety and reliability and even to regulate mobility systems; this importance can only increase as machine learning ("ML") systems grow more in importance for mobility systems. Mobility systems have a very low tolerance for fraud and abuse, however. They *must* be made and kept highly resilient against malicious actors exploiting the cyber-physical system with falsified data because of the impact this data can have on safety systems and real-world outcomes. While it is not yet a widespread issue, we can assume that in the near future, malicious actors will bring fabricated data into mobility systems, whether to manipulate outcomes of traffic systems, to divert resources from legitimate data sources, or to commit industrial sabotage (for instance, lowering the safety ratings of competitors or sabotaging the safety of a specific physical traffic system). While we can hope these kinds of malicious acts will occur rarely, it seems safe to assume that the merely greedy will more readily insert themselves in open marketplaces for machine-learning data sets; once there, they might sell falsified, fabricated, or duplicated "spam" data that will need to be identified and rejected for the safety of the consuming algorithms. Given these incentives, **resilience** against malicious actors will quickly become key requirements for cyber-physical value chains in the mobility sector. This may include some combination of behavioral or trust analytics of data vendors, mandatory audits of entire data sets, and a system of **comprehensive** data provenance (enabled by harmonized data standards and global identity) that enable tracing any data back to verifiable origins in individual cars. The growing use of data and the internet of things (IoT) across every industry -- including logistics, manufacturing, and mobility -- is ushering in an era of data abundance. Abundant data can certainly be a good thing, but only if the risk of using it can be analyzed by scoring methods in an automated way. Businesses must now the origin and risks of data from different sources before using them. Businesses need highly automated and verifiable instruments for assessing the **provenance of data** which is vital for ML applications [ ref](https://www.forbes.com/sites/forbestechcouncil/2019/05/22/four-reasons-data-provenance-is-vital-for-analytics-and-ai/#11c89aae57d6). In this paper, we propose and detail one such mechanism for establishing data provenance in real-time along a data chain that sources driving event data and processes it into a ML label. When the data provenance of a given dangerous-driving machine-learning label is known, a scoring model can be applied to it that calculates the risks of consuming this data label for system control or responsible decision-making. This example, chosen for clarity and simplicity, is nevertheless applicable much more widely to any digital data chain, as we feel all machine-learning labels, with time, will need some degree of rating and scoring to be used safely, legally, and/or responsibly. ## Problem: Adversarial Authenticity Assurance for ML Labels The problem of detecting fake data is not new, and Machine Learning essentially inherits that problem from the data it consumes. Pictures and videos can be fabricated in very sophisticated ways so that it is impossible to immediately separate truth from lies. The development of Generative Adversarial Networks ([GANs](https://arxiv.org/abs/1406.2661)) marked a milestone in using machine learning to create next-generation fakes derived from training data sets, that look at least superficially authentic to human observers, having many realistic characteristics, if the data set was sufficiently high-quality and authentic. It is not difficult to create fake entities and content - such as pictures or telematic data sets - on behalf of fake entities for both people and vehicles. As the consequences of fake data can be very serious, owners of platforms will be held accountable for the distribution of fake data sets and damages resulting from the use of this data. Realistic assessments of risk and liability may well hold back the development and application of these technologies until provenance can be adequately verified for the data sets training and feeding the algorithms. Trust in the authenticity, integrity and quality of a given ML data label can be established by the following mechanisms: | # | Approach | Description | Quality of Trust Insights | | -------- | -------- | -------- | -------- | | 1 | Local Reputation systems | Correlation of events and feedback scores about the identity subjects that created the data sets | low | | 2 | Algorithmic Analysis | Analysis of output data sets based on machine learning | low to medium | | 3 | End-to-end Data Provenance | Authenticity analysis to verify concrete origin of data and integrity of the entire data chain | medium | | 4 | Data Provenance w/ scoring | Analysis of authenticity and integrity of the data chain **and** scoring of the entities involved based on their life-cycle certificates and historic events, where available | high | Other hybrid models as a combination of algorithmic analysis, data provenance and scoring can be developed as well. Fundamentally, though, all analyses are stronger when combined with a reputation/provenance-based scoring mechanism, and these mechanisms requires the broadest and most inclusive reputation system possible. For this reason, a discussion of how to develop better, more global reputation systems proves unavoidable, since using ML alone cannot adequately verify that input data or labels based on them have not been falsified using similar ML. ### Approach #1: Today's Local Reputation Systems Today reputation systems work at scale only on monolithic platforms such as Amazon or Facebook, and even there they require significant maintenance overhead and incentivization. Typically, a marketplace has a native reputation system that works independently and is abstracted from unique personal identities. The lack of robust verification and scoring mechanisms allow participants to manipulate these scores, in turn distorting users’ perception of them or of their competitors. Overtime, the price points and incentives give way to a marketplace for spam and fraud: vendors and merchants of the Amazon platform buy positive or negative reviews on Facebook to influence online reputations. \[ [ref](https://thehustle.co/amazon-fake-reviews), [ref](https://www.datasciencecentral.com/profiles/blogs/exposing-potential-fraud-in-amazon-reviews) \] Current reputation systems work within silos and are not suitable for an open mobility system with an even broader variety of participants. Fundamentally, reputation is no easier to enforce and no harder to falsify than authenticity of data. The integrity and authenticity of the data cannot be checked without high-level access to the identity registries of a centralized platform, even assuming the integrity of the contents of that ideally well-governed registry. Decentralized reputation systems and pseudonymizing token-curated registries, both enabled by blockchain, would be able to verify and create unique digital identities for all participants of an open system and aggregate reputation data across all the platforms where the data subject has consented to be correlated for reputational purposes. However, this so-called "web of trust" approach to federated/open reputational publishing is at an early stage and its own unique attack vectors have yet to be tested by real-world usage [ref](https://github.com/WebOfTrustInfo/rwot9-prague/blob/master/final-documents/reputation-interpretation.md). Until such systems reach maturity, these kinds of decentralized reputation scores can be taken as one data source among many for a probabilistic, hybrid scoring model. ### Approach #2: Limitations of Algorithmic Analysis on Output Analysing the output data of an IoT device or an algorithm with ML algorithms has been effective enough in the early days of AI, but it starting to falter. There are techniques to determine whether a given data set is fake or real: | Output vector | Description | Image Processing Example | | -------- | -------- | -------- | ------- | | Object Features | Analysis of selected features or places of an object where algorithms creating the fake object typically fail | Visible artifacts at the intersection of hair and body in a picture of a human | | Format Features | Analysis of content relating to a particular format features | Fake images tend to have smoother textures | | Neural Monitoring, a.k.a. Reflexive Monitoring | Analysis of neurons and layers of the network that are activated in identification/processing of real and fake images | Testing how other advanced algorithms respond to previously-sorted authentic and falsified images | Static criteria for all three of these vectors of analysis can be provided manually, but since adversarial networks have been trained on historic data, they quickly overcome any analysis that is not also adversarial and self-refining. This leads all three of these methods to become three separate fronts in an "arms race" between self-refining algorithms, in which none of the above is a silver bullet, so much as an additional attack vector. The reader should note the logical circularity in this exercise of categorization, since all three refer to using machine learning to identify byproducts of simpler or older machine learning, in an ongoing process that is never complete until falsifiers stop advancing their methods beyond currently-identified weaknesses. The examples are all drawn from today's arms race over GAN-generated images, because it is the most advanced and widely-publicized example of adversarial cat-and-mouse games over analysis algorithms. AI-powered malware and malware detection have begun to attract [some attention](https://www.dailydot.com/debug/ai-malware/) in recent years, but it is thankfully a largely theoretical topic for the time being. Much less \[peer-reviewed, or even published\] research has been done on falsifying driving event data sets, or on identifying those falsifications. It can be assumed that the analytical approaches mentioned above will not be sufficient to distinguish real and fake driving event data sets. ### Approach #3: End-to-End Data Provenance In today's digital world, the phrase "end-to-end" has a very reassuring ring to it, as does "authenticity." Being able to trace a piece of data to its genesis, to the exact device that first took a measurement or registered an event, and every data event along that chain, is understandably a high bar of confidence. Being able to log, or even dereference and identity, ever actor that transformation and computation all the way to a strongly-identified origin validates a complete data chain. However, the completeness and validity of a data chain is not the same as certainty of its results-- if anything, the final data is as reliable and trustworthy as the weakest link in that chain. And in a dynamic data chain or an unpredictable real-world context, *which* link was chosen at every stage in that chain is often a result of the kinds of analyses mentioned in Approaches 1 & 2. The net reliability is greater, but problems can be introduced anywhere along the chain (particularly cross-silo, cross-reputation-system chains) that are not detected by the process of *validating* that chain. Approach 3 can be applied fruitfully and effectively within a closed system, where all the actors can be traced and analyzed by a centralized cpoordination. Take, for example, a Machine Learning schema based on end-to-end schema that takes place entirely within the biggest silo/perimeter of all, Amazon. There, engineers were able to build a system for cataloguing and analyzing the whole process of ML training including all its inputs and refinements by isolating logical and physical operations, because they had complete control and tracking of all the inputs and actors in the system [ref](http://learningsys.org/nips17/assets/papers/paper_13.pdf). The gains to speed and accuracy of training are noteworthy, but as systems like this grow, this gets more, not less, expensive to apply at scale. And in less monolithic environments than Amazon, with spontanous inputs and unknown actors, this would be essentially unthinkable. As for extending the ML strategies from approach #2 to data provenance across dynamic chains, this can be even harder here than there on a technical level, since fundamentally origins and data trails that are not externally verifiable are even easier to forge than the data itself. Thus, this output-analytical approach **cannot** judge the quality or reliability of data about the *life-cycle* of a vehicle, of *telematics* data, or of *driving-event* machine learning labels, which involve too many other factors not reflected in the data analyzed. In summary, a dynamic chain requires not just end-to-end traceability, but a way to "fork off" that trail, querying or tracking the unknown actors along the chain. A system that scores, judges, and queries these unknown actors is the most advanced and thorough of data chains-- not just comprehensive and authenticated, but rich with links out to additional identity, data and history. ### Approach #4: Global Provenance and Global Scoring This richest data trails relies on a mechanism for pulling in that "outside data" that we call a **scoring model**. This "scores" or assesses the risk of using machine-learning data labels by assigning relative values of trustworthiness or validation to all unknown actors. Since net risk is still a major safety **requirement** for any distributed mobility system, the most accuracy possible is required in assessing the actors and agents in a system that are the least historically known or least predictable. Lacking recourse to a centralized global oracle for trust and accuracy, *the next best thing to a utopian mechanism for global oversight is a far more practicable mechanism for global interoperability of basic facts for reputation and auditing.* In order to address these shortcomings, we propose anchoring as much data as possible to a decentralized identity meta-platform [ref](https://github.com/WebOfTrustInfo/rwot9-prague/blob/master/final-documents/CooperationBeatsAggregation.md) and using the PKI functions of those publically-anchored identities to electronically sign each genesis and transformation event to turn the data chain into a *globally verifiable* and *reputation-enabled* data chain. Only a reputation system anchored to the most neutral and complete audit trail possible can score the trustworthiness of self-refining algorithms to an appropriate degree of trust. ![](https://i.imgur.com/ZYhWGeS.png) A scoring algorithm can request life-cycle credentials from the verifiable digital twins of the identity subjects involved in a data processing chain. This scoring metric could thus reflect the overall, aggregated trustworthiness and accuracy of a machine learning label. Consumers of these labels will presumably pay more for or rely their risk-decision on a more trustworthy label. ## Solution Strategy: Global Provenance powered by Verifiable Data Chains \[*Note: this paper applies the current W3C terminology around decentralised identifier, verifiable credentials and verifiable data chains to the processing of driving-event data. If this terminology is unfamiliar to you, reading Appendix A (an introduction to these) before continuing is recommended.*\] The application we are proposing turns driving-event data into decentralized automatic data, known as "DAD". This also enables the *post facto* verification of any given data flow and an estimation of the trustworthiness of a machine-learning algorithm's output data. As the outputs of ML algorithms can be determinant to several control, risk and business systems in mobility, it is important that any entity be able to evaluate the relative trustworthiness of any algorithms inputs and outputs. Our approach uses **cryptographic data structures** (aka Credential Chaining) to link data objects and to establish a method for data-flow provenance [1]. These need to be identified by by externally-resolvable, "public" DIDs of significant age, as explained in Appendix A, to ensure they've built up a positive reputation without dodging a negative one. These public anchors for reputation are a requirement for adequate objectivity in any mature reputation system. They map well to the "trust anchors" (usually large institutions with public governance and uniform processes) that structure most public-private trust frameworks stabilizing and vouchsafing decentralized data systems. ![](https://i.imgur.com/yshLD0u.png) By "data-flow provenance," we mean a mechanism for tracing data points and the history of control over them through a processing system that registers any transformation to said data points. This includes flows with multiple sources, collective sensor fusion, and processing by machine-learning algorithms. Comprehensive data-flow provenance entails not just tracing custody of the data, but also verifying the end-to-end integrity of every data flow, including any transformations (additions, deletions, modifications, combinations, and ML processing). In the world of adversarial training, this can provide essential **timestamping** and version-control **auditing**, since most algorithms get less reliable over time as adversarial algorithms outstrip them in complexity. Knowing exactly *when* a label or algorithm did its work also tells you what version or stage of evolution it was at when that work was completed-- how useful or trustworthy its results are depend on what happened since. When a DID is bound intimately to a machine, and thus cryptographically signs all the data it emits, the provenance chain of the data flow can provide the foundation for verifiable claims and attestations about the data flow itself as well as for reputation mechanisms. These novel *verifiable data chains* and reputation mechanisms allow trustless actors to asses trustworthiness, reliability or risk metrics of that machine. These judgments can be made directly from the data of that machine and/or indirectly, looking at public/open registries and reputation systems that have been previously tracking it. Without needing to leave the mobility sector, the applications for these kinds of verifiable data chains extend to any number of use cases: real-time vehicle valuation, dangerous driving assesment, road and obstacle mapping, usage-based insurance (UBI), reliable feedback loops into Driver-Assistance System (DAS) and autonomous driving infrastructures, cooperative mobility systems, and more generally to vehicle communications (of both the vehicle-to-vehicle/"V2V" and vehicle-to-infrastructure/"V2I" varieties). Our verifiable data chain concept supports the overall goal of demonstrating working Blockchain/DLT technology in real-time data-driven use-cases that can be scaled today to improve digital value chains. ### Verifiable data chains of driving events A data "chain" is any cryptographic data structure that "chains" signed data objects together (with unidirectional or bidirectional "links" between objects), establishing a navigational method for extensive data-flow provenance and auditing. Data flow provenance allows the verification of the end-to-end integrity of every data-flow object and its transformations (additions, deletions, modifications, combinations, and machine learning processing). ![](https://i.imgur.com/9WbT6xI.png) The processing of driving-event data is already operational in multiple disciplines within the mobility sector. Driving-event data processing can include multiple data sources, parties, algorithms and processing steps. A human or non-human end-user of driving-event data chains needs to be able to validate the trustworthiness and accuracy of data chain output data. This requirement achieves critical importance when the output data is used in safety- or security-relevant use cases or to make economic decisions with significant commercial consequences. Indeed, any algorithmic economic decision-making risks significant consequences at scale. ### Minimal Data Model for Maximal Interoperability Establishing a verifiable data chain requires all the data points and metadata linked to be **signed** by known/knowable identities. In the DID/VC system, these identities are represented by private/public keypairs, which cryptographically sign "envelopes" which contain data points. (Today, these can be VPs, VCs, or in a more trusted, controlled environment, bare JWEs, but in time more hybrid solutions will probably arise. The W3C is working with IETF and IANA to standardize data encodings extending this model into other kinds of data networks beyond TCP/IP.) The following code fragment illustrates how a DID/VC schema can be wrapped around an ML payload as verification-enabling, chain-linking metadata. It contains a machine-learning label ("red traffic light"), information about the algorithm that created the label (Algorithm 1) and a link to the previous data-chain block (previous block ID), as well as the signatures and cryptographic traces to verify and audit metadata. <pre><code>HEADER: TOKEN TYPE & SIGNATURE ALGORITHM { "typ": "JWT", "alg": "ES256K-R" } </code></pre> <pre><code>PAYLOAD: DATA { { "iat": 1546724123, "exp": 1546810523, "signer": { "type": "algorithm", "name": "Algorithm 1" }, "data": { "claim": { "predictionLabel": "red traffic light, red traffic signal, stoplight", "predictionProb": "0.983483", "did": "did:ethr:0xe405b9ecb83582e4edc546ba27867ee6f46a940d" }, "previousBlockId": "b86d95d0-1131-11e9-982e-51c29ca1f26e", "previousBlockHash": "307b817de9b7175db0ded0ea9576027efd64fb21" }, "iss": "did:ethr:0x5ed65343eda1c46566dff6774132830b2b821b35" } </code></pre> The data chain object can be verified by validating the signature of the payload and its history. Cryptographic data chains enable users to validate the provenance of entire driving-event data-processing chains including the authenticity and integrity of the input data, the output data and the provenance of sensing devices and processing algorithms. They also allow blacklists or watchlists to be updated, in the case of bad actors or recalled hardware. ## Solution Architecture: DID-rooted data processing chains To jump rapidly from the data model to the birds-eye view, we could say that tightly binding decentralized identity to the individual data point, while counterintuitively inefficient when seen from that scale, is actually the most elegant and simple way to make the data truly portable and makes reputation assesments silo-proof. We will now outline how this works at scale. To establish a more secure, traceable, and privacy-preserving foundation for low-trust and no-trust machine-to-machine communications, we recommend: - establishing verifiable data chains for all driving-event data processing to include assessment of its inputs and outputs, - providing a DID for every entity and data set to harmonize and streamline this verification infrastructure across all contexts, and - anchoring the data chains to cooperatively-maintained DID registries (such as validated lists of OEMs issuance DIDs, the decentralized equivalent of traditional TCP/IP-based Certificate Authorities). These can streamline rapid and high-trust verification of hardware and software, without the risk and overhead of internet connections. The verifiable digital twins of all these physical machines, data-streaming sensors, and version-controlled algorithms, in such a system, are discoverable and addressable via APIs listed in their DIDs. The more natively and deeply interoperable these DIDs are, the less complexity and error is introduced by this resolution process; they do not all need to be contained in a unitary registry and addressed by the same DID:Method, but that would be the simplest implementation. With such addressability, granular access control can be powerfully implemented. Any appropriately-privileged or -vetted counterparty can then query the digital twin as needed for information about all the organizations, sensors, telematics devices, data sets, external data sources, software algorithms and users involved in the data chain. ![](https://i.imgur.com/drw8siJ.png) This approach would be of particular value in situations where validation or benchmarking data (or even factory recall information, or firmware updates) were available about the sensing devices, vehicles and algorithms implicated in the processing of driving-event data. In combination with a reputation or validation system, any user could calculate trustworthiness and accuracy metrics about the output data, derived in part from the trustworthiness and accuracy of its inputs and the inputs of those inputs. Provided that the economic incentives for participation are carefully monitored in their implementation, decentralized reputation methods can be integrated for the scoring of both individual digital twins and entire data chains. Further theoretical and standardization work on data-chain trustworthiness and accuracy metrics remains to be done to model and accelerate the development of such systems at scale. But the net gains for efficiency and enforceable safety standards seem safe to assume, validating the core concepts. ### Dangerous driving event data chain for automotive use cases Dangerous driving events can be divided into two groups: (1) the interaction between a driver’s vehicle and the road environment, and (2) the interaction between a driver’s vehicle and nearby vehicles [5]. Diverse methods for enhancing driving safety have been proposed. Such methods can be roughly classified as passive or active. Passive methods (e.g., seat-belts, airbags, and anti-lock braking systems), which have significantly reduced traffic fatalities, were originally introduced to diminish the degree of injury from an accident. By contrast, active methods are designed to prevent accidents from occurring. Driver assistance systems (DAS) are designed to alert the driver - or an autonomous driving module - as quickly as possible to a potentially dangerous situation. The two classes of driving events may occur simultaneously and lead to certain serious traffic situations. The automotive industry is working on active methods and systems including machine learning algorithms to analyze these two kinds of events and identify *dangerous situations* from data collected by various sensors and data from external sources. The machine learning output labels about dangerous curves, road obstacles or poor vehicle conditions are fed into control, transaction and risk systems. In distributed mobility systems, the trustworthiness and accuracy of the output labels must be independently verifiable. > Key question: How can I trust vehicle identity data, third-party data and machine learning labels that are created and processed along a distributed mobility value chain? To achieve trustworthiness of output labels we propose to integrate the historic driving event data from the DID-anchored verifiable data chains described above with a recurrent neural network (RNN) machine learning algorithm that builds a verifiable driving solution. This verifiability makes the solution agile, in that it can refines itself or even simplify itself over time as new options come into its own data supply chain, rather than be manually upgraded in a top-down way. This is one of many extensions that could be built once the foundation is laid for this global-data architecture: 1. End-to-end integration of remote sensing (telematics) data could be tightly integrated with RNN machine learning algorithms through interoperable data model 2. Cryptographically secured and blockchain-enabled data chains move data out of silos for dispute resolution by key exchange, while still enabling strong cyberphysical binding to physical assets 3. Reputation systems and scoring mechanisms analysing them could be made adequately objective if contributed to (even if unevenly) by all market players 4. Interoperable decentralised identity and verifiable digital twinning protocol interacts with other value chains more reliably ### RNN Dangerous Driving Algorithm Input BMW ... ### Data Provenance with Scoring Data provenance about the entities involved in a data processing chain and the resulting machine learning labels (using DIDs, VCs and DLT to ensure uniform metadata) provides the foundation for sophisticated forms of risk scoring, including the kind of actuary primitives needed for what is called "Insure AI" in insurance industry. Harmonized data (and more importantly, metadata) is the key to AI objectivity, whether managed in traditional top-down ways, by new forms of reputation, or by new forms of actuary accounting and trustworthiness ratings. [ref](https://www.brookings.edu/research/how-insurance-can-mitigate-ai-risks/) The verifiable credentials about the identity subjects - such as the vehicle, pre-processing and ML algorithms - can be processed in a scoring model to further improve quality of risk data about a given driving-event machine-learning label. ![](https://i.imgur.com/HW4t5YE.png) Assessing the provenance of vehicle data based on the vehicle's credential is prior art. Gathering, analyzing and scoring provenance information by extracting, analyzing and benchmarking metadata and common artifacts about a given machine learning (ML) configuration (datasets, models, predictions, evaluations and training runs) is a developing field to increase reliability and security of machine learning algorithms [ ref](http://learningsys.org/nips17/assets/papers/paper_13.pdf). ### Reference Implementation BMW and Spherity implemented the verifiable data chain for a supervised learning scenario with a RNN algorithm for dangerous driving and cloud infrastructure by integrating historic dangerous driving event data sets that were used to train a RNN model and to simulate vehicle telematic data streams. In a second iteration of the project, a fleet of real cars and its live, real-world data will be integrated with this validated data-chain infrastructure. The cryptographic data structure enabled us to: - to prove the integrity of the data chain, - to identify all the entities involved in the creation of a specific machine learning label, and - to request life-cycle credentials of these entities in order to feed a scoring model for the respective machine learning label. ![](https://i.imgur.com/gLfdfuU.png) ## Future Business Models We can only speculate about what new markets and forms of business could arise once this kind of provance infrastructure is in production. Here are a few hypotheses: 1) **Everyone (Asset OEMs, owners, and operators) will sell data, data access, and data verification**. Or, to put it another way, it would be logical to share the revenues from the sale of high-quality, end-to-end verifiable data revenues among all the parties verified. This would incentivize ongoing cooperation, harmonization, and maintance of shared infrastructure. 2) **Certification companies** on the model of today's hardware and software certifying bodies (like TÜV and Underwriters Laboratories) will likely innovate synthetic offerings to certify ML products. It seems reasonable that such certifications would validate the integrity of a processing chain and the provenance of its data. In combination with their more traditional certificates for enterprises, machines, algorithms and infrastructure, companies like TÜV could even operate their own permissioned, secure verifiable data chain infrastructures and sell access to OEM customers, small players, regulators, and arbitrators on different terms. 3) **Algorithm developers & operators** will likely compete on how smoothly they integrate with this certification/auditing ecosystem. A market premium (or access to the more desirable data marketplaces) will likely be contingent on verifiable origin and quality of their products. This could be combined with finance and insurance products, such as: 4) **Traditional Credit Scoring**, such as the kind used to rate individual creditworthiness or bond ratings, could factor in risk assess ratings of products sold or used by enterprises, or the life-cycle certificates of their products. 5) **Insure AI** will swallow the software swallowing the world. Insurance pools, products, and derivatives associated with the risk of a given data processing chain can insure the risks, accuracy, and/or predictability of a ML label for consumption by a third party. This could revolutize conventional insurance, or it could create new ecosystems of interdependent or competing middlemen in a more "decentralized" form of riskpooling; this is particularly hard to predict. [ref](https://www.forbes.com/sites/forbestechcouncil/2019/05/22/four-reasons-data-provenance-is-vital-for-analytics-and-ai/#6be6420e57d6) Overall, the cost incurred by poor data quality and data manipulation can be reduced significantly, and the above-outlined economic opportunities will become rapidly feasible as the minimum and average level of data quality in marketplaces rise. The kinds of discovery and forensic audits required by both routine regulatory compliance, criminal investigations, and dispute resolution could be executed in a much more efficient way once entire data processing pipelines become verifiable to any auditor with the right consents or credentials. This also fosters innovation and business process agility, as a individual actors (even non-human ones!) can better assess the risks of relying on data sets, data sources, and algorithms dynamically or spontaneously. ## Outlook [_BMW should outline field-test parameters/possibility/requirements, or else cut this section. --JC/spherity_] We are looking forward to field test our solution in a complete mobility ecosystem. ## References [1][ A DID for Everything - Rebooting Web of Trust Working Draft](https://github.com/WebOfTrustInfo/rwot7/blob/master/draft-documents/A_DID_for_everything.md) [2][ W3C DID Specification](https://w3c-ccg.github.io/did-spec/) [3][ W3C Verifiable Claims](https://www.w3.org/2017/vc/WG/) [4][ Decentral Identity Foundation](https://identity.foundation) [5][ Dangerous Driving Event Analysis System by a Cascaded Fuzzy Reasoning Petri Net](https://www.researchgate.net/publication/224650669_Dangerous_Driving_Event_Analysis_System_by_a_Cascaded_Fuzzy_Reasoning_Petri_Net) [6][ Detecting Fake Content: One Of The Biggest Challenges For 2020](https://www.forbes.com/sites/forbestechcouncil/2020/01/02/detecting-fake-content-one-of-the-biggest-challenges-for-2020/#476378021219) [7][ Generative adversarial network](https://arxiv.org/abs/1406.2661) [8][ Decentralized Identity as a Meta-platform: How Cooperation Beats Aggregation](https://nbviewer.jupyter.org/github/WebOfTrustInfo/rwot9-prague/blob/master/final-documents/CooperationBeatsAggregation.pdf) ## Appendix A: An Overview of Underlying Technologies (W3C Standards) #### Decentralized identifiers (DIDs) The resulting combinatorics of possible connections between any given set of entities in a mobility system is an impossibly large number. Yet in today's user journeys and business environments, agents (whether human, machine, or software) increasingly need to communicate, access or transact with a diverse group of these interconnected entities to achieve their goals. This requires an interoperable and ubiquitous method to address, verify and connect these elements. We propose to adopt the open decentralized identifier standard (DID) as an open, interoperable addressing standard and to establish mechanisms to resolve DIDs across multiple centralized or decentralized mobility systems [2]. DIDs are the *atomic units* of a new layer of decentralized identity infrastructure. DIDs were designed to function as identifiers for people, but the architecture can readily be extended to any entity. We use DIDs to help identify and manage data sets, objects, machines or software agents through their digital twins, securely tracking locations, events, and pure data objects as they relate to the entity and/or its digital twin. DIDs are derived from public/private key pairs, registered in an immutable registry for discovery purposes. We use innovative cryptographic solutions for secure key management such as fragmenting the private key of a DID that never exists in its entirety. This key management solution is very effective for securely signing transactions and vouchsafing the identity of metadata, such as in the case of smart phones, algorithms or data sets. An integration of the key management technology into embedded devices is on our technology roadmap. A DID has the following required syntax: `did:method:idstring` We are registering DIDs on the Ethereum Blockchain via the standard W3C DID method *ethr* for our development work. In this method, the public key of any valid ethereum keypair/address is used as the identifier string. Thus, our DIDs look like this: `did:ethr:0x5ed65343eda1c46566dff6774132830b2b821b35` As our technology stack is blockchain-agnostic any other DID method based on alternative blockchains can be integrated and used; rotation is part of compliance with W3C standards, which includes rotating a DID record to point to a record on another blockchain, by "rotating" to a new method and a new native address there. #### Verifiable credentials DIDs are only the base layer of decentralized identity infrastructure. The next layer up (where most of the value is unlocked) consists of highly-portable documents called "verifiable credentials" ("VCs") [3,4]. This is the technical term for a digitally signed electronic data structure that conforms to the interoperability standards currently being refined as "recommendations" in the W3C Verifiable Credentials Working Group. Verifiable credentials can be either self-issued by an entity such as a machine to provide a proof about the authenticity and integrity of data or they can be issued by an "issuer", usually an institution like an Original Equipment Manufacturer (OEM), a government, an auditor or certification authority like TÜV in Germany, a service provider, or a bank. Since issuers tend to be institutions with publically-verifiable reputations, their digital signatures are relatively easy to find, verify, and trust; this trust trickles down to the entities about which they issue credentials that they allow to be widely, publically, and automatically verified. In mobility systems, any entity might want (or need) to transact with any other entity. This means entities need ways to engage with each other dynamically, ideally even on demand. It is already growing impracticable to pre-define which entities are permissioned to interact with which others, making traditional security perimeters and methods impracticable in the process. To ensure efficient transactions, any new entity in a mobility value chain must be able to independently verify counterparties which may be completely unknown to local or familiar authorities and reputation registries. To achieve this kind of dynamic communication, the only options are extreme centralization (one entity controlling reputation and communications across all organizations and platforms globally) or extreme decentralization (deliberately anti-monopolistic protocols serving as meta-platform, structured by open standards). We are part of a broader, trans-industry movement to use the DID/VC standards and protocols as such a decentralizing meta-platform; anchoring identities and verifiable credentials on a publically-legible, consensual distributed ledger moves the *cryptographic root of trust* from central systems into a decentral, interoperable infrastructure. ### Digital twins that are verifiable A digital twin is a digital representation of a biological entity (human, living organism, organization), a physical entity (objects, machines), a digital entity (digital asset, software agent) or any system formed of any combination of individual entities. Digital twins can represent objects and entities as varied as a IoT sensors, ECUs, spare parts, vehicles, traffic lights, access gates, human users, or a city, and anything in between. More recently, they have started to be used to represent intangible entities like services, code, data, processes and knowledge. Digital twin data can consist of any life-cycle attributes, metadata, readings from external sensors, tamper-proofed telematics outputs, or even compute data and other traffic metrics. A verifiable digital twin is a digital twin with attributes that are represented by verifiable credentials. These attributes such as a birth certificate, authentication proof, a calibration report or sensor data attestations can be independently verified by any third party. This type of digital twin provides verifiable data about its creation, life-cycle, sensor readings, actuator commands or transactions. These verifiable data can be used for audit trails, for high-stakes decision-making, and for feedback loops in (autonomous) control systems. ## Appendix B: Taxonomy for DAD Provenance In dangerous driving event processing, the following entities class have to be accounted for: | # | Identity Subject | Provenance Credentials | | -------- | -------- | -------- | | 1 | Vehicle Identity | Vehicle life-cycle, vehicle configuration, historic events | | 2 | Telematics data sets | Integrity and authenticity of the telematics data generated by a given vehicle | | 3 | Intermediaries | Integrity and authenticity of the telematics data distributed by intermediaries | | 4 | Pre-processing Algorithm(s) | Configuration, accuracy and historic performance of a given pre-processing algorithm | | 5 | Machine Learning Algorithm(s) | Benchmarking, accuracy and historic performance of a ML algorithm and its respective training data | A verifiable data chain would allow us to assess the integrity and transparency of driving event data processing when multiple 3rd parties are involved, which will commonly be the case in future mobility systems. It also delivers (and returns) reputational information about the identity subjects relevant to trust and payment of third parties. A scoring algorithm can request life-cycle credentials from the verifiable digital twins of the identity subjects. This scoring metric could thus reflect the overall, aggregated trustworthiness and accuracy of a machine learning label. Consumers of these labels will presumably pay more for a more trustworthy label, as labels can be assumed to compete on trust in such a market. #### Design principles For our digital twin and data chain integration work, we are applying the following design principles: |Principle |Description | |--------|--------| |**From VINs to value chains** | Abstracting the *Concept of Identity* to a mobility system of vehicles, IoT, road & travel infrastructure, mobility systems, ML agents, driving event data set, autonomous driving/DAS feedback loops, markets and humans. Exchanging data among these traceable entities. E2E data provenance along a value chain.| |**Blockchain-agnostic** |Use blockchain for anchoring attestations or verifiable claims. Decision on which blockchain to anchor claims based on user preferences or economic metrics such as Tx costs. Use of fiat-backed stable coins for micropayments.| |**Scalable integration** |Integrated technology stack consisting of off-chain data structures, serverless cloud infrastructure, high-performance IoT data stream queues, secure key management, DIDs, ML agents, sensor data, data chain fusion and blockchain connectors.| |**Responsibility Segregation**| Implementation of this common pattern for micro services design that supports scalability and maintainability of the infrastructure.| |**Global Standards**|Use of existing W3C, Industry 4.0 and Automotive data standards and semantic models to ensure adaptability and portability of our solution.| |**Business value**|Focusing on simple data integrity and authenticity problems within existing value chains. Retrofitting of existing infrastructures to scale adoption.| ## Wastebin: ## EDITOR'S NOTES DISREGARD -data provenance of digital value chain - automotive case as particularly simplest of DVCs-- - source preprocessing process postprocessing - (ML terms) + preproc = normalizing for easier Neural Model/etc (cNN or RNN algo) - hyperperimeter and configuring + sealed model = identity of algorithm (hashable --> DID) (snapshot of algo --> benchmark, trust measurement, etc) + benchmarking = lifecycle credential (audit trail); scoring metrics + risk metric for output label + scoring model is for future research - that's what schufa wants - code snippets - delete privacy section, keep business models ### Data Privacy and Data Provenance The provenance audit trail can be stored behind an access-controlled Gateway or recreated by the use of 'linked data'. ... **more content to follow** appendix - finding fake label - - AI for scoring labels? Targets? - only for BMW internal; Our primary leanings are around the following areas of interest (to be updated): - Abstraction of the DID method to any object - Identity solutions for machine learning agents - Scalability of off-chain data structures - Integration of serverless cloud infrastructure with key management and identity wallets - Verifiability of data chains - Applicability of data chains for above mentioned data-driven business solutions - Business needs - Integration with scoring models