Structured Data & Graph Models

--- robots: noindex, nofollow tags: WIP, bc, concept --- # Structured Data & Graph Models ## Classic data structures Software engineers are taught very early in their careers a number of classic types of structured data, in particular arrays, lists, sets, and dictionaries. Each of these data structures are used to organize and store data in a way that is efficient and easy to access. * Arrays are collections of data that are stored in a contiguous block of memory, and can be accessed using an index. * Lists are similar to arrays, but the elements in a list are not required to be stored in contiguous memory locations. * Sets are collections of data in which each element is unique * Dictionaries are data structures that store data in Key-Value pairs. Each of these data structures has its own advantages and disadvantages, and software engineers choose the appropriate data structure based on the needs of the specific application they are working on. ## Graph Data Structures However, beyond these classics, there is a family of structured data that leverage graphs. Unfortunately, until recently, these graph data stuctures are rarely taught outside of graduate level programs, and are still often left out of accellerated software engineering programs, so their advantages and limitations are less well-known. The addition that graph-based structures offer is that they can be used to represent relationships between different pieces of data. At the bottom of graph structures are these basic types. * A directed graph is a type of graph in which the edges have a direction and can only be traversed in that direction. * An undirected graph is a type of graph in which the edges do not have a direction and can be traversed in either direction. * A weighted graph is a type of graph in which each edge has a weight or cost associated with it. ## Graph Models In turn, these different graph stuctures are constructed into graph models. These include entity-relationship diagrams, property graphs, and semantic graphs: * An entity-relationship diagram is a type of graph model that is used to represent the relationships between different entities in a database. In this model, entities are represented as nodes, and the relationships between entities are represented as edges. * Property graphs are a type of graph model that is used to represent data that has properties associated with it. In this model, the nodes represent entities, and the edges represent relationships between entities. Each node and edge can have multiple properties associated with it, which can be used to provide additional information about the data. * Semantic graphs are a type of graph model that is used to represent data in a way that is understandable by machines. In this model, the nodes represent entities, and the edges represent relationships between entities. The relationships between entities are defined using a formal language, which allows machines to understand the meaning of the data. Software engineers can leverage these graph models to represent structured data in a more flexible and expressive ways than are possible with classic structured data. ## Graph Model Archictures At yet higher level, each of these graphc models have support in various software architectures. * LPG (Labeled Property Graphs): This is a type of *property graph* in which both the nodes and the edges have labels associated with them. These labels can be used to provide additional information about the data and the relationships between the data. Labeled property graphs are a useful tool for representing structured data because they allow software engineers to include more information about the data and its relationships than is possible with other types of graph models. * RDF (Resource Description Framework) is a type of *semantic graph* In RDF, the data is first represented using triples, which consist of a subject, predicate, and object. In particular, these triples are written in a formal language that allows machines to understand the meaning of the data and the relationships between the data, forming a quad. * JSON-LD (JavaScript Object Notation for Linked Data) is a variant of RDF (and thus a *semantic graph*) that focuses on representing data in a way that is both human- and machine-readable. It is based on the JSON format, which is a popular data interchange format, and uses the Linked Data principles to add additional semantic information to the data. ## The Linked Data Principles In particular RDF & JSON-LD use the Linked-Data Principles as a set of best practices for sharing structured data on the web. The linked-data principles were first proposed by Tim Berners-Lee, the inventor of the World Wide Web, in 2006. The linked-data principles are as follows: * Use URIs (Uniform Resource Identifiers) as names for things. * Use HTTP URIs so that people can look up those names. * When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL). * Include links to other URIs, so that they can discover more things. The linked-data principles are designed to make it easier to share and integrate structured data on the web, and to enable the creation of a global, interconnected web of data. By following these principles, publishers of structured data can make their data more easily accessible and discoverable, and users of structured data can more easily access and integrate data from multiple sources. One challenge with the linked-data principles is that they rely on the use of URIs as names for things. The use of URIs can be complex and difficult to understand for some users, which can make it difficult to use linked-data principles in practice. Another challenge with the linked-data principles is that they rely on the use of standardized formats such as RDF and SPARQL. While these formats are well-suited for representing structured data, they can also be complex and difficult to use. Additionally, the linked-data principles alone do not provide any guidance on how to manage the quality or reliability of the data that is published using these principles. This can make it difficult to ensure the accuracy and integrity of the data, which can in turn make it challenging to use the data in a trustworthy and reliable manner. Overall, while the linked-data principles are a useful framework for publishing and using structured data on the web, they do have some challenges and weaknesses that need to be considered when implementing these principles in practice. ### Differences between LPG and RDF/JSON-LD Both LPG & RDF/JSON-LD represent data in a graph-like structures. Both are useful for representing structured data, but they have some key differences. On is the way in which the relationships between entities are represented. In a LPG, the edges between nodes represent relationships between entities, and these relationships can be labeled with additional information. In RDF, the relationships between entities are represented using triples. One advantage of LPGs is that they are generally easier to understand and work with than RDF. The relationships between entities are represented using edges, which are intuitive and easy to understand. In contrast, the triples in RDF can be more difficult to understand and work with, especially for those who are not familiar with the formal language used to represent the relationships. Another difference between LPG and RDF/JSON-LD is the way in which the data is stored and accessed in a persisent database. In a LPG, the data is stored in a way that is optimally designed to store and query data in a SQL-like format (e.g. neo4J's Cypher query language). In RDF, the data is typically stored in a triple store, which is a database that stores triples and allows them to be queried using the more complex RDF SPARQL query language. Another difference is that RDF/JSON-LD has the advantage of being a standardized format for representing data, whereas LPG is not standardized. This means that RDF data can be easily shared and integrated with other systems, whereas labeled property graphs may require more work to integrate with other systems. Additionally, RDF/JSON-LD has a rich set of tools and frameworks for working with the data, whereas the tools and frameworks for working with labeled property graphs may be more limited. RDF/JSON-LD offers is powerful for representing complex data and relationships, but it can be complex and difficult to work with, especially for those who are not familiar with the formal language used to represent the data. ## What are the advantages of JSON-LD over RDF? JSON-LD is more compact and easier to read than RDF, which can make it easier to work with large datasets. JSON-LD also allows data to be expressed using multiple vocabularies, which can make it more flexible and expressive than the original RDF format. One potential limitation of JSON-LD is that it may not be as expressive as some other formats (such as the emerging RDF-star), and still requires some understanding of RDF formal language for describing semantic data. JSON-LD also requires more effort to integrate with systems that only support JSON, as the data conversion is not lossless, in particular for cryptographic authentication. ## The Open World Model The "open world" model (or "open world" assumption) is a philosophical approach to knowledge representation that is used in some graph data models, including RDF (Resource Description Framework) and OWL (Web Ontology Language). The "open world" model is based on the idea that the information in a graph database is incomplete and may change over time, and that new information can be added to the database at any time. In contrast, the "closed world" model assumes that the information in a database is complete and cannot be changed, and that any information not present in the database is assumed to be false. One of the main advantages of the "open world" model is that it allows for more flexible and expressive querying of the data. This can be particularly useful in applications that need to handle large and complex datasets, where it may not be possible to represent all of the information in the database upfront. Additionally, the "open world" model allows the data to evolve over time, which can be useful in applications that need to integrate data from multiple sources. One of the main limitations of the "open world" model is that it can be more difficult to work with than the "closed world" model. This is because the "open world" model requires queries to be formulated in such a way that they can handle incomplete or uncertain information, which can be more complex and difficult to do. Additionally, the "open world" model does not provide a way to make definitive statements about the data, which can be a disadvantage in applications that need to make decisions based on the data. Another weakness of the "open world" model is that it can lead to inconsistencies in the data, as new information can be added to the database at any time. This can make it difficult to maintain the integrity and consistency of the data, and can require additional checks and validation to ensure that the data remains consistent. ### The Open World Model and Authentication The "open world" model is challenging to authenticate cryptographically because it allows for the addition of new information to the database at any time. In a traditional cryptographic authentication system, the authenticity of a piece of information is determined by comparing it to a known reference, such as a digital signature or a cryptographic hash. However, in the "open world" model, the reference may change over time as new information is added to the database. This means that it is difficult to determine the authenticity of a piece of information with certainty, as new information may be added to the database that invalidates the reference. One way to address this challenge is to use a dynamic references, such as a Decentralized Identifier, that can updated whenever new information is added to the database. This allows the authenticity of the information to be verified in real-time, as the reference is always up-to-date. However, this approach can be intensive, is subject to denial of service, and may not be suitable for large or complex datasets. Another way to address this challenge is to use a combination of cryptographic and non-cryptographic authentication methods. For example, the "open world" model can be combined with a trust model, in which the information is authenticated based on its provenance and the trustworthiness of the source. This allows the authenticity of the information to be verified in a more flexible and scalable way, but it may not provide the same level of security as a purely cryptographic authentication system. In summary, the "open world" model is challenging to authenticate cryptographically because it allows for the addition of new information at any time. This makes it difficult to determine the authenticity of a piece of information with certainty, as the reference may change over time. ### The "Open World" Model and Privacy The "open world" model presents several personal privacy challenges because it allows for the addition of new information to the database at any time. In a traditional personal privacy system, the privacy of an individual's data is protected by limiting the access to the data and by restricting the ability to add new information to the database. However, in the "open world" model, the access to the data and the ability to add new information to the database is not as restricted, which can put the personal privacy of the individuals at risk. Another personal privacy challenges of the "open world" model is the risk of unauthorized access to the data. In the "open world" model, the data is easily linked and correlatable, which can make it easy for unauthorized individuals to access the data without the consent of the individuals. This can put the personal privacy of the individuals at risk, as their data may be accessed and used without their knowledge or consent. Another personal privacy challenge of the "open world" model is the risk of data leakage. In the "open world" model, the data is generally distributed across multiple nodes in the network, which can make it difficult to control and protect the data. This can lead to data leakage, in which the data is inadvertently disclosed to unauthorized individuals or systems. Data leakage can put the personal privacy of the individuals at risk, as their data may be exposed to unauthorized parties. ## SIDENOTE: Graph Query Languages Cypher is an LPG query languages uses a human-readable, ASCII-based syntax that is similar to SQL, while SPARQL uses a more formal, XML-based syntax that is based on the RDF data model. Another difference between Cypher and SPARQL is the way they match patterns in the graph. Cypher uses a declarative approach, in which the user specifies the pattern they are looking for and the query engine finds the matching data. SPARQL uses a more procedural approach, in which the user specifies the steps to follow in order to find the data. This means that Cypher is generally easier to read and write than SPARQL, but SPARQL is more flexible and allows the user to specify more complex queries. Thus a strenght of Cypher is that it is a relatively simple and intuitive language, which makes it easy to learn and use. This makes Cypher a good choice for applications that need to query a graph database quickly and easily, such as in an interactive environment or as part of a web application. Additionally, Cypher has a rich set of tools and frameworks that support its use, including an online console, a graphical query builder, and integration with various programming languages. One of the main weaknesses of Cypher is that it is not a standardized language, which means that it is only supported by Neo4j and its ecosystem of tools and frameworks. This can make it difficult to use Cypher with other graph databases or to integrate Cypher-based applications with other systems. Additionally, Cypher is not as powerful or expressive as some other query languages, such as SPARQL, which means that it may not be suitable for complex or specialized querying tasks. ## SIDENOTE: Labeled Property Graphs & Authenticated Data & Privacy Labeled property graphs, also known as LPGs, are a type of graph model that is used to represent data that has properties associated with it. In this model, the nodes represent entities, and the edges represent relationships between entities. Both the nodes and the edges can have labels associated with them, which can provide additional information about the data and its relationships. One potential advantage of using LPGs for cryptographic authentication is that the labels associated with the nodes and edges can be used to provide additional information that can be used to verify the authenticity of the data. For example, a label on a node could specify the source of the data, and a label on an edge could specify the type of relationship between the two nodes. This additional information could be used to verify the authenticity of the data using techniques such as digital signatures and hash functions. Another advantage of using LPGs for personal privacy is that the labels associated with the nodes and edges can be used to control access to the data. For example, a label on a node could specify the level of access required to view the data, and a label on an edge could specify the type of relationship between the two nodes. This additional information could be used to restrict access to the data to only those individuals who have the appropriate level of access. However, it should be noted that LPGs are not standardized, which can make it more challenging to use labeled property graphs for cryptographic authentication and privacy in practice. ## SIDENOTE: An interesting possible future? RDF* (aka RDF-star) RDF-star is a proposed extension to the Resource Description Framework (RDF), RDF-star is designed to provide additional features and capabilities for working with RDF data, and offers several benefits over the original RDF format. One of the main benefits of RDF-star is that it allows multiple objects to be associated with a single subject-predicate pair, whereas in the original RDF format, each subject-predicate pair can only be associated with a single object. This makes RDF-star more expressive and flexible than the original RDF format, and allows it to represent more complex data and relationships. Another benefit of RDF-star is that it allows blank nodes, which are nodes that do not have a URI, to be used as objects in subject-predicate-object triples. This allows data to be represented more compactly and efficiently, and can make it easier to work with large datasets. RDF-star also introduces the concept of star queries, which are queries that can be used to retrieve multiple objects associated with a single subject-predicate pair. This can make it easier to work with complex RDF data, and can improve the performance of certain types of queries. At this point I have not seen any proposals to extend RDF-star's features to JSON-LD. ## PERSONAL SIDENOTE: Gordian Envelopes My own Gordian Envelope architecture come closest to RDF-star, but without the semantic requirements of RDF and RDF-star. Like noSQL databases, Gordian Envelopes tries to be somewhat agnostic to any specific structure, so can support all them.