(Lack of?) OCA structure

# (Lack of?) OCA structure The OCA essence is data defined via capture base and metadata via overlays. Metadata by its nature is optional, thus all overlays tend to be optional. The so far defined overlays are universal across jurisdictions and use cases. They decorate raw bytes of data with additional meaning. They classify distinct characteristics that are part of, lets call it, the `enriched data types ontology`. In other words they constitute an ontology/classification about raw bytes. Some of the current overlays are borrowed from other ontologies, i.e., unit. Some can be inferred, i.e., character encoding. > Inference works much like `type inference` in some `strongly typed` programming languages, where type annotation is not explicitly required. It is deduced. The metadata defined through existing overlays, while universally applicable to any use case, is at the same time heavily influenced by external factors. The most signifiacnt is the need for data presentation in various ways using user interfaces. The demand for proper fundational ontology is a prerequisite for the next challenge: [ontology-based data integration](https://en.wikipedia.org/wiki/Ontology-based_data_integration). Before going deeper into this topic, lets define what is actually the case of even starting such challenge. One of the primary characteristics (selling points) of OCA is the data harmonization/data integration through metadata-enrichment process. The `enriched data types ontology`, so the currently defined overlays enable integration on the level of enriched data types, much like a strongly typed programming language enables certain operations on the same types. Integrating datasets with enriched data types end up in a larger dataset, compliant to the same characteristics as its sources. Some industries define their classifications, i.e. SNOMED, to create unambigious characteristics and relations for concepts. Many corporate environments define their internal classifications to unambigously navigate across [heterogeneous data sources](http://pubs.sciepub.com/acis/3/1/3/). Thus, [global and local](https://www.diag.uniroma1.it/degiacom/papers/2001/CaDL01swws.pdf) classifications give a different perspective and add even more metadata to raw bytes. To anyhow conduct the data integration on a higher level than the foundation ontology, models defined with OCA must include higher level classifications. While capture base comes with the `classification` attribute, consider the following example: ``` # Measurement schema sensor_id: string sensor_value: number device_id: string location: string occurrenced_at: datetime ... ``` Measurement is a consequence of an act of capturing the value at specified point in time. This is thie essence and everything else is metadata that is optional. Furthermore, measurement schemas will vary across origins and/or jurisdictions, complying to the specific needs in a given use case. Rewriting the above into: ``` # Measurement schema sensor_id: string sensor_value: number occurrenced_at: datetime ``` and adding metadata: ``` device_id: string location: string ``` inherently decouples measurement from context (where/when/... it happened). Note it is different approach from classical relational modeling. If relations are in the game the above capture base looks like the following: ``` sensor_id: SAID value: number device_id: SAID location_id: SAID occurrenced_at: datetime ``` The difference is significant and it implies how to model and represent the data. The `classification` attribute comes on the schema level, but it can be as well on the metadata level. To support it from the technical arm, consider the following snippet given in CESR pseudo code: ```=1 { sensor_id: 5, sensor_value: 1 } ---AAB # protocol selection protocol metadata, i.e. when/what happened ---AAC # protocol selection protocol metadata, i.e. geospatial location of where it happened ... ``` Line `1` specifies the underlying message, a captured record. Lines `2-5` enrich the message with metadata. The metadata is arbitrary but compliant with the protocol. The protocol is a technical representation of a classification, i.e., an ontology. The message may have an arbitrary number of enriching protocols, varying on the use case. Such approach is just an example, but it also implies a paradigm shift of what is metadata and what is the underlying message. The proposed decoupling imposes explicit separation, involving also mental model change. ### Towards strict and hierarchical semantic models Data models represented with OCA benefit from the foundational ontology and optionally from higher level classifications (global or local), given via `classification` attribute in the capture base. Environments including many datasets, utilizing various types of database systems, are lacking common perspective upon the underlying data. Each database system comes with custom data model and custom semantics, defined by the modelers, unaware of the semantics in the other systems. [Heterogeneous database systems](https://en.wikipedia.org/wiki/Heterogeneous_database_system) overcome the differences by unifying the access interface and enable schema and semantic integration. While schema integrations are doable by applying proper transformations, semantics require more effort by carefully and consciously doing mappings. Classifications are invaluable allies when working with semantics and creating these mappings. Classification is a hierarchical structure of information, also called a semantic model. Classifications aim to be accurate, unambigious and complete. When exist, they help with data management at scale, because they enable interoperability among so far heterogeneous and siloed datasets. At scale, it is a significant factor to consider. ## The semantics culture Enriching new or existing datasets with metadata coming from semantic models requires work and habits. While semantics is the study of meaning and all data have a meaning, not every organization shall consider semantics as a primary factor to address, even though all organizations manage data. Adding metadata to any data has a cost that benefits must outperform. [Semantic Web](https://en.wikipedia.org/wiki/Semantic_Web) is an example of where it did not happen, ending up in an unsuccessful adoption. The semantics culture is about adding additional habits to the organization's daily activities. It starts by thinking about semantics as a primary concern within an organization. Simply applying technologies enabling semantics enrichments does not solve the underlying problem. Technical patterns do not solve political issues but add a burden in terms of cost that the management may not simply accept. Siloed and heterogeneous datasets are a barrier to getting the big picture of the organization's data. While adding semantics enriches the data, the consumers and profits of these enrichments must be clearly defined. The scale of maintained datasets is a significant entering factor for considering semantics as a valuable addition to an organization. It becomes essential when reporting or analytics across the datasets as it is the unification factor. # Technical stack movements ## Semantic Engine (SE) In principle SE organizes communication among data producers and data consumers, enabling a potentially interested party in consuming a dataset to actually do so (after following a sophisticated business process). The semantics part in SE is supported by the so-called `enriched data types ontology`, the foundational OCA ontology. Take a look of how others enable full fledged semantics support, i.e., [here](https://blog.cambridgesemantics.com/knowledge-graphs-the-future-of-data-integration) or [here](https://www.thehyve.nl/articles/tools-building-knowledge-graphs?gclid=CjwKCAiA3pugBhAwEiwAWFzwdaBSaoyfCZeP-cKnIxl3ycW0aqu2GdCVaU_52BtjqlO-cq18HIyMqxoChpwQAvD_BwE). Thus, the full product (or a set of products?) shall consist of: - data integration part heavily relying on semantic models; - set of features enabling auto-generation/semi-auto-generation of semantic models; - auto scanning of datasets and applying semantic models to enrich the data; - potentially visualizing the end result in the data consumer environment. The current scope of the SE as-a-product focuses more on who has what rather than the benefits of applying semantics. The `enriched data types ontology` may be useful in the actual postprocessing of the data, so at the point where data consumer and data producer(s) are done with the sophisticated business process and exchanged the datasets. The primary feature is however to pair interested parties. Not only that, the fundamental primary feature of SE is to allow data consumer to run queries among data producers in a distributed fashion using `MapReduce` pattern and complying to all the data producers regulations and consents. Thus, the result of interacting with SE is one of a visualisation, an unit or a collection of data. Finally, it shall behave much like ChatGPT and provide the desired end result in a compact form. ## OCA The newest 1.0 specification shapes OCA to support various use cases under a common denominator: decentralized semantics and task-specific objects (overlays). OCA has two main areas of interest that so far emerged in a crystal clear fashion: - IoT - SSI. ### Criticism and improvements IoT lacks proper support for large-scale deployments, starting from a lack of semantics culture. Industries like IoT, usually backed by big players, require semantics as a first-class citizen, spanning the whole organization with proper classification and crossing all possibly interested branches. SSI seems to be overinvested, supporting cases eventually relevant, but in the next several years. The most prominent feature recognized in the community in early 2023 is the ability to provide translations to many languages, but it is far from being a fundamental cutting-edge feature. The primary (cutting-edge) OCA feature is (or shall be) the ability to work across classifications and jurisdictions and capture as much metadata for data as feasible and reasonable to enable the continuous enrichment of the data with the metadata. It adds proper support for higher level components like SE. The inherent end goal would be reorienting current developments and road maps, i.e., supporting existing global classifications as a first-class feature. OCA is solely a technical pattern applied to a given problem. Before starting with OCA, a `Semantics Culture` must be established within the organization so everyone clearly understands how it maps to the technical solution or why it is actually needed. OCA shall not focus on cases that do not have semantics issues. A semantic problem occurs where there is ambiguity. Both reporting and analytics, the most beneficial areas of having rich semantics, must have straightforward recipes to resolve the ambiguity. #### Decentralised semantics Decentralised Semantics (DS) enables a distinct representation of the characteristics specified within the Semantics. Lets define all the Semantics characteristics as a set `S,` and interested party `A` may choose a subset `U` that is `U ⊆ S`. `A` best interest is to choose characteristics that fit his intent, but to achieve it, `A` must learn what these characteristics are about. In many cases, it is an unnecessary step as the applicability of a subset `U` is use-case dependent. It is, therefore, more convenient to impose specific characteristics for certain use cases and specify `U_{1}, U_{2}, ..., U_{n}`. DS has another capability that plays well with the authentication layer. It enables multi-governance and cross-jurisdiction use cases by allowing authorities to express their intent only to a subset of a larger whole. In conjunction with the authentication layer, it enables the authoritative to benefit from digital signatures and signing (confirming) specific characteristics of the Semantics. The authority is capable to `dig_sign(V)` that is `V ⊆ U`, but by doing so, the authority is lacking the big picture, that is `U`, and instead, it is getting `U - V`. In the authority best interest is to sign `U` (i.e., certain OCA Bundle) rather than `V`, because only such operation guarantees context preservation as a whole. What was agreed by the authority is in use by third parties in exactly the same way. Third parties use `U`, not `V ∪ Z`, where `Z` is an addition from third party. If, however, the authority agrees to manipulate with `V` such as `V ∪ Z ⊆ U`, there must be additional explicit mechanism, equipped with capabilities, defining what is `Z`. Such mechanism is out of scope of DS. To preserve the above conclusions in the Decentralised Semantics, it must move towards a federated ecosystem, where pieces are assembled from units and free-form is an undesirable trait. #### Spanning layer The underlying OCA assumptions fit to the [hourglass model](https://arxiv.org/pdf/1607.07183.pdf) assumptions, but ![](https://i.imgur.com/qy3tWn4.png) ### Summary Semantics Culture changes the mental model within the organization. The outcome of introducing it shall be the semantic models that bring abstract structure and hierarchy to the organization data. With semantic models in place, the organization starts looking at the underlying technology to map semantic models within organization datasets. The organization furthermore may opt-in for having SE for further data mining. To saturate current market demands, OCA should evolve and become the spanning layer across classifications. All the battles across environments and ecosystems (i.e. SSI) then become obsolete, because they do not have semantics issues.