Content Normalisation for Democratic Web

# Content Normalisation for Democratic Web ##### tags: `web3` `dweb` `d2web` `decentralised web` `democratic web` `web2` `keyword` `keyphrase` `EsteroDAO` #### authors: avneet.eth, neiman.eth ## Abstract **Web3**, or **decentralised web**, or **dWeb**, in its current form has barely covered a fraction of web2's surface area and it is fair to say that the core infrastructure is still in its early days. Ad space in web3 as a result is still a near abstract idea and hasn't drawn much thought so far. In this document, we explore the web3 infrastructure targeting ad space through keyword tokenisation. The proposed infrastructure is then imagined in a **democratic web** setting. ## Terms **web3**, or **dWeb**, or **decentralised web**, is an internet framework attributed for its use of distributed and decentralised network of resources, such as frontend hosting, backend computation and access control, and distributed database or ledger access etc. **d2web**, or **democratic dWeb**, or simply **democratic web**, is a decentralised internet framework that is also democratic. Democraticity here refers to the ability of dWeb residents to democratically enact rules for their decentralised space. By default, democratic web must also be decentralised since decentralisation is assumed to be a precondition for democraticity. In this document, we will use the term dWeb for decentralised web (web3 ⊆ dWeb) and d2Web for democratic web. ## Introduction Ad space in dWeb generates value by targeting interests conveyed through content by its residents; this is essentially the same principle used in web2 so far. To elaborate, consider a search engine frontpage for dWeb (e.g. [esteroids.eth](https://esteroids.eth.limo)) that contains dedicated **ad container** for advertisement. In web2 world (e.g. Google), this container is sold as ad space by the parent company and the resulting monetisation process of user searches is maximally centralised and benefits the parent company alone. This puts the users of internet in web2 world at great risk since most of their interaction with the internet gets mediated by a centralised entity (e.g. Google); this has disadvantages such as consumer behaviour targeting & exploitation, spam advertisements, phishing scams etc. The problem of course gets increasingly worse if the said parent company is an outright monopoly in the market and ill-regulated. Let's now reimagine this scenario in dWeb universe. To begin with, any search engine frontpage for dWeb must be DAO-controlled instead of one unique central agency. The outright benefit of this is that the residents may now democratically enact the rules for monetisation, potentially benefit directly from the said monetisation process (through their own experience in the dWeb space), and also be able to control adverse effects such as spams and phishing by decentralising access control and governance of the dWeb frontpage. ## Tokenisation Once a self-regulating decentralised governance framework exists as described in the previous section, then the next immediate critical step is the monetisation of ad space. This step requires us to find methods for tokenisation of user interest. This is the technical aspect that we will discuss in great detail in the next sections. ### User Interest **Tokenisation of User Interest**, or TUI, is a self-explanatory term that outlines the process of, a) defining 'user interest' in a given context, and b) codifying those interests in a deterministic manner (aka tokenisation). The devil in TUI lies in the details as usual. In this case, the questions are: what is a good proxy for user interest?, and how does one codify such a proxy? Straight of the bat, we will go with the natural analogy of a dWeb search engine frontpage; in this scenario, user interest is naturally mapped by the **keywords** or the **keyphrases** that a user searches for. For example, if a user searches for a certain keyword or keyphrase, then the entity owning the token(s) for that keyword or keyphrase programmatically dictate how to legitimately utilise and monetise the ad space container that is tiggered by the user search. Other possible scenarios also exist such as when more than one entity owns the tokens to keywords that make up a keyphrase; we will elaborate on such scenarios as well in this draft. The second and last remaining piece in the puzzle is then the normalisation and codification of keywords and keyphrases. This subject will form the backbone of this document. ## Keywords and keyphrases This section outlines the proposed process to codify and tokenise keywords and keyphrases. To begin with, we assert that a keyphrase is a **convolution** of keywords, where the operation of convolution must be explicitly and deterministically defined. This essentially means that a keyphrase can be tokenised by tokensing its keywords independently and then applying a **convolution** operation. This significantly eases our global tokeinsation process where we can concentrate on the tokenising keywords first. To outline this process in sufficient detail, we consider a reasonably complicated search input by a user of the form: #### `I was watching three black Jaguars with the most beautiful patterns swimming with other Jaguars` Let's now attempt to normalise this input first and then tokenise the user interest based on the normalised search input. The first step in normalisation of user input is **stemming**. ### Stemming Stemming is a process through which a derived or inflected word is reduced to its root form primarily through suffix-stripping; note that suffix-stripping is not always sufficient. For example, words like `sitting`, `sits` and `seated` all stem from the same root word `sit`, although `seated` cannot be stemmed by suffix-stripping alone. In our example sentence with Jaguars, the stemmed output will look like: #### `I was watch three black Jaguar with the most beautiful pattern swim with other Jaguar` Other examples: `affection`, `affecting`, `affection`, `affector`, `affectation` and `affected` stem from root word `affect`. ### Lemmatisation Lemmatisation is a slightly more complex and context-based reduction of reducing words to their root form. This process is used to reduce words that are not simple suffixed versions of their root word via pluralisation, verbisation etc. For instance, Lemmatisation involves the process of - reducing comparative and superlative forms to their root word, such as `better`, `best` → `good` etc. - reducing past and future forms of auxiliary verbs to the root form, such as `was`, `will` → `is` etc. Upon Lemmatisation, the output of our example sentence looks like: #### `I is watch three black Jaguar with beautiful pattern swim with other Jaguar` ### Pruning Further, in context of search engines especially, Lemmatisation is followed by another process of removing indeterministic nouns (`other`), pronouns (`I`), prepositions (`with`) and auxiliary verbs (`is`) altogether, aka Pruning. Pruning also involves removal of repeating words. Upon pruning, the output of our example sentence looks like: #### `watch three black jaguar beautiful pattern swim` Contents removed through pruning are typically irrelevant to search results which allows this reduction. In general natural language processing, pruning is context-based and not always useful. ### De-verbisation De-verbisation is one of more complex tasks where some of the verbs may be removed from the reduced keyphrase (after pruning) depending on their relative or absolute importance. In our example, `watch` is nearly irrelevant to tokenisation of the keyphrase whereas `swim` is highly relevant. In such a scenario, `watch` may be removed from the reduced keyphrase. Naturally, codifying such a threshold is a not so straightforward and may be highly contextual for other examples than the one presented here. More discussion and proposals for this process are available at the end of this document. Upon de-verbisation, we get #### `three black jaguar beautiful pattern swim` This keyphrase is now ready to be tokenised and convoluted. ## Tokenisation and Convolution Let us now tokenise the set of keywords in `three black jaguar beautiful pattern swim` and consider possible ordering of keywords which is encoded in the convolution operation. (TBA) # Appendix ### De-verbisation via Term Frequency (TBA)