Metadata specification

## Metadata standart  In any file containing text, within a certain interval, there is usually a large amount of information about the content inside. We can leverage this information by generating documents **metadata**.   We propose to organize the metadata for each text document in the format of the dictionary(key-value pairs) with the following fields: ### Metadata format: | Name | Type | Size (approximate in bytes) | Example | Description | | - | - | - | - | - | | Person-Entities | String | 100 on avg | Brandon Mahone,Arran Sym | Comma-separated person entities | | Norp-Entities | String | 100 on avg | Asian,Korean | Nationalities or religious or political groups. | | Fac-Entities | String | 100 on avg | Cafe Olympico,the Stade Olympique | Buildings, airports, highways, bridges, etc. | | Org-Entities | String | 100 on avg | Twitter,Arbitrum | Companies, agencies, institutions, etc. | | Gpe_Entities | String | 100 on avg | US,Japan | Countries, cities, states. | | Loc-Entities | String | 100 on avg | New England,Europe | Non-GPE locations, mountain ranges, bodies of water. | | Product-Entities | String | 100 on avg | Google Docs,Discord,Notion | Objects, vehicles, foods, etc. (Not services.) | | Event-Entities | String | 100 on avg | China Construction Expo,World Cup | Named hurricanes, battles, wars, sports events, etc. | | Work-Of-Art-Entities | String | 100 on avg | cyberpunk Cthulhu China,midjourney | Titles of books, songs, etc. | | Law-Entities | String | 100 on avg | Quest Protocol,The Toucan Protocol | Named documents made into laws. | | Date-Entities | String | 100 on avg | July 19,November 1 | Absolute or relative dates or periods. | | Language-Entities | String | 100 on avg | English, Spanish | Any named language. | | Money-Entities | String | 100 on avg | 10 yuan,$13.7 billion | Monetary values, including unit. | | Quantity-Entities | String | 100 on avg | 23 metres,approximately 227 tonnes | Measurements, as of weight or distance. | | Time-Entities | String | 100 on avg | 4pm EST/ 9pm UTC,10:30pm | Times smaller than a day. | | Keywords | String | 100 on avg | season quest,exploring defi | Any keywords containing 1 or 2 words | | Categories | String | 100 on avg | music,sports,finance | Odysee categores (https://help.odysee.tv/category-categories/categories/) | | Content-Hash | Integer | 8 | 2296761207 | Hash of the content | | Major-Language | String | 10 on avg | 'en' | The main language of the text | | Other-Languages | String | 100 on avg | 'es,de' | Other languages present in the text excluding 'Major-Language' | | Alpha-Token-Count | Integer | 8 | 46 | | | Alpha-Char-Count | Integer | 8 | 157 | | | Total-Char-Count | Integer | 8 | 241 | Number of characters in the text | | Non-Alpha-Char-Count | Integer | 8 | 84 | | | Alpha-Ratio | Float | 8 | 0.65145 | | | Embeddings | Array(Float) | 12344 | 1536x1 vector | Embeddings of the model 'text-embedding-ada-002' | | Text-Style | String | 100 to 200 | romantic | Style of the text generated with LLM | | Text-Summary | String | 200 to 1000 | some summary | LLM generated short text summary | | Text-Title | String | 100 to 500 | some title | LLM generated text title | | Transcription | List(Dict) | unlimited | [{'text': 'hello everyone', 'start': '0', 'end': '100'}, {'text': 'text continues', 'start': '100', 'end': '200'}] | Text trancription | ### Usecases - Content discovery - Recomendation systems, ranking by relevance - Clustering, analytics ### Examples of metadata - Example of [metadata](https://drive.google.com/file/d/1Ozyzn6a64-7vu0cd18JjhyKEaV8r7d3j/view?usp=sharing) for text documents. Inside, there are 50 rows in json document for 50 files from Arweave.  ### Compatibility In an ideal world tags would support rich data types such as arrays, vectors and numbers. However as of now the most common assumption on the gateway level is that tag values are strings. Reference gateways also only implement exact match access patterns. Goldsky allows also for wildcards and more fuzzy matching yet always assumes string also. In the spirit of compatibility we will release our string tags first (in lowercase). For numerical tags we will attempt to convert to string by bucketing. For arrays we will put each array item as a seperate tag so category [music, film] becomes tag category:music and tag category:film. Vectors will be published as strings for possible future implementation of vector search on a gateway level or offloaded to the client. ### Discovery - All entities are dynamic ie not predetermined and will grow over time in cardinality. DataOS will serve as the authoritive source of these possible entities as they appear. - Categories are consistent for each version of metadata released. The current categories are: 'Pop Culture', 'Artists', 'Education', 'Comedy', 'Lifestyle', 'Music', 'Sports', 'Gaming', 'Tech', 'Finance', 'Spirituality', 'News & Politics', 'Universe', 'Rabbit Hole'