owned this note
owned this note
Published
Linked with GitHub
# Introduction
This folder contains all the knowledge nuggets discovered by AICoE search and discovery team.
# Contents of the repo
## Repo consists of 4 folders named as
- **Common folder** - It consists of all data scrapped from website by Soham across different domains & for different nuggets.
- **Media folder** - It consists of data scrapped manually comrpising of different knowledge nuggets across media domain.
- **Retail folder** - It consists of data scrapped manually comrpising of different knowledge nuggets across different retail domains.
- **Scripts folder** - It comprises of scripts present across all the folders (common, media & retail) and its subfolders(abbreviations, wordcrawl, spell crawl & etc).
## Common Folder
It comprises of data scrapped from website for different domains which includes Grocery, Home & Living, Fashion, Beauty, Digital & Media.
*Process flow of web data scrapping*

*Here common folder comprimises of multiple subfolders which includes*
***Abbreviations***
This comprises of various category of products on the 1mg site, as mentioned below A. common abbreavations B. famous people c. famous firms.
| abbrevation | expansion | description |
|-|-|-|
|common | 2TG | 2 Temple Gardens | Company |
***Famous People***
This comprises of directory aliases of famous people across different generations
| name | aliases |
|-|-|
| Dr. Rajendra Prasad | Desh Ratna, Ajatshatru |
***Ingestion API***
This can be used in creating an API to process json data and create triplets of data nuggets that are later output as protobuf messages.
Files in the directory comprises of
Following are the files in this directory:
1. collection grid
2. Data collection documents
3. Flowchart
4. Ingestion API
***Janaki Scripts***
It comprises of dictionary of various synonyms, hyponyms, word breaks and various other commonly used words for items in ajio and jiomart.
***Word Forms***
It comprises of repository of various word forms, superlatives and adjectives of various basic word forms.
| root form | word form |
|-|-|
| close | closer, closely, closed, closing, closes, disclose |
***Pincode Collections***
The folder contains pincodes in a json format of various locations in India
| Name | Pincode |
|-|-|
Laxmipur | 721242 |
***Vocab Dictionaries***
It comprises of dictionary for various domains which includes Fashion, Electronics, Grocery, Genric, Healthcare. Here data is scrapped from Ajio, Bigbasket, Myntra, Reliance Digital, 1Mg and others.
***Miscellaneous***
It comprises of all miscellaneous content crawled which donot fit into any of the above folders. Eg - Nutrient Value Queries ppt, Recipie Search ppt.
## Media Folder
Media industry has at times abbreviations and slangs used predominently by critic writers or talk show hosts. Keeping in mind about potential users' search stamp which they use for respective movies/series we have prepared Knowledge Nugget for Media domain
*Process flow of web data scrapping*

*Here Media folders comprises of follwoing categories*
* -Abbreviations & Acronyms
* -Slangs
***Abbreviations & Acronyms***
Multiple critic writers, hosts of talkshows & influencers tend to type acronym avoiding hastle to write more, hence list of acronym & abbreviations dictionary in repository becomes very critical to provide good user search experience
| Acronym | Full form | Category |
|-|-|-|
|BBB| Band Baja Baraat | Movie |
***Slangs***
Many critic writers,hosts of talkshow use slangs or create slang words intentionally to create interest among users or try to create positive imprint among its followers.
Slang | Show Name | Category|
|-|-|-|
|1983 | 83| Slang|
## Retail Folder
*Retail folder comprises of multiple domains which includes *
* Beauty
* Digital
* Education
* Fashion
* Food
* Grocery
* Health
* Home & Living
*Each folder provides relavent files and datasets for a particular project. Each folder may contain a subset of the following subfolders* -
* Documents
* Data Files
* API Documentation
* Data Dictionaries
* Readme
### **Beauty**
Many blog writers/influencers promoting respective products use slangs and abbreviation for name of products and brands respectively. Keeping in mind about potential users' terminologies which they use for respective beauty products we have prepared Knowledge Nugget for beauty domain.
Here Beauty domain comprises of subfolders which includes
***Abbreviations***
It comprises of Json file having list of abbreviations assosiated with beauty industry.
| Acronym | Full form | Category |
|-|-|-|
|SPF | Sun Protection Factor | Industry |
***Indic***
It comprises of Json file having list of indic translations assosiated with beauty industry.
| Product | Indic Translation | Language |
|-|-|-|
|Vermilion | Kum Kum| Hindi|
***Slangs***
It comprises of Json file having list of Slangs assosiated with beauty industry.
| Category | Product | Slang|
|-|-|-|
|Perfume | Perfume | Cologne|
***Synonyms***
It comprises of Json file having list of Slangs assosiated with beauty industry.
| Category | Product | Synonyms |
|-|-|-|
|Face Makeup | BB Cream | Beauty balm cream, Blemish balm cream, Beblesh balm, Blemish base |
### **Digital**
Digital industry has multiple items and products which are called with different names across different parts of world. Keeping in mind about potential users' terminologies which they use for respective digital products we have prepared Knowledge Nugget for digital domain
Here Digital domain comprises of subfolders which includes
***Abbreviations***
It comprises of Json file having list of abbreviations assosiated with digital industry.
| Acronym | Full form | Category |
|-|-|-|
|AUX | Auxiliary | Product |
***Indic***
It comprises of Json file having list of indic translations assosiated with digital industry.
| Gadget | Indic Translation | Language |
|-|-|-|
|Watch | Ghadi| Hindi|
***Slangs***
It comprises of Json file having list of Slangs assosiated with digital industry.
| Category | Product | Slang|
|-|-|-|
|Mobile & Accesories | Mobile| Smartphone|
***Synonyms***
It comprises of Json file having list of Slangs assosiated with digital industry.
| Category | Product | Synonyms |
|-|-|-|
|Household Items | Extension Board | Power Strip, Extension Cord, Charging Ports, Surge Protectors |
### **Education**
The folder contains
* Notes from class 6th to 12th from byju's website as a zip file
* Script to crawl these from the website.
### **Fashion**
Fashion industry has multiple items and products which are called with different names across different parts of world. Keeping in mind about potential users terminologies which they use for respective fashion products we have prepared Knowledge Nugget for fashion domain.
Here Fashion domain comprises of subfolders which includes
***Abbreviations***
It comprises of Json file having list of abbreviations assosiated with fashion industry.
| Acronym | Full form | Category |
|-|-|-|
|LP | Louis Philippe | Brand |
***Catlog***
It comprises of subfolders having data sets related to fashion domain which includes
* -Ajio brand description
* -Express Scrapping
* -Luxury Brands
* -Myntra Attributes
***Indic***
It comprises of Json file having list of indic translations assosiated with fashion industry.
| Product | Indic Translation | Language |
|-|-|-|
|Leather | Chamada| Hindi|
***Review***
It comprises of data crawled from Myntra comprising of reviews and ratings along with other metadata for products on Myntra.
***Slangs***
It comprises of Json file having list of Slangs assosiated with fashion industry.
| Category | Product | Slang|
|-|-|-|
|Mobile & Accesories | Mobile| Smartphone|
***Synonyms***
It comprises of Json file having list of Slangs assosiated with fashion industry.
| Category | Product | Synonyms |
|-|-|-|
|Unisex | Tshirt | Tee, Tee-shirt, t-shirt, polo-shirt, polo neck |
### **Food**
The folder contains
* **Indian Resturants-** List of various famous popular resturants listed on Dineout website.
Eg - 24/7, One8 Commune
* **Archana's Recipie-** This comprises of various types of recipes on archana's kitchen website.
Eg - 10-minutes-oats-unni-appam-nei-appam-recipe, 100-whole-wheat-bread-with-instant-yeast.
### Grocery
India is a multilingual state where multiple grocery items are called with different names across different states like potato is called as Aloo or batata.Hence, we have prepared Knowledge Nugget for grocery domain
Here Grocery domain comprises of subfolders which includes
***Abbreviations***
It comprises of Json file having list of abbreviations assosiated with grocery industry.
| Acronym | Full form | Category |
|-|-|-|
|KMF| Karnatak Milk Federation | Dairy Product |
***Catlog***
It comprises of subfolders having data sets related to grocery domain which includes
* -data dictionary
* -Jiomart image crawl
* -Nutrition data
* -Quantity entity extractor
***Indic***
It comprises of Json file having list of indic translations assosiated with grocery industry.
| Product | Hindi | Marathi | Kannada | Tamil | Telgu | Malyalam | Punjabi | Gujarati |
|-|-|-|-|-|-|-|-|-|
| Oil| Tel | Tel | Taila | Enney | Nooni | Enna | Tel | Tela |
It also comprises of data crawled from Jiomart comprising of Indic Translation.
***Pricing***
It comprises of data crawled comprising of products, orders, quantity alongside its MRP.
***Recipie***
The folder contains a repository of various food items, b2b catalogs, receipies and scripts.This helps in classify items as receipe and expand the query.
Eg- 7UP, Cornflour
***Review***
The folder contains
* product review links as a text file
* compilation of product reviews with their associated ratings and commends
* Python script for crawling the reviews
***Slangs***
It comprises of Json file having list of Slangs assosiated with grocery industry.
| Category | product | Slang|
|-|-|-|
|Vegetables | Vegetable | Veggie |
***Spell Check***
The folder comprises of resources of :
* dictionary of various names and mis-spelt
* translation of common words from Hindi to other languages
* Product orders
* Various Translations
***Synonyms***
It comprises of Json file having list of Slangs assosiated with grocery industry.
| Category | Product | Synonyms |
|-|-|-|
|Dairy Product | Cream | Milk fat, Butter Fat, Skimmed milk, Malai |
### **Health**
The folder contains crawled data from 1mg comprising of
- List of medicines
- List of products
- List of words on 1mg
Various categories are included in crawled data which comprises of
* fitness
* healthcare devices
* personal care
* ayurveda
* homeopathy
| Category | example |
|-|-|
|homeopathy: | "SBL Kali Bromatum 0/21 LM" , "SBL Aralia Racemosa Mother Tincture Q" |
### **Home & Living**
It comprises of data crawled from
* Bed bath & beyond
* Qalara
* Stylumia
***Bed bath & beyond***
This folder is a repository of data files of bed, bath and beyond website which contains
A. Image files from bed bath and beyond
B. collection of URLs
c. Product Description
| Category | example |
|-|-|
| bedding | https://www.bedbathandbeyond.com/store/category/bedding/10001 |
***Qalara***
The folder contains a repository of images and product descriptions of various products from Qalara catalog
| product ID | Material |
|-|-|
| 10101110197 | Wool, Bamboo |
***Stylumia***
This folder contains scripts to scrap data similar in the qalara catalog.
| website | list |
|-|-|
| amazon | Mooas Loop Bathroom Clock, Shower Clock |
dunelm | Massey Clock, Heycroft Clock, Delphi Mantle Clock |