Scripts - HackMD

# Scripts ## `settings.py` `settings.py`: This script contains addresses of databases and collections. * **MONGO_URL_SOURCE**: host address of mongodb containing the resources * **MONGO_URL_SINK**: host address of mongodb where we write results (they should be the same usually) * **MONGO_DB_SINK**: the database where we want to write the merged collection * **MONGO_COLL_SINK**: the name of the target collection **THERE IS A PARAMETER NO_URL_VALIDATION MAKE SURE TO SET IT TO FALSE WHEN NOT EXPERIMENTING** ### IMPORTANT: There are 2 parameters `MONGO_DB_SOURCE` and `MONGO_DB_LATEST` `MONGO_DB_LATEST` refers to the sprint and ideally contains all collections which get updated `MONGO_DB_SOURCE` is basically `zippia_pipeline` and contains the one which are not updated periodically Check the sources pairs (db,collection) before running. ## `source_info.py` `source_info.py`: This script basically contains a dictionary. The keys of the dictionary are the resources we are fetching information from. Each document contains the database and the collection name corresponding to the resource. ``` "CompanyCulturalImage": { "db": MONGO_COLL_COMPANY_CULTURAL_IMAGES[0], "collection": MONGO_COLL_COMPANY_CULTURAL_IMAGES[1], "field_path": "", "kwargs" : {}, 'projection' : { 'companyID' : 1, 'pictures.keep' : 1, 'pictures.original' : 1 } }, "mission_statements": { "field_path": "cleanMissionStatement", "db": MONGO_COLL_MISSION_STATEMENTS[0], "collection": MONGO_COLL_MISSION_STATEMENTS[1], "kwargs" : {} } ``` * **db**: database name * **collection**: collection name * **field_path**: if we are only interested in a subdocument of the full document (extract only that SUBDOCUMENT/field) * **kwargs**: filter arguments for find query of mongodb * **projection**: projection arguments for find query of mongodb ### IMPORTANT: you should verify the flags for each collection. Also this script checks that every collection exists and contains at least one document ## `merge.py` This script is unlikely to change because it basically composes all other logic parts and run a very trivial logic for merging. ### General steps: * Given a batch of company ids * Download all their data from all resources * Merge each company individually ### Individual Merge 1. Take the data of a single company from all resources 2. First do extra fields generation * Gathering all name fields from all resources into aliases * Creating a compound field like (state_city) so that we vote on it using majority to get a consistent location instead of writing the logic specifically for it separately 3. Take each field separately and apply a premerge processing if necessary * Filtering bad descriptions * Checking urls 4. Apply merging function * Raw priority * Majority * Interpolation window * Concatenation 5. Postmerge processing.. Examples are (but not limited to): * Converting numerical values to discrete (revenue...) * Splitting merged (city,state) tuple into separate fields * Capping images links into maximum threshold ## `download_filters.py` This script contains logic functions for filtering documents from certain resources. It has a dictionary which looks like: ``` SOURCES_DOWNLOAD_FN = { 'CompanyCulturalImage': pictures_filter, 'CompanyLogos' : logos_filter, 'mission_statements' : statements_filter } ``` Write new functions and add new resources to this dictionary per demand. ## `augmentation.py` This script contains logic for adding synthetic fields to documents before merging and to add extra fields after merging. ### Before merging The only instance where we need this is constructing the `state_city` field across all resources. This is because when selecting the best value we should do it in a way that the city and the state are consistent to avoid getting a result where the city is in a different state. ### After merging Several functionalities: * Splitting `city_state` into individual fields * Populating all fields corresponding to names as aliases * Adding detailed address while making sure it's consistent with city/state * Extracting company domain * Adding the initial category based on industry (note that this is a temporary field) and the actual generation is in update_industries.py * Making `primary_sic` and `secondary_sic` fields after populating siccodes * Adding discrete size/revenue fields * Adding company display website * Capping urls in company images ## `premerge.py` Now as mentioned above, for each field we construct a dictionary {resource -> value} for all resources and then obtain the merged value. This script contains preprocessing functions for each field. ``` PREMERGE_FN_MAP = { COMPANY_DESCRIPTION_FIELD : filter_bad_descriptions, COMPANY_IMAGES_FIELD : partial(check_urls , data_type = "Image" , set_rv_to_v = True), COMPANY_LOGO_FIELD : partial(check_urls , data_type = "Image" , set_rv_to_v = True) , COMPANY_URLS : partial(check_urls , data_type = "Text" , set_rv_to_v = False) , URL_FIELD : partial(check_urls , data_type = "Text" , set_rv_to_v = False) , DOMAIN_ALIASES : partial(check_urls , data_type = "Text" , set_rv_to_v = False) , COMPANY_DOMAIN_FIELD : partial(check_urls , data_type = "Text" , set_rv_to_v = False) , POPULATED_INDUSTRY : remove_unmappable_industries } ``` Write new functions and add new resources to this dictionary per demand. ## `merging_functional.py` This script contains all logic functions for merging values from different resources to a final value. Example of functions are: * raw_priority (**THIS IS THE DEFAULT**) * majority (city state) * concatenation (urls for example) * selection of url based on type flags * window interpolation of numerical values * definite fixed source ``` FIELD_MERGING_FN = { URL_FIELD : partial(url_selective , fn = raw_priority), COMPANY_DOMAIN_FIELD : partial(url_selective , fn = raw_priority), COMPANY_FOUNDED_DATE_FIELD : partial(majority , map_fn = int), ACTUAL_REVENUE : partial(window_interpolate , window_size=3 , pick = 'priority'), ACTUAL_EMPLOYEE_COUNT : raw_priority, COMPANY_STATE_CITY_FIELD : majority, COMPANY_CITY_FIELD : "AVOID", COMPANY_STATE_FIELD : "AVOID", COMPANY_ALIASES : concat, SICCODES : concat, POPULATED_INDUSTRY : raw_priority, MAPPED_CATEGORY : "AVOID", COMPANY_TYPE: majority, COMPANY_LOCATION: "AVOID", COMPANY_STOCK_TICKER_FIELD : majority, } ``` **PLEASE NOTE THAT THE DEFAULT IS RAW PRIORITY, ANYTHIING ELSE RELATED TO ANY FIELD SHOULD BE ADDED TO THIS DICTIONARY** You can pretty much write any function you want. Use `functools` library for easier way of adding custom logic. **USE "AVOID" for fields where you don't want any merge happening (fields with special separate derivation/merging)** ## # Running Instructions: Read the information regarding `settings.py`, `source_info.py` Ideally for running you only need to modify the database and collection names in `settings.py` and it should be fine. `source_info.py` contains no addresses but it's rather better for clarifying explicitly which resources we need. For modifying filters and projections change in `source_info.py`. The only thing you need is running `python merge.py` **IMPORTANT NOTE :: update_industries.py should be run before running `suppliment.py` because all the industry/category fields setting is moved exclusively to that script. Suppliment should only add the startup and fortune flags** # Additional scripts ### `create_test_set.csv` For downloading a batch of companies and generating sheets per field for analysis ### `gridsearch.py` For testing different logic functions on a field and comparing against verification sheet and outputing metrics ### `compare_sheets.py` comparison utilities and metrics

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.