# Scripts
## `settings.py`
`settings.py`: This script contains addresses of databases and collections.
* **MONGO_URL_SOURCE**: host address of mongodb containing the resources
* **MONGO_URL_SINK**: host address of mongodb where we write results (they should be the same usually)
* **MONGO_DB_SINK**: the database where we want to write the merged collection
* **MONGO_COLL_SINK**: the name of the target collection
**THERE IS A PARAMETER NO_URL_VALIDATION MAKE SURE TO SET IT TO FALSE WHEN NOT EXPERIMENTING**
### IMPORTANT:
There are 2 parameters `MONGO_DB_SOURCE` and `MONGO_DB_LATEST`
`MONGO_DB_LATEST` refers to the sprint and ideally contains all collections which get updated
`MONGO_DB_SOURCE` is basically `zippia_pipeline` and contains the one which are not updated periodically
Check the sources pairs (db,collection) before running.
## `source_info.py`
`source_info.py`: This script basically contains a dictionary. The keys of the dictionary are the resources we are fetching information from. Each document contains the database and the collection name corresponding to the resource.
```
"CompanyCulturalImage": {
"db": MONGO_COLL_COMPANY_CULTURAL_IMAGES[0],
"collection": MONGO_COLL_COMPANY_CULTURAL_IMAGES[1],
"field_path": "",
"kwargs" : {},
'projection' : {
'companyID' : 1,
'pictures.keep' : 1,
'pictures.original' : 1
}
},
"mission_statements": {
"field_path": "cleanMissionStatement",
"db": MONGO_COLL_MISSION_STATEMENTS[0],
"collection": MONGO_COLL_MISSION_STATEMENTS[1],
"kwargs" : {}
}
```
* **db**: database name
* **collection**: collection name
* **field_path**: if we are only interested in a subdocument of the full document (extract only that SUBDOCUMENT/field)
* **kwargs**: filter arguments for find query of mongodb
* **projection**: projection arguments for find query of mongodb
### IMPORTANT:
you should verify the flags for each collection. Also this script checks that every collection exists and contains at least one document
## `merge.py`
This script is unlikely to change because it basically composes all other logic parts and run a very trivial logic for merging.
### General steps:
* Given a batch of company ids
* Download all their data from all resources
* Merge each company individually
### Individual Merge
1. Take the data of a single company from all resources
2. First do extra fields generation
* Gathering all name fields from all resources into aliases
* Creating a compound field like (state_city) so that we vote on it using majority to get a consistent location instead of writing the logic specifically for it separately
3. Take each field separately and apply a premerge processing if necessary
* Filtering bad descriptions
* Checking urls
4. Apply merging function
* Raw priority
* Majority
* Interpolation window
* Concatenation
5. Postmerge processing.. Examples are (but not limited to):
* Converting numerical values to discrete (revenue...)
* Splitting merged (city,state) tuple into separate fields
* Capping images links into maximum threshold
## `download_filters.py`
This script contains logic functions for filtering documents from certain resources. It has a dictionary which looks like:
```
SOURCES_DOWNLOAD_FN = {
'CompanyCulturalImage': pictures_filter,
'CompanyLogos' : logos_filter,
'mission_statements' : statements_filter
}
```
Write new functions and add new resources to this dictionary per demand.
## `augmentation.py`
This script contains logic for adding synthetic fields to documents before merging and to add extra fields after merging.
### Before merging
The only instance where we need this is constructing the `state_city` field across all resources. This is because when selecting the best value we should do it in a way that the city and the state are consistent to avoid getting a result where the city is in a different state.
### After merging
Several functionalities:
* Splitting `city_state` into individual fields
* Populating all fields corresponding to names as aliases
* Adding detailed address while making sure it's consistent with city/state
* Extracting company domain
* Adding the initial category based on industry (note that this is a temporary field) and the actual generation is in update_industries.py
* Making `primary_sic` and `secondary_sic` fields after populating siccodes
* Adding discrete size/revenue fields
* Adding company display website
* Capping urls in company images
## `premerge.py`
Now as mentioned above, for each field we construct a dictionary {resource -> value} for all resources and then obtain the merged value. This script contains preprocessing functions for each field.
```
PREMERGE_FN_MAP = {
COMPANY_DESCRIPTION_FIELD : filter_bad_descriptions,
COMPANY_IMAGES_FIELD : partial(check_urls , data_type = "Image" , set_rv_to_v = True),
COMPANY_LOGO_FIELD : partial(check_urls , data_type = "Image" , set_rv_to_v = True) ,
COMPANY_URLS : partial(check_urls , data_type = "Text" , set_rv_to_v = False) ,
URL_FIELD : partial(check_urls , data_type = "Text" , set_rv_to_v = False) ,
DOMAIN_ALIASES : partial(check_urls , data_type = "Text" , set_rv_to_v = False) ,
COMPANY_DOMAIN_FIELD : partial(check_urls , data_type = "Text" , set_rv_to_v = False) ,
POPULATED_INDUSTRY : remove_unmappable_industries
}
```
Write new functions and add new resources to this dictionary per demand.
## `merging_functional.py`
This script contains all logic functions for merging values from different resources to a final value. Example of functions are:
* raw_priority (**THIS IS THE DEFAULT**)
* majority (city state)
* concatenation (urls for example)
* selection of url based on type flags
* window interpolation of numerical values
* definite fixed source
```
FIELD_MERGING_FN = {
URL_FIELD : partial(url_selective , fn = raw_priority),
COMPANY_DOMAIN_FIELD : partial(url_selective , fn = raw_priority),
COMPANY_FOUNDED_DATE_FIELD : partial(majority , map_fn = int),
ACTUAL_REVENUE : partial(window_interpolate , window_size=3 , pick = 'priority'),
ACTUAL_EMPLOYEE_COUNT : raw_priority,
COMPANY_STATE_CITY_FIELD : majority,
COMPANY_CITY_FIELD : "AVOID",
COMPANY_STATE_FIELD : "AVOID",
COMPANY_ALIASES : concat,
SICCODES : concat,
POPULATED_INDUSTRY : raw_priority,
MAPPED_CATEGORY : "AVOID",
COMPANY_TYPE: majority,
COMPANY_LOCATION: "AVOID",
COMPANY_STOCK_TICKER_FIELD : majority,
}
```
**PLEASE NOTE THAT THE DEFAULT IS RAW PRIORITY, ANYTHIING ELSE RELATED TO ANY FIELD SHOULD BE ADDED TO THIS DICTIONARY**
You can pretty much write any function you want. Use `functools` library for easier way of adding custom logic.
**USE "AVOID" for fields where you don't want any merge happening (fields with special separate derivation/merging)**
##
# Running Instructions:
Read the information regarding `settings.py`, `source_info.py`
Ideally for running you only need to modify the database and collection names in `settings.py` and it should be fine. `source_info.py` contains no addresses but it's rather better for clarifying explicitly which resources we need. For modifying filters and projections change in `source_info.py`.
The only thing you need is running `python merge.py`
**IMPORTANT NOTE :: update_industries.py should be run before running `suppliment.py` because all the industry/category fields setting is moved exclusively to that script. Suppliment should only add the startup and fortune flags**
# Additional scripts
### `create_test_set.csv`
For downloading a batch of companies and generating sheets per field for analysis
### `gridsearch.py`
For testing different logic functions on a field and comparing against verification sheet and outputing metrics
### `compare_sheets.py`
comparison utilities and metrics