# Introduction
The control lists involve two sources:
- Manually curated control list loaded from a CSVs
- Webshrinker API to query new categories
Given a domain of a URL, one of the following criteria will be returned
- `full-deny`: the URL, domain and content is blocked
- ...
To reduce the number of queries to the API and speed up the process, several measurements are applied:
- a cache of the Websrhinker requests are stored in MySQL
- Webshrinker categories are no older than 30 days
- the manually curated control list is stored persistently in MySQL
- the MySQL data relate to the control lists is pre-loaded into Redis in a pre-digested format
# Webshrinker Service:
Webshrinker is a website categorization service that returns a list of categories for the queried domain.
The full documentation can be found here: https://docs.webshrinker.com/v3/website-category-api.html
Webshrinker uses 2 taxonomies for website categories, IAB and simplified. The server uses the simplified taxonomy.
Webshrinker offers 2 ways to authenticate: Basic HTTP Authentication and Pre-Signed URLs. The server uses Pre-Signed URLs authentication (WebshrinkerWrapper.js)
## Setting up webshrinker keys
The credentials for authentication are stored in defined/webshrinker.json in the following format:
{
"ACCESS_KEY": "[access_key]",
"SECRET_KEY": "[secret_key]"
}
This file needs to be created manually during the installation of the server and stored in the "defined" directory
# MySQL tables
## controllists table
contains the manually curated control list
popoulated from the data/controllist.csv file
The criteria can be one of the following options:
full_deny, full_allow, only_domain, only_url
```
CLEAN_DOMAIN | CRITERIA
news.example.com | full_allow
example.com | full_deny
another.com | only_url
```
## webshrinker table
Cache table for responses from webshrinker.
contains the response from the webshrinker API and a timestamp
```
| DOMAIN | RESPONSE | TIMESTAMP |
+--------------+------------------------------------------------+-----------------------+
| chase.com | {"data": [{"url": "chase.com", "categories": | 2023-03-02 09:01:57 |
| | [{"id": "business", "label": "Business"}]}]} | |
```
## webshrinkercategories table
contains the rules for each category
populated from the data/categories.csv file
follows the same criterias as the controllists table
```
| CATEGORY | CRITERIA |
+---------------------+-------------+
| Adult Content | full_deny |
| Gambling | only_url |
```
## log table missing
blah, blah
## Mysql to redis
Each table is associated with a reddis database:
- controllists table -> Database 1
- webshrinker table -> Database 2
Note that the webshrinker cache database in redis contains the category and not the full reponse like the webshrinker cache table.
-webshrinkercategories table -> Database 3
Add Figure in google docs
## Instructions for re-installing DB
blah blah
# CSV control lists management
## Location and Format of CSV files:
Both CSV file are located in the "data" folder in the server's root directory
controllist.csv
```clean_domain,criteria
example.com,full_deny
google.com,only_url
yahoo.com,only_domain
```
categories.csv
```
category,criteria
Adult Content,full_deny
Adult,full_deny
Economy and Finance,full_deny
Gambling,only_url
Real Estate,only_domain
```
the categories in category column correlate with the categories defined by Webshrinker Service.
a Complete list of categories can be found here: https://webshrinker.com/webshrinker-categories
## Manually adding a domain to the control list file:
1. open controllist.csv file located in the data folder
2. on the second line (under the header) add the new domain in the following format:
domain.tld,rule
example:
example.com,full_deny
3. changes to the csv file will only take effect when reinstalling the server or after dropping the mysql database and reinstalling it. (add instructions for reinstalling db?)
# CSV control lists management
## Manually adding a domain directly to the MySQL db:
1. log into the MySQL DB:
`mysql -u [db username] --password webtracker`
2. you will be prompt to type the username's password
3. Add the domain to the controllists table:
`INSERT INTO controllists (CLEAN_DOMAIN, CRITERIA) VALUES ('[domain]', '[criteria]');`
keep in mind that when adding a domain directly to the DB it will not be added to the controllist.csv file and therefore will be lost next time the server is installed.
## TODO: Dump and Restore Webshrinker Cache and Log tables
# Redis management:
The server uses the redis npm package version 2.8.0 - https://www.npmjs.com/package/redis/v/2.8.0
not to be confused with Node-Redis package.
## Useful redis commands:
To connect to redis server from a terminal window: redis-cli
after connecting to the redis client you can use the following commands:
To switch between active databases
SELECT [num of database, a number between 0-15]
To see all keys within a selected database
KEYS *
To see the value of a specific key
GET [key name]
To see the value of a hash value (Database 2)
HGETALL [key name]
Wiping a redis DBs:
Select the DB to be deleted:
SELECT [num of db, 0-15]
Wipe out the selected DB:
FLUSHDB
Wiping all redis databases:
FLUSHALL
# Workflows
## On Server installation:
1. the controllist table is created, populated from the controllist.csv file, and then loaded into reddis DB 1 (ControlListsTableClass.js)
*any previously existing records in DB 1 will be flushed
2. the webshrinkercategories table is created, populated from the categories.csv file, and then loaded into reddis DB 3 (ControlListsTableClass.js)
*any previously existing records in DB 1 will be flushed
3. the webshrinker table is created.
4. when the installation is finished there will be another check by app.js if the controllist has been successfully loaded to reddis
## On server (re)start:
app.js will check if there are records in each of the redis databases - if there are, it does nothing, if there aren't - it will load the records from the MSQL tables into reddis.
This check is done to prevent loading the reddis databases each time we restart the server.
The redis database is not wiped when the server is restarted, but it is wiped if the machine is restarted.
Note that changes that are made to any of the CSV files will only take effect when the server is reinstalled, not when the server or the machine is restarted.
## Workflow of control list request:
1. the extension makes a post request with a url to the tracking/controllists route
2. the queryURL method is called (controlLists.js)
3. the subdomain is removed from the url
4. the domain is checked against the controllist redis db, if found the matching rule is returned
5. the domain is checked against the webshrinker cache redis db, if found and the record is less than 90 days old the matching rule is returned
6. if the domain wasn't found in the control list and the webshrinker cache or it's older than 90 days the server will make an API call to the webshrinker service (WebshrinkerWrapper.js)
7. the response is stored in the webshrinker cache MySQL table (if the domain existed before, the record will get updated with the new response and timestamp)
8. a record is created in redis db 2. note that there may be a small differece in the timestamp between the webshrinker table and the redis db
9. the webshrinker category will be checked against the webshrinker category database (3)
10. the matching rule for the category is returned.
# Instructions on how to re-start everything
this section could simply refer to other sections: how to flush the database, how to flush redis, and the installations instructions in the other wiki, anything else?