# COMS6111 Project 2
## Team Members
<!-- Your name and Columbia UNI, and your teammate's name and Columbia UNI -->
| Name | UNI |
| - | - |
| Cherry Chu | ccc2207 |
| Deka Auliya Akbar | da2897 |
<!-- A README file including the following information: -->
## Files in Submission
<!-- - A list of all the files that you are submitting -->
| File | Description |
| - | - |
| `requirements.txt` | List of required packages |
| `env-vm.yml` | List of required packages under conda environment for the vm |
| `run.py` | The command line interface for Iterative Set Expansion (ISE) program to extract any-K tuples from documents. |
| `config.py` | Consist of constants and configurations for the program such as MAX_CHARS, ITERATION_LIMIT, and TIMEOUT |
| `ise_extract.py` | The main controller / orchestrator of the whole ISE pipeline from initial query to extracting k-tuple relations. |
| `relation_utils.py` | A special data structure and utility for storing extracted relations |
| `search_utils.py` | Utilities related to searching and scraping documents for query |
| `output` | A folder that contains the transcripts of our implementation |
## How to Run the Program
<!-- - A clear description of how to run your program. Note that your project must compile/run in a Google Cloud VM that you set up exactly following our instructions. Provide all commands necessary to install the required software and dependencies for your program. -->
### How to Setup the VM for Required Dependencies
<!--
- install python 3.7
- install conda
- update conda environments
- need to install requirements from conda env file / requirements.txt
TODO: -> check if it really works on a VM
-->
1. Install python3.7 on VM
```bash
# generic update
sudo apt-get update
# installing stuffs
sudo apt-get install git python-virtualenv python-dev
# install python 3.7
# Follow https://linuxize.com/post/how-to-install-python-3-7-on-ubuntu-18-04/
sudo apt install software-properties-common
sudo add-apt-repository ppa:deadsnakes/ppa
sudo apt install python3.7
```
2. Install conda on VM
```bash
# install miniconda
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh
# run installer for miniconda (agree with the options)
bash ~/miniconda.sh
source ~/.bashrc
# don't automatically run conda at initial
conda config --set auto_activate_base false
```
3. Setup conda Environment
Option 1: Create env from env file (recommended)
```bash
# create new environment
conda env create -f env-vm.yml
```
Option 2: Create env from scratch with env file
```bash
# create virtualenv if it doesn't exist
conda create --name 6111 python=3.7
conda activate 6111
# update with env file
conda env update -f env-vm.yml --name 6111
```
Option 2: Install from requirements.txt
```bash
# create virtualenv if it doesn't exist
conda create --name 6111 python=3.7
conda activate 6111
pip3 install -r requirements.txt
```
Option 3: Create env from scratch and install packages manually
```bash
# create virtualenv if it doesn't exist
conda create --name 6111 python=3.7
conda activate 6111
# update by manually installing packages
conda install requests beautifulsoup4
pip install --upgrade google-api-python-client
conda install -c conda-forge python-dotenv lxml
pip install stanfordnlp
```
### How to Run the Program
1. Change directory and activate virtual environment
```bash
# from home directory of a VM user (see credentials below)
# gcloud beta compute --project [PROJECT_ID] ssh --zone "us-east1-d" "[user]@cs6111-instance"
cd 6111project
# or if using tar gz
# tar -xzvf proj2.tar.gz
# cd proj2
# activate conda environment
conda activate 6111
```
2. Make the Program Executable
We have made this program executable, but just in case if the permission on the file is changed, please update to the following:
`chmod +x ./run.py`
2. Run the Program
`./run.py [API_KEY] [SEARCH_ID] [RELATION] [THRESHOLD] "[QUERY]" [K]`
**Arguments:**
- RELATION: an integer between 1-4, 1: Schools_Attended, 2: Work_For, 3: Live_In, 4: Top_Member_Employees
- THRESHOLD: a float indicating the "Extraction Confidence Level"
- QUERY: sequence of words indicating the tuples of [KEYWORD ATTRIBUTE]
- K: an integer greater than 0, the number of tuples we requested in output
## Project Design
<!-- - A clear description of the internal design of your project, explaining the general structure of your code (i.e., what its main high-level components are and what they do), as well as acknowledging and describing all external libraries that you use in your code -->
### General Structure of the Code
<!-- Add some diagrams and pipeline
- CLI App
- ISE Component
- Relation Component
- Search and Scraper Component\
-->
| Component | Source Code | Description |
| - | - | - |
| CLI Application | `run.py` | The main command line interface which receives initial user input (secret credentials, type of relation, initial query, threshold, and K) and calls `IterativeSetExpansion` to extract relations |
| Iterative Set Expansion (ISE) pipeline | `IterativeSetExpansion.py` | The controller / orchestrator of the whole query expansion pipeline from query to query expansion iterations |
| Extracted Relations | `relation_utils.py` | A special data structure and utility which stores the extracted relations |
| Search | `search_utils.py` | Methods for searching documents given a query |
| Scraper | `search_utils.py` | Methods for performing preprocessing, scraping, and post processing scraped documents |
### Use of External Libraries
| Library | Usage Description |
| - | - |
| `google-api-python-client` | for searching web documents given a query term |
| `requests` | For scraping web document |
| `beautifulsoup4` | For parsing the scraped web document |
| `pdfminer.six` | For scraping pdf document |
| `lxml` | For parsing the scraped web document |
| `stanfordnlp` | For processing text data and extracting named entities and kbp relations |
## Project Implementation of Iterative Set Expansion (ISE)
<!-- A detailed description of how you carried out Step 3 in the "Description" section above
-->
### Overview
<!-- The overall pipeline of the query expansion program from user input to termination is depicted in the figure below.
-->

The above diagram depicted the overall pipeline from user query to extracting k relation tuples.
<!-- - Add some diagrams and pipeline
- Overall Pipeline:
- User and Flow
- user issued a query
- scrape documents from the query result (scrapable or not)
- scrape if scrapable and preprocess the document, truncate
- for the doc, do the first ner annotation pipeline -> tokenization of tokens, sentence, and ner tagging
- filter sentences based on the ner tags according to the ner tags of the chosen relation
- if sentence matches the condition, fed this sentence to the second kbp annotaiton pipeline -> kbp tagging
- check if the kbp matches the chosen relation
- if matches, check the confidence level
- if it's above the confidence
- add tuple to extracted relation data
structure -->
### Scraping Method
We used the `requests`, `BeautifulSoup4`, and `lxml` package to scrape and preprocess scraped pages.
#### Fetch Content from a Webpage
We are only scrape HTML documents, thus we use `mime` metadata to filter if the result is an HTML page or not. Next, we'll only scrape for documents that are permissible to be scraped. We did this by checking the `robots.txt` of the website. If scraping is allowed, then we'll fetch the content of HTML webpage by following the `link` in the result item using `requests`. We'll also examine the http response code as some webpages might respond with error code 404. If the page is available for scraping, we'll scrape the content.
#### Scraping and Preprocessing the Fetched content
We extracted the textual content of a webpage using `BeautifulSoup` and `lxml`. Since the original text is in html format, it is dirty hence we'll remove all the unrelevant tag elements and perform data cleaning to get the cleaned text content of the HTML page.
### Iterative Set Expansion
#### Document Preprocessing
To support efficiency, we're truncating large documents to be 20000 characters.
#### Data Structure: Extracted Relations
We implemented a special data structure to encapsulate the function of managing extracted relation tuples. There are several methods included in this implementation, including:
| Method | Description |
| - | - |
| `add_tuple_conf(tuple, conf)` | Add tuples that have a higher confidence than the threshold level if it doesn't exist yet in the extracted relations. If it already exist, it will check the confidence of the existing tuple, and update the conf value if the new conf level is higher. <br/><br/> Note that we're performing a normalization of the tuples by **lowercasing** the tuples before adding them to the extracted relations.|
| `is_processed(tuple)` | Check if Tuple is processed |
| `mark_tuple_as_processed(tuple)` | Mark if a tuple has been processed. This occurs whenever we had finished processing a new tuple in the ISE iteration after the first iteration (because at initial we're processing the initial query not tuple) |
| `get_unprocessed_tuple()` | Gets the next unprocessed tuple with the highest level of confidence by sorting tuples in descending order of conf and check if the tuple had been processed |
| `get_ordered_tuples_by_conf()` | Returns the list of tuples and conf sorted by descending order of conf |
#### First Annotation Pipeline: Named Entity Recognition (NER)
In this first annotation pipeline, we're going to use `stanfordnlp` library and tokenize the cleaned scraped document into sentences, perform NER annotation, and filter the sentences by checking if the sentence contains matching NER tags of the chosen relation. If the sentence matches this condition, then we will add this sentence into a list of filtered sentences, later be used by the second annotation pipeline.
#### Second Annotation Pipeline: Knowledge Base Population (KBP)
In the second annotation pipeline, we're going to use `stanfordnlp` library to take a sentence from the filtered sentence, and perform kbp annotation to extract the tuple relations if it has a confidence higher than the threshold. To do this, we first if the sentence contain the matching KBP relations of the relation condition, if yes, we'll check if the conf is higher than the threshold, if yes then we'll add the extracted relation tuple into the extracted relation data structure.
Note that due scraping from web pages, there is uncertainty on the sentence structure. We initially encountered problem when annotating a very long sentence, hence to avoid this issue we're limiting the number of characters of the sentence to be annotated to 500 characters.
### Observation of the output
When we ran queries and compared against the transcripts from the reference implementation. We noticed some factors that can contribute to the differences in results.
#### Google search engine
Google search results vary over time because the search results are based on page ranking, which is constantly changing due to search hits, content change and improvement of search algorithm. Therefore, for the same query, the extracted URLs in each iteration can be different over time, and as a result the extracted relations can also be different.
#### Non-static webpage
In addition to that, some news webpage contains relevant articles with snippets on the same webpage for easy browsing. Relevant articles on a webpage are subject to change over time as more recent and relevant articles will replace the old ones. Therefore, even though we are scraping the same webpage, tuples extracted from that webpage can change over time due to the difference in the relevant articles on the webpage.
## Credentials
<!-- - Your Google Custom Search Engine JSON API Key and Engine ID (so we can test your project) -->
| VM | Search Engine API Key | Search Engine ID |
| - | - | - |
| ccc2207 | AIzaSyCu-UfCuTGjzX0cGktfKoN6iC5a3eFci8Y | 018423500619609660246:0jukua5kuhl
| da2897 | AIzaSyD5mzbTkFuhB-8mCwraRz7KmQBQxXQatTM | 004590458276941574971:abdovzhok53
| Google Project ID | VM ID | User |
| - | - | - |
| graceful-creek-266703 | cs6111-instance | cherrychu_120 |
| coms6111-268404 | cs6111-instance | da2897 |
## Additional Information
<!-- Any additional information that you consider significant -->
1. We're performing case normalization on the extracted tuples to use lowercase.
2. For checking if the query had been processed, we also add the logic for checking if the whole current query is the subset of previously processed query.
3. Aside from truncating document into 20000 characters, we're also limiting the length of the sentence during the second kbp annotation pipeline to be 500 characters.
4. There are several issues and caveats when implementing ISE over dynamic webpages such as changing search results and changing content as we had described on [Observation of the output](#Observation-of-the-output)