COMS6111 Project 2

# COMS6111 Project 2 ## Team Members  | Name | UNI | | - | - | | Cherry Chu | ccc2207 | | Deka Auliya Akbar | da2897 |  ## Files in Submission  | File | Description | | - | - | | `requirements.txt` | List of required packages | | `env-vm.yml` | List of required packages under conda environment for the vm | | `run.py` | The command line interface for Iterative Set Expansion (ISE) program to extract any-K tuples from documents. | | `config.py` | Consist of constants and configurations for the program such as MAX_CHARS, ITERATION_LIMIT, and TIMEOUT | | `ise_extract.py` | The main controller / orchestrator of the whole ISE pipeline from initial query to extracting k-tuple relations. | | `relation_utils.py` | A special data structure and utility for storing extracted relations | | `search_utils.py` | Utilities related to searching and scraping documents for query | | `output` | A folder that contains the transcripts of our implementation | ## How to Run the Program  ### How to Setup the VM for Required Dependencies  1. Install python3.7 on VM ```bash # generic update sudo apt-get update # installing stuffs sudo apt-get install git python-virtualenv python-dev # install python 3.7 # Follow https://linuxize.com/post/how-to-install-python-3-7-on-ubuntu-18-04/ sudo apt install software-properties-common sudo add-apt-repository ppa:deadsnakes/ppa sudo apt install python3.7 ``` 2. Install conda on VM ```bash # install miniconda wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh # run installer for miniconda (agree with the options) bash ~/miniconda.sh source ~/.bashrc # don't automatically run conda at initial conda config --set auto_activate_base false ``` 3. Setup conda Environment Option 1: Create env from env file (recommended) ```bash # create new environment conda env create -f env-vm.yml ``` Option 2: Create env from scratch with env file ```bash # create virtualenv if it doesn't exist conda create --name 6111 python=3.7 conda activate 6111 # update with env file conda env update -f env-vm.yml --name 6111 ``` Option 2: Install from requirements.txt ```bash # create virtualenv if it doesn't exist conda create --name 6111 python=3.7 conda activate 6111 pip3 install -r requirements.txt ``` Option 3: Create env from scratch and install packages manually ```bash # create virtualenv if it doesn't exist conda create --name 6111 python=3.7 conda activate 6111 # update by manually installing packages conda install requests beautifulsoup4 pip install --upgrade google-api-python-client conda install -c conda-forge python-dotenv lxml pip install stanfordnlp ``` ### How to Run the Program 1. Change directory and activate virtual environment ```bash # from home directory of a VM user (see credentials below) # gcloud beta compute --project [PROJECT_ID] ssh --zone "us-east1-d" "[user]@cs6111-instance" cd 6111project # or if using tar gz # tar -xzvf proj2.tar.gz # cd proj2 # activate conda environment conda activate 6111 ``` 2. Make the Program Executable We have made this program executable, but just in case if the permission on the file is changed, please update to the following: `chmod +x ./run.py` 2. Run the Program `./run.py [API_KEY] [SEARCH_ID] [RELATION] [THRESHOLD] "[QUERY]" [K]` **Arguments:** - RELATION: an integer between 1-4, 1: Schools_Attended, 2: Work_For, 3: Live_In, 4: Top_Member_Employees - THRESHOLD: a float indicating the "Extraction Confidence Level" - QUERY: sequence of words indicating the tuples of [KEYWORD ATTRIBUTE] - K: an integer greater than 0, the number of tuples we requested in output ## Project Design  ### General Structure of the Code  | Component | Source Code | Description | | - | - | - | | CLI Application | `run.py` | The main command line interface which receives initial user input (secret credentials, type of relation, initial query, threshold, and K) and calls `IterativeSetExpansion` to extract relations | | Iterative Set Expansion (ISE) pipeline | `IterativeSetExpansion.py` | The controller / orchestrator of the whole query expansion pipeline from query to query expansion iterations | | Extracted Relations | `relation_utils.py` | A special data structure and utility which stores the extracted relations | | Search | `search_utils.py` | Methods for searching documents given a query | | Scraper | `search_utils.py` | Methods for performing preprocessing, scraping, and post processing scraped documents | ### Use of External Libraries | Library | Usage Description | | - | - | | `google-api-python-client` | for searching web documents given a query term | | `requests` | For scraping web document | | `beautifulsoup4` | For parsing the scraped web document | | `pdfminer.six` | For scraping pdf document | | `lxml` | For parsing the scraped web document | | `stanfordnlp` | For processing text data and extracting named entities and kbp relations | ## Project Implementation of Iterative Set Expansion (ISE)  ### Overview  ![Overall Pipeline](https://i.imgur.com/ATuX3xh.png) The above diagram depicted the overall pipeline from user query to extracting k relation tuples.  ### Scraping Method We used the `requests`, `BeautifulSoup4`, and `lxml` package to scrape and preprocess scraped pages. #### Fetch Content from a Webpage We are only scrape HTML documents, thus we use `mime` metadata to filter if the result is an HTML page or not. Next, we'll only scrape for documents that are permissible to be scraped. We did this by checking the `robots.txt` of the website. If scraping is allowed, then we'll fetch the content of HTML webpage by following the `link` in the result item using `requests`. We'll also examine the http response code as some webpages might respond with error code 404. If the page is available for scraping, we'll scrape the content. #### Scraping and Preprocessing the Fetched content We extracted the textual content of a webpage using `BeautifulSoup` and `lxml`. Since the original text is in html format, it is dirty hence we'll remove all the unrelevant tag elements and perform data cleaning to get the cleaned text content of the HTML page. ### Iterative Set Expansion #### Document Preprocessing To support efficiency, we're truncating large documents to be 20000 characters. #### Data Structure: Extracted Relations We implemented a special data structure to encapsulate the function of managing extracted relation tuples. There are several methods included in this implementation, including: | Method | Description | | - | - | | `add_tuple_conf(tuple, conf)` | Add tuples that have a higher confidence than the threshold level if it doesn't exist yet in the extracted relations. If it already exist, it will check the confidence of the existing tuple, and update the conf value if the new conf level is higher. <br/><br/> Note that we're performing a normalization of the tuples by **lowercasing** the tuples before adding them to the extracted relations.| | `is_processed(tuple)` | Check if Tuple is processed | | `mark_tuple_as_processed(tuple)` | Mark if a tuple has been processed. This occurs whenever we had finished processing a new tuple in the ISE iteration after the first iteration (because at initial we're processing the initial query not tuple) | | `get_unprocessed_tuple()` | Gets the next unprocessed tuple with the highest level of confidence by sorting tuples in descending order of conf and check if the tuple had been processed | | `get_ordered_tuples_by_conf()` | Returns the list of tuples and conf sorted by descending order of conf | #### First Annotation Pipeline: Named Entity Recognition (NER) In this first annotation pipeline, we're going to use `stanfordnlp` library and tokenize the cleaned scraped document into sentences, perform NER annotation, and filter the sentences by checking if the sentence contains matching NER tags of the chosen relation. If the sentence matches this condition, then we will add this sentence into a list of filtered sentences, later be used by the second annotation pipeline. #### Second Annotation Pipeline: Knowledge Base Population (KBP) In the second annotation pipeline, we're going to use `stanfordnlp` library to take a sentence from the filtered sentence, and perform kbp annotation to extract the tuple relations if it has a confidence higher than the threshold. To do this, we first if the sentence contain the matching KBP relations of the relation condition, if yes, we'll check if the conf is higher than the threshold, if yes then we'll add the extracted relation tuple into the extracted relation data structure. Note that due scraping from web pages, there is uncertainty on the sentence structure. We initially encountered problem when annotating a very long sentence, hence to avoid this issue we're limiting the number of characters of the sentence to be annotated to 500 characters. ### Observation of the output When we ran queries and compared against the transcripts from the reference implementation. We noticed some factors that can contribute to the differences in results. #### Google search engine Google search results vary over time because the search results are based on page ranking, which is constantly changing due to search hits, content change and improvement of search algorithm. Therefore, for the same query, the extracted URLs in each iteration can be different over time, and as a result the extracted relations can also be different. #### Non-static webpage In addition to that, some news webpage contains relevant articles with snippets on the same webpage for easy browsing. Relevant articles on a webpage are subject to change over time as more recent and relevant articles will replace the old ones. Therefore, even though we are scraping the same webpage, tuples extracted from that webpage can change over time due to the difference in the relevant articles on the webpage. ## Credentials  | VM | Search Engine API Key | Search Engine ID | | - | - | - | | ccc2207 | AIzaSyCu-UfCuTGjzX0cGktfKoN6iC5a3eFci8Y | 018423500619609660246:0jukua5kuhl | da2897 | AIzaSyD5mzbTkFuhB-8mCwraRz7KmQBQxXQatTM | 004590458276941574971:abdovzhok53 | Google Project ID | VM ID | User | | - | - | - | | graceful-creek-266703 | cs6111-instance | cherrychu_120 | | coms6111-268404 | cs6111-instance | da2897 | ## Additional Information  1. We're performing case normalization on the extracted tuples to use lowercase. 2. For checking if the query had been processed, we also add the logic for checking if the whole current query is the subset of previously processed query. 3. Aside from truncating document into 20000 characters, we're also limiting the length of the sentence during the second kbp annotation pipeline to be 500 characters. 4. There are several issues and caveats when implementing ISE over dynamic webpages such as changing search results and changing content as we had described on [Observation of the output](#Observation-of-the-output)

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.