Deka Auliya Akbar
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    # COMS6111 Project 2 ## Team Members <!-- Your name and Columbia UNI, and your teammate's name and Columbia UNI --> | Name | UNI | | - | - | | Cherry Chu | ccc2207 | | Deka Auliya Akbar | da2897 | <!-- A README file including the following information: --> ## Files in Submission <!-- - A list of all the files that you are submitting --> | File | Description | | - | - | | `requirements.txt` | List of required packages | | `env-vm.yml` | List of required packages under conda environment for the vm | | `run.py` | The command line interface for Iterative Set Expansion (ISE) program to extract any-K tuples from documents. | | `config.py` | Consist of constants and configurations for the program such as MAX_CHARS, ITERATION_LIMIT, and TIMEOUT | | `ise_extract.py` | The main controller / orchestrator of the whole ISE pipeline from initial query to extracting k-tuple relations. | | `relation_utils.py` | A special data structure and utility for storing extracted relations | | `search_utils.py` | Utilities related to searching and scraping documents for query | | `output` | A folder that contains the transcripts of our implementation | ## How to Run the Program <!-- - A clear description of how to run your program. Note that your project must compile/run in a Google Cloud VM that you set up exactly following our instructions. Provide all commands necessary to install the required software and dependencies for your program. --> ### How to Setup the VM for Required Dependencies <!-- - install python 3.7 - install conda - update conda environments - need to install requirements from conda env file / requirements.txt TODO: -> check if it really works on a VM --> 1. Install python3.7 on VM ```bash # generic update sudo apt-get update # installing stuffs sudo apt-get install git python-virtualenv python-dev # install python 3.7 # Follow https://linuxize.com/post/how-to-install-python-3-7-on-ubuntu-18-04/ sudo apt install software-properties-common sudo add-apt-repository ppa:deadsnakes/ppa sudo apt install python3.7 ``` 2. Install conda on VM ```bash # install miniconda wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh # run installer for miniconda (agree with the options) bash ~/miniconda.sh source ~/.bashrc # don't automatically run conda at initial conda config --set auto_activate_base false ``` 3. Setup conda Environment Option 1: Create env from env file (recommended) ```bash # create new environment conda env create -f env-vm.yml ``` Option 2: Create env from scratch with env file ```bash # create virtualenv if it doesn't exist conda create --name 6111 python=3.7 conda activate 6111 # update with env file conda env update -f env-vm.yml --name 6111 ``` Option 2: Install from requirements.txt ```bash # create virtualenv if it doesn't exist conda create --name 6111 python=3.7 conda activate 6111 pip3 install -r requirements.txt ``` Option 3: Create env from scratch and install packages manually ```bash # create virtualenv if it doesn't exist conda create --name 6111 python=3.7 conda activate 6111 # update by manually installing packages conda install requests beautifulsoup4 pip install --upgrade google-api-python-client conda install -c conda-forge python-dotenv lxml pip install stanfordnlp ``` ### How to Run the Program 1. Change directory and activate virtual environment ```bash # from home directory of a VM user (see credentials below) # gcloud beta compute --project [PROJECT_ID] ssh --zone "us-east1-d" "[user]@cs6111-instance" cd 6111project # or if using tar gz # tar -xzvf proj2.tar.gz # cd proj2 # activate conda environment conda activate 6111 ``` 2. Make the Program Executable We have made this program executable, but just in case if the permission on the file is changed, please update to the following: `chmod +x ./run.py` 2. Run the Program `./run.py [API_KEY] [SEARCH_ID] [RELATION] [THRESHOLD] "[QUERY]" [K]` **Arguments:** - RELATION: an integer between 1-4, 1: Schools_Attended, 2: Work_For, 3: Live_In, 4: Top_Member_Employees - THRESHOLD: a float indicating the "Extraction Confidence Level" - QUERY: sequence of words indicating the tuples of [KEYWORD ATTRIBUTE] - K: an integer greater than 0, the number of tuples we requested in output ## Project Design <!-- - A clear description of the internal design of your project, explaining the general structure of your code (i.e., what its main high-level components are and what they do), as well as acknowledging and describing all external libraries that you use in your code --> ### General Structure of the Code <!-- Add some diagrams and pipeline - CLI App - ISE Component - Relation Component - Search and Scraper Component\ --> | Component | Source Code | Description | | - | - | - | | CLI Application | `run.py` | The main command line interface which receives initial user input (secret credentials, type of relation, initial query, threshold, and K) and calls `IterativeSetExpansion` to extract relations | | Iterative Set Expansion (ISE) pipeline | `IterativeSetExpansion.py` | The controller / orchestrator of the whole query expansion pipeline from query to query expansion iterations | | Extracted Relations | `relation_utils.py` | A special data structure and utility which stores the extracted relations | | Search | `search_utils.py` | Methods for searching documents given a query | | Scraper | `search_utils.py` | Methods for performing preprocessing, scraping, and post processing scraped documents | ### Use of External Libraries | Library | Usage Description | | - | - | | `google-api-python-client` | for searching web documents given a query term | | `requests` | For scraping web document | | `beautifulsoup4` | For parsing the scraped web document | | `pdfminer.six` | For scraping pdf document | | `lxml` | For parsing the scraped web document | | `stanfordnlp` | For processing text data and extracting named entities and kbp relations | ## Project Implementation of Iterative Set Expansion (ISE) <!-- A detailed description of how you carried out Step 3 in the "Description" section above --> ### Overview <!-- The overall pipeline of the query expansion program from user input to termination is depicted in the figure below. --> ![Overall Pipeline](https://i.imgur.com/ATuX3xh.png) The above diagram depicted the overall pipeline from user query to extracting k relation tuples. <!-- - Add some diagrams and pipeline - Overall Pipeline: - User and Flow - user issued a query - scrape documents from the query result (scrapable or not) - scrape if scrapable and preprocess the document, truncate - for the doc, do the first ner annotation pipeline -> tokenization of tokens, sentence, and ner tagging - filter sentences based on the ner tags according to the ner tags of the chosen relation - if sentence matches the condition, fed this sentence to the second kbp annotaiton pipeline -> kbp tagging - check if the kbp matches the chosen relation - if matches, check the confidence level - if it's above the confidence - add tuple to extracted relation data structure --> ### Scraping Method We used the `requests`, `BeautifulSoup4`, and `lxml` package to scrape and preprocess scraped pages. #### Fetch Content from a Webpage We are only scrape HTML documents, thus we use `mime` metadata to filter if the result is an HTML page or not. Next, we'll only scrape for documents that are permissible to be scraped. We did this by checking the `robots.txt` of the website. If scraping is allowed, then we'll fetch the content of HTML webpage by following the `link` in the result item using `requests`. We'll also examine the http response code as some webpages might respond with error code 404. If the page is available for scraping, we'll scrape the content. #### Scraping and Preprocessing the Fetched content We extracted the textual content of a webpage using `BeautifulSoup` and `lxml`. Since the original text is in html format, it is dirty hence we'll remove all the unrelevant tag elements and perform data cleaning to get the cleaned text content of the HTML page. ### Iterative Set Expansion #### Document Preprocessing To support efficiency, we're truncating large documents to be 20000 characters. #### Data Structure: Extracted Relations We implemented a special data structure to encapsulate the function of managing extracted relation tuples. There are several methods included in this implementation, including: | Method | Description | | - | - | | `add_tuple_conf(tuple, conf)` | Add tuples that have a higher confidence than the threshold level if it doesn't exist yet in the extracted relations. If it already exist, it will check the confidence of the existing tuple, and update the conf value if the new conf level is higher. <br/><br/> Note that we're performing a normalization of the tuples by **lowercasing** the tuples before adding them to the extracted relations.| | `is_processed(tuple)` | Check if Tuple is processed | | `mark_tuple_as_processed(tuple)` | Mark if a tuple has been processed. This occurs whenever we had finished processing a new tuple in the ISE iteration after the first iteration (because at initial we're processing the initial query not tuple) | | `get_unprocessed_tuple()` | Gets the next unprocessed tuple with the highest level of confidence by sorting tuples in descending order of conf and check if the tuple had been processed | | `get_ordered_tuples_by_conf()` | Returns the list of tuples and conf sorted by descending order of conf | #### First Annotation Pipeline: Named Entity Recognition (NER) In this first annotation pipeline, we're going to use `stanfordnlp` library and tokenize the cleaned scraped document into sentences, perform NER annotation, and filter the sentences by checking if the sentence contains matching NER tags of the chosen relation. If the sentence matches this condition, then we will add this sentence into a list of filtered sentences, later be used by the second annotation pipeline. #### Second Annotation Pipeline: Knowledge Base Population (KBP) In the second annotation pipeline, we're going to use `stanfordnlp` library to take a sentence from the filtered sentence, and perform kbp annotation to extract the tuple relations if it has a confidence higher than the threshold. To do this, we first if the sentence contain the matching KBP relations of the relation condition, if yes, we'll check if the conf is higher than the threshold, if yes then we'll add the extracted relation tuple into the extracted relation data structure. Note that due scraping from web pages, there is uncertainty on the sentence structure. We initially encountered problem when annotating a very long sentence, hence to avoid this issue we're limiting the number of characters of the sentence to be annotated to 500 characters. ### Observation of the output When we ran queries and compared against the transcripts from the reference implementation. We noticed some factors that can contribute to the differences in results. #### Google search engine Google search results vary over time because the search results are based on page ranking, which is constantly changing due to search hits, content change and improvement of search algorithm. Therefore, for the same query, the extracted URLs in each iteration can be different over time, and as a result the extracted relations can also be different. #### Non-static webpage In addition to that, some news webpage contains relevant articles with snippets on the same webpage for easy browsing. Relevant articles on a webpage are subject to change over time as more recent and relevant articles will replace the old ones. Therefore, even though we are scraping the same webpage, tuples extracted from that webpage can change over time due to the difference in the relevant articles on the webpage. ## Credentials <!-- - Your Google Custom Search Engine JSON API Key and Engine ID (so we can test your project) --> | VM | Search Engine API Key | Search Engine ID | | - | - | - | | ccc2207 | AIzaSyCu-UfCuTGjzX0cGktfKoN6iC5a3eFci8Y | 018423500619609660246:0jukua5kuhl | da2897 | AIzaSyD5mzbTkFuhB-8mCwraRz7KmQBQxXQatTM | 004590458276941574971:abdovzhok53 | Google Project ID | VM ID | User | | - | - | - | | graceful-creek-266703 | cs6111-instance | cherrychu_120 | | coms6111-268404 | cs6111-instance | da2897 | ## Additional Information <!-- Any additional information that you consider significant --> 1. We're performing case normalization on the extracted tuples to use lowercase. 2. For checking if the query had been processed, we also add the logic for checking if the whole current query is the subset of previously processed query. 3. Aside from truncating document into 20000 characters, we're also limiting the length of the sentence during the second kbp annotation pipeline to be 500 characters. 4. There are several issues and caveats when implementing ISE over dynamic webpages such as changing search results and changing content as we had described on [Observation of the output](#Observation-of-the-output)

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully