NLP Resume-Extraction Master

# NLP Resume-Extraction Master ###### tags: `Meet.jobs` `NLP` Last updated: 2022/08/30 by @husohome # Purpose This document overviews the work flow of resume-extraction AI. - The main functionality now we need is to turn an uploaded resume.pdf (or even .docx) into structured data in any form (doesn't have to tabular). - The AI model may tabulate parsed results of the resume into fields it finds fit, and still let applicants manually change the results. The results can be used to feedback to the AI and make it smarter. - Preferrably, the structure needs to facilitate future tasks like - recommending applicants to job positions (the specific positions) or companies - extracting key phrases, or tapping into the skills and competencies implied by the achievement statements - predicting successful hires (and selling this knowledge to the public or applicants) # Guiding Principles - We should avoid training all data from ground up, and instead simply feeding new training data to existing, pretrained models. - The structure of the AI-parsed data should be compatible, or easily adaptable, to existing conventions of platforms and APIs of similar trades to Meet.jobs, including LinkedIn's API. - **propose more** # Workflow The workflow will be organic, involving multiple trial-and-error and modifications along the way. Each plan is given a nickname for mnemonics. But more specifically, each plan differs in the way we label entities, and always starts with defining a working protocol for the entity labels. # Plan - Divide and Conquer ## 1. Overview ```graphviz digraph D { node[shape="r"] sectioning[shape="oval", label="Sectioning AI"] sec1[label="Section 1\n(e.g., Education)"] sec2[label="Section 2\n(e.g., Work)"] sec3[label="Section 3\n(e,g., Personal Info)"] text[label="Plain Text"] Resume -> text[label="either OCR or tika/ pdf2"] text -> Preprocessor -> sectioning -> sec2 sectioning -> sec1[label="classifies into"] sectioning -> sec3 [label="..."] m1[label="AI\nfor Section 1", shape="oval"] m2[label="AI\nfor Section 2", shape="oval"] m3[label="AI\nfor Section 3", shape="oval"] p1[label="Specialized\nPreprocessor\nfor\nSection 1" style="dotted"] p2[label="Specialized\nPreprocessor\nfor\n\Section 2" style="dotted"] p3[label="Specialized\nPreprocessor\n\for\nSection 3" style="dotted"] sec1 -> p1 -> m1 sec2 -> p2 -> m2 sec3 -> p3 -> m3 subgraph cluster_m1 { m1 -> m11[label="all components\ncould be further divided\ninto smaller modules (AIs)"] m11 [label="More Specialized AI"] } } ``` 1. Incoming resume pdfs are transformed into pure text, by the `Preprocessor`, which cleans up the text and makes the text slightly more readable. 3. There is a sectioning AI that classifies parts of the resume into different sections, such as Education, Work and Personal Info, or Skills, etc. 5. For each section, there is a specialized preprocessor and AI model trained to extract information (using `Named Entity Recognition`). ## 2. Prototyping and Defining Protocols :::info :bulb:For prototyping only 1. We use `pyPDF` in Python to transform pdfs to text. - otherwise, we decide among `pyPDF2`, `tika`, or `easyOCR` 3. We define *NO preprcossors nor specialized preprocessors for the sections* . - otherwise, we optimize the following components: stopword removal, lemmatization, stemming, etc. 3. We only check the efficacy of the AI model for Work section. ::: 1. - [ ] Have a preliminary set of `entity labels` and `sections` ready by `some deeadline (let's discuss)`. 4. - [ ] Have a preliminary version of `entity labels` and `sections` ready by `some deeadline (let's discuss)`. See [Meet.jobs Resume Extraction NLP - Entity and Section Labels](/AwbyyXrJSfWUnsyn1dxKhw) for documentation. - [ ] Repeat til done agreement is 95% or up. - [ ] Nick asks HR and marketing experts in Meet.jobs and checks compatibility with existing API conventions by `some deadline (let's discuss)`. - [ ] Nick, Tony and PL try to separately annotate the same 100 resumes to see if the entity labels and sections make sense by `some deeadline (let's discuss)`. During the process, we discuss the feasibility of the entity labels. - [ ] Nick, Tony and PL separately annotate the same 100 resumes and Nick checks agreement til everyone is on the same page. 5. - [ ] Nick has the classifier model for sectioning ready `some deeadline (let's discuss)`. Tony and PL can join if so they desire. 5. - [ ] Have a preliminary model ready for parsing the `Work` section `some deeadline (let's discuss)` - [ ] Initial run - [ ] Hyperparameter Optimization (Usually takes 1 week) 4. `Let's define what is good performance.` - Evaluated againsts public resume data? - On Our Data Precision? Recall? F1? Accuracy?