Home Office - HackMD

# Home Office ## Bodies of Work At a high-level, the work required for Home Office breaks down into four buckets: * Data Parsing * E-mail Parsing * OCR * Data Processing (Batch) * Apply algorithms and create summary JSON to power the UI * Algorithms 1) Extractive 2) Classifiers * UI ## Sprint Roadmap 1) Sprint 1 (Wrapping July 27th) * Batch processing/UI pipeline working * Demonstrate the solution can identify key names, locations, dates and identify the name of a MP and their constituency * Should probably incorporate Robinson's improvements here (QALEX + People Rez + Affiliations) * Incorporate @Kiran's UI improvements * Identify the questions being asked by the MP and constituent * Prove the ability of linking the questions in outcome 2 to the 'application type' 3) Sprints 2-4 (???) * Data Science creates extractive/classification functions based on user input * Templated NLG * Integrate Robinson's People * Better drill-down to document * Work with client to draft report/presentation deck ## Data Processing + Parsing 1) Start with a case, consisting of an e-mail and multiple attachments 2) Parse attachments (running OCR if need be) 3) Run classification (sentence- and document-level), extraction algorithms against each document, generating a doc-level JSON (see below) 4) Run code on all of the doc-level JSONs to aggregate and resolve the doc-level extractions/classifications into one case-level JSON file to power the UI ### Document-level JSON Stored as `<doc_id>.json` (maybe hash the OG filenames, assuming they're unique?). This would tell the UI where to find the local file `/standard/path/to/files/case-id/<doc_id>.<extension>` ``` { 'extractions': [ { 'type': 'mp_name' 'span': [115, 150], 'val': 'Dominic Raab', 'pred_proba': .95 }, { 'type': 'person', 'span': [1115, 1150], 'val': 'Kiran Smith', 'pred_proba': .91 } ... ], 'sentence_classifications': [ { 'type': 'compassionate_need', 'span': [160, 250], 'key_span': ['190', 210], pred_proba: .85 }, ... ], 'document_classifications': [ { 'type': 'from_mp' 'span': null, 'pred_proba': .99 }, ... ], 'text': "This is the full doc text to slice with spans.", 'extension': 'pdf' } ``` ### Summary-level JSON Developed by Kiran based on UI considerations -- see, e.g., https://github.com/PrimerAI/home-office/blob/4a6c2b7db814de830a9439c803d254bc5385eb37/ukvi-frontend/src/components/CaseView/_sample_case_3.json. ## Algorithms ### Extractive Could be achieved by Regex + logic or a QA model. Explicitly mentioned: 1) MP Name and Address 1) Constituent Name and Address 2) Constituent Nationality 2) Constituent DOB 3) Case Reference # 3) Future Dates 4) Questions from MP/Constituent Potentially valuable as well: * Key numbers (re-use primer-core) ### Classifiers All explicitly mentioned. Sentence-level feels like the right level of granularity here for the following: 1) Compassionate Need * Funeral * Death/orphaned * Forced marriage * Critical/terminal illness * Emergency * Child exploitation * FGM * Rape/sexual assault * Torture * Persecution * Domestic violence 2) Process Urgency * Urgent travel * Deportation * Removal * Detention * Persecution * Asylum 1) Disqualifying Criteria * Criminal 6) Less urgent reasons (e.g., mental health conditions) 4) Complaint 5) Other British government departments 6) FCO or foreign governments 7) Asks/Request for Urgency 8) Expedite 9) Request 8) Sponsor Details (need more clarity here, though) Document-level "classifiers"/filters: * From MP * From Consituent * Angry/charged language (could also handle at sentence level -- not sure here?) * Legal language (e.g., referencing sub-clauses of various documents, etc.) * Irrelevant docs * Passport scans ## UI Just use a good component library (Carbon FTW!). ## Appendix ### User Workflow 1) Is the UK Home Office the right group to triage? 2) If so, it is urgent? a) If not: form response b) If so: commission to difference group