# Home Office
## Bodies of Work
At a high-level, the work required for Home Office breaks down into four buckets:
* Data Parsing
* E-mail Parsing
* OCR
* Data Processing (Batch)
* Apply algorithms and create summary JSON to power the UI
* Algorithms
1) Extractive
2) Classifiers
* UI
## Sprint Roadmap
1) Sprint 1 (Wrapping July 27th)
* Batch processing/UI pipeline working
* Demonstrate the solution can identify key names, locations, dates and identify the name of a MP and their constituency
* Should probably incorporate Robinson's improvements here (QALEX + People Rez + Affiliations)
* Incorporate @Kiran's UI improvements
* Identify the questions being asked by the MP and constituent
* Prove the ability of linking the questions in outcome 2 to the 'application type'
3) Sprints 2-4 (???)
* Data Science creates extractive/classification functions based on user input
* Templated NLG
* Integrate Robinson's People
* Better drill-down to document
* Work with client to draft report/presentation deck
## Data Processing + Parsing
1) Start with a case, consisting of an e-mail and multiple attachments
2) Parse attachments (running OCR if need be)
3) Run classification (sentence- and document-level), extraction algorithms against each document, generating a doc-level JSON (see below)
4) Run code on all of the doc-level JSONs to aggregate and resolve the doc-level extractions/classifications into one case-level JSON file to power the UI
### Document-level JSON
Stored as `<doc_id>.json` (maybe hash the OG filenames, assuming they're unique?). This would tell the UI where to find the local file `/standard/path/to/files/case-id/<doc_id>.<extension>`
```
{
'extractions': [
{
'type': 'mp_name'
'span': [115, 150],
'val': 'Dominic Raab',
'pred_proba': .95
},
{
'type': 'person',
'span': [1115, 1150],
'val': 'Kiran Smith',
'pred_proba': .91
}
...
],
'sentence_classifications': [
{
'type': 'compassionate_need',
'span': [160, 250],
'key_span': ['190', 210],
pred_proba: .85
},
...
],
'document_classifications': [
{
'type': 'from_mp'
'span': null,
'pred_proba': .99
},
...
],
'text': "This is the full doc text to slice with spans.",
'extension': 'pdf'
}
```
### Summary-level JSON
Developed by Kiran based on UI considerations -- see, e.g., https://github.com/PrimerAI/home-office/blob/4a6c2b7db814de830a9439c803d254bc5385eb37/ukvi-frontend/src/components/CaseView/_sample_case_3.json.
## Algorithms
### Extractive
Could be achieved by Regex + logic or a QA model. Explicitly mentioned:
1) MP Name and Address
1) Constituent Name and Address
2) Constituent Nationality
2) Constituent DOB
3) Case Reference #
3) Future Dates
4) Questions from MP/Constituent
Potentially valuable as well:
* Key numbers (re-use primer-core)
### Classifiers
All explicitly mentioned. Sentence-level feels like the right level of granularity here for the following:
1) Compassionate Need
* Funeral
* Death/orphaned
* Forced marriage
* Critical/terminal illness
* Emergency
* Child exploitation
* FGM
* Rape/sexual assault
* Torture
* Persecution
* Domestic violence
2) Process Urgency
* Urgent travel
* Deportation
* Removal
* Detention
* Persecution
* Asylum
1) Disqualifying Criteria
* Criminal
6) Less urgent reasons (e.g., mental health conditions)
4) Complaint
5) Other British government departments
6) FCO or foreign governments
7) Asks/Request for Urgency
8) Expedite
9) Request
8) Sponsor Details (need more clarity here, though)
Document-level "classifiers"/filters:
* From MP
* From Consituent
* Angry/charged language (could also handle at sentence level -- not sure here?)
* Legal language (e.g., referencing sub-clauses of various documents, etc.)
* Irrelevant docs
* Passport scans
## UI
Just use a good component library (Carbon FTW!).
## Appendix
### User Workflow
1) Is the UK Home Office the right group to triage?
2) If so, it is urgent?
a) If not: form response
b) If so: commission to difference group