# ODTJ Pre-Sprint Plan
- [ ] Research
- [ ] Collect 15 financial reports from 10 different banks
- "Collect" means - get the download link for the file
- Starting point should be this Google spreadsheet:
Which provides some info on how to find the data.
- After downloading the report, also copy it to dedicated S3 bucket
- [ ] For each report: extract the data from it 'manually'
- Write a Jupyter Notebook that extracts the data from the table(s) in the document
- Don't need to process the extracted data or save it anywhere
- Just need to generate a list of `dict`s, one for each row.
- Try to evaluate different, existing sofware libraries (e.g. tabula-py, pdfminer etc.) to see which fits best our needs
- Store the notebooks in the Git repo as well
- [ ] Download rest of the files and put them in S3 bucket
- [ ] Infrastructure
- [ ] Deploy server with
- [ ] Pipelines
- [ ] Sourcespec store
- [ ] Auth
- [ ] Redash
- [ ] DB
- [ ] Domain name
- [ ] Create git repo for research and code
- [ ] S3 Bucket for docs
### Questions for specialists
1. NBI(net banking income) <=> turnover - YES
2. income_tax <=> corporate_income_tax <=> tax_on_profit ?
total_tax_paid ! replace with Corporation
3. income_before_tax <=> profit_before_tax?
### Research Results:
#### Per file summary
- Location of data (original + google drive):
- How can we identify it?
#### Generic extractor parameters:
- e.g. Page num
- Table title
- ... etc
#### List of tools to extract tables from PDFs
- [`tabula-py`](https://github.com/chezou/tabula-py) - Python wrapper over [`tabula-java`](https://github.com/tabulapdf/tabula-java) which powers http://tabula.technology/
- It can guess the tables location and also takes coordinates if we can [get them somehow](https://github.com/tabulapdf/tabula-java/wiki/Using-the-command-line-tabula-extractor-tool#grab-coordinates-of-the-table-you-want) (usually through the UI)
- `tabula-java` seems to be the best tool for the task at the moment; most tools I found are wrappers over it
- [`pdftables`](https://github.com/okfn/pdftables) - 4yrs old Python project forked and abandoned by OKFN
- Rellies on [`pdfminer`](http://www.unixuser.org/~euske/python/pdfminer/) to extract the text from PDF
UPDATE: This library assumes the entire PDF is a table. It doesn't seem to have any way to extract only the table from a PDF that also has text.