ODTJ Pre-Sprint Plan

--- tags: oki --- # ODTJ Pre-Sprint Plan - [ ] Research - [ ] Collect 15 financial reports from 10 different banks - "Collect" means - get the download link for the file - Starting point should be this Google spreadsheet: https://docs.google.com/spreadsheets/d/1SFjPW8oOA-1bQshwgzc4T5Og8f_xzRn2B5TIqGHo4nk/edit#gid=0 Which provides some info on how to find the data. - After downloading the report, also copy it to dedicated S3 bucket - [ ] For each report: extract the data from it 'manually' - Write a Jupyter Notebook that extracts the data from the table(s) in the document - Don't need to process the extracted data or save it anywhere - Just need to generate a list of `dict`s, one for each row. - Try to evaluate different, existing sofware libraries (e.g. tabula-py, pdfminer etc.) to see which fits best our needs - Store the notebooks in the Git repo as well - [ ] Download rest of the files and put them in S3 bucket - [ ] Infrastructure - [ ] Deploy server with - [ ] Pipelines - [ ] Sourcespec store - [ ] Auth - [ ] Redash - [ ] DB - [ ] Domain name - [ ] Create git repo for research and code - [ ] S3 Bucket for docs ### Questions for specialists 1. NBI(net banking income) <=> turnover - YES 2. income_tax <=> corporate_income_tax <=> tax_on_profit ? total_tax_paid ! replace with Corporation tax paid 3. income_before_tax <=> profit_before_tax? ### Research Results: #### Per file summary 1. Name/Year - URL: - Location of data (original + google drive): - How can we identify it? 2. ... #### Generic extractor parameters: - e.g. Page num - Table title - ... etc #### List of tools to extract tables from PDFs - [`tabula-py`](https://github.com/chezou/tabula-py) - Python wrapper over [`tabula-java`](https://github.com/tabulapdf/tabula-java) which powers http://tabula.technology/ - It can guess the tables location and also takes coordinates if we can [get them somehow](https://github.com/tabulapdf/tabula-java/wiki/Using-the-command-line-tabula-extractor-tool#grab-coordinates-of-the-table-you-want) (usually through the UI) - `tabula-java` seems to be the best tool for the task at the moment; most tools I found are wrappers over it - [`pdftables`](https://github.com/okfn/pdftables) - 4yrs old Python project forked and abandoned by OKFN - Rellies on [`pdfminer`](http://www.unixuser.org/~euske/python/pdfminer/) to extract the text from PDF UPDATE: This library assumes the entire PDF is a table. It doesn't seem to have any way to extract only the table from a PDF that also has text.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.