Statistics - Automation, ETL, App development

--- title: Statistics - Automation, ETL, App development tags: Templates, Talk description: View the slide with "Slide Mode". --- # Statistics - Automation, ETL, App development *"Would you be interested in preparing a concise 15 min presentation on what is going on in Stats department in terms of automation of ETL processes, and app development infrastructure you've been setting up lately?"*  slides: https://hackmd.io/@DavidZenz/BJGalwa4F --- # Where are we in statistics? I ## Automation - **ready** - **AID** (automatic indicator download) available via RStudio: https://rdev.wiiw.io code: http://gitlab.wiiw.io/DavidZenz/AID - monthly - scrapes/collects - extracts - transforms (if not already in wide format) as required by statistics this has to be in Excel format - implemented countries: Baltics & PL, and some of Alexandra's - **EU Comext** / **UN Comtrade** - twice a year (or more often if required) - bulk download - ETL (from raw to **wiiw client**) - copy of files to shared drive by hand --- # Where are we in statistics? II ## Apps ready ### wiiw client - data from EU Comext/UN Comtrade scripts - GUI - filtering - download ### VDE - ETL from db export - daily - GUI - filtering - visualization - download --- # AID in RStudio on server rdev ![](https://i.imgur.com/YYnpK97.png) https://rdev.wiiw.io --- # wiiw client GUI ![](https://i.imgur.com/7f3a9KV.png) http://client.wiiw.ac.at  --- # Where are we in statistics? III ## Automation - **in the works** - EU Comext / UN Comtrade - automatic sync of new files from file servers via cron (daily) - ETL to Apache Parquet - **wiiw client** to use Apache Parquet (instead of SQL as of today) - **to do** - expand **AID** to all other countries (this hasn't happened in 3yrs now) - other datasets on shared drive `W:\00 POOL\08 External Databases\` - automatic scraping/collection (*if possible*) - ETL to Apache Parquet - making them available via **wiiw client** -> plan to make all of this datasets available not via file share, but via client (*might need some kind of convention for variables*) - implement API for **wiiw client** for programmatic users (e.g. via `plumber`) --- # Where are we in statistics? IV ## Automation - **what should be done** - **AID** for all countries - ETL to Apache Parquet - from there - create Excel reports - or implement other system for - analyzing / consistency checking - visualization - etc. - **increase participation of others** skills needed - wide-ranging computer skills - *R* or *Python* required today --- # Code (R, Stata, ...) currently scattered all around the institute, mostly with no documentation at all - usually - personal hard drives - cloud environment - rather unusual - shared drive - local git - recommended - gitlab (http://gitlab.wiiw.io) ## What should be done - migrate to gitlab - establish - practice of documentation - version control - collaboration on code ## Tools - git - local IDE with git implementation (RStudio etc.) - Browser (https://rdev.wiiw.io) --- # gitlab ![](https://i.imgur.com/3S6XH2y.png) --- # Data storage I ## Status quo currently scattered all around the institute - **SQL database (processed)** but (more or less) only available for website users - **Shared drive on file server (`W:\`)** all possible updated and outdated files/data sets - **personal hard drives** this is much more common than one would expect - **cloud environment** not so common, but I know of at least one person - **rdev/rprod** raw and processed data on server, currently only http://client.wiiw.ac.at for usage of processed data ![](https://i.imgur.com/ix49atY.png) --- # Data storage II ## Need - **convention** - where to store raw data - "how to" process e.g. naming conventions of variables etc. - if possible - where to store processed data - **infrastructure** - Data Lake (raw) - Data Warehouse (processed) can consist of different forms of storage / databases - stat database - external databases - etc ## Difference | | Data Lake | Data Warehouse | | ------------- | ------------------------------------ | -------- | | Structure | Raw | Processed | | Purpose | Not yet determined | In Use | | Users | IT / Data Scientists / Statisticians | Statisticians / Economists | | Accessibility | Highly, quick to update | More complicated & costly to updated | ## There's more - might consider "Data Lakehouse" e.g. using Apache Parquet, processed, using some set of conventions --- # Thank you!