---
title: Statistics - Automation, ETL, App development
tags: Templates, Talk
description: View the slide with "Slide Mode".
---
# Statistics - Automation, ETL, App development
*"Would you be interested in preparing a concise 15 min presentation on what is going on in Stats department in terms of automation of ETL processes, and app development infrastructure you've been setting up lately?"*
<!-- Put the link to this slide here so people can follow -->
slides: https://hackmd.io/@DavidZenz/BJGalwa4F
---
# Where are we in statistics? I
## Automation
- **ready**
- **AID** (automatic indicator download)
available via RStudio: https://rdev.wiiw.io
code: http://gitlab.wiiw.io/DavidZenz/AID
- monthly
- scrapes/collects
- extracts
- transforms (if not already in wide format)
as required by statistics this has to be in Excel format
- implemented countries:
Baltics & PL, and some of Alexandra's
- **EU Comext** / **UN Comtrade**
- twice a year (or more often if required)
- bulk download
- ETL (from raw to **wiiw client**)
- copy of files to shared drive by hand
---
# Where are we in statistics? II
## Apps ready
### wiiw client
- data from EU Comext/UN Comtrade scripts
- GUI
- filtering
- download
### VDE
- ETL from db export
- daily
- GUI
- filtering
- visualization
- download
---
# AID in RStudio on server rdev

https://rdev.wiiw.io
---
# wiiw client GUI

http://client.wiiw.ac.at
<!--
- **twitter sentiment** for Visegrad countries (2019)
-->
---
# Where are we in statistics? III
## Automation
- **in the works**
- EU Comext / UN Comtrade
- automatic sync of new files from file servers via cron (daily)
- ETL to Apache Parquet
- **wiiw client** to use Apache Parquet
(instead of SQL as of today)
- **to do**
- expand **AID** to all other countries (this hasn't happened in 3yrs now)
- other datasets on shared drive `W:\00 POOL\08 External Databases\`
- automatic scraping/collection (*if possible*)
- ETL to Apache Parquet
- making them available via **wiiw client**
-> plan to make all of this datasets available not via file share, but via client
(*might need some kind of convention for variables*)
- implement API for **wiiw client** for programmatic users (e.g. via `plumber`)
---
# Where are we in statistics? IV
## Automation
- **what should be done**
- **AID** for all countries
- ETL to Apache Parquet
- from there
- create Excel reports
- or implement other system for
- analyzing / consistency checking
- visualization
- etc.
- **increase participation of others**
skills needed
- wide-ranging computer skills
- *R* or *Python* required today
---
# Code (R, Stata, ...)
currently scattered all around the institute, mostly with no documentation at all
- usually
- personal hard drives
- cloud environment
- rather unusual
- shared drive
- local git
- recommended
- gitlab (http://gitlab.wiiw.io)
## What should be done
- migrate to gitlab
- establish
- practice of documentation
- version control
- collaboration on code
## Tools
- git
- local IDE with git implementation (RStudio etc.)
- Browser (https://rdev.wiiw.io)
---
# gitlab

---
# Data storage I
## Status quo
currently scattered all around the institute
- **SQL database (processed)**
but (more or less) only available for website users
- **Shared drive on file server (`W:\`)**
all possible updated and outdated files/data sets
- **personal hard drives**
this is much more common than one would expect
- **cloud environment**
not so common, but I know of at least one person
- **rdev/rprod**
raw and processed data on server, currently only http://client.wiiw.ac.at for usage of processed data

---
# Data storage II
## Need
- **convention**
- where to store raw data
- "how to" process
e.g. naming conventions of variables etc. - if possible
- where to store processed data
- **infrastructure**
- Data Lake (raw)
- Data Warehouse (processed)
can consist of different forms of storage / databases
- stat database
- external databases
- etc
## Difference
| | Data Lake | Data Warehouse |
| ------------- | ------------------------------------ | -------- |
| Structure | Raw | Processed |
| Purpose | Not yet determined | In Use |
| Users | IT / Data Scientists / Statisticians | Statisticians / Economists |
| Accessibility | Highly, quick to update | More complicated & costly to updated |
## There's more
- might consider "Data Lakehouse"
e.g. using Apache Parquet, processed, using some set of conventions
---
# Thank you!