# Frictionless Data
![Frictionless Data](https://i.imgur.com/LTj6Kh9.png =200x)
Lightweight specifications and tooling to make it effortless to get, share, and validate data.
Note:
- 5 MINUTES FOR INTRO -> PROBLEM DEFINITIONS
---
# Who we are
**Paul Walsh**
Chief Product Officer
*Open Knowledge International*
https://github.com/pwalsh
**Adam Kariv**
Engineering Lead
*Open Knowledge International*
https://github.com/akariv
---
# Where we work
![Open Knowledge International](https://i.imgur.com/RJ1fNTl.png =200x)
[Open Knowledge International](https://okfn.org)
"A world where knowledge creates power for the many, not for the few."
----
## Mission
Open up all essential, public interest information and see it used to create insight that drives positive change.
----
## What we do
- technical platforms for open data
- thought leadership
- community facilitation
----
## CKAN [↗](https://ckan.org)
![CKAN](https://i.imgur.com/ZcEVt1k.png)
Note:
- The platform that started Open Knowledge
- in many ways, defined practices around the publication of open data.
- Widely installed by governments and public bodies
- Several commercial providers globally
----
## OpenSpending [↗](http://next.openspending.org)
![OpenSpending](https://i.imgur.com/dLsMbFc.png)
Note:
- One of the oldest Open Knowledge projects
- Where does my money go -> OS
- Newest iteration lead by Adam
- Newest iteration based on FD
----
## OpenTrials [↗](https://opentrials.net)
![OpenTrials](https://i.imgur.com/IPPZSpU.png)
Note:
- Opening up clinical trial data
- Link together disparate sources
- Fix academic publication around trials (false claims)
- Fix public access to trial data (Ebola example)
----
## Open Data Index [↗](http://index.okfn.org)
![Open Data Index](https://i.imgur.com/K3U11KB.png)
Note:
- Crowdsourced assessment tool for government data
- Great example of a community-engaged project
- High degree of engagement from governments - this machine changes policy
----
## OKI Labs [↗](http://okfnlabs.org/)
![OKI Labs](https://i.imgur.com/wgK94LI.png)
Note:
- Ad hoc tech/data community around Open Knowledge
- Small meetups and knowledge sharing
- Skunkworks
----
## OK Network [↗](http://okfn.org/network/)
![OK Network](https://i.imgur.com/UF8AKhC.png)
Note:
- Globally distributed
- Highly autonomous
---
# Frictionless Data
Plumbing, not a platform.
- Generic
- Reusable
- Modular
- Needs-driven
Addressing the pain of working with public data.
Note:
- 10 MINUTES FOR THIS SECTION UNTIL DEEP DIVE
- FD is not a platform. It is plumbing for open/public data
- To understand the offering, we need to understand the problems it solves (segues into the next section)
---
# Problems
`{{ govt_dept }}` wants to publish data, or is already publishing data.
What can we expect?
----
## Tidiness
Where is the data in the file? (sheet, header rows…)
![Imgur](http://i.imgur.com/vq4uuy7.png "This is how the monthly Consumer Price Index report is published. It's a table in an Excel file, but it's still not tabular data.")
Note:
This is how the monthly Consumer Price Index report is published.
It's a table in an Excel file, but it's still not tabular data.
----
## Cleaning
Spurious values everywhere
![Imgur](http://i.imgur.com/e69wAH0.png)
Note:
This is how local municipality statistical data is published.
There's a '..' and a '-' - which are not numbers but have special meaning -
as well as the blue background color...
----
## Data Format
Dates, Numbers etc.
![Imgur](http://i.imgur.com/n1hiD3E.png)
Note:
The same report for two different years is published in two different date formats.
----
## Validation
constraints, mandatory columns, data types
![Imgur](http://i.imgur.com/45pJldj.png)
Note:
The 'LocalityCode' column looks like numbers but should be treated as strings.
The 'EstbYr' column is conceptually an integer but has bad values.
----
## File format
Which version of excel? Character encoding?
![Imgur](http://i.imgur.com/sNBDXTl.png)
Note:
All the icons lead to a separate Excel file from a different ministry or authority.
Each of this files was generated as a report from the same government system. Nevertheless each one of them has a (slightly) different set of columns and is stored using a different version of MS Excel.
----
## Metadata
When was this file published? Who published it?
![Imgur](http://i.imgur.com/MpXqWrB.png)
Note:
Website says last update date was Dec. 29th, 2016.
Actual files are from 2017.
----
## Documentation
What does each column mean?
![Imgur](http://i.imgur.com/J3JdRe6.png)
Note:
I tried to get some explanations from a governement official on some columns of the national budget file.
His response: 'What next? Shall I also make you a cup of coffee?"
---
# Solutions
- Frictionless Data is a set of specifications and tooling to address these problems and more.
- The solutions are all simple in isolation.
- Together, they build a powerful stack for working with open data, benefiting publishers and consumers.
Note:
- Here we do not want to dive into the FD spiel. We want to talk solutions to the problems presented.
----
## On disk
```shell
/package/ ## files generally co-located
/package/datapackage.json ## descriptor
/package/data1.csv ## data source
/package/data2.csv ## data source
```
----
## Descriptor
A **Data Package** descriptor looks as follows:
```json
{
"title": "Procurement Log May 2016",
"name": "procurement-log-may-2016",
"description": "Narrative description",
"resources": [
{... data resource ...},
{... data resource ...}
],
... other properties ...
}
```
Note:
- Packages a data collection
- Provides metadata for the whole collection
- documentation can be a form of metadata
- Minimum required for interoperability
- A range of optional fields
----
## Descriptor
A **Data Resource** descriptor looks as follows:
```json
{
"name": "may-2016",
"period": "2016-05",
"data": [ "http://example.com/procurement-may-2016.csv" ],
"schema": {... table schema ...},
"format": "csv",
"encoding": "utf-8",
... other properties ...
}
```
Note:
- Packages a single source of data
- Provides metadata for that data
- documentation can be a form of metadata
- Minimum required for interoperability
- A range of optional fields
----
## Descriptor
A **Table Schema** descriptor looks as follows:
```json
{
"fields": [
{ "name": "transaction_id", "type": "string" },
{ "name": "amount", "type": "number" },
{ "name": "year", "type": "integer",
"constraints": {
"required": true,
"minimum": 1948,
"maximum": 2017
}
}
]
}
```
Note:
- Declarative schema for a tabular data source
- Implementation agnostic
- Not tied to a language, an ORM, or a database
- Designed for, but not limited to, data as text
- Minimum required for interoperability
- A range of optional fields
- Table Schema, on top of Data Resource and Data Package, is where things get interesting - it is a fundamental building block for higher-level work, which we'll see later.
----
## Conventions
- Conventions are formalised as specifications
- Specifications enable tooling
- Tooling targets publishers and consumers
Note:
- We try not to invent a new file format or a new technology
- We take the best practices and widely used solutions and state what's the best practice.
----
## Modular
![stack1](https://i.imgur.com/aQRXUnI.png =700x)
Note:
- A set of concepts designed for composition
- TODO explain about the bricks from bottom up. that is the rest of the presentation...
- TODO repeat this splide with highlights of parts in focus.....
----
## Extendable
![stack2](https://i.imgur.com/Miu0MrU.png =700x)
Note:
- The core specifications and libraries are building blocks for higher-level applications
- Primarily concerned at this level with
- Data processing (ETL)
- Data quality
----
## That's it
No magic, no amazing insights.
- Table Schema
- Data Resource
- Data Package
Simple, extendable, reusable.
Note:
- We build our world on these three fundamental concepts
- They provide the building blocks for working with tabular data
- Requirements that are driven by simplicity
- Extensibility and customisation by design
- Metadata that is human-editable and machine-usable
- Reuse of existing standard formats for data
- Language-, technology- and infrastructure-agnostic
---
# Specifications [↗](http://specs.frictionlessdata.io)
- [Table Schema](http://specs.frictionlessdata.io/table-schema/)
- [Data Resource](http://specs.frictionlessdata.io/data-resource/)
- [Tabular Data Resource](http://specs.frictionlessdata.io/tabular-data-resource/)
- [Data Package](http://specs.frictionlessdata.io/data-package/)
- [Tabular Data Package](http://specs.frictionlessdata.io/tabular-data-package/)
- [Fiscal Data Package](http://specs.frictionlessdata.io/fiscal-data-package/)
- [Data Package Identifier](http://specs.frictionlessdata.io/data-package-identifier/)
- [CSV Dialect](http://specs.frictionlessdata.io/csv-dialect/)
Note:
- In common:
- Descriptor
- Profiles
- Extendable
- Web oriented (JSON, CSV, Streaming)
- Reuse (JSON Pointer, etc)
- We'll just look a bit ore at three:
- Data Package
- Tabular Data Package
- Table Schema
----
## Data Package [↗](http://specs.frictionlessdata.io/data-package/)
```json
{
"title": "My Data Package",
"resources": [
{
"data": [ "path/to-data.csv" ]
}
]
}
```
Note:
- yes, this is it. Incredibly simple.
- Go to spec page an also look at the optional properties - overview of main properties for spec
- Show Data Resource as part of Data Package
----
## Tabular Data Package [↗](http://specs.frictionlessdata.io/tabular-data-package/)
```json
{
"profile": "tabular-data-package",
"title": "My Data Package",
"resources": [
{
"data": [ "path/to-data.csv" ],
"schema": {
"fields": [
{ "name": "age", "type": "integer" }
]
}
}
]
}
```
Note:
- A profile of data package that extends the base spec
- This is the most common use case for public/open data- tabular resources / csv
- tabular data in plain text - we need a schema
- Show Tabular Data Resource as part of Tabular Data Package
----
## Table Schema [↗](http://specs.frictionlessdata.io/table-schema/)
```json
{
"fields":[
{
"name": "first_name",
"type": "string"
},
{
"name": "email",
"type": "string",
"format": "email"
},
{
"name": "date_of_birth",
"type": "date",
"format": "YYYY-MM-DD",
"constraints": { "required": true }
}
]
}
```
Note:
- implementation agnostic schemas for tabular data
- explain what is in it and why
- overview of main properties for spec
- Was designed for declaring schemas for text-based data. Can be used to develop things like declarative ORMs or form validation libraries. We've experimented with an ORM type lib in "jsontableschema-models-js"
---
# Libraries
Core libraries that implement the specifications.
- Data Package [Python](https://github.com/frictionlessdata/datapackage-py) [JavaScript](https://github.com/frictionlessdata/datapackage-js) [Ruby](https://github.com/frictionlessdata/datapackage-rb) [R](https://github.com/frictionlessdata/datapackage-r)
- Table Schema [Python](https://github.com/frictionlessdata/tableschema-py) [JavaScript](https://github.com/frictionlessdata/tableschema-js) [Ruby](https://github.com/frictionlessdata/jsontableschema-rb)
Note:
- These are the core on which other tools are developed
- We maintain the Python and JavaScript implementations
- Ruby maintained by Open Data Institute
- R maintained by rOpenSci
- Others maintain Ruby and R under our guidance
- More to come (tool fund ->)
----
## Aside: Tool Fund [↗](http://toolfund.frictionlessdata.io)
[![Tool Fund](http://i.imgur.com/GZRBXVF.png)](http://toolfund.frictionlessdata.io)
Note:
- We are offering minigrants of $5000 for implementations
- Pull some details from the website
---
# Demo [↗](https://github.com/frictionlessdata/notebooks/blob/master/Introduction%20to%20Frictionless%20Data.ipynb)
[![Frictionless Data Notebooks](http://i.imgur.com/PQVpLoe.png)](https://github.com/frictionlessdata/notebooks/blob/master/Introduction%20to%20Frictionless%20Data.ipynb)
Note:
- A short demonstration of the core libraries
---
# Core extensions and utilities
- [Tabulator](https://github.com/frictionlessdata/tabulator-py)
- [Table Schema SQL](https://github.com/frictionlessdata/jsontableschema-sql-py)
- [Table Schema BigQuery](https://github.com/frictionlessdata/jsontableschema-bigquery-py)
- [Table Schema Pandas](https://github.com/frictionlessdata/jsontableschema-pandas-py)
- [Coming to Pandas and Jupyter](https://github.com/pandas-dev/pandas/issues/14386)
Note:
- Tabulator is a consistent iteration interface for reading and writing tabular data sources
- CSV, Excel, Google Sheets, JSON, NDJSON, Open Office (ODS)
---
# Higher-level applications
Higher-level applications leverage the stack towards data quality and validation goals, and complex processing pipelines.
----
## goodtables [↗](https://github.com/frictionlessdata/goodtables-py)
- Concerned with data quality (consistency, structure, schema)
- Generates detailed reports for end users (non-technical)
- Based on [Data Quality Spec](https://github.com/frictionlessdata/data-quality-spec), can easily customise
- We are working on a web service around this
- goodtables.io : "Continous Data Validation"
----
## Data Package Pipelines [↗](https://github.com/frictionlessdata/datapackage-pipelines)
- Stream-based ETL, producing Data Packages
- Meant for loading messy tabular data with a powerful standard library
----
## OpenSpending [↗](https://github.com/openspending/openspending)
- Platform for fiscal data
- OLAP API
- Advanced visualisation components
- Interactive data modeling
- All data handling based on Frictionless Data
- Fiscal Data package specification
- Flat file storage as single point of truth
- SQL and Elasticsearch databases derived
---
# Demo [↗](https://github.com/frictionlessdata/notebooks/blob/master/Introduction%20to%20Frictionless%20Data.ipynb)
[![Frictionless Data Notebooks](http://i.imgur.com/PQVpLoe.png)](https://github.com/frictionlessdata/notebooks/blob/master/Introduction%20to%20Frictionless%20Data.ipynb)
Note:
- What is good tables
- How does it work
- What is the report
- Data Quality as a first order concern
- Data Quality spec
---
# Data Package Pipelines
An data processing framework
based on Data Packages
Note:
- 15 MINUTES for DPP section
----
## Philosophy and Origin
- Contribution oriented
- Many similar tasks vs. one big graph
- Data Package native
--
First use case:
scrape and process EU subsidy data from ~140 different data sources across Europe.
----
## Contribution oriented
- Coding vs Declaring:
- Define processing tasks in YAML, not code
- Validate YAMLs using schema
- Processing pipelines are built of small & simple building blocks
- A powerful 'standard library' of processing steps
----
## Contribution oriented
- Performance:
- Processing is done on data streams
- Limits memory, CPU and disk resource use
- Allows more flexibility in running pipelines concurrently (+hw requirements)
----
## Contribution oriented
- Maintenance:
- Easier to understand and maintain
- Limits the complexity of implementation
--> Allows contribution from non-technical users
----
## Many similar tasks
- Most ETL solutions focus on creating a big graph of tasks
![Imgur](http://i.imgur.com/KzBn4Pt.png)
Note:
Graph from AirFlow's website, an ETL solution from AirBnB
----
## Many similar tasks
- Our use cases cosists of many similar, disconnected, sources
- Consider:
- Scraping and processing yearly budgets from Israel's 259 different municipalities
- Each one published in a slightly different format
- Each one in their own web portal etc.
----
## Data Package Native
Benefit from the entire toolset - DRY!
- Each step reads a datapackage from stdin
- Modify metadata
- Modify resources
- Emit a datapackage to stdout
Datapackage + Data are always valid
----
## Example
CO2 Emissions from World Development Indicators
![Imgur](http://i.imgur.com/SOp0FCr.png)
http://api.worldbank.org/v2/en/indicator/EN.ATM.CO2E.PC?downloadformat=excel
----
## Pipeline
```yaml
worldbank-co2-emissions:
pipeline:
-
run: add_metadata
parameters:
name: 'co2-emissions'
title: 'CO2 emissions (metric tons per capita)'
homepage: 'http://worldbank.org/'
-
run: add_resource
parameters:
name: 'global-data'
url: "http://api.worldbank.org/v2/en/indicator/EN.ATM.CO2E.PC?downloadformat=excel"
format: xls
headers: 4
```
----
## Pipeline (cont.)
```yaml
-
run: stream_remote_resources
cache: True
-
run: set_types
parameters:
resources: global-data
types:
"[12][0-9]{3}":
type: number
-
run: dump.to_zip
parameters:
out-file: co2-emisonss-wb.zip
```
----
## Running it
```bash
$ pip install datapackage-pipelines
$ dpp
Available Pipelines:
- ./worldbank-co2-emissions (*)
$ dpp run ./worldbank-co2-emissions
INFO :Main:RUNNING ./worldbank-co2-emissions
INFO :Main:- lib/add_metadata.py
INFO :Main:- lib/add_resource.py
INFO :Main:- lib/stream_remote_resources.py
INFO :Main:- lib/dump/to_zip.py
INFO :Main:DONE lib/add_metadata.py
INFO :Main:DONE lib/add_resource.py
INFO :Main:stream_remote_resources: OPENING http://api.worldbank.org/v2/en/indicator/EN.ATM.CO2E.PC?downloadformat=excel
INFO :Main:stream_remote_resources: TOTAL 264 rows
INFO :Main:stream_remote_resources: Processed 264 rows
INFO :Main:DONE lib/stream_remote_resources.py
INFO :Main:dump.to_zip: INFO :Main:Processed 264 rows
INFO :Main:DONE lib/dump/to_zip.py
INFO :Main:RESULTS:
INFO :Main:SUCCESS: ./worldbank-co2-emissions
{'dataset-name': 'co2-emissions', 'total_row_count': 264}
```
----
## Custom processors
```python
# Add license information
from datapackage_pipelines.wrapper import ingest, spew
_, datapackage, resource_iterator = ingest()
datapackage['license'] = 'CC-BY-SA'
spew(datapackage, resource_iterator)
```
----
## What else?
- Advanced standard library
- Scheduled execution
- Execution dashboard
- Pipeline dependencies
- Caching results
----
## What else?
http://next.obudget.org/pipelines
---
# Find us
- [Website](http://frictionlessdata.io/)
- [Specifications](http://specs.frictionlessdata.io/)
- [Case Studies](http://frictionlessdata.io/case-studies/)
- [GitHub](https://github.com/frictionlessdata/)
- [Twitter](https://twitter.com/okfnlabs)
- [Gitter](https://gitter.im/frictionlessdata/chat)
- [Tool Fund](http://toolfund.frictionlessdata.io)
- [@pwalsh](https://github.com/pwalsh)
- [@akariv](https://github.com/akariv)
Note:
- 5 MINUTES: QA
{"metaMigratedAt":"2023-06-14T12:37:59.870Z","metaMigratedFrom":"YAML","title":"Frictionless Data","breaks":true,"slideOptions":"{\"parallaxBackgroundImage\":\"http://i.imgur.com/4C99vLP.jpg\",\"parallaxBackgroundSize\":\"4600px 3080px\",\"center\":false}","contributors":"[]"}