Lightweight specifications and tooling to make it effortless to get, share, and validate data.
Paul Walsh
Chief Product Officer
Open Knowledge International
https://github.com/pwalsh
Adam Kariv
Engineering Lead
Open Knowledge International
https://github.com/akariv
"A world where knowledge creates power for the many, not for the few."
Open up all essential, public interest information and see it used to create insight that drives positive change.
Plumbing, not a platform.
Addressing the pain of working with public data.
{{ govt_dept }}
wants to publish data, or is already publishing data.
What can we expect?
Where is the data in the file? (sheet, header rows…)
Spurious values everywhere
Dates, Numbers etc.
constraints, mandatory columns, data types
Which version of excel? Character encoding?
When was this file published? Who published it?
What does each column mean?
Frictionless Data is a set of specifications and tooling to address these problems and more.
The solutions are all simple in isolation.
Together, they build a powerful stack for working with open data, benefiting publishers and consumers.
/package/ ## files generally co-located
/package/datapackage.json ## descriptor
/package/data1.csv ## data source
/package/data2.csv ## data source
A Data Package descriptor looks as follows:
{
"title": "Procurement Log May 2016",
"name": "procurement-log-may-2016",
"description": "Narrative description",
"resources": [
{... data resource ...},
{... data resource ...}
],
... other properties ...
}
A Data Resource descriptor looks as follows:
{
"name": "may-2016",
"period": "2016-05",
"data": [ "http://example.com/procurement-may-2016.csv" ],
"schema": {... table schema ...},
"format": "csv",
"encoding": "utf-8",
... other properties ...
}
A Table Schema descriptor looks as follows:
{
"fields": [
{ "name": "transaction_id", "type": "string" },
{ "name": "amount", "type": "number" },
{ "name": "year", "type": "integer",
"constraints": {
"required": true,
"minimum": 1948,
"maximum": 2017
}
}
]
}
No magic, no amazing insights.
Simple, extendable, reusable.
{
"title": "My Data Package",
"resources": [
{
"data": [ "path/to-data.csv" ]
}
]
}
{
"profile": "tabular-data-package",
"title": "My Data Package",
"resources": [
{
"data": [ "path/to-data.csv" ],
"schema": {
"fields": [
{ "name": "age", "type": "integer" }
]
}
}
]
}
{
"fields":[
{
"name": "first_name",
"type": "string"
},
{
"name": "email",
"type": "string",
"format": "email"
},
{
"name": "date_of_birth",
"type": "date",
"format": "YYYY-MM-DD",
"constraints": { "required": true }
}
]
}
Core libraries that implement the specifications.
Higher-level applications leverage the stack towards data quality and validation goals, and complex processing pipelines.
An data processing framework
based on Data Packages
–
First use case:
scrape and process EU subsidy data from ~140 different data sources across Europe.
–> Allows contribution from non-technical users
Our use cases cosists of many similar, disconnected, sources
Consider:
Benefit from the entire toolset - DRY!
Datapackage + Data are always valid
CO2 Emissions from World Development Indicators
http://api.worldbank.org/v2/en/indicator/EN.ATM.CO2E.PC?downloadformat=excel
worldbank-co2-emissions:
pipeline:
-
run: add_metadata
parameters:
name: 'co2-emissions'
title: 'CO2 emissions (metric tons per capita)'
homepage: 'http://worldbank.org/'
-
run: add_resource
parameters:
name: 'global-data'
url: "http://api.worldbank.org/v2/en/indicator/EN.ATM.CO2E.PC?downloadformat=excel"
format: xls
headers: 4
-
run: stream_remote_resources
cache: True
-
run: set_types
parameters:
resources: global-data
types:
"[12][0-9]{3}":
type: number
-
run: dump.to_zip
parameters:
out-file: co2-emisonss-wb.zip
$ pip install datapackage-pipelines
$ dpp
Available Pipelines:
- ./worldbank-co2-emissions (*)
$ dpp run ./worldbank-co2-emissions
INFO :Main:RUNNING ./worldbank-co2-emissions
INFO :Main:- lib/add_metadata.py
INFO :Main:- lib/add_resource.py
INFO :Main:- lib/stream_remote_resources.py
INFO :Main:- lib/dump/to_zip.py
INFO :Main:DONE lib/add_metadata.py
INFO :Main:DONE lib/add_resource.py
INFO :Main:stream_remote_resources: OPENING http://api.worldbank.org/v2/en/indicator/EN.ATM.CO2E.PC?downloadformat=excel
INFO :Main:stream_remote_resources: TOTAL 264 rows
INFO :Main:stream_remote_resources: Processed 264 rows
INFO :Main:DONE lib/stream_remote_resources.py
INFO :Main:dump.to_zip: INFO :Main:Processed 264 rows
INFO :Main:DONE lib/dump/to_zip.py
INFO :Main:RESULTS:
INFO :Main:SUCCESS: ./worldbank-co2-emissions
{'dataset-name': 'co2-emissions', 'total_row_count': 264}
# Add license information
from datapackage_pipelines.wrapper import ingest, spew
_, datapackage, resource_iterator = ingest()
datapackage['license'] = 'CC-BY-SA'
spew(datapackage, resource_iterator)