Frictionless Data

# Frictionless Data ![Frictionless Data](https://i.imgur.com/LTj6Kh9.png =200x) Lightweight specifications and tooling to make it effortless to get, share, and validate data. Note: - 5 MINUTES FOR INTRO -> PROBLEM DEFINITIONS --- # Who we are **Paul Walsh** Chief Product Officer *Open Knowledge International* https://github.com/pwalsh **Adam Kariv** Engineering Lead *Open Knowledge International* https://github.com/akariv --- # Where we work ![Open Knowledge International](https://i.imgur.com/RJ1fNTl.png =200x) [Open Knowledge International](https://okfn.org) "A world where knowledge creates power for the many, not for the few." ---- ## Mission Open up all essential, public interest information and see it used to create insight that drives positive change. ---- ## What we do - technical platforms for open data - thought leadership - community facilitation ---- ## CKAN [↗](https://ckan.org) ![CKAN](https://i.imgur.com/ZcEVt1k.png) Note: - The platform that started Open Knowledge - in many ways, defined practices around the publication of open data. - Widely installed by governments and public bodies - Several commercial providers globally ---- ## OpenSpending [↗](http://next.openspending.org) ![OpenSpending](https://i.imgur.com/dLsMbFc.png) Note: - One of the oldest Open Knowledge projects - Where does my money go -> OS - Newest iteration lead by Adam - Newest iteration based on FD ---- ## OpenTrials [↗](https://opentrials.net) ![OpenTrials](https://i.imgur.com/IPPZSpU.png) Note: - Opening up clinical trial data - Link together disparate sources - Fix academic publication around trials (false claims) - Fix public access to trial data (Ebola example) ---- ## Open Data Index [↗](http://index.okfn.org) ![Open Data Index](https://i.imgur.com/K3U11KB.png) Note: - Crowdsourced assessment tool for government data - Great example of a community-engaged project - High degree of engagement from governments - this machine changes policy ---- ## OKI Labs [↗](http://okfnlabs.org/) ![OKI Labs](https://i.imgur.com/wgK94LI.png) Note: - Ad hoc tech/data community around Open Knowledge - Small meetups and knowledge sharing - Skunkworks ---- ## OK Network [↗](http://okfn.org/network/) ![OK Network](https://i.imgur.com/UF8AKhC.png) Note: - Globally distributed - Highly autonomous --- # Frictionless Data Plumbing, not a platform. - Generic - Reusable - Modular - Needs-driven Addressing the pain of working with public data. Note: - 10 MINUTES FOR THIS SECTION UNTIL DEEP DIVE - FD is not a platform. It is plumbing for open/public data - To understand the offering, we need to understand the problems it solves (segues into the next section) --- # Problems `{{ govt_dept }}` wants to publish data, or is already publishing data. What can we expect? ---- ## Tidiness Where is the data in the file? (sheet, header rows…) ![Imgur](http://i.imgur.com/vq4uuy7.png "This is how the monthly Consumer Price Index report is published. It's a table in an Excel file, but it's still not tabular data.") Note: This is how the monthly Consumer Price Index report is published. It's a table in an Excel file, but it's still not tabular data. ---- ## Cleaning Spurious values everywhere ![Imgur](http://i.imgur.com/e69wAH0.png) Note: This is how local municipality statistical data is published. There's a '..' and a '-' - which are not numbers but have special meaning - as well as the blue background color... ---- ## Data Format Dates, Numbers etc. ![Imgur](http://i.imgur.com/n1hiD3E.png) Note: The same report for two different years is published in two different date formats. ---- ## Validation constraints, mandatory columns, data types ![Imgur](http://i.imgur.com/45pJldj.png) Note: The 'LocalityCode' column looks like numbers but should be treated as strings. The 'EstbYr' column is conceptually an integer but has bad values. ---- ## File format Which version of excel? Character encoding? ![Imgur](http://i.imgur.com/sNBDXTl.png) Note: All the icons lead to a separate Excel file from a different ministry or authority. Each of this files was generated as a report from the same government system. Nevertheless each one of them has a (slightly) different set of columns and is stored using a different version of MS Excel. ---- ## Metadata When was this file published? Who published it? ![Imgur](http://i.imgur.com/MpXqWrB.png) Note: Website says last update date was Dec. 29th, 2016. Actual files are from 2017. ---- ## Documentation What does each column mean? ![Imgur](http://i.imgur.com/J3JdRe6.png) Note: I tried to get some explanations from a governement official on some columns of the national budget file. His response: 'What next? Shall I also make you a cup of coffee?" --- # Solutions - Frictionless Data is a set of specifications and tooling to address these problems and more. - The solutions are all simple in isolation. - Together, they build a powerful stack for working with open data, benefiting publishers and consumers. Note: - Here we do not want to dive into the FD spiel. We want to talk solutions to the problems presented. ---- ## On disk ```shell /package/ ## files generally co-located /package/datapackage.json ## descriptor /package/data1.csv ## data source /package/data2.csv ## data source ``` ---- ## Descriptor A **Data Package** descriptor looks as follows: ```json { "title": "Procurement Log May 2016", "name": "procurement-log-may-2016", "description": "Narrative description", "resources": [ {... data resource ...}, {... data resource ...} ], ... other properties ... } ``` Note: - Packages a data collection - Provides metadata for the whole collection - documentation can be a form of metadata - Minimum required for interoperability - A range of optional fields ---- ## Descriptor A **Data Resource** descriptor looks as follows: ```json { "name": "may-2016", "period": "2016-05", "data": [ "http://example.com/procurement-may-2016.csv" ], "schema": {... table schema ...}, "format": "csv", "encoding": "utf-8", ... other properties ... } ``` Note: - Packages a single source of data - Provides metadata for that data - documentation can be a form of metadata - Minimum required for interoperability - A range of optional fields ---- ## Descriptor A **Table Schema** descriptor looks as follows: ```json { "fields": [ { "name": "transaction_id", "type": "string" }, { "name": "amount", "type": "number" }, { "name": "year", "type": "integer", "constraints": { "required": true, "minimum": 1948, "maximum": 2017 } } ] } ``` Note: - Declarative schema for a tabular data source - Implementation agnostic - Not tied to a language, an ORM, or a database - Designed for, but not limited to, data as text - Minimum required for interoperability - A range of optional fields - Table Schema, on top of Data Resource and Data Package, is where things get interesting - it is a fundamental building block for higher-level work, which we'll see later. ---- ## Conventions - Conventions are formalised as specifications - Specifications enable tooling - Tooling targets publishers and consumers Note: - We try not to invent a new file format or a new technology - We take the best practices and widely used solutions and state what's the best practice. ---- ## Modular ![stack1](https://i.imgur.com/aQRXUnI.png =700x) Note: - A set of concepts designed for composition - TODO explain about the bricks from bottom up. that is the rest of the presentation... - TODO repeat this splide with highlights of parts in focus..... ---- ## Extendable ![stack2](https://i.imgur.com/Miu0MrU.png =700x) Note: - The core specifications and libraries are building blocks for higher-level applications - Primarily concerned at this level with - Data processing (ETL) - Data quality ---- ## That's it No magic, no amazing insights. - Table Schema - Data Resource - Data Package Simple, extendable, reusable. Note: - We build our world on these three fundamental concepts - They provide the building blocks for working with tabular data - Requirements that are driven by simplicity - Extensibility and customisation by design - Metadata that is human-editable and machine-usable - Reuse of existing standard formats for data - Language-, technology- and infrastructure-agnostic --- # Specifications [↗](http://specs.frictionlessdata.io) - [Table Schema](http://specs.frictionlessdata.io/table-schema/) - [Data Resource](http://specs.frictionlessdata.io/data-resource/) - [Tabular Data Resource](http://specs.frictionlessdata.io/tabular-data-resource/) - [Data Package](http://specs.frictionlessdata.io/data-package/) - [Tabular Data Package](http://specs.frictionlessdata.io/tabular-data-package/) - [Fiscal Data Package](http://specs.frictionlessdata.io/fiscal-data-package/) - [Data Package Identifier](http://specs.frictionlessdata.io/data-package-identifier/) - [CSV Dialect](http://specs.frictionlessdata.io/csv-dialect/) Note: - In common: - Descriptor - Profiles - Extendable - Web oriented (JSON, CSV, Streaming) - Reuse (JSON Pointer, etc) - We'll just look a bit ore at three: - Data Package - Tabular Data Package - Table Schema ---- ## Data Package [↗](http://specs.frictionlessdata.io/data-package/) ```json { "title": "My Data Package", "resources": [ { "data": [ "path/to-data.csv" ] } ] } ``` Note: - yes, this is it. Incredibly simple. - Go to spec page an also look at the optional properties - overview of main properties for spec - Show Data Resource as part of Data Package ---- ## Tabular Data Package [↗](http://specs.frictionlessdata.io/tabular-data-package/) ```json { "profile": "tabular-data-package", "title": "My Data Package", "resources": [ { "data": [ "path/to-data.csv" ], "schema": { "fields": [ { "name": "age", "type": "integer" } ] } } ] } ``` Note: - A profile of data package that extends the base spec - This is the most common use case for public/open data- tabular resources / csv - tabular data in plain text - we need a schema - Show Tabular Data Resource as part of Tabular Data Package ---- ## Table Schema [↗](http://specs.frictionlessdata.io/table-schema/) ```json { "fields":[ { "name": "first_name", "type": "string" }, { "name": "email", "type": "string", "format": "email" }, { "name": "date_of_birth", "type": "date", "format": "YYYY-MM-DD", "constraints": { "required": true } } ] } ``` Note: - implementation agnostic schemas for tabular data - explain what is in it and why - overview of main properties for spec - Was designed for declaring schemas for text-based data. Can be used to develop things like declarative ORMs or form validation libraries. We've experimented with an ORM type lib in "jsontableschema-models-js" --- # Libraries Core libraries that implement the specifications. - Data Package [Python](https://github.com/frictionlessdata/datapackage-py) [JavaScript](https://github.com/frictionlessdata/datapackage-js) [Ruby](https://github.com/frictionlessdata/datapackage-rb) [R](https://github.com/frictionlessdata/datapackage-r) - Table Schema [Python](https://github.com/frictionlessdata/tableschema-py) [JavaScript](https://github.com/frictionlessdata/tableschema-js) [Ruby](https://github.com/frictionlessdata/jsontableschema-rb) Note: - These are the core on which other tools are developed - We maintain the Python and JavaScript implementations - Ruby maintained by Open Data Institute - R maintained by rOpenSci - Others maintain Ruby and R under our guidance - More to come (tool fund ->) ---- ## Aside: Tool Fund [↗](http://toolfund.frictionlessdata.io) [![Tool Fund](http://i.imgur.com/GZRBXVF.png)](http://toolfund.frictionlessdata.io) Note: - We are offering minigrants of $5000 for implementations - Pull some details from the website --- # Demo [↗](https://github.com/frictionlessdata/notebooks/blob/master/Introduction%20to%20Frictionless%20Data.ipynb) [![Frictionless Data Notebooks](http://i.imgur.com/PQVpLoe.png)](https://github.com/frictionlessdata/notebooks/blob/master/Introduction%20to%20Frictionless%20Data.ipynb) Note: - A short demonstration of the core libraries --- # Core extensions and utilities - [Tabulator](https://github.com/frictionlessdata/tabulator-py) - [Table Schema SQL](https://github.com/frictionlessdata/jsontableschema-sql-py) - [Table Schema BigQuery](https://github.com/frictionlessdata/jsontableschema-bigquery-py) - [Table Schema Pandas](https://github.com/frictionlessdata/jsontableschema-pandas-py) - [Coming to Pandas and Jupyter](https://github.com/pandas-dev/pandas/issues/14386) Note: - Tabulator is a consistent iteration interface for reading and writing tabular data sources - CSV, Excel, Google Sheets, JSON, NDJSON, Open Office (ODS) --- # Higher-level applications Higher-level applications leverage the stack towards data quality and validation goals, and complex processing pipelines. ---- ## goodtables [↗](https://github.com/frictionlessdata/goodtables-py) - Concerned with data quality (consistency, structure, schema) - Generates detailed reports for end users (non-technical) - Based on [Data Quality Spec](https://github.com/frictionlessdata/data-quality-spec), can easily customise - We are working on a web service around this - goodtables.io : "Continous Data Validation" ---- ## Data Package Pipelines [↗](https://github.com/frictionlessdata/datapackage-pipelines) - Stream-based ETL, producing Data Packages - Meant for loading messy tabular data with a powerful standard library ---- ## OpenSpending [↗](https://github.com/openspending/openspending) - Platform for fiscal data - OLAP API - Advanced visualisation components - Interactive data modeling - All data handling based on Frictionless Data - Fiscal Data package specification - Flat file storage as single point of truth - SQL and Elasticsearch databases derived --- # Demo [↗](https://github.com/frictionlessdata/notebooks/blob/master/Introduction%20to%20Frictionless%20Data.ipynb) [![Frictionless Data Notebooks](http://i.imgur.com/PQVpLoe.png)](https://github.com/frictionlessdata/notebooks/blob/master/Introduction%20to%20Frictionless%20Data.ipynb) Note: - What is good tables - How does it work - What is the report - Data Quality as a first order concern - Data Quality spec --- # Data Package Pipelines An data processing framework based on Data Packages Note: - 15 MINUTES for DPP section ---- ## Philosophy and Origin - Contribution oriented - Many similar tasks vs. one big graph - Data Package native -- First use case: scrape and process EU subsidy data from ~140 different data sources across Europe. ---- ## Contribution oriented - Coding vs Declaring: - Define processing tasks in YAML, not code - Validate YAMLs using schema - Processing pipelines are built of small & simple building blocks - A powerful 'standard library' of processing steps ---- ## Contribution oriented - Performance: - Processing is done on data streams - Limits memory, CPU and disk resource use - Allows more flexibility in running pipelines concurrently (+hw requirements) ---- ## Contribution oriented - Maintenance: - Easier to understand and maintain - Limits the complexity of implementation --> Allows contribution from non-technical users ---- ## Many similar tasks - Most ETL solutions focus on creating a big graph of tasks ![Imgur](http://i.imgur.com/KzBn4Pt.png) Note: Graph from AirFlow's website, an ETL solution from AirBnB ---- ## Many similar tasks - Our use cases cosists of many similar, disconnected, sources - Consider: - Scraping and processing yearly budgets from Israel's 259 different municipalities - Each one published in a slightly different format - Each one in their own web portal etc. ---- ## Data Package Native Benefit from the entire toolset - DRY! - Each step reads a datapackage from stdin - Modify metadata - Modify resources - Emit a datapackage to stdout Datapackage + Data are always valid ---- ## Example CO2 Emissions from World Development Indicators ![Imgur](http://i.imgur.com/SOp0FCr.png) http://api.worldbank.org/v2/en/indicator/EN.ATM.CO2E.PC?downloadformat=excel ---- ## Pipeline ```yaml worldbank-co2-emissions: pipeline: - run: add_metadata parameters: name: 'co2-emissions' title: 'CO2 emissions (metric tons per capita)' homepage: 'http://worldbank.org/' - run: add_resource parameters: name: 'global-data' url: "http://api.worldbank.org/v2/en/indicator/EN.ATM.CO2E.PC?downloadformat=excel" format: xls headers: 4 ``` ---- ## Pipeline (cont.) ```yaml - run: stream_remote_resources cache: True - run: set_types parameters: resources: global-data types: "[12][0-9]{3}": type: number - run: dump.to_zip parameters: out-file: co2-emisonss-wb.zip ``` ---- ## Running it ```bash $ pip install datapackage-pipelines $ dpp Available Pipelines: - ./worldbank-co2-emissions (*) $ dpp run ./worldbank-co2-emissions INFO :Main:RUNNING ./worldbank-co2-emissions INFO :Main:- lib/add_metadata.py INFO :Main:- lib/add_resource.py INFO :Main:- lib/stream_remote_resources.py INFO :Main:- lib/dump/to_zip.py INFO :Main:DONE lib/add_metadata.py INFO :Main:DONE lib/add_resource.py INFO :Main:stream_remote_resources: OPENING http://api.worldbank.org/v2/en/indicator/EN.ATM.CO2E.PC?downloadformat=excel INFO :Main:stream_remote_resources: TOTAL 264 rows INFO :Main:stream_remote_resources: Processed 264 rows INFO :Main:DONE lib/stream_remote_resources.py INFO :Main:dump.to_zip: INFO :Main:Processed 264 rows INFO :Main:DONE lib/dump/to_zip.py INFO :Main:RESULTS: INFO :Main:SUCCESS: ./worldbank-co2-emissions {'dataset-name': 'co2-emissions', 'total_row_count': 264} ``` ---- ## Custom processors ```python # Add license information from datapackage_pipelines.wrapper import ingest, spew _, datapackage, resource_iterator = ingest() datapackage['license'] = 'CC-BY-SA' spew(datapackage, resource_iterator) ``` ---- ## What else? - Advanced standard library - Scheduled execution - Execution dashboard - Pipeline dependencies - Caching results ---- ## What else? http://next.obudget.org/pipelines --- # Find us - [Website](http://frictionlessdata.io/) - [Specifications](http://specs.frictionlessdata.io/) - [Case Studies](http://frictionlessdata.io/case-studies/) - [GitHub](https://github.com/frictionlessdata/) - [Twitter](https://twitter.com/okfnlabs) - [Gitter](https://gitter.im/frictionlessdata/chat) - [Tool Fund](http://toolfund.frictionlessdata.io) - [@pwalsh](https://github.com/pwalsh) - [@akariv](https://github.com/akariv) Note: - 5 MINUTES: QA