owned this note
owned this note
Published
Linked with GitHub
# Feedback on Datahub Next
## Portal.js
> https://github.com/datopian/portal.js
- Brand it as something like "Hugo/Astro/Jekyll for data literate documents".
- Rill Developer is working on something similar but with Svelte.
- Taking this Frictionless dataset (https://portal-js.vercel.app/):
- Is there a way to provide a small playground (Datasette style)? Would that be a component?
- Is there a way to automatically "archive" the datasets?
- Will this dataset be exposed publicly under a common API?
## Data Literate Notebooks
> https://datahub.io/docs/dms/notebook
- You could even embed pyscript and DuckDB/SQLite! :heart_eyes:
- Similar to evidence.dev. Is there a way to reuse/collaborate between the two projects?
- Ultimately, this feels more on the publishing side. I feel there is still something like Rath or Rill Dev that is needed before this to explore and understand the data. I'm thinking along the lines of the ideas I share in the Open Source Web Data IDE section (https://publish.obsidian.md/davidgasquez/Open+Data#Open+Source+Web+Data+IDE).
That said, this is the kind of doc I (and I'm sure lots of organizations) would love to be able to write:
```
## Renewable Energy
\```energy_data
select * from https://raw.githubusercontent.com/owid/energy-data/master/owid-energy-data.csv
\```
Since the Industrial Revolution, the energy mix of most countries across the world has become...
...
...
...
### How much of our primary energy comes from renewables?
\```python
import something
daily_energy = something.process(energy_data)
\```
<LineChart
data={daily_energy}
x=date
y=energy
yAxisTitle="Energy line chart!"
/>
<importFromFile(graphs/complex_graph.json)/>
```
Basically, having a more accesible and open way to emulate the original research: https://ourworldindata.org/renewable-energy.
By looking at https://app.excalidraw.com/l/9u8crB2ZmUo/j2SkAQNJ8n, seems this is something moving closer to classic enterprise BI tools (Hex, Mode Analytitcs, Evidence.dev). This might be the right call as there is no easy way to use Looker/Mode with random datasets but I'm curious how these tools could be integrated/play well together.
> [rufus] this was a bit more towards what i call the "Data Project" product option vs the "Data README" product option. That said, the first part of this flow may be needed whatever you do.
Love the rest of the notebook discussion and ideas. They gave me more questions. Feel free to ignore them though.
- Would love to know if you have ideas around what can be learned from package managers like cargo in Rust and Nix? **💬2023-03-13 have not really looked at cargo nix. My overall sense was i got a bit less interested in package managers b/c of both the deno/go trype approach plus thought that this is something that emerges from a need**
- You wrote about checking if there was a need for a simple dataflow library? Did this result into https://github.com/datopian/datapipes and https://github.com/datahq/dataflows? What are the plans there? **💬2023-03-13 yes, those were two efforts. Still think there is some missing sweet spot of super simple. That said, my conclusion was this is something to "buy" not build as so much out there especially e.g. prefect etc**
- re Aircan: this is the approach I'd love to explore more! Reusing/Integrating the current tooling instead of building new ones. How can the problem be broken in a way we can reuse more existing tooling and approaches?
## Datahub Next Architecture

- How does the user explore the data before _Edit & Push_? I guess the answer is the "Data Canvas" idea (load some data, explore it quickly). Mentioning because I know these things that will happen 99% of times: **💬2023-03-13 great question. one way is to just allow people to push - does it matter if your tests fail when you first push to github.**
- Data will be messy/incorrect
- Some light transformations might be needed
- In my mind, there is an orthogonal box in that diagram: the Data Package Manager. You have it locally and work with it to edit/push, it needs to work with the "DB" and is used by the "presentation" layer. Not sure what is the MVP here though. Specifically, I'd love to figure out how it could play well with the different boxes in the diagram.
- Reading the plan (https://datahub.io/notes/plan) and the v3 docs (https://datahub.io/docs/dms/datahub/v3) gives a great overview of the specifics! Again, I'm super stoked you are doing all of this in public! :heart_eyes:
- Common workflows like Data Validation and Data Summary are quite common in enterprise and have great solutions. This was something I contributed to back in PL: https://github.com/bacalhau-project/amplify.
- Been tracking GitHub blocks (https://blocks.githubnext.com/) could be something interesting to apply to your idea! **💬2023-03-13 had not seen this and this has a lot there.**
- Data APIs (https://datahub.io/docs/dms/data-api):
- They can be cheap and very fast using R2 and CDNs alongside something like ROAPI (https://github.com/roapi/roapi) in a serverless platform. Not sure if this could even be in WASM with things like DuckDB Shell (https://shell.duckdb.org/) ir Seafowl (https://github.com/splitgraph/seafowl). Even faster if in the backend, the data is transformed from CSVs to compressed Parquet files. The folks at HuggingFace do that and it enables things like this: https://twitter.com/davidgasquez/status/1605174686081601536. The main insight here is that nowadays, most of the data CAN be processed in your laptop. I know this was something you were dealing with back in 2011 (https://www.youtube.com/watch?v=guNjG05ra9k).
- Let me know if you want me to send a PR to https://datahub.io/notes/design-data-api with the ideas! **✅2023-03-13 👍 great idea ~rufus**
- I think there are lots of great ideas in ODF (https://github.com/open-data-fabric/open-data-fabric), Seafowl (https://seafowl.io/docs/guides/baking-dataset-docker-image), Datalad (https://www.datalad.org/) and Quilt (https://github.com/quiltdata/quilt) that could be savaged/reused.
- More GitHub data examples (https://github.com/datopian/datahub-next/blob/main/content/notes/github-data-examples.md):
- https://github.com/owid/owid-datasets **💬2023-03-13 great - please add it to the page**
- Tried to adapt DataHub Next DMS drawing to the mental image I have of how everything fit together: https://excalidraw.com/#json=rOcjhjUYA68H4k-FvaJkm,6NUttcw__lu_UNQE8cN_qw
- Raw: https://excalidraw.com/#room=b1845be57a46bccaf250,iK3i_MfTFuF8bn28eRxK3g

## Frictionless
> https://specs.frictionlessdata.io/data-package/
- https://github.com/frictionlessdata/framework is a great project but I wonder how many new "custom checks" and new features could get if it moved closer to the "modern data stack" and integrated with other toolings. Something like https://github.com/great-expectations/great_expectations or https://github.com/cleanlab/cleanlab. **💬2023-03-13 great expectations etc came along since Frictionless started and, IMO, has somewhat over-taken it now.**
- This is me again trying to think how these great protocols could be made with less isolation from all the cool things happening in the data space. Instead of building custom tooling, build a protocol and integrate with external one (so easy to say!).
- TODO (david)
## Iterating the story
I'm in an company. I have a hunch I want to investigate with data.
In a company this is relatively easy to do:
- I know where to look.
- If the data is there, it is probably easily queriable (schemas, tests, ...) as there is a team that is managing that (data team).
- Getting external data is also easy. Run your Singer or Airbyte tap: https://airbyte.com/
- For each part of the stack, there are interoperable standards and tools. e.g. warehouses (S3, Redshift, Clickhouse), modeling (getdbt, Airflow + python), orchestration (prefect, dagster, airflow, ...) etc
With public data, is much more painful:
e.g. I want to check if the number of graduates from $UNIVERSITY is correlated with the GDP of $COUNTRY
I want to be able to grab and work on a public data source as easily as I do in the company and reuse the tools in a data company.
```sql!
select
date,
graduates,
gdp
from university_data
left join country_data
```
or even something like `data get organization/university_data --json && data get organization/country_data --format json`.
In a company, models compound. You don't have to derive the same models over and over. In Open Data, we usually only have the raw data.
Collaborate on data like people are doing in some DAOs (https://github.com/duneanalytics/spellbook/tree/main/models, https://github.com/MetricsDAO/near_dbt/pull/98). Reuse these models in a permissionless way.
## Iterating data packaging workflow
### New dataset
- I come across an interesting dataset online. E.g: https://github.com/datasets/awesome-data/issues/334
- I want to create a page for it.
- Doing this should also publish a package + API.
- Two options
- Option 1: Add a page to the "wiki" (quick and dirty). Open a PR on a certain repository acting as a hub.
- Option 2: It's own folder/repo 👈 preferred
- The previous step should link it to datahub automatically. Alternatively, Datahub could scrape Github searching for compilant datasets.
- ie. datahub.io/@david/my-dataset automatically proxies to github.com/david/my-dataset (and gives error if not there or not data ...)
- Showcase: turn README into a nice page maybe with extra features - https://datahub.io/notes/design-showcase
- MDX+Data aka data rich documents: markdown + MDX + data components for tables, graphs etc
- Proxy issues, PRs, ...
### Existing dataset in Datahub
- I saw the showcase/README from the previous workflow. I want to add a new graph or relation!
- For a simple graph, fork the repository and update the README. Make PR.
- _For a complex dataset? How can we compose data in this case?_ E.g:
### Existing dataset in X (HuggingFace, GitHub, Socrata, FTP, Random database)
- I come across an interesting dataset online. E.g: https://github.com/datasets/awesome-data/issues/334
- Since is on Kaggle, I can somehow use a _Kaggle dataset adapter_ and use it as if it was any other dataset.
- It might get ingested in the backend to create a copy
- Same steps as the first workflopw
## Principles
If I had to summarize the principles/guidelines I have in mind while exploring this, it would be:
- Integrate protocols with the common data stack tools to avoid reinventing the wheel (ETL, formats, orchestrators, ...).
- Make things simpler (wrappers) and have strong opinions on top of the tools.
- Data as code. Workflows as code, ...
- Frictionless Data Composability/collaboration (e.g: build SQL models on top of other people's ones).
- Build adapters as a community so data becomes connected.
- Simplicity/pragmatism (CSV, published data is better than almost published one, ...)
- Packaging should be format/platform agnostic.