David
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
      • Invitee
      • No invitee
    • Publish Note

      Publish Note

      Everyone on the web can find and read all notes of this public team.
      Once published, notes can be searched and viewed by anyone online.
      See published notes
      Please check the box to agree to the Community Guidelines.
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Engagement control
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Versions and GitHub Sync Engagement control Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
Invitee
No invitee
Publish Note

Publish Note

Everyone on the web can find and read all notes of this public team.
Once published, notes can be searched and viewed by anyone online.
See published notes
Please check the box to agree to the Community Guidelines.
Engagement control
Commenting
Permission
Disabled Forbidden Owners Signed-in users Everyone
Enable
Permission
  • Forbidden
  • Owners
  • Signed-in users
  • Everyone
Suggest edit
Permission
Disabled Forbidden Owners Signed-in users Everyone
Enable
Permission
  • Forbidden
  • Owners
  • Signed-in users
Emoji Reply
Enable
Import from Dropbox Google Drive Gist Clipboard
   owned this note    owned this note      
Published Linked with GitHub
Subscribed
  • Any changes
    Be notified of any changes
  • Mention me
    Be notified of mention me
  • Unsubscribe
Subscribe
# Feedback on Datahub Next ## Portal.js > https://github.com/datopian/portal.js - Brand it as something like "Hugo/Astro/Jekyll for data literate documents". - Rill Developer is working on something similar but with Svelte. - Taking this Frictionless dataset (https://portal-js.vercel.app/): - Is there a way to provide a small playground (Datasette style)? Would that be a component? - Is there a way to automatically "archive" the datasets? - Will this dataset be exposed publicly under a common API? ## Data Literate Notebooks > https://datahub.io/docs/dms/notebook - You could even embed pyscript and DuckDB/SQLite! :heart_eyes: - Similar to evidence.dev. Is there a way to reuse/collaborate between the two projects? - Ultimately, this feels more on the publishing side. I feel there is still something like Rath or Rill Dev that is needed before this to explore and understand the data. I'm thinking along the lines of the ideas I share in the Open Source Web Data IDE section (https://publish.obsidian.md/davidgasquez/Open+Data#Open+Source+Web+Data+IDE). That said, this is the kind of doc I (and I'm sure lots of organizations) would love to be able to write: ``` ## Renewable Energy \```energy_data select * from https://raw.githubusercontent.com/owid/energy-data/master/owid-energy-data.csv \``` Since the Industrial Revolution, the energy mix of most countries across the world has become... ... ... ... ### How much of our primary energy comes from renewables? \```python import something daily_energy = something.process(energy_data) \``` <LineChart data={daily_energy} x=date y=energy yAxisTitle="Energy line chart!" /> <importFromFile(graphs/complex_graph.json)/> ``` Basically, having a more accesible and open way to emulate the original research: https://ourworldindata.org/renewable-energy. By looking at https://app.excalidraw.com/l/9u8crB2ZmUo/j2SkAQNJ8n, seems this is something moving closer to classic enterprise BI tools (Hex, Mode Analytitcs, Evidence.dev). This might be the right call as there is no easy way to use Looker/Mode with random datasets but I'm curious how these tools could be integrated/play well together. > [rufus] this was a bit more towards what i call the "Data Project" product option vs the "Data README" product option. That said, the first part of this flow may be needed whatever you do. Love the rest of the notebook discussion and ideas. They gave me more questions. Feel free to ignore them though. - Would love to know if you have ideas around what can be learned from package managers like cargo in Rust and Nix? **💬2023-03-13 have not really looked at cargo nix. My overall sense was i got a bit less interested in package managers b/c of both the deno/go trype approach plus thought that this is something that emerges from a need** - You wrote about checking if there was a need for a simple dataflow library? Did this result into https://github.com/datopian/datapipes and https://github.com/datahq/dataflows? What are the plans there? **💬2023-03-13 yes, those were two efforts. Still think there is some missing sweet spot of super simple. That said, my conclusion was this is something to "buy" not build as so much out there especially e.g. prefect etc** - re Aircan: this is the approach I'd love to explore more! Reusing/Integrating the current tooling instead of building new ones. How can the problem be broken in a way we can reuse more existing tooling and approaches? ## Datahub Next Architecture ![](https://i.imgur.com/0VKNmv1.png) - How does the user explore the data before _Edit & Push_? I guess the answer is the "Data Canvas" idea (load some data, explore it quickly). Mentioning because I know these things that will happen 99% of times: **💬2023-03-13 great question. one way is to just allow people to push - does it matter if your tests fail when you first push to github.** - Data will be messy/incorrect - Some light transformations might be needed - In my mind, there is an orthogonal box in that diagram: the Data Package Manager. You have it locally and work with it to edit/push, it needs to work with the "DB" and is used by the "presentation" layer. Not sure what is the MVP here though. Specifically, I'd love to figure out how it could play well with the different boxes in the diagram. - Reading the plan (https://datahub.io/notes/plan) and the v3 docs (https://datahub.io/docs/dms/datahub/v3) gives a great overview of the specifics! Again, I'm super stoked you are doing all of this in public! :heart_eyes: - Common workflows like Data Validation and Data Summary are quite common in enterprise and have great solutions. This was something I contributed to back in PL: https://github.com/bacalhau-project/amplify. - Been tracking GitHub blocks (https://blocks.githubnext.com/) could be something interesting to apply to your idea! **💬2023-03-13 had not seen this and this has a lot there.** - Data APIs (https://datahub.io/docs/dms/data-api): - They can be cheap and very fast using R2 and CDNs alongside something like ROAPI (https://github.com/roapi/roapi) in a serverless platform. Not sure if this could even be in WASM with things like DuckDB Shell (https://shell.duckdb.org/) ir Seafowl (https://github.com/splitgraph/seafowl). Even faster if in the backend, the data is transformed from CSVs to compressed Parquet files. The folks at HuggingFace do that and it enables things like this: https://twitter.com/davidgasquez/status/1605174686081601536. The main insight here is that nowadays, most of the data CAN be processed in your laptop. I know this was something you were dealing with back in 2011 (https://www.youtube.com/watch?v=guNjG05ra9k). - Let me know if you want me to send a PR to https://datahub.io/notes/design-data-api with the ideas! **✅2023-03-13 👍 great idea ~rufus** - I think there are lots of great ideas in ODF (https://github.com/open-data-fabric/open-data-fabric), Seafowl (https://seafowl.io/docs/guides/baking-dataset-docker-image), Datalad (https://www.datalad.org/) and Quilt (https://github.com/quiltdata/quilt) that could be savaged/reused. - More GitHub data examples (https://github.com/datopian/datahub-next/blob/main/content/notes/github-data-examples.md): - https://github.com/owid/owid-datasets **💬2023-03-13 great - please add it to the page** - Tried to adapt DataHub Next DMS drawing to the mental image I have of how everything fit together: https://excalidraw.com/#json=rOcjhjUYA68H4k-FvaJkm,6NUttcw__lu_UNQE8cN_qw - Raw: https://excalidraw.com/#room=b1845be57a46bccaf250,iK3i_MfTFuF8bn28eRxK3g ![](https://i.imgur.com/SqBopEd.png) ## Frictionless > https://specs.frictionlessdata.io/data-package/ - https://github.com/frictionlessdata/framework is a great project but I wonder how many new "custom checks" and new features could get if it moved closer to the "modern data stack" and integrated with other toolings. Something like https://github.com/great-expectations/great_expectations or https://github.com/cleanlab/cleanlab. **💬2023-03-13 great expectations etc came along since Frictionless started and, IMO, has somewhat over-taken it now.** - This is me again trying to think how these great protocols could be made with less isolation from all the cool things happening in the data space. Instead of building custom tooling, build a protocol and integrate with external one (so easy to say!). - TODO (david) ## Iterating the story I'm in an company. I have a hunch I want to investigate with data. In a company this is relatively easy to do: - I know where to look. - If the data is there, it is probably easily queriable (schemas, tests, ...) as there is a team that is managing that (data team). - Getting external data is also easy. Run your Singer or Airbyte tap: https://airbyte.com/ - For each part of the stack, there are interoperable standards and tools. e.g. warehouses (S3, Redshift, Clickhouse), modeling (getdbt, Airflow + python), orchestration (prefect, dagster, airflow, ...) etc With public data, is much more painful: e.g. I want to check if the number of graduates from $UNIVERSITY is correlated with the GDP of $COUNTRY I want to be able to grab and work on a public data source as easily as I do in the company and reuse the tools in a data company. ```sql! select date, graduates, gdp from university_data left join country_data ``` or even something like `data get organization/university_data --json && data get organization/country_data --format json`. In a company, models compound. You don't have to derive the same models over and over. In Open Data, we usually only have the raw data. Collaborate on data like people are doing in some DAOs (https://github.com/duneanalytics/spellbook/tree/main/models, https://github.com/MetricsDAO/near_dbt/pull/98). Reuse these models in a permissionless way. ## Iterating data packaging workflow ### New dataset - I come across an interesting dataset online. E.g: https://github.com/datasets/awesome-data/issues/334 - I want to create a page for it. - Doing this should also publish a package + API. - Two options - Option 1: Add a page to the "wiki" (quick and dirty). Open a PR on a certain repository acting as a hub. - Option 2: It's own folder/repo 👈 preferred - The previous step should link it to datahub automatically. Alternatively, Datahub could scrape Github searching for compilant datasets. - ie. datahub.io/@david/my-dataset automatically proxies to github.com/david/my-dataset (and gives error if not there or not data ...) - Showcase: turn README into a nice page maybe with extra features - https://datahub.io/notes/design-showcase - MDX+Data aka data rich documents: markdown + MDX + data components for tables, graphs etc - Proxy issues, PRs, ... ### Existing dataset in Datahub - I saw the showcase/README from the previous workflow. I want to add a new graph or relation! - For a simple graph, fork the repository and update the README. Make PR. - _For a complex dataset? How can we compose data in this case?_ E.g: ### Existing dataset in X (HuggingFace, GitHub, Socrata, FTP, Random database) - I come across an interesting dataset online. E.g: https://github.com/datasets/awesome-data/issues/334 - Since is on Kaggle, I can somehow use a _Kaggle dataset adapter_ and use it as if it was any other dataset. - It might get ingested in the backend to create a copy - Same steps as the first workflopw ## Principles If I had to summarize the principles/guidelines I have in mind while exploring this, it would be: - Integrate protocols with the common data stack tools to avoid reinventing the wheel (ETL, formats, orchestrators, ...). - Make things simpler (wrappers) and have strong opinions on top of the tools. - Data as code. Workflows as code, ... - Frictionless Data Composability/collaboration (e.g: build SQL models on top of other people's ones). - Build adapters as a community so data becomes connected. - Simplicity/pragmatism (CSV, published data is better than almost published one, ...) - Packaging should be format/platform agnostic.

Import from clipboard

Advanced permission required

Your current role can only read. Ask the system administrator to acquire write and comment permission.

This team is disabled

Sorry, this team is disabled. You can't edit this note.

This note is locked

Sorry, only owner can edit this note.

Reach the limit

Sorry, you've reached the max length this note can be.
Please reduce the content or divide it to more notes, thank you!

Import from Gist

Import from Snippet

or

Export to Snippet

Are you sure?

Do you really want to delete this note?
All users will lose their connection.

Create a note from template

Create a note from template

Oops...
This template is not available.
Upgrade
All
  • All
  • Team
No template found.

Create custom template

Upgrade

Delete template

Do you really want to delete this template?
Turn this template into a regular note and keep its content, versions, and comments.

This page need refresh

You have an incompatible client version.
Refresh to update.
New version available!
See releases notes here
Refresh to enjoy new features.
Your user state has changed.
Refresh to load new user state.

Sign in

Forgot password

or

By clicking below, you agree to our terms of service.

Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
Wallet ( )
Connect another wallet

New to HackMD? Sign up

Help

  • English
  • 中文
  • Français
  • Deutsch
  • 日本語
  • Español
  • Català
  • Ελληνικά
  • Português
  • italiano
  • Türkçe
  • Русский
  • Nederlands
  • hrvatski jezik
  • język polski
  • Українська
  • हिन्दी
  • svenska
  • Esperanto
  • dansk

Documents

Help & Tutorial

How to use Book mode

How to use Slide mode

API Docs

Edit in VSCode

Install browser extension

Get in Touch

Feedback

Discord

Send us email

Resources

Releases

Pricing

Blog

Policy

Terms

Privacy

Cheatsheet

Syntax Example Reference
# Header Header 基本排版
- Unordered List
  • Unordered List
1. Ordered List
  1. Ordered List
- [ ] Todo List
  • Todo List
> Blockquote
Blockquote
**Bold font** Bold font
*Italics font* Italics font
~~Strikethrough~~ Strikethrough
19^th^ 19th
H~2~O H2O
++Inserted text++ Inserted text
==Marked text== Marked text
[link text](https:// "title") Link
![image alt](https:// "title") Image
`Code` Code 在筆記中貼入程式碼
```javascript
var i = 0;
```
var i = 0;
:smile: :smile: Emoji list
{%youtube youtube_id %} Externals
$L^aT_eX$ LaTeX
:::info
This is a alert area.
:::

This is a alert area.

Versions and GitHub Sync
Upgrade to Prime Plan

  • Edit version name
  • Delete

revision author avatar     named on  

More Less

No updates to save
Compare
    Choose a version
    No search result
    Version not found
Sign in to link this note to GitHub
Learn more
This note is not linked with GitHub
 

Feedback

Submission failed, please try again

Thanks for your support.

On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

Please give us some advice and help us improve HackMD.

 

Thanks for your feedback

Remove version name

Do you want to remove this version name and description?

Transfer ownership

Transfer to
    Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

      Link with GitHub

      Please authorize HackMD on GitHub
      • Please sign in to GitHub and install the HackMD app on your GitHub repo.
      • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
      Learn more  Sign in to GitHub

      Push the note to GitHub Push to GitHub Pull a file from GitHub

        Authorize again
       

      Choose which file to push to

      Select repo
      Refresh Authorize more repos
      Select branch
      Select file
      Select branch
      Choose version(s) to push
      • Save a new version and push
      • Choose from existing versions
      Include title and tags
      Available push count

      Upgrade

      Pull from GitHub

       
      File from GitHub
      File from HackMD

      GitHub Link Settings

      File linked

      Linked by
      File path
      Last synced branch
      Available push count

      Upgrade

      Danger Zone

      Unlink
      You will no longer receive notification when GitHub file changes after unlink.

      Syncing

      Push failed

      Push successfully