David
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    # Feedback on Datahub Next ## Portal.js > https://github.com/datopian/portal.js - Brand it as something like "Hugo/Astro/Jekyll for data literate documents". - Rill Developer is working on something similar but with Svelte. - Taking this Frictionless dataset (https://portal-js.vercel.app/): - Is there a way to provide a small playground (Datasette style)? Would that be a component? - Is there a way to automatically "archive" the datasets? - Will this dataset be exposed publicly under a common API? ## Data Literate Notebooks > https://datahub.io/docs/dms/notebook - You could even embed pyscript and DuckDB/SQLite! :heart_eyes: - Similar to evidence.dev. Is there a way to reuse/collaborate between the two projects? - Ultimately, this feels more on the publishing side. I feel there is still something like Rath or Rill Dev that is needed before this to explore and understand the data. I'm thinking along the lines of the ideas I share in the Open Source Web Data IDE section (https://publish.obsidian.md/davidgasquez/Open+Data#Open+Source+Web+Data+IDE). That said, this is the kind of doc I (and I'm sure lots of organizations) would love to be able to write: ``` ## Renewable Energy \```energy_data select * from https://raw.githubusercontent.com/owid/energy-data/master/owid-energy-data.csv \``` Since the Industrial Revolution, the energy mix of most countries across the world has become... ... ... ... ### How much of our primary energy comes from renewables? \```python import something daily_energy = something.process(energy_data) \``` <LineChart data={daily_energy} x=date y=energy yAxisTitle="Energy line chart!" /> <importFromFile(graphs/complex_graph.json)/> ``` Basically, having a more accesible and open way to emulate the original research: https://ourworldindata.org/renewable-energy. By looking at https://app.excalidraw.com/l/9u8crB2ZmUo/j2SkAQNJ8n, seems this is something moving closer to classic enterprise BI tools (Hex, Mode Analytitcs, Evidence.dev). This might be the right call as there is no easy way to use Looker/Mode with random datasets but I'm curious how these tools could be integrated/play well together. > [rufus] this was a bit more towards what i call the "Data Project" product option vs the "Data README" product option. That said, the first part of this flow may be needed whatever you do. Love the rest of the notebook discussion and ideas. They gave me more questions. Feel free to ignore them though. - Would love to know if you have ideas around what can be learned from package managers like cargo in Rust and Nix? **💬2023-03-13 have not really looked at cargo nix. My overall sense was i got a bit less interested in package managers b/c of both the deno/go trype approach plus thought that this is something that emerges from a need** - You wrote about checking if there was a need for a simple dataflow library? Did this result into https://github.com/datopian/datapipes and https://github.com/datahq/dataflows? What are the plans there? **💬2023-03-13 yes, those were two efforts. Still think there is some missing sweet spot of super simple. That said, my conclusion was this is something to "buy" not build as so much out there especially e.g. prefect etc** - re Aircan: this is the approach I'd love to explore more! Reusing/Integrating the current tooling instead of building new ones. How can the problem be broken in a way we can reuse more existing tooling and approaches? ## Datahub Next Architecture ![](https://i.imgur.com/0VKNmv1.png) - How does the user explore the data before _Edit & Push_? I guess the answer is the "Data Canvas" idea (load some data, explore it quickly). Mentioning because I know these things that will happen 99% of times: **💬2023-03-13 great question. one way is to just allow people to push - does it matter if your tests fail when you first push to github.** - Data will be messy/incorrect - Some light transformations might be needed - In my mind, there is an orthogonal box in that diagram: the Data Package Manager. You have it locally and work with it to edit/push, it needs to work with the "DB" and is used by the "presentation" layer. Not sure what is the MVP here though. Specifically, I'd love to figure out how it could play well with the different boxes in the diagram. - Reading the plan (https://datahub.io/notes/plan) and the v3 docs (https://datahub.io/docs/dms/datahub/v3) gives a great overview of the specifics! Again, I'm super stoked you are doing all of this in public! :heart_eyes: - Common workflows like Data Validation and Data Summary are quite common in enterprise and have great solutions. This was something I contributed to back in PL: https://github.com/bacalhau-project/amplify. - Been tracking GitHub blocks (https://blocks.githubnext.com/) could be something interesting to apply to your idea! **💬2023-03-13 had not seen this and this has a lot there.** - Data APIs (https://datahub.io/docs/dms/data-api): - They can be cheap and very fast using R2 and CDNs alongside something like ROAPI (https://github.com/roapi/roapi) in a serverless platform. Not sure if this could even be in WASM with things like DuckDB Shell (https://shell.duckdb.org/) ir Seafowl (https://github.com/splitgraph/seafowl). Even faster if in the backend, the data is transformed from CSVs to compressed Parquet files. The folks at HuggingFace do that and it enables things like this: https://twitter.com/davidgasquez/status/1605174686081601536. The main insight here is that nowadays, most of the data CAN be processed in your laptop. I know this was something you were dealing with back in 2011 (https://www.youtube.com/watch?v=guNjG05ra9k). - Let me know if you want me to send a PR to https://datahub.io/notes/design-data-api with the ideas! **✅2023-03-13 👍 great idea ~rufus** - I think there are lots of great ideas in ODF (https://github.com/open-data-fabric/open-data-fabric), Seafowl (https://seafowl.io/docs/guides/baking-dataset-docker-image), Datalad (https://www.datalad.org/) and Quilt (https://github.com/quiltdata/quilt) that could be savaged/reused. - More GitHub data examples (https://github.com/datopian/datahub-next/blob/main/content/notes/github-data-examples.md): - https://github.com/owid/owid-datasets **💬2023-03-13 great - please add it to the page** - Tried to adapt DataHub Next DMS drawing to the mental image I have of how everything fit together: https://excalidraw.com/#json=rOcjhjUYA68H4k-FvaJkm,6NUttcw__lu_UNQE8cN_qw - Raw: https://excalidraw.com/#room=b1845be57a46bccaf250,iK3i_MfTFuF8bn28eRxK3g ![](https://i.imgur.com/SqBopEd.png) ## Frictionless > https://specs.frictionlessdata.io/data-package/ - https://github.com/frictionlessdata/framework is a great project but I wonder how many new "custom checks" and new features could get if it moved closer to the "modern data stack" and integrated with other toolings. Something like https://github.com/great-expectations/great_expectations or https://github.com/cleanlab/cleanlab. **💬2023-03-13 great expectations etc came along since Frictionless started and, IMO, has somewhat over-taken it now.** - This is me again trying to think how these great protocols could be made with less isolation from all the cool things happening in the data space. Instead of building custom tooling, build a protocol and integrate with external one (so easy to say!). - TODO (david) ## Iterating the story I'm in an company. I have a hunch I want to investigate with data. In a company this is relatively easy to do: - I know where to look. - If the data is there, it is probably easily queriable (schemas, tests, ...) as there is a team that is managing that (data team). - Getting external data is also easy. Run your Singer or Airbyte tap: https://airbyte.com/ - For each part of the stack, there are interoperable standards and tools. e.g. warehouses (S3, Redshift, Clickhouse), modeling (getdbt, Airflow + python), orchestration (prefect, dagster, airflow, ...) etc With public data, is much more painful: e.g. I want to check if the number of graduates from $UNIVERSITY is correlated with the GDP of $COUNTRY I want to be able to grab and work on a public data source as easily as I do in the company and reuse the tools in a data company. ```sql! select date, graduates, gdp from university_data left join country_data ``` or even something like `data get organization/university_data --json && data get organization/country_data --format json`. In a company, models compound. You don't have to derive the same models over and over. In Open Data, we usually only have the raw data. Collaborate on data like people are doing in some DAOs (https://github.com/duneanalytics/spellbook/tree/main/models, https://github.com/MetricsDAO/near_dbt/pull/98). Reuse these models in a permissionless way. ## Iterating data packaging workflow ### New dataset - I come across an interesting dataset online. E.g: https://github.com/datasets/awesome-data/issues/334 - I want to create a page for it. - Doing this should also publish a package + API. - Two options - Option 1: Add a page to the "wiki" (quick and dirty). Open a PR on a certain repository acting as a hub. - Option 2: It's own folder/repo 👈 preferred - The previous step should link it to datahub automatically. Alternatively, Datahub could scrape Github searching for compilant datasets. - ie. datahub.io/@david/my-dataset automatically proxies to github.com/david/my-dataset (and gives error if not there or not data ...) - Showcase: turn README into a nice page maybe with extra features - https://datahub.io/notes/design-showcase - MDX+Data aka data rich documents: markdown + MDX + data components for tables, graphs etc - Proxy issues, PRs, ... ### Existing dataset in Datahub - I saw the showcase/README from the previous workflow. I want to add a new graph or relation! - For a simple graph, fork the repository and update the README. Make PR. - _For a complex dataset? How can we compose data in this case?_ E.g: ### Existing dataset in X (HuggingFace, GitHub, Socrata, FTP, Random database) - I come across an interesting dataset online. E.g: https://github.com/datasets/awesome-data/issues/334 - Since is on Kaggle, I can somehow use a _Kaggle dataset adapter_ and use it as if it was any other dataset. - It might get ingested in the backend to create a copy - Same steps as the first workflopw ## Principles If I had to summarize the principles/guidelines I have in mind while exploring this, it would be: - Integrate protocols with the common data stack tools to avoid reinventing the wheel (ETL, formats, orchestrators, ...). - Make things simpler (wrappers) and have strong opinions on top of the tools. - Data as code. Workflows as code, ... - Frictionless Data Composability/collaboration (e.g: build SQL models on top of other people's ones). - Build adapters as a community so data becomes connected. - Simplicity/pragmatism (CSV, published data is better than almost published one, ...) - Packaging should be format/platform agnostic.

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully