Hugging Face Dataset Structuring

## Create the `Dataset` to push to a HF repo This will enable you or your uses to retrieve the dataset in two ways, each with advantages: - The "Hugging Face" way: `ds = load_dataset("namespace/dataset-name")` (or with `split="splitname"` included). This keeps the actual data in the Hugging Face cache and is a bit more abstract but has better integration with the ecosystem. - A "legacy" way that enables use ### Set up the PR following [instructions](https://huggingface.co/docs/hub/en/repositories-pull-requests-discussions#how-do-i-manage-pull-requests-locally). - Create a PR on Hugging face (e.g. `pr/42`) - Clone the repo locally. - Retrieve and checkout the remote PR ``` git fetch origin refs/pr/42:pr/42 git checkout pr/42 ``` Note that locally, the branch is referred to as `pr/<#>`. On the remote side, it's referred to as `refs/pr/<#>`. Excluding the `refs/` for the remote will break things! ### Make your changes. Create a CSV storing the `file_path` for data and any other relevant info you want accessible for each entry. For example, the [metadata.csv](https://huggingface.co/datasets/imageomics/rare-species/raw/refs%2Fpr%2F10/metadata.csv) for the rare-species dataset. (Note that this CSV was created programmatically ) Use the CSV to create a `Dataset`. ``` import polars as pl from datasets import Dataset, Features, Value, Image # Load into Dataset with default df = pl.read_csv("metadata.csv") # Make the image show up first for the dataset previewer cols = ["file_path"] + [c for c in df.columns if c!="file_path"] ds0 = Dataset.from_polars(df.select(cols)) # Define your target schema features = Features({ "file_path": Image(), **{c: Value("string") for c in cols if c != "file_path"} }) # Map to just return file_path with your features ds = ds0.map(lambda ex: {"file_path": ex["file_path"]}, features=features) ``` Then push it up: ``` ds.push_to_hub( repo_id="imageomics/<repo-name>", revision="refs/pr/42", commit_message="Rebuild dataset from CSV", token=True ) ``` Now, you'll have the newly formatted Parquet dataset on the remote at ``` https://huggingface.co/datasets/imageomics/<repo-name>/tree/refs%2Fpr%2F42 ``` But it won't be synced locally yet, since you used the `push_to_hub` method. To get it locally, do ``` git pull origin refs/pr/42:pr/42 ``` Within a `scripts/` directory, an 'export' script can be created so users can get the dataset onto their filesystem in a familiar format with minimal manual steps. e.g. for `rare-species`: [`scripts/export_rare_species.py`](https://huggingface.co/datasets/imageomics/rare-species/blob/refs%2Fpr%2F10/scripts/export_rare_species.py) Also make a `scripts/requirements.txt` needed for the export. e.g. for `rare-species`: [`scripts/requirements.txt`](https://huggingface.co/datasets/imageomics/rare-species/blob/refs%2Fpr%2F10/scripts/requirements.txt) Then, a user should be able to do the following to get the structured dataset: > > To export the data into a format matching the previously used dataset structure, you may do the following without cloning the repository: > > 1. Create and activate a virtual environment. > 2. Install dependencies. > ``` > pip install -r \ > https://huggingface.co/datasets/imageomics/rare-species/resolve/main/scripts/requirements.txt > ``` > 3. Run the export script. You may customize the output directory by specifying the `--dataset-path` argument. > ``` > curl -s https://huggingface.co/datasets/imageomics/rare-species/raw/main/scripts/export_rare_species.py \ > | python3 - --dataset-path ./exported_dataset > ``` > > This will create a directory structure like the following: > `<dataset-directory-hierarchy>` Note that before the PR is merged, to test these remote dependency and execution commands, you'll need to use the URL associated with the PR. For the `rare-species` examples above, the URLs would change to: ``` https://huggingface.co/datasets/imageomics/rare-species/raw/refs%2Fpr%2F10/scripts/requirements.txt ``` and ``` https://huggingface.co/datasets/imageomics/rare-species/raw/refs%2Fpr%2F10/scripts/export_rare_species.py ```

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.