## Create the `Dataset` to push to a HF repo
This will enable you or your uses to retrieve the dataset in two ways, each with advantages:
- The "Hugging Face" way: `ds = load_dataset("namespace/dataset-name")` (or with `split="splitname"` included). This keeps the actual data in the Hugging Face cache and is a bit more abstract but has better integration with the ecosystem.
- A "legacy" way that enables use
### Set up the PR following [instructions](https://huggingface.co/docs/hub/en/repositories-pull-requests-discussions#how-do-i-manage-pull-requests-locally).
- Create a PR on Hugging face (e.g. `pr/42`)
- Clone the repo locally.
- Retrieve and checkout the remote PR
```
git fetch origin refs/pr/42:pr/42
git checkout pr/42
```
Note that locally, the branch is referred to as `pr/<#>`. On the remote side, it's referred to as `refs/pr/<#>`. Excluding the `refs/` for the remote will break things!
### Make your changes.
Create a CSV storing the `file_path` for data and any other relevant info you want accessible for each entry.
For example, the [metadata.csv](https://huggingface.co/datasets/imageomics/rare-species/raw/refs%2Fpr%2F10/metadata.csv) for the rare-species dataset.
(Note that this CSV was created programmatically )
Use the CSV to create a `Dataset`.
```
import polars as pl
from datasets import Dataset, Features, Value, Image
# Load into Dataset with default
df = pl.read_csv("metadata.csv")
# Make the image show up first for the dataset previewer
cols = ["file_path"] + [c for c in df.columns if c!="file_path"]
ds0 = Dataset.from_polars(df.select(cols))
# Define your target schema
features = Features({
"file_path": Image(),
**{c: Value("string") for c in cols if c != "file_path"}
})
# Map to just return file_path with your features
ds = ds0.map(lambda ex: {"file_path": ex["file_path"]}, features=features)
```
Then push it up:
```
ds.push_to_hub(
repo_id="imageomics/<repo-name>",
revision="refs/pr/42",
commit_message="Rebuild dataset from CSV",
token=True
)
```
Now, you'll have the newly formatted Parquet dataset on the remote at
```
https://huggingface.co/datasets/imageomics/<repo-name>/tree/refs%2Fpr%2F42
```
But it won't be synced locally yet, since you used the `push_to_hub` method.
To get it locally, do
```
git pull origin refs/pr/42:pr/42
```
Within a `scripts/` directory, an 'export' script can be created so users can get the dataset onto their filesystem in a familiar format with minimal manual steps.
e.g. for `rare-species`: [`scripts/export_rare_species.py`](https://huggingface.co/datasets/imageomics/rare-species/blob/refs%2Fpr%2F10/scripts/export_rare_species.py)
Also make a `scripts/requirements.txt` needed for the export.
e.g. for `rare-species`: [`scripts/requirements.txt`](https://huggingface.co/datasets/imageomics/rare-species/blob/refs%2Fpr%2F10/scripts/requirements.txt)
Then, a user should be able to do the following to get the structured dataset:
>
> To export the data into a format matching the previously used dataset structure, you may do the following without cloning the repository:
>
> 1. Create and activate a virtual environment.
> 2. Install dependencies.
> ```
> pip install -r \
> https://huggingface.co/datasets/imageomics/rare-species/resolve/main/scripts/requirements.txt
> ```
> 3. Run the export script. You may customize the output directory by specifying the `--dataset-path` argument.
> ```
> curl -s https://huggingface.co/datasets/imageomics/rare-species/raw/main/scripts/export_rare_species.py \
> | python3 - --dataset-path ./exported_dataset
> ```
>
> This will create a directory structure like the following:
> `<dataset-directory-hierarchy>`
Note that before the PR is merged, to test these remote dependency and execution commands, you'll need to use the URL associated with the PR.
For the `rare-species` examples above, the URLs would change to:
```
https://huggingface.co/datasets/imageomics/rare-species/raw/refs%2Fpr%2F10/scripts/requirements.txt
```
and
```
https://huggingface.co/datasets/imageomics/rare-species/raw/refs%2Fpr%2F10/scripts/export_rare_species.py
```