Speech Enhancement Dataset Creation

# Speech Enhancement Dataset Creation This is a short guide to dataset creation for speech enhancement training and testing. Below are the dataset sources, the ways to import them locally and the scripts to create the datasets, with the option to upload to hf or save loaclly. **Adonis Asonitis, ETH Zurich** *Zurich, November 2025* **aasonitis@ethz.ch** --- **TABLE OF CONTENTS** [TOC] # Speech Datasets ## HiFiTTS-2 Dataset More information: [HiFiTTS-2 on Hugging Face](https://huggingface.co/datasets/nvidia/hifitts-2) ### Step 1: Create workspace directory ```sh mkdir -p ~/datasets/hifitts2 cd ~/datasets/hifitts2 ``` ### Step 2: Download manifest and chapters JSON files Replace `44khz` with `22khz` for lower sampling rate: ```shell wget https://huggingface.co/datasets/nvidia/hifitts-2/resolve/main/44khz/manifest_44khz.json wget https://huggingface.co/datasets/nvidia/hifitts-2/resolve/main/44khz/chapters_44khz.json ``` ### Step 3: Install NeMo Speech Data Processor (SDP) Follow instructions here: [NeMo Speech Data Processor](https://github.com/NVIDIA/NeMo-speech-data-processor) ### Step 4: Download audio using SDP ```sh python /home/NeMo-speech-data-processor/main.py \ --config-path="/home/NeMo-speech-data-processor/dataset_configs/english/hifitts2" \ --config-name="config_44khz.yaml" \ workspace_dir="/home/hifitts2" \ max_workers=8 ``` # Noise Datasets ## Room Impulse Response and Noise Database (RIRs) Dataset #28 on OpenSLR: [OpenSLR RIRs](https://www.openslr.org/28/) ```sh TARGET_DIR="/path/to/your/datasets/rirs_noises" mkdir -p "$TARGET_DIR" # Download directly into target directory wget -O "$TARGET_DIR/rirs_noises.zip" https://openslr.trmal.net/resources/28/rirs_noises.zip # Unzip and clean up unzip "$TARGET_DIR/rirs_noises.zip" -d "$TARGET_DIR" rm "$TARGET_DIR/rirs_noises.zip" ``` ## DEMAND - Natural Noise ```sh TARGET_DIR="/path/to/your/datasets/demand" mkdir -p "$TARGET_DIR" # Download directly into the target directory curl -L -o "$TARGET_DIR/demand.zip" \ https://www.kaggle.com/api/v1/datasets/download/chrisfilo/demand # Unzip and remove the archive unzip "$TARGET_DIR/demand.zip" -d "$TARGET_DIR" rm "$TARGET_DIR/demand.zip" ``` ## WHAM - Natural Noise Link to huggingface hosted dataset: [HF REPO](https://huggingface.co/datasets/nguyenvulebinh/wham) ```python from datasets import load_dataset dataset = load_dataset( "nguyenvulebinh/wham", cache_dir="/path/to/my/local/datasets/wham_cache", download_mode="force_redownload" ) ``` ## TUT Urban Acoustic Scenes 2018, Development dataset Link to Zenodo hosted dataset: [ZENODO](https://zenodo.org/records/1228142) ```sh TARGET_DIR="/path/to/your/datasets/zenodo_1228142" mkdir -p "$TARGET_DIR" # Download the archive wget -O "$TARGET_DIR/archive.zip" "https://zenodo.org/api/records/1228142/files-archive" # Unzip into the target directory unzip "$TARGET_DIR/archive.zip" -d "$TARGET_DIR" # (Optional) Remove the zip to save space rm "$TARGET_DIR/archive.zip" ``` ## MUSAN - Noise & Music ```sh TARGET_DIR="/path/to/your/datasets/musan" mkdir -p "$TARGET_DIR" # Download to target directory wget -O "$TARGET_DIR/musan.tar.gz" https://www.openslr.org/resources/17/musan.tar.gz # Unpack tar -xzf "$TARGET_DIR/musan.tar.gz" -C "$TARGET_DIR" # Optional: remove the archive to save space rm "$TARGET_DIR/musan.tar.gz" ``` --- # DATASET CREATION SCRIPT This script generates synthetic clean-noisy audio pairs for speech enhancement research. It automatically applies various types of noise, reverberation, packet loss, and other audio degradations to clean speech recordings, producing paired datasets suitable for training or evaluating speech enhancement models. ## Features - **Two distribution modes:** - **`general`**: Generates samples using a mixture strategy where multiple noise types can be applied simultaneously to each sample - **`custom`**: Generates fixed counts of individual noise types plus mixture samples, allowing precise control over the dataset composition - **Multiple noise types:** - White noise - Natural noise (from noise datasets) - Packet loss - Reverberation (using RIR files) - Downsampling/upsampling artifacts - Mixture combinations of the above - **Flexible output:** - **Local save mode**: Saves audio waveforms as WAV files in organized folders - **HuggingFace Hub mode**: Uploads datasets directly to HuggingFace Hub with audio arrays ## Configuration All parameters are configurable at the top of the script. You can customize: ### Batch Size and Dataset Size ```python # Configuration BATCH_SIZE = 10000 # Number of samples to accumulate before saving/uploading TOTAL_ROWS = 1000000 # Total rows (used for reference) RANDOM_SEED = 42 # Random seed for reproducibility ``` The script automatically scales the number of samples based on the selected split: - **`train`**: 1,000,000 samples - **`grpo`**: 100,000 samples - **`test`**: 10,000 samples The `BATCH_SIZE` determines how many samples are accumulated before saving (for local mode) or uploading (for HuggingFace mode). This helps manage memory usage and allows for incremental processing. ### Noise Type Probabilities (for mixture generation) ```python # Parameters for mixture_to_clean noise type generation # Probabilities for each noise type (0.0 to 1.0) MIXTURE_PROB_DOWNSAMPLING = 0.25 # Probability of applying downsampling MIXTURE_PROB_WHITE_NOISE = 0.2 # Probability of applying white noise MIXTURE_PROB_PACKET_LOSS = 0.05 # Probability of applying packet loss MIXTURE_PROB_NATURAL_NOISE = 0.7 # Probability of applying natural noise MIXTURE_PROB_REVERBERATION = 0.05 # Probability of applying reverberation MIXTURE_PROB_FALLBACK_NOISE = 0.9 # Probability of adding natural noise if no noise was applied # SNR ranges (in dB) for each noise type MIXTURE_SNR_WHITE_NOISE_MIN = -7 # Minimum SNR for white noise MIXTURE_SNR_WHITE_NOISE_MAX = 5 # Maximum SNR for white noise MIXTURE_SNR_NATURAL_NOISE_MIN = -8 # Minimum SNR for natural noise (main) MIXTURE_SNR_NATURAL_NOISE_MAX = 15 # Maximum SNR for natural noise (main) MIXTURE_SNR_REVERBERATION_MIN = -2 # Minimum SNR for reverberation MIXTURE_SNR_REVERBERATION_MAX = 15 # Maximum SNR for reverberation MIXTURE_SNR_FALLBACK_NOISE_MIN = -5 # Minimum SNR for fallback natural noise MIXTURE_SNR_FALLBACK_NOISE_MAX = 15 # Maximum SNR for fallback natural noise # Packet loss parameters MIXTURE_PACKET_LOSS_DROP_PROB_MIN = 0.02 # Minimum drop probability MIXTURE_PACKET_LOSS_DROP_PROB_MAX = 0.2 # Maximum drop probability MIXTURE_PACKET_LOSS_DURATION_MS_MIN = 10 # Minimum packet duration (ms) MIXTURE_PACKET_LOSS_DURATION_MS_MAX = 200 # Maximum packet duration (ms) # RMS threshold for noise signal validation MIXTURE_RMS_THRESHOLD = 0.001 # Minimum RMS value to consider noise as valid signal ``` **Dataset distributions** can be modified in the `DATASET_DISTRIBUTION_general` and `DATASET_DISTRIBUTION_custom` dictionaries. ## Deployment ### Prerequisites 1. **Install required packages:** ```bash pip install soundfile librosa scipy numpy torch datasets huggingface_hub psutil ``` 2. **Configure data directories:** - Edit `CLEAN_SPEECH_DIRS`, `NOISE_DIRS`, and `RIR_DIRS` at the top of the script with absolute paths to your data directories ### Usage #### Local Save Mode (saves WAV files locally) ```bash python create_dataset_SE.py \ --identifier dataset_001 \ --distribution general \ --split train \ --save_mode local \ --output_dir /path/to/output ``` #### HuggingFace Hub Mode (uploads to HuggingFace) ```bash python create_dataset_SE.py \ --identifier dataset_001 \ --distribution general \ --split train \ --save_mode hf \ --hf_token YOUR_HF_TOKEN ``` Or set the token via environment variable: ```bash export HF_TOKEN=your_token_here python create_dataset_SE.py \ --identifier dataset_001 \ --distribution general \ --split train \ --save_mode hf ``` ### Command Line Arguments - `--identifier`: Unique identifier for the dataset (used in HuggingFace dataset name) - `--distribution`: `general` (mixture-based) or `custom` (fixed counts per noise type) - `--split`: `train` (1M samples), `grpo` (100K samples), or `test` (10K samples) - `--save_mode`: `local` (save WAV files) or `hf` (upload to HuggingFace Hub) - `--output_dir`: Required for local mode, specifies where to save audio files - `--hf_token`: Optional, HuggingFace API token (can also use `HF_TOKEN` environment variable) ## Output Structure When saved locally, the script creates the following directory structure: ``` /path/to/output/ ├── clean/ │ ├── clean_00000000.wav │ ├── clean_00000001.wav │ └── ... ├── noisy/ │ ├── dirty_00000000.wav │ ├── dirty_00000001.wav │ └── ... └── metadata_batch_0000.jsonl metadata_batch_0001.jsonl ... ``` - **`clean/`**: Contains clean audio waveforms (44.1 kHz WAV files) - **`noisy/`**: Contains corresponding noisy audio waveforms (44.1 kHz WAV files) - **`metadata_batch_*.jsonl`**: JSONL files containing metadata for each sample (sample ID, noise type, file paths, duration, etc.) Each sample pair shares the same ID (e.g., `clean_00000000.wav` and `dirty_00000000.wav` are a matching pair). All audio files are saved at 44.1 kHz sample rate and can be directly used for training or evaluation. ## SCRIPT LINK LINK: [script](https://github.com/lucala/dac-se1/blob/main/create_data_SE.py) ---