Data used in Inadequate sampling of the soundscape leads to overoptimistic estimates of recogniser performance: a case study of two sympatric macaw species

# Data used in Inadequate sampling of the soundscape leads to overoptimistic estimates of recogniser performance: a case study of two sympatric macaw species --- This data contains four files: 1. ```training_data_features.csv``` 2. ```R2_RF_evaluation.csv``` 3. ```metrics.csv``` 4. ```R2.rds``` All of these dataset were used in the analysis for a paper of the same name. Raw data used to create this dataset was collected from autonomous recording units in northern Costa Rica. Here is a short explaination of each of the four files: 1. template-matching process was used to identify candidate signals, then a one-second window was put around each candidate signal. We extracted a total of 113 acoustic features using the warbler package in R (R Core Team, 2020): 20 measurements of frequency, time, and amplitude parameters, and 93 Mel-frequency cepstral coefficients (MFCCs) (Araya‐Salas and Smith‐Vidaurre, 2017). 2. the results of manually checking detections that were the output of a trained random forest. These were initially output as selection tables, individual sound files were loaded in Raven Lite, selection tables were loaded, and each detection was manually checked and labelled. 3. Performance metrics of R2 4. The random forest model, created using tidymodels in R. ## Description of the Data and file structure 1. Contains 118 columns: * 1: ```sound.files``` : ID of the source recordings * 2-3: ```start:end```: start and end of clip * 4: ```selec```: clip id within the ```sound.file``` * 5-29: ```spectral measurements```: features extracted using the ```spectro_analysis``` function from the R package [```warbleR```](https://https://cran.r-project.org/web/packages/warbleR/warbleR.pdf) * 30-117: ```Mel-cepstral coefficients```: features extracted using the ```mfcc_stats``` function from the ```warbleR``` package. * 118: ```actual label```: audio clip label, to species level where possible. 2. Contains 8 columns: - ```id```: source of row - ```.pred_GGM```: the probability of that entry being a "GGM" - ```.pred_no```: the probability of that entry being a "no" - ```.pred_SCM```: the probability of that entry being a "SCM" - ```.row```: reference to the row number of the ```training_data_features.csv" - ```._preds_class```: the predicted class of that entry - ```Common```: the validated class of that entry - ```.config```: model configuration reference 3. Contains 5 columns: - ```Recogniser```: the recogniser ID - ```Type```: stage in process that performance was estimated - ```target```: the class the performance is estimated on - ```Metric```: the performance metric used to estimate performance - ```values```: the estimated performance of that row ## Sharing/access Information An R project and the html code are also stored on: 1. https://github.com/tclewis29/Species-specific-recogniser 2. [DOI: 10.5281/zenodo.7533386](https://doi.org/10.5281/zenodo.7533386)