InfQuery Module

# InfQuery Module  **slides:** [https://hackmd.io/@sonik/ByukrOckq](https://hackmd.io/@sonik/ByukrOckq) > [name=Hamilakis Nicolas] --- **TODO**: brief description --- ## Experiment format Experiments are described with a yaml file which is easy to edit for humans and easy to parse for programs. A schema validation garanties that all information is correctly entered before the experiment is run. ```yaml= settings: ... experiments: - kwargs: ... input: ... output: ... models: ... task: ... ``` --- ### Settings A set of global settings and parameters to take into account when running experiments. ```yaml= settings: sbatch_mode: array debug: false hostname: oberon ``` --- ### experiments A list of experiments to run. ```yaml= experiments: - ... - ... ``` --- #### kwargs Setting and parameters of an individual experiment. ```yaml= kwargs: quantize_options: max_size_seq: 10200 batch_size: 16 strict: false convert_audio_to: ".wav" pseudo_probabilities_options: batch_size: 256 decoding_span_size: 8 inner_batch_size: 64 ``` --- #### input A way to specify input files to the experiment (audio, csv, etc) ```yaml= input: audio: ext: .wav method: glob root_dir: data/wav lexical: csv: gold.csv columns: 'id,filename,voice,frequency,word,phones,length,correct' ``` --- #### output A way to specify formating of output files. ```yaml= output: quantization: keep: false pseudo_proba: keep: false lexical: keep: true action: merge_by_family format: csv by: length ``` --- #### models A way to specify on which models to run the experiment (InfQuery Selector) ```yaml= model: model_root: /data/infTrain/models selector: en: - "baby50" - "baby100" en|fr: - "baby200" fr: - "baby3600" de: "all" ``` --- #### task A way to specify what task to run (quantization, pseudo_probability, abx, lexical, etc). ```yaml= task: "quantize | proba | lexical" ``` --- ## Data types Different data types paired with dataloaders to allow integration & interactions with the tasks (ex: audio, csv files, etc). --- ## Task Types A task takes a model and a dataset (a list of data types) and produces a dataset as output. Task lists - audio quantizations (CPC/Kmeans) - pseudo probabilities (LSTM/BERT) - comparative evaluation (lexical, semantic, etc.) - distance evaluation --- ## Task Run example ![schema](https://i.imgur.com/A8aVCE2.png) --- ## Task parallelization Various issues arize from adding parallelization. - identifying bottlenecks (what are the sections to render parallel) - memory management - process management - gpu allocation --- ![version1](https://i.imgur.com/gbkfdeH.png) --- Parallelization happens at the model level, via sbatch_array or other ressource manager (optional), a process (sbatch/other) is create for each model and the task_sequence is sequential. ---