# UMX-PRO description
UMX-PRO is a software written in Python and using the TensorFlow framework that provides an off-the-shelf solution for music source separation (MSS).
MSS consists in extracting different instrumental sounds from a mixture signal. In the scenario considered by UMX-PRO, a mixture signal is decomposed into a pre-definite set of so called `targets`, such as: (scenario 1) {`vocals`, `bass`, `drums`, `guitar`, `other`} or (scenario 2) {`vocals`, `accompaniment`}.
The following key design choices were made for UMX-PRO:
• The software revolves around the training and inference of a deep neural network (DNN), building upon the TensorFlow v2 framework. The DNN implemented in UMX-PRO is based on a BLSTM recurrent network. However, the software has been designed to be easily extended to other kinds of network architectures to allow for research and easy extensions.
• Given an appropriately formatted database (not part of UMX-PRO), the software trains the network. The database has to be split into `train` and `valid` subsets, each one being composed of folders called samples. All samples must contain the same set of audio files, having the same duration: one for each desired target. For instance: {vocals.wav, accompaniment.wav}. The software can handle any number of targets, provided they are all present in all samples. Since the model is trained jointly, a larger number of targets increases the GPU memory usage during training.
• The software comes with pre-trained models for the two scenarios mentioned above (the 5 targets and 2 targets ones).
• Once the models have been trained, they can be used for separation of new mixtures through a dedicated `end-to-end` separation network. Interestingly, this end-to-end network comprises an optional refining step called `expectation-maximization` that usually improves separation quality.
The software comes with full documentation, detailed commenting of the code and unit tests. In this short description, we just mention the core elements of interest:
• A `model` module implements the following classes:
◦ `STFTLayer` / `ISTFTLayer`: subclass `keras.layers.Layer` and encapsulate the transformation back and forth to the short-term Fourier domain.
◦ `BLSTMSpectralFilterLayer`: the core filter model, that inputs a mixture spectrogram and outputs a specific target spectrogram.
◦ `get_joint_spectral_filter`: this puts together several spectral filter layers (either based on currently implemented BLSTM or based on some other model that’s not part of UMX-PRO): this keras Model puts together several spectral filters and combines them to produce several target spectrograms. This joint filter is available both as a functional Keras model or as a subclass of keras.Model for convenience.
◦ An `ExpectationMaximization` keras `Layer` takes several estimates in the complex STFT domain as well as the complex STFT of a mixture and refines those estimates through the EM algorithm (it can be understood as a TF2 implementation of the `norbert` python toolbox)
◦ `UMX`: this keras subclassed `Model` puts together everything mentioned above to take time-domain signals and produce separated time-domain targets.
• A `data` module implements the `FixedSourcesTrackFolderDataset` which is a TF v2 compliant parallelizable data pipeline, with the following features:
◦ it takes as an input a path with the structure mentioned above.
◦ it enables optional augmentations: random track mixing, random target gains, stereo swapping.
◦ it randomly extracts samples of some given duration from the actual data.
◦ the pipeline does not require any pre-processing, thus supports streaming audio inputs.
• A `train` module implements the actual training of the network.
◦ It creates a joint spectral filter, train/validation datasets
◦ It creates a configuration profile with the required parameters to continue training and restore models
◦ It trains the models under a distributed strategy, with early stopping and learning rate decay on plateau.
◦ It checkpoints regularly, monitors losses through tensorboard and saves the model to the (google) cloud regularly.
◦ Saves the model through its weights and in the `SavedModel` format for deployment.
◦ Alternatively to a joint training, one may train each source independently and only use them jointly afterwards for inference.
• An `inference` module permits the use of the pre-trained model:
◦ It restores a model from checkpoints/saved weights/SavedModel formats
◦ It takes audio files as input and produces separated files as output.
The rest of the software description may be found in the docs, automatically generated from the densely commented source code.