Project : Speech saperate

# Project : Speech saperate #### Reference : https://github.com/JusperLee/Conv-TasNet [TOC] ## TO DO LIST spectrogram - [ ] Why size of output wave file is bigger than original file. - [ ] DataLoader Splits. The length of results are smaller than input data while val/test. :warning: - [ ] While the range of mix_audio is between -1 and 1, the output value is very big ( +-thousands ) in aesteroid. However, in this model the range is similar to mix_audio, and when i subtract spk1/spk2 from mix_audio, there's only little difference, but it get good result after I multipling spk1/spk2 by 100 or bigger value. - [x] Deep encoder / decoder : https://arxiv.org/abs/2002.08688 - [ ] dilated / padding ( in D-conv ) :warning: - [ ] causal ( we only train in non-causal ) :warning: - [ ] Real-time implement ( need to train a causal model ) : https://github.com/kaituoxu/Conv-TasNet/issues/9 - [ ] Getting sound input in python * https://python-sounddevice.readthedocs.io/en/0.4.1/index.html - [ ] only support one microphone - [ ] music saperation - [ ] end-to-end - [ ] Explainable ML ---- python ---- - [ ] yield - [x] logging - [x] Pytorch groups / dilation :question: - [x] zero_grad/gradient accumulation - [x] DataParallel / Model Parallel ( Use multiple GPUs to train a model ) * https://ithelp.ithome.com.tw/articles/10226382 - [x] model.eval() / with torch.no_grad() - [x] Skip connection / Residual path :question: - [x] YAML file - [x] argparse * https://haosquare.com/python-argparse/ - [x] *args / **kwargs - [ ] elementwise affine ( norm ) - [ ] torch.nn.utils.clip_grad_norm_ - [x] DataLoader collate_fn - [x] Pytorch shuffle :question: - [ ] librosa amplitude_to_db - [x] Pytorch indentity // visdom ( visualize tool ) ## Code review * Model has better result without skip-connection. ( by JusperLee's experiment ) * Use "groups" to perform depthwise convolution. * This model is focusing on separation part, and only use shallow encoder/decoder. ( linear ) --> There's another paper trying to use deeper encoder/decoder architecture, and improve the average result by more than 1dB. ### Data * Mix_audio = Speaker1 + Speaker2. * All audio are divided into 4s. * How to split : 1. Only use the data that longer than 4s, and choose a **random start time**. Test data start from 0s. 2. Use alomost all the data, unless it is too short. * Parameters : `chunk_size` : 8000hz * 4s = 32000. (default) `least` : If the length of audio is less than this value, it will be ignored. * If the length of audio **longer** than `chunk_size`. --> It may be divided into many 4s parts. --> Longest audio in training data is about 16s. * If the length of audio **shorter** than `chunk_size`. --> Zero padding. --> There are a few audio shorter than 2s. ![](https://i.imgur.com/XrD2gpZ.png) ### Model ![](https://i.imgur.com/y44MOMm.png) 1. Network configuration : * N : Output ( Encoder ) & Input ( Decoder ) channel * L : Kernel size in encoder & decoder * B : Output channel in bottleneck * Sc : :x: * H : Output channel D-conv * P : Kernel size in D-conv * X : Number of 1-D conv block in each repeat * R : Number of repeats in separation part ![](https://i.imgur.com/0yuW3RV.png) ```yaml= #### network configure net_conf: N: 512 L: 40 B: 128 H: 256 P: 3 X: 7 R: 2 norm: gln num_spks: 2 activate: sigmoid causal: false ``` ![](https://i.imgur.com/F5W5DGF.png) ## Feasible approach 1. IBM ( ideal binary mask ) * Compare two sources, and find the bigger value. * Need ground truth, but in practice we only have mix sources. ## Model's problem 1. Conv-TasNet outperforms Deep Clustering in WSJ0-2mix dataset, but actually it still has some problem. ![](https://i.imgur.com/eqLCkAM.png) 2. If we use the datasets with only english speakers, the model can't properly separate that who are saying different languages. --> On the other hands, Deep clustering can get the better result. 3. The number of outputs is fixed in Conv-TasNet, so it can't deal with the input source that have different number of speakers. * One solution : Training a model that only separate one speaker every times, and take the rest of the source as a new input. ![](https://i.imgur.com/MNHFQR1.png) ## What Conv-TasNet trying to improve LSTM、STFT ![](https://i.imgur.com/B1PgDZx.png) ## Evaluation * SNR ( Signal-to-Noise Ratio ) SI-SDR = SI-SNR ( Scale Invariant Signal-to-Distortion Ratio ) * SNR has a problem while we enhance X* and we will get a better loss value. It is not a reasonable function to evaluate the result, so we use SI-SDR instead. * ![](https://i.imgur.com/GXpjjM3.png) reference :　http://speech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/SP%20(v3).pdf * Other loss function : MSE STOI ( Short-Time Oblective Intelligibility ) PESQ ( Perceptual Evaluation of Speech Quality ) ## Permutation Issue PIT is not that good : https://arxiv.org/abs/1910.12706 ![](https://i.imgur.com/yqsv1Ev.png) ![](https://i.imgur.com/dg5KzaS.png) ## TCN ( Temporal Convolution Network ) * Separator ## Zero_grad / Gradient accumulation Reference : https://meetonfriday.com/posts/18392404/ https://www.zhihu.com/question/303070254 * Pytorch model training template : ```python= ## training step for i, (image, label) in enumerate(train_loader): optimizer.zero_grad() # reset gradient pred = model(image) loss = criterion(pred, label) loss.backward() # BP -> find gradient optimizer.step() # update model's weight ``` * Why Pytorch do not empty gradient automatically ? * `Gradient accumulation` : * Due to the resource of GPU is limit, and we still need to train a big model. However, it is hard to get a good result while we are using small batch size. * Actually, it is fine to use small batch size, but we need a tip that help you design model. **`Not update weight after every iteration, but try to update weight after few iterations. `** * That is to say, we accumulate backward loss through many iterations, and then only do one update correspondingly, which indirectly achieves the effect of large batch size. * NOTICE : It is not as good as that we directly use large batch size. On the other hands, learning rate also need to be modified, and the training time will be longer. * EXAMPLE : BERT. ```python= ## training step with gradient accumulation for i, (image, label) in enumerate(train_loader): pred = model(image) loss = criterion(pred, label) loss = loss / accumulation_steps # loss regularization loss.backward() if (i+1) % accumulation_steps == 0: optimizer.step() # update model's weight optimizer.zero_grad() # reset gradient ``` ## model.eval() / with torch.no_grad() * model.eval() * Change the forward behavior of Dropout / BatchNorm layer. * with torch.no_grad() * Deactivate "autograd" function that means stop computing gradient, so that it can reduce GPU memory usage, speed up and run bigger batch. * If you don't care about the GPU memory usage and computing time, model.eval() is enough to get the correct answer. ## YAML file Reference : :star:https://www.wikiwand.com/zh-tw/YAML https://medium.com/bryanyang0528/%E4%BD%BF%E7%94%A8-python-%E8%AE%80%E5%8F%96-yaml-%E6%AA%94%E6%A1%88-d3f413d7dd6 https://ithelp.ithome.com.tw/articles/10206454 * YAML can do everything that JSON can do and more. It is a kind of format that describing settings, and make people immediately understand what it means. * YAML doesn't allow `TAB` character, so we can't use that to indent. ```yaml= ## This is a comment ## YAML can't use 'TAB' to indent key1: a: 10 b: !!float 0.5 ## !! : strict type declaration key2: list_key: - If we add `-` in front of the option, - it will be regarded as a list. ## '|' : Every line after this char will be seen as new data. ## 'key3' : {'This is a\nbook\n'} key3: | This is a book ## '>' : The data will be seen as new data only when indent changed. ## 'key4' : {'This is a book\nNewline\n'} key4: > This is a book Newline ## Others : ?, &/*, ... -> go Reference ``` ```python= ## How to use YAML file in python import yaml with open('filename.yaml', 'r') as stream: data = yaml.load(stream) ``` ## Skip connection / Residual path https://www.youtube.com/watch?v=6g2aPc0ol2Y&list=LL&index=3 ( 9:00 wavenet path ) ![](https://i.imgur.com/NlX2fWm.png) * Skip-connection is next block's input. (left) * Dimensionality may not be the same, so we need a weighted vector to do linear tranform to keep two data's dimensionality. https://ithelp.ithome.com.tw/articles/10204727 ![](https://i.imgur.com/WTooGJD.png) ## argparse ### Basic manipulation ```python= ## argparse.py import argparse parser = argparse.ArgumentParser( prog = 'My argparse test program', description = 'Descript the purpose of this program', epilog = 'This line will show at the end of --help') parser.add_argument('-B', '--batch_size', default = 8, type = int, help = 'Batch size') parser.add_argument("-S", '--shuffle', default = 1, type = int, help = 'Whether to shuffle the data or not') args = parser.parse_args() # get the data from the parser print(args.batch_size) print(args.shuffle) ``` ```python= ## How to use? ## use -h or --help to get help message $ python3 argparse.py -h ## change specific args value $ python3 argparse.py --batch_size 16 ``` ### Verbose * store_true、conflict、choices、count. ```python= ## action = 'xxx' parser.add_argument("--verbose", action = 'store_true') # 引數儲存為 boolean ``` * Example : ![](https://i.imgur.com/yNqdc3C.png) ## *args / **kwargs https://skylinelimit.blogspot.com/2018/04/python-args-kwargs.html 1. *args * Data will be stored in `Tuple`. ```python= def plus(*nums): res = 0 for i in nums: res += i return res plus(1,2,3,4,5) # 15 ``` 2. **kwargs * Data will be stored in `Dict`. ```python= dt = {'sep': ' # ', 'end': '\n\n'} print('hello', 'world', **dt) # print('hello', 'world', sep=' # ', end='\n\n') ``` ```python= def test(**_settings): print(_settings) test(name='Sky', attack=100, hp=500) # {'name': 'Sky', 'attack': 100, 'hp': 500} ``` ```python= # mix version def test(a, *args, kw1, **kwargs): print(a, args, kw1, kwargs, sep=' # ') test(1, 2, 3, 4, 5, kw1=6, g=7, f=8, l=9) # a = 1 # args = (2, 3, 4, 5) # kw1 = 6 # kwargs = {'g': 7, 'f': 8, 'l': 9} ``` ## Pytorch groups / dilation https://pytorch.org/docs/stable/generated/torch.nn.Conv2d.html Dilation : https://github.com/vdumoulin/conv_arithmetic/blob/master/README.md Group convolution : https://www.cnblogs.com/shine-lee/p/10243114.html https://blog.csdn.net/chenyuping333/article/details/82531047?utm_source=blogxgwz6 :::warning When divide the model into two parts, how to update new weight? ( like AlexNet ) ::: ![](https://i.imgur.com/NAYWD2x.png) ![](https://i.imgur.com/QupU5Gw.png) * `groups` : 1. Divide input data into `groups` parts, and concatenate all the output after convolution. 2. `in_channel` and `out_channel` must be divisible by `groups`. 3. If `groups = in_channel`, the operation is just like `depthwise convolution`. ![](https://i.imgur.com/mkGDbfr.png) ## logging https://editor.leonh.space/2022/python-log/ 1. Priority : CRITICAL > ERROR > WARNING > INFO > DEBUG * Default level is `Warning`. * When the priority is inferior to current logger's level, the message won't be displayed. ```python= import logging logging.debug('debug message') logging.info('info message') logging.warning('warning message') logging.error('error message') logging.critical('critical message') # WARNING:root:warning message # ERROR:root:error message # CRITICAL:root:critical message ``` 2. Logger / Handler * One logger can have more than one handler. If we add StreamHandler and FileHandler to the logger, one log message can display on the screen and .log file at the same time. * The output is different to the above example beacause of the Formatter. ```python= ## getLogger / setLevel / Handler import logging logger = logging.getLogger(name='dev') # root logger logger.setLevel(logging.INFO) stream_handler = logging.StreamHandler() # display the message on stderr file_handler = logging.FileHandler(filename) logger.addHandler(stream_handler) logger.addHandler(file_handler) logger.debug('debug message') logger.info('info message') logger.warning('warning message') logger.error('error message') logger.critical('critical message') # info message # warning message # error message # critical message ``` * Formatter ![](https://i.imgur.com/ZfwsycM.png) ```python= ## Formatter import logging logger = logging.getLogger(name='dev') logger.setLevel(logging.DEBUG) handler = logging.StreamHandler() formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s') handler.setFormatter(formatter) logger.addHandler(handler) logger.debug('debug message') logger.info('info message') logger.warning('warning message') logger.error('error message') logger.critical('critical message') # 2005-03-19 15:10:26,620 - dev - INFO - info message # 2005-03-19 15:10:26,695 - dev - WARNING - warn message # 2005-03-19 15:10:26,697 - dev - ERROR - error message # 2005-03-19 15:10:26,773 - dev - CRITICAL - critical message ``` ## DataLoader collate_fn https://www.it145.com/9/177246.html https://pytorch.org/docs/stable/data.html * To achieve custom batching, e.g. * Input data have different size. * Collating along a dimension other than the first. * Padding sequences of various lengths. ## Pytorch shuffle https://blog.csdn.net/qq_31049727/article/details/116206349 :::warning shuffle tensor ?? DataLoader shuffle : already shuffle the data random.shuffle ![](https://i.imgur.com/FjYHA9h.png) ::: ## Pytorch indentity https://stackoverflow.com/questions/64229717/what-is-the-idea-behind-using-nn-identity-for-residual-learning * For better API that user can easily understand. ```python= ## To keep the model having the same operation. batch_norm = nn.BatchNorm2d if dont_use_batch_norm: batch_norm = Identity ... ... nn.Sequential( ... batch_norm(N, momentum=0.05), ... ) ```