# Project : Speech saperate
#### Reference : https://github.com/JusperLee/Conv-TasNet
[TOC]
## TO DO LIST
spectrogram
- [ ] Why size of output wave file is bigger than original file.
- [ ] DataLoader Splits. The length of results are smaller than input data while val/test. :warning:
- [ ] While the range of mix_audio is between -1 and 1, the output value is very big ( +-thousands ) in aesteroid. However, in this model the range is similar to mix_audio, and when i subtract spk1/spk2 from mix_audio, there's only little difference, but it get good result after I multipling spk1/spk2 by 100 or bigger value.
- [x] Deep encoder / decoder : https://arxiv.org/abs/2002.08688
- [ ] dilated / padding ( in D-conv ) :warning:
- [ ] causal ( we only train in non-causal ) :warning:
- [ ] Real-time implement ( need to train a causal model ) : https://github.com/kaituoxu/Conv-TasNet/issues/9
- [ ] Getting sound input in python
* https://python-sounddevice.readthedocs.io/en/0.4.1/index.html
- [ ] only support one microphone
- [ ] music saperation
- [ ] end-to-end
- [ ] Explainable ML
---- python ----
- [ ] yield
- [x] logging
- [x] Pytorch groups / dilation :question:
- [x] zero_grad/gradient accumulation
- [x] DataParallel / Model Parallel ( Use multiple GPUs to train a model )
* https://ithelp.ithome.com.tw/articles/10226382
- [x] model.eval() / with torch.no_grad()
- [x] Skip connection / Residual path :question:
- [x] YAML file
- [x] argparse
* https://haosquare.com/python-argparse/
- [x] *args / **kwargs
- [ ] elementwise affine ( norm )
- [ ] torch.nn.utils.clip_grad_norm_
- [x] DataLoader collate_fn
- [x] Pytorch shuffle :question:
- [ ] librosa amplitude_to_db
- [x] Pytorch indentity
// visdom ( visualize tool )
## Code review
* Model has better result without skip-connection. ( by JusperLee's experiment )
* Use "groups" to perform depthwise convolution.
* This model is focusing on separation part, and only use shallow encoder/decoder. ( linear )
--> There's another paper trying to use deeper encoder/decoder architecture, and improve the average result by more than 1dB.
### Data
* Mix_audio = Speaker1 + Speaker2.
* All audio are divided into 4s.
* How to split :
1. Only use the data that longer than 4s, and choose a **random start time**. Test data start from 0s.
2. Use alomost all the data, unless it is too short.
* Parameters :
`chunk_size` : 8000hz * 4s = 32000. (default)
`least` : If the length of audio is less than this value, it will be ignored.
* If the length of audio **longer** than `chunk_size`.
--> It may be divided into many 4s parts.
--> Longest audio in training data is about 16s.
* If the length of audio **shorter** than `chunk_size`.
--> Zero padding.
--> There are a few audio shorter than 2s.

### Model

1. Network configuration :
* N : Output ( Encoder ) & Input ( Decoder ) channel
* L : Kernel size in encoder & decoder
* B : Output channel in bottleneck
* Sc : :x:
* H : Output channel D-conv
* P : Kernel size in D-conv
* X : Number of 1-D conv block in each repeat
* R : Number of repeats in separation part

```yaml=
#### network configure
net_conf:
N: 512
L: 40
B: 128
H: 256
P: 3
X: 7
R: 2
norm: gln
num_spks: 2
activate: sigmoid
causal: false
```

## Feasible approach
1. IBM ( ideal binary mask )
* Compare two sources, and find the bigger value.
* Need ground truth, but in practice we only have mix sources.
## Model's problem
1. Conv-TasNet outperforms Deep Clustering in WSJ0-2mix dataset, but actually it still has some problem.

2. If we use the datasets with only english speakers, the model can't properly separate that who are saying different languages.
--> On the other hands, Deep clustering can get the better result.
3. The number of outputs is fixed in Conv-TasNet, so it can't deal with the input source that have different number of speakers.
* One solution :
Training a model that only separate one speaker every times, and take the rest of the source as a new input.

## What Conv-TasNet trying to improve
LSTM、STFT

## Evaluation
* SNR ( Signal-to-Noise Ratio )
SI-SDR = SI-SNR ( Scale Invariant Signal-to-Distortion Ratio )
* SNR has a problem while we enhance X* and we will get a better loss value. It is not a reasonable function to evaluate the result, so we use SI-SDR instead.
*

reference : http://speech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/SP%20(v3).pdf
* Other loss function :
MSE
STOI ( Short-Time Oblective Intelligibility )
PESQ ( Perceptual Evaluation of Speech Quality )
## Permutation Issue
PIT is not that good : https://arxiv.org/abs/1910.12706


## TCN ( Temporal Convolution Network )
* Separator
## Zero_grad / Gradient accumulation
Reference :
https://meetonfriday.com/posts/18392404/
https://www.zhihu.com/question/303070254
* Pytorch model training template :
```python=
## training step
for i, (image, label) in enumerate(train_loader):
optimizer.zero_grad() # reset gradient
pred = model(image)
loss = criterion(pred, label)
loss.backward() # BP -> find gradient
optimizer.step() # update model's weight
```
* Why Pytorch do not empty gradient automatically ?
* `Gradient accumulation` :
* Due to the resource of GPU is limit, and we still need to train a big model. However, it is hard to get a good result while we are using small batch size.
* Actually, it is fine to use small batch size, but we need a tip that help you design model.
**`Not update weight after every iteration, but try to update weight after few iterations. `**
* That is to say, we accumulate backward loss through many iterations, and then only do one update correspondingly, which indirectly achieves the effect of large batch size.
* NOTICE : It is not as good as that we directly use large batch size. On the other hands, learning rate also need to be modified, and the training time will be longer.
* EXAMPLE : BERT.
```python=
## training step with gradient accumulation
for i, (image, label) in enumerate(train_loader):
pred = model(image)
loss = criterion(pred, label)
loss = loss / accumulation_steps # loss regularization
loss.backward()
if (i+1) % accumulation_steps == 0:
optimizer.step() # update model's weight
optimizer.zero_grad() # reset gradient
```
## model.eval() / with torch.no_grad()
* model.eval()
* Change the forward behavior of Dropout / BatchNorm layer.
* with torch.no_grad()
* Deactivate "autograd" function that means stop computing gradient, so that it can reduce GPU memory usage, speed up and run bigger batch.
* If you don't care about the GPU memory usage and computing time, model.eval() is enough to get the correct answer.
## YAML file
Reference :
:star:https://www.wikiwand.com/zh-tw/YAML
https://medium.com/bryanyang0528/%E4%BD%BF%E7%94%A8-python-%E8%AE%80%E5%8F%96-yaml-%E6%AA%94%E6%A1%88-d3f413d7dd6
https://ithelp.ithome.com.tw/articles/10206454
* YAML can do everything that JSON can do and more. It is a kind of format that describing settings, and make people immediately understand what it means.
* YAML doesn't allow `TAB` character, so we can't use that to indent.
```yaml=
## This is a comment
## YAML can't use 'TAB' to indent
key1:
a: 10
b: !!float 0.5 ## !! : strict type declaration
key2:
list_key:
- If we add `-` in front of the option,
- it will be regarded as a list.
## '|' : Every line after this char will be seen as new data.
## 'key3' : {'This is a\nbook\n'}
key3: |
This is a
book
## '>' : The data will be seen as new data only when indent changed.
## 'key4' : {'This is a book\nNewline\n'}
key4: >
This is a
book
Newline
## Others : ?, &/*, ... -> go Reference
```
```python=
## How to use YAML file in python
import yaml
with open('filename.yaml', 'r') as stream:
data = yaml.load(stream)
```
## Skip connection / Residual path
https://www.youtube.com/watch?v=6g2aPc0ol2Y&list=LL&index=3 ( 9:00 wavenet path )

* Skip-connection is next block's input. (left)
* Dimensionality may not be the same, so we need a weighted vector to do linear tranform to keep two data's dimensionality.
https://ithelp.ithome.com.tw/articles/10204727

## argparse
### Basic manipulation
```python=
## argparse.py
import argparse
parser = argparse.ArgumentParser(
prog = 'My argparse test program',
description = 'Descript the purpose of this program',
epilog = 'This line will show at the end of --help')
parser.add_argument('-B',
'--batch_size',
default = 8,
type = int,
help = 'Batch size')
parser.add_argument("-S",
'--shuffle',
default = 1,
type = int,
help = 'Whether to shuffle the data or not')
args = parser.parse_args() # get the data from the parser
print(args.batch_size)
print(args.shuffle)
```
```python=
## How to use?
## use -h or --help to get help message
$ python3 argparse.py -h
## change specific args value
$ python3 argparse.py --batch_size 16
```
### Verbose
* store_true、conflict、choices、count.
```python=
## action = 'xxx'
parser.add_argument("--verbose",
action = 'store_true') # 引數儲存為 boolean
```
* Example :

## *args / **kwargs
https://skylinelimit.blogspot.com/2018/04/python-args-kwargs.html
1. *args
* Data will be stored in `Tuple`.
```python=
def plus(*nums):
res = 0
for i in nums:
res += i
return res
plus(1,2,3,4,5)
# 15
```
2. **kwargs
* Data will be stored in `Dict`.
```python=
dt = {'sep': ' # ', 'end': '\n\n'}
print('hello', 'world', **dt)
# print('hello', 'world', sep=' # ', end='\n\n')
```
```python=
def test(**_settings):
print(_settings)
test(name='Sky', attack=100, hp=500)
# {'name': 'Sky', 'attack': 100, 'hp': 500}
```
```python=
# mix version
def test(a, *args, kw1, **kwargs):
print(a, args, kw1, kwargs, sep=' # ')
test(1, 2, 3, 4, 5, kw1=6, g=7, f=8, l=9)
# a = 1
# args = (2, 3, 4, 5)
# kw1 = 6
# kwargs = {'g': 7, 'f': 8, 'l': 9}
```
## Pytorch groups / dilation
https://pytorch.org/docs/stable/generated/torch.nn.Conv2d.html
Dilation : https://github.com/vdumoulin/conv_arithmetic/blob/master/README.md
Group convolution :
https://www.cnblogs.com/shine-lee/p/10243114.html https://blog.csdn.net/chenyuping333/article/details/82531047?utm_source=blogxgwz6
:::warning
When divide the model into two parts, how to update new weight? ( like AlexNet )
:::


* `groups` :
1. Divide input data into `groups` parts, and concatenate all the output after convolution.
2. `in_channel` and `out_channel` must be divisible by `groups`.
3. If `groups = in_channel`, the operation is just like `depthwise convolution`.

## logging
https://editor.leonh.space/2022/python-log/
1. Priority : CRITICAL > ERROR > WARNING > INFO > DEBUG
* Default level is `Warning`.
* When the priority is inferior to current logger's level, the message won't be displayed.
```python=
import logging
logging.debug('debug message')
logging.info('info message')
logging.warning('warning message')
logging.error('error message')
logging.critical('critical message')
# WARNING:root:warning message
# ERROR:root:error message
# CRITICAL:root:critical message
```
2. Logger / Handler
* One logger can have more than one handler. If we add StreamHandler and FileHandler to the logger, one log message can display on the screen and .log file at the same time.
* The output is different to the above example beacause of the Formatter.
```python=
## getLogger / setLevel / Handler
import logging
logger = logging.getLogger(name='dev') # root logger
logger.setLevel(logging.INFO)
stream_handler = logging.StreamHandler() # display the message on stderr
file_handler = logging.FileHandler(filename)
logger.addHandler(stream_handler)
logger.addHandler(file_handler)
logger.debug('debug message')
logger.info('info message')
logger.warning('warning message')
logger.error('error message')
logger.critical('critical message')
# info message
# warning message
# error message
# critical message
```
* Formatter

```python=
## Formatter
import logging
logger = logging.getLogger(name='dev')
logger.setLevel(logging.DEBUG)
handler = logging.StreamHandler()
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
handler.setFormatter(formatter)
logger.addHandler(handler)
logger.debug('debug message')
logger.info('info message')
logger.warning('warning message')
logger.error('error message')
logger.critical('critical message')
# 2005-03-19 15:10:26,620 - dev - INFO - info message
# 2005-03-19 15:10:26,695 - dev - WARNING - warn message
# 2005-03-19 15:10:26,697 - dev - ERROR - error message
# 2005-03-19 15:10:26,773 - dev - CRITICAL - critical message
```
## DataLoader collate_fn
https://www.it145.com/9/177246.html
https://pytorch.org/docs/stable/data.html
* To achieve custom batching, e.g.
* Input data have different size.
* Collating along a dimension other than the first.
* Padding sequences of various lengths.
## Pytorch shuffle
https://blog.csdn.net/qq_31049727/article/details/116206349
:::warning
shuffle tensor ??
DataLoader shuffle : already shuffle the data
random.shuffle

:::
## Pytorch indentity
https://stackoverflow.com/questions/64229717/what-is-the-idea-behind-using-nn-identity-for-residual-learning
* For better API that user can easily understand.
```python=
## To keep the model having the same operation.
batch_norm = nn.BatchNorm2d
if dont_use_batch_norm:
batch_norm = Identity
...
...
nn.Sequential(
...
batch_norm(N, momentum=0.05),
...
)
```