Custom CPP and CUDA extensions

# Custom CPP and CUDA extensions ## Why use CPP/CUDA extensions? - PyTorch provides a convenient way to write a CPP extension. - sometimes code may be better optimized in terms of speed if run on CPP extensions. In some other cases, your code may also require interacting with C or CPP libraries - the custom CPP extension mechanism is for developers to create **PyTorch Operators** out-from-source (seperated from the PyTorch backend). - PyTorch provides spares much of the boilerplate in integrating PyTorch operations while giving high flexibility in creating PyTorch based projects. - once the operation has been defined as a CPP extension, you can then turn it into a native PyTorch function. - turning the operation into a native function is only a matter of **code organization**. ## The setup tool method ## The fusion method - as PyTorch only recognizes the operations involved in an algorithm, it will launch as many CUDA kernels as needed to run your operations. This may create a significant amount of overhead. - Therefore, the fusion method is introduced to re-write parts of the CPP extension and fuse particular groups of extension. - Fusing means combining a few function implementations into a single function. This profits from fewer kernel launches. - This also encourages increased visibility of global data flow.