# Codegen Inference
## Faster transformer
What is FT??? it's a bunch of fused cuda kernels that are optimised for inference, in other words manual transformer layers written in C++/CUDA.
How do models work in FT, well you need to take a trained model, replicate it's architecture in pure C++ and convert their weights to be used with C++ layers
- You can also use FT kernels as ops in torch/tensorflow frameworks, to run model in those framework,
- Current Implemenation matrix is here: https://github.com/NVIDIA/FasterTransformer#support-matrix
We can see codegen hasn't been implemented in here, and gptj has only triton frontend
### Triton Server
Triton is way of deploying things, it just setups a server which handles server loading etc etc, but you've to provide the runtime backend to it, here runtime means which kernels are you using, Tensorflow, torch, tensorrt, Faster transformer
- Triton is recommended to be used as a Docker container
- By default Triton doesn't support Faster transformer backend, we've to build it manually
- Luckily Moyix has maintained a Triton with FT docker image https://hub.docker.com/r/moyix/triton_with_ft
### How to do Codegen inference with FT
#### Make a FT model
- Codegen is not avilable directly in FT, that is nobody has written a manual implementation of it in C++/FTlayers.
- But it's architecture is very similar to GPTj, which is already implemented in C++/FT
- Moyix used this to create a script that converts codegen models to GPTJ, and here are stored model https://huggingface.co/Moyix i.e all CG models converted to gptj architecture
- now just convert gptj models torch to gptj model FT, with triton frontend configuration https://github.com/NVIDIA/FasterTransformer/blob/main/examples/pytorch/gptj/utils/huggingface_gptj_ckpt_convert.py
- taa! da!! FT model is ready Acheived this on 10 November or earlier
#### Serve the FT model (headache starts)
***If we lived in a timeline that could have solved alignment, we would be lucky enough to have a docker node too, deploy the moyix image and just pass the path to FT models, that's all. Alas!! Not a single docker node;***
Easiest path:
`docker run --gpus=1 --rm --net=host -v ${PWD}/model_repository:/models hub.docker.com/r/moyix/triton_with_ft tritonserver --model-repository=/models`
- So what did we tried after that ; converted this docker to singularity.
```singularity build moyix_ft.sig docker:://moyix/triton_with_ft:22.09```
- But singularity doesn't have --gpus flag; this is important factor while converting models
- So did lot of things, but could not deploy the singularity container directly
Found out that singularity can't read config.pbtxt from /fsx, this was the reason for delay
-
#### Write a client that interacts with FT server
Improved the client over Codegenproxy written by fauxpilot
## ONNX inference;
### Using transfomer.onnx OPSET 12
### Using optimum
### Using transformers.onnx OPSET 13