Codegen Inference

# Codegen Inference ## Faster transformer What is FT??? it's a bunch of fused cuda kernels that are optimised for inference, in other words manual transformer layers written in C++/CUDA. How do models work in FT, well you need to take a trained model, replicate it's architecture in pure C++ and convert their weights to be used with C++ layers - You can also use FT kernels as ops in torch/tensorflow frameworks, to run model in those framework, - Current Implemenation matrix is here: https://github.com/NVIDIA/FasterTransformer#support-matrix We can see codegen hasn't been implemented in here, and gptj has only triton frontend ### Triton Server Triton is way of deploying things, it just setups a server which handles server loading etc etc, but you've to provide the runtime backend to it, here runtime means which kernels are you using, Tensorflow, torch, tensorrt, Faster transformer - Triton is recommended to be used as a Docker container - By default Triton doesn't support Faster transformer backend, we've to build it manually - Luckily Moyix has maintained a Triton with FT docker image https://hub.docker.com/r/moyix/triton_with_ft ### How to do Codegen inference with FT #### Make a FT model - Codegen is not avilable directly in FT, that is nobody has written a manual implementation of it in C++/FTlayers. - But it's architecture is very similar to GPTj, which is already implemented in C++/FT - Moyix used this to create a script that converts codegen models to GPTJ, and here are stored model https://huggingface.co/Moyix i.e all CG models converted to gptj architecture - now just convert gptj models torch to gptj model FT, with triton frontend configuration https://github.com/NVIDIA/FasterTransformer/blob/main/examples/pytorch/gptj/utils/huggingface_gptj_ckpt_convert.py - taa! da!! FT model is ready Acheived this on 10 November or earlier #### Serve the FT model (headache starts) ***If we lived in a timeline that could have solved alignment, we would be lucky enough to have a docker node too, deploy the moyix image and just pass the path to FT models, that's all. Alas!! Not a single docker node;*** Easiest path: `docker run --gpus=1 --rm --net=host -v ${PWD}/model_repository:/models hub.docker.com/r/moyix/triton_with_ft tritonserver --model-repository=/models` - So what did we tried after that ; converted this docker to singularity. ```singularity build moyix_ft.sig docker:://moyix/triton_with_ft:22.09``` - But singularity doesn't have --gpus flag; this is important factor while converting models - So did lot of things, but could not deploy the singularity container directly Found out that singularity can't read config.pbtxt from /fsx, this was the reason for delay - #### Write a client that interacts with FT server Improved the client over Codegenproxy written by fauxpilot ## ONNX inference; ### Using transfomer.onnx OPSET 12 ### Using optimum ### Using transformers.onnx OPSET 13