Onnx, GGUF, MLIR

# Onnx, GGUF, MLIR ## ONNX Specification Issues - https://github.com/onnx/onnx/issues/3651 - My fight or flight moment: Dynamic attributes: https://github.com/onnx/onnx/issues/3420 which ended with a lot of unanswered questions - Do not uses Safetensors: https://github.com/huggingface/safetensors ## GGUF Specification issues for wonnx - webgpu currently does not support uint2, uint4, uint8 matrix operations. Only float16 and float32. So we will have to convert all those operations to float32 and loose a lot on performance: https://www.w3.org/TR/WGSL/#integer-types ## Candle quantization issues - See: https://github.com/huggingface/candle/issues/359 ## MLIR -> Good for compiler https://mlir.llvm.org/docs/Tutorials/Toy/Ch-3/ -> SIMD and GPU dialect -> Faster than C/Rust: https://www.modular.com/blog/the-worlds-fastest-unified-matrix-multiplication ## Annexe ### Safetensors | Format | Safe | Zero-copy | Lazy loading | No file size limit | Layout control | Flexibility | Bfloat16/Fp8 | ----------------------- | --- | --- | --- | --- | --- | --- | --- | | pickle (PyTorch) | ✗ | ✗ | ✗ | 🗸 | ✗ | 🗸 | 🗸 | | H5 (Tensorflow) | 🗸 | ✗ | 🗸 | 🗸 | ~ | ~ | ✗ | | SavedModel (Tensorflow) | 🗸 | ✗ | ✗ | 🗸 | 🗸 | ✗ | 🗸 | | MsgPack (flax) | 🗸 | 🗸 | ✗ | 🗸 | ✗ | ✗ | 🗸 | | Protobuf (ONNX) | 🗸 | ✗ | ✗ | ✗ | ✗ | ✗ | 🗸 | | Cap'n'Proto | 🗸 | 🗸 | ~ | 🗸 | 🗸 | ~ | ✗ | | Arrow | ? | ? | ? | ? | ? | ? | ✗ | | Numpy (npy,npz) | 🗸 | ? | ? | ✗ | 🗸 | ✗ | ✗ | | pdparams (Paddle) | ✗ | ✗ | ✗ | 🗸 | ✗ | 🗸 | 🗸 | | SafeTensors | 🗸 | 🗸 | 🗸 | 🗸 | 🗸 | ✗ | 🗸 | - Safe: Can I use a file randomly downloaded and expect not to run arbitrary code ? - Zero-copy: Does reading the file require more memory than the original file ? - Lazy loading: Can I inspect the file without loading everything ? And loading only some tensors in it without scanning the whole file (distributed setting) ? - Layout control: Lazy loading, is not necessarily enough since if the information about tensors is spread out in your file, then even if the information is lazily accessible you might have to access most of your file to read the available tensors (incurring many DISK -> RAM copies). Controlling the layout to keep fast access to single tensors is important. - No file size limit: Is there a limit to the file size ? - Flexibility: Can I save custom code in the format and be able to use it later with zero extra code ? (~ means we can store more than pure tensors, but no custom code) - Bfloat16/Fp8: Does the format support native bfloat16/fp8 (meaning no weird workarounds are necessary)? This is becoming increasingly important in the ML world. ### Main oppositions - Pickle: Unsafe, runs arbitrary code - H5: Apparently now discouraged for TF/Keras. Seems like a great fit otherwise actually. Some classic use after free issues: <https://www.cvedetails.com/vulnerability-list/vendor_id-15991/product_id-35054/Hdfgroup-Hdf5.html>. On a very different level than pickle security-wise. Also 210k lines of code vs ~400 lines for this lib currently. - SavedModel: Tensorflow specific (it contains TF graph information). - MsgPack: No layout control to enable lazy loading (important for loading specific parts in distributed setting) - Protobuf: Hard 2Go max file size limit - Cap'n'proto: Float16 support is not present [link](https://capnproto.org/language.html#built-in-types) so using a manual wrapper over a byte-buffer would be necessary. Layout control seems possible but not trivial as buffers have limitations [link](https://stackoverflow.com/questions/48458839/capnproto-maximum-filesize). - Numpy (npz): No `bfloat16` support. Vulnerable to zip bombs (DOS). Not zero-copy. - Arrow: No `bfloat16` support. Seem to require decoding [link](https://arrow.apache.org/docs/python/parquet.html#reading-parquet-and-memory-mapping)