## Motivation
Many PyIceberg users need to run the software in resource-limited environments like [AWS Lambda](https://docs.aws.amazon.com/lambda/latest/dg/welcome.html).
They sometimes complain that `pyarrow` is [too large](https://pypi.org/project/pyarrow/#files). For instance, `pyarrow-17.0.0` for `linux x86_64` needs 40MiB while downloading and 100MiB on disk.
The `pyiceberg` library utilizes `fsspec` for assistance; however, since `fsspec` lacks support for `arrow`, users might still require `pyarrow` to handle arrow related works.
So I propose to use iceberg-rust as pyiceberg file io, which can:
- Handle IO Operations like read/write/delete/list.
- Convert data stream between iceberg and arrow.
## Benefits
- Combine community efforts.
It's a significant loss that our community cannot benefit from the existing iceberg-java implementations. We have to build many things from scratch. However, thanks to Rust's excellent interoperability, we can address this issue.
By incorporating parts of iceberg-rust into pyiceberg, we can evolve the community together and ultimately power pyiceberg with a Rust core.
- Fast yet small pyiceberg
For Fast:
I apologize for saying this without conducting any benchmarks at this time. But we can imagine a pyiceberg core without GIL and runtime cost. We can revisit this part after we've actually built it.
For Small:
Someone has built an [`arrow-rs`](https://github.com/apache/arrow-rs) python binding [arro3](https://github.com/kylebarron/arro3). It only needs 1MiB on disk! And no numpy!
## Plan
Pyiceberg features a dynamic file IO system that enables users to implement their own solutions.
```python
def load_file_io(properties: Properties = EMPTY_DICT, location: Optional[str] = None) -> FileIO:
# First look for the py-io-impl property to directly load the class
if io_impl := properties.get(PY_IO_IMPL):
if file_io := _import_file_io(io_impl, properties):
logger.info("Loaded FileIO: %s", io_impl)
return file_io
else:
raise ValueError(f"Could not initialize FileIO: {io_impl}")
```
So our first step could be implement into file IO system first. At this stage, users can use `iceberg-rust-fileio` as an alternative to `fsspec`.
The next step is make `pyarrow` optional too, we can provide `project_table` and other APIs that `pyiceberg` needs. Maybe some refactor to allow users tp provide their own `py-arrow-impl`. The details could be extended later.
## Questions
### How python call rust code?
[pyo3](https://pyo3.rs/) is a great lib that widely used in the rust community to build python bindings. It allows us to build interoperable, zero-copy python bindings easily.
<details>
<summary>There is a quick example</summary>
In rust we write:
```rust
use pyo3::prelude::*;
#[pyfunction]
fn sum_as_string(a: usize, b: usize) -> PyResult<String> {
Ok((a + b).to_string())
}
#[pymodule]
fn string_sum(m: &Bound<'_, PyModule>) -> PyResult<()> {
m.add_function(wrap_pyfunction!(sum_as_string, m)?)?;
Ok(())
}
```
In Python, we can call it:
```py
>>> import string_sum
>>> string_sum.sum_as_string(5, 20)
'25'
```
</details>
With pyo3, we can export python API without extra efforts.
### Are you trying to rewrite pyiceberg in rust?
No, I'm not.
pyiceberg exsists, and it works well. We should not break things that work.
In the future, Pyiceberg might be powered by a Rust core, but we will ensure it's implemented without any breaking changes. As outlined in the plan section, we are introducing new features to Pyicberg and offering them as optional for users to try, allowing us to gradually stabilize these additions.
I expect pyiceberg to become *rusty* without any visible changes to users.