Many PyIceberg users need to run the software in resource-limited environments like AWS Lambda.
They sometimes complain that pyarrow
is too large. For instance, pyarrow-17.0.0
for linux x86_64
needs 40MiB while downloading and 100MiB on disk.
The pyiceberg
library utilizes fsspec
for assistance; however, since fsspec
lacks support for arrow
, users might still require pyarrow
to handle arrow related works.
So I propose to use iceberg-rust as pyiceberg file io, which can:
It's a significant loss that our community cannot benefit from the existing iceberg-java implementations. We have to build many things from scratch. However, thanks to Rust's excellent interoperability, we can address this issue.
By incorporating parts of iceberg-rust into pyiceberg, we can evolve the community together and ultimately power pyiceberg with a Rust core.
For Fast:
I apologize for saying this without conducting any benchmarks at this time. But we can imagine a pyiceberg core without GIL and runtime cost. We can revisit this part after we've actually built it.
For Small:
Someone has built an arrow-rs
python binding arro3. It only needs 1MiB on disk! And no numpy!
Pyiceberg features a dynamic file IO system that enables users to implement their own solutions.
def load_file_io(properties: Properties = EMPTY_DICT, location: Optional[str] = None) -> FileIO:
# First look for the py-io-impl property to directly load the class
if io_impl := properties.get(PY_IO_IMPL):
if file_io := _import_file_io(io_impl, properties):
logger.info("Loaded FileIO: %s", io_impl)
return file_io
else:
raise ValueError(f"Could not initialize FileIO: {io_impl}")
So our first step could be implement into file IO system first. At this stage, users can use iceberg-rust-fileio
as an alternative to fsspec
.
The next step is make pyarrow
optional too, we can provide project_table
and other APIs that pyiceberg
needs. Maybe some refactor to allow users tp provide their own py-arrow-impl
. The details could be extended later.
pyo3 is a great lib that widely used in the rust community to build python bindings. It allows us to build interoperable, zero-copy python bindings easily.
In rust we write:
use pyo3::prelude::*;
#[pyfunction]
fn sum_as_string(a: usize, b: usize) -> PyResult<String> {
Ok((a + b).to_string())
}
#[pymodule]
fn string_sum(m: &Bound<'_, PyModule>) -> PyResult<()> {
m.add_function(wrap_pyfunction!(sum_as_string, m)?)?;
Ok(())
}
In Python, we can call it:
>>> import string_sum
>>> string_sum.sum_as_string(5, 20)
'25'
With pyo3, we can export python API without extra efforts.
No, I'm not.
pyiceberg exsists, and it works well. We should not break things that work.
In the future, Pyiceberg might be powered by a Rust core, but we will ensure it's implemented without any breaking changes. As outlined in the plan section, we are introducing new features to Pyicberg and offering them as optional for users to try, allowing us to gradually stabilize these additions.
I expect pyiceberg to become rusty without any visible changes to users.