Try   HackMD

Motivation

Many PyIceberg users need to run the software in resource-limited environments like AWS Lambda.

They sometimes complain that pyarrow is too large. For instance, pyarrow-17.0.0 for linux x86_64 needs 40MiB while downloading and 100MiB on disk.

The pyiceberg library utilizes fsspec for assistance; however, since fsspec lacks support for arrow, users might still require pyarrow to handle arrow related works.

So I propose to use iceberg-rust as pyiceberg file io, which can:

  • Handle IO Operations like read/write/delete/list.
  • Convert data stream between iceberg and arrow.

Benefits

  • Combine community efforts.

It's a significant loss that our community cannot benefit from the existing iceberg-java implementations. We have to build many things from scratch. However, thanks to Rust's excellent interoperability, we can address this issue.

By incorporating parts of iceberg-rust into pyiceberg, we can evolve the community together and ultimately power pyiceberg with a Rust core.

  • Fast yet small pyiceberg

For Fast:

I apologize for saying this without conducting any benchmarks at this time. But we can imagine a pyiceberg core without GIL and runtime cost. We can revisit this part after we've actually built it.

For Small:

Someone has built an arrow-rs python binding arro3. It only needs 1MiB on disk! And no numpy!

Plan

Pyiceberg features a dynamic file IO system that enables users to implement their own solutions.

def load_file_io(properties: Properties = EMPTY_DICT, location: Optional[str] = None) -> FileIO:
    # First look for the py-io-impl property to directly load the class
    if io_impl := properties.get(PY_IO_IMPL):
        if file_io := _import_file_io(io_impl, properties):
            logger.info("Loaded FileIO: %s", io_impl)
            return file_io
        else:
            raise ValueError(f"Could not initialize FileIO: {io_impl}")

So our first step could be implement into file IO system first. At this stage, users can use iceberg-rust-fileio as an alternative to fsspec.

The next step is make pyarrow optional too, we can provide project_table and other APIs that pyiceberg needs. Maybe some refactor to allow users tp provide their own py-arrow-impl. The details could be extended later.

Questions

How python call rust code?

pyo3 is a great lib that widely used in the rust community to build python bindings. It allows us to build interoperable, zero-copy python bindings easily.

There is a quick example

In rust we write:

use pyo3::prelude::*;

#[pyfunction]
fn sum_as_string(a: usize, b: usize) -> PyResult<String> {
    Ok((a + b).to_string())
}

#[pymodule]
fn string_sum(m: &Bound<'_, PyModule>) -> PyResult<()> {
    m.add_function(wrap_pyfunction!(sum_as_string, m)?)?;
    Ok(())
}

In Python, we can call it:

>>> import string_sum
>>> string_sum.sum_as_string(5, 20)
'25'

With pyo3, we can export python API without extra efforts.

Are you trying to rewrite pyiceberg in rust?

No, I'm not.

pyiceberg exsists, and it works well. We should not break things that work.

In the future, Pyiceberg might be powered by a Rust core, but we will ensure it's implemented without any breaking changes. As outlined in the plan section, we are introducing new features to Pyicberg and offering them as optional for users to try, allowing us to gradually stabilize these additions.

I expect pyiceberg to become rusty without any visible changes to users.