# IO interface in Julia The current abstract type IO has no documented interface in Julia. This causes problems: 1. It's difficult and brittle to create your own subtype of IO, because you don't know what methods to implement. 2. Both current and new IO types are likely to hit slow fallback methods, because the current IO-related functions don't share any fast abstractions. ## Background reading Please add more material here that might be relevant * Issue: [readbytes! doesn't throw on a non-readable stream](https://github.com/JuliaLang/julia/issues/50584) * Issue: [readbytes!: support the all keyword for all methods](https://github.com/JuliaLang/julia/issues/40793) * Issue: [unsafe_read(io, pointer, n) should return the number of read bytes](https://github.com/JuliaLang/julia/issues/16656) * Issue: [review of IO blocking behaviour](https://github.com/JuliaLang/julia/issues/24526) * PR: [added docs for IO interface](https://github.com/JuliaLang/julia/pull/41291) * Discourse post: [What is the IO interface?](https://discourse.julialang.org/t/what-is-the-io-interface/107350) ## About this proposal One solution could be to simply thoroughly document the existing methods. Better documentation would certainly help. However, since the status quo is characterized by a _lack of interface_, documenting an existing ad-hoc interface is not likely to result in a consistent and cohesive API. A more promising approach is to think IO from the ground up: Build a few low-level abstractions, and implement the existing functions in terms of these abstractions. Therefore, this proposal is a reimagining of IO in Julia from the ground up. It add several new functions, one trait, one new array type, and deprecates one existing function. I believe it's better to start from an idealistic point and then scale back the ambitions if the changes are considered too comprehensive. Note that no changes proposed here should be breaking - everything can be put in a minor release. ## How does Rust do it? When looking for good low-level abstractions, Rust is always worth a look. Let's look at the foundational IO abstractions in Rust. Rust manages to get very far with only 5 basic methods: * Trait [Read](https://doc.rust-lang.org/std/io/trait.Read.html) - `read(&mut self, buf: &mut [u8]) -> io::Result<usize>` * Trait [Write](https://doc.rust-lang.org/std/io/trait.Write.html) - `write(&mut self, buf: &[u8]) -> io::Result<usize>` - `flush(&mut self) -> io::Result<()>` * Trait [BufRead](https://doc.rust-lang.org/std/io/trait.BufRead.html) - `fill_buf(&mut self) -> io::Result<&[u8]>` - `consume(&mut self, amt: usize)` If we build on these abstractions, we need to re-fit them for Julia, so, let's make the following changes: 1. We don't organize the methods by these three traits. Instead, we use one trait to enable dispatch between buffered and unbuffered IO. The other methods are ducktyped, and simply implemented directly without any Julia-level traits. 2. The slice (`&[u8]`) type does not exist in Julia. We mimick it with a new `AbstractVector{UInt8}` type that consists of a `MemoryRef{UInt8}` and a length. Let's call this a `MemoryView{UInt8}`. 3. There is no way in Rust to get the buffer of a `BufRead` without also filling it. This is annoying. We separate getting and filling the buffer into two distinct methods ## Proposed core Julia IO interface This proposal adds: New functions: `readinto!`, `getbuffer`, `fillbuffer`, `consume` and `readall!` The trait: `IOBuffering` ### Example implementation An example implementation can be found here: https://github.com/jakobnissen/IO-playground (NB: work in progress!) ### Core IO interface * Writers should implement `unsafe_write` and `flush` * Buffered readers should implement: `getbuffer`, `fillbuffer`, `consume`. Readers are assumed to be buffered unless explicitly opted out of. Most readers will be buffered. * Unbuffered readers should implement `Base.IOBuffering`, and `readinto!` ### New functions * `Base.IOBuffering(::Type) -> Union{Base.IsBuffered, Base.NotBuffered}` Trait to signal if a reader IO is buffered. Defaults to `IsBuffered()`. New unbuffered io types `T` should signal they are unbuffered by implementing `Base.IOBuffering(::Type{T}) = Base.NotBuffered()`. Public but unexported. * `getbuffer(::IO) -> MemoryView{UInt8}` Get a view into the current buffer containing all the data that is available for consumers to read. This data should not be mutated by consumers. * `readinto!(io::IO, mem::MemoryView{UInt8}) -> Int` Read bytes from `io` into the start of `mem` and return the number of bytes read. This function should only read zero bytes if `mem` is empty, or if `io` is EOF. This has default implementations: For buffered readers, it simply copies from the buffer. For unbuffered readers, it ultimately calls `unsafe_read`. * `consume(::IO, n::UInt) -> Nothing` Remove the first `n` bytes of buffered data from the IO. Bytes that have been consumed will not be returned from future calls to `getbuffer`. This function will error if `n` is larger than the number of the current bytes in the buffer. * `fillbuffer(io::IO) -> Int` Fill data into `io`'s buffer from data source that `io` wraps, returning the number of bytes copied. This function will block until at least 1 byte has been copied. This function should return zero when the `io` is prevented from filling data into its buffer, e.g. because it's EOF, it's closed for reading, or because the type does not support filling its buffer. Unless an IO is being concurrently mutated, the following should always hold: ```julia old_buffer = copy(getbuffer(io)) n = fillbuffer(io) @assert length(getbuffer(io)) == length(old_buffer) + n @assert getbuffer(io)[1:length(old_buffer)] == old_buffer ``` This function might error if the buffer is already full and cannot be grown, depending on the concrete IO type. * `readall!(io::IO, mem::MemoryView{UInt8}) -> Int` Read bytes from `io` until `mem` is full or `io` is EOF. Returns the number of bytes read. This has a default implementation that calls `readinto!` in a loop. * `read(::Type{T}, mem::MemoryView{UInt8})` Load a `T` from memory `mem`. The length of memory can be assumed to have length sizeof(T). Manually calling this function with a `mem` shorter than `sizeof(T)` is illegal. Implemening this method will allow you to read T from all IOs ### Optional methods These functions should have no fallback definition * `close`: Only for IO objects that can be closed. Must also implement `isopen` * `closewrite(::IO)`: Only for when the write half of a stream can be closed independently of the read half. * `isopen`: Must also implement `close` * `reseteof` * `mark`, `ismarked`, `unmark`: Must all be implemented together * `reset`: Requires `mark` and `seek` * `isreadable` * `iswritable` * `seek` (requires `position`. Implementing this gives: `seekstart`) * `position` (implementing this and `seek` gives `skip`) * `seekend` (requires `seek`) * `lock` and `unlock`: Must be implemented together * `bytesavailable`: Has default impl for buffered readers, may be implemented for unbuffered ones * `readavailable`: Provided for buffered readers, or if `bytesavailable` is implemented * `eof`: Has default impl for buffered readers, may be implemented for unbuffered ones * `unsafe_read`: Has default impl for buffered readers, may be implemented for unbuffered ones ### Derived methods The following methods are defined in terms of the core API, and so new subtypes of IO get them for free * `readbytes!` * `read(::IO, T)` * `read(::IO)` * `read(::IO, String)` * `read!` * `readeach` * `readall!` (new function) * `readinto!` (buffered reader only, new function) * `peek` (buffered readers only) * `unsafe_read` (buffered readers only) * `eof` (buffered readers only) * `bytesavailable` (buffered readers only) * `readavailable` (buffered readers only) * `copyline` (buffered readers only) * `readline` (buffered readers only) * `readlines` (buffered readers only) * `eachline` (buffered readers only) * `readuntil` (buffered readers only) * `copyuntil` (buffered readers only) ### Future changes All types in Julia that are backed by contiguous memory are ultimately backed by `Memory`, except: * `String` (and `SubString{String}`) * `Symbol` The existence of these types makes it awkward to abstract _all possible slices_ of Julia-owned memory that can be written to an IO as simply `MemoryView`. Therefore, the core writing API is defined in terms of `unsafe_write` operating on raw pointers. However, it's not great that the core API is unsafe and pointer based - that is, that we require users to use pointers to implement an writeable IO. If the above types become `Memory`-backed in the future, we can change the required API in a backwards compatible manner, requiring `write(::IO, ::MemoryView{UInt8})` instead of `unsafe_write` This is backwards compatible, because this `write` method has a default implementation based on `unsafe_write` ## API for readable / writeable types If you create a new type T and want to be able to read/write it to arbitrary IO objects: * Read: Implement `sizeof` and `unsafe_read(::Type{T}, mem::MemoryView{UInt8})` (a new method) * Write: Manually implement this yourself, by copying the data into a Memory and then writing the memory to the IO. ## Discussion: ### `readinto!` seems awfully similar to `readbytes!`. Do we need it? We don't _need_ it, in the same sense that we don't _need_ a good API at all. But we _ought to want_ a good API. See the section on why `readbytes!` is bad. ### What's this `MemoryView` type? We need a way to get an efficient view into the buffer of an `IO`. We could return a `SubArray{UInt8, 1, Memory{UInt8}, Tuple{UnitRange{Int64}}, true}` instead, but in my experience subarrays can be tricky to get optimal performance of in low-level code. This is because `SubArray` abstracts over a lot of different array and index types (just look at the type! Its's huge!), which means consumers can't assume much about the output type. If you don't like my proposal for a memory view type, we can just say the API returns a `Tuple{MemoryRef{UInt8}, Int}` or even `Tuple{Memory{UInt8}, Int, Int}` instead - but I do think it's much nicer if this "memory with start and length" is an actual `AbstractVector{UInt8}`. Also, practically speaking, if we don't use memory views, much of the code will end up looking like `foo(io::IO, v::Memory{UInt8}, start::Int, end::Int)` instead of `foo(io::IO, v::MemoryView{UInt8})`, which is less convenient and requires unnecessary bounds checking. ### Why is the default implementation `readinto!` and similar methods written in terms of `MemoryView`? Most users will probably read into a vector or such. Wouldn't that be a better type to use in the Base API? A Vector can be trivially converted to a memory view (as can `Memory`, and views of the former). Therefore, by implementing `readinto!` for memory views, we get an implementation all other memory-backed types for free. ### Do we need the `IOBuffering` trait? Why not ducktype? Yeah maybe we don't _need_ it as such. But after having tried out a toy implementation of this API, it's much nicer when one can dispatch on whether a given IO object provides buffering. Buffered IOs are generally _much_ easier to work with. ### Should we have a fallible API? The current API generally handles problems by throwing errors. This is idiomatic in Julia, and is easier to use. However, it's easier to start from a fallible API and build an infallible (error-throwing) API on top of it, rather than going the other way. I have no strong opinions here, but perhaps other people do. ## Misc ### Thoughts #### `readbytes!` is bad and should not be used to implement other methods - It promises to resize the buffer, which is not generally possible for most buffer types, including the most obvious buffer type: `Memory`. Including the logic to conditionally resize the buffer also significantly complicates the implementation of this basic function, even though it's unnecessary in most cases. - The `nb` keyword is superfluous: If the user wants to read fewer bytes than the buffer length, simply pass in a view of the buffer. If `nb` is larger than the buffer length, you can pass in a larger buffer, resizing it yourself. - The `all` kwarg - which is true by default, but ONLY for `IOStream` significantly changes the semantics of the function. This means that we essentially have the `IOStream` method of `readbytes!` behave very differently from the generic method. But what if I want to: - Read into an array, only doing one read call? Use `readinto!` - Read N bytes into an array until EOF or array is filled? Use `readall!` - Read N bytes into an array, where N is shorter than the array length? Use `readall!` or `readinto!` with a view of the array - Read until EOF into an array, growing the array to fit? Use `read`. We could have a dedicated function to this that allows you to reuse a `Vector` - but growing will likely cause allocations anyway so I see little point. #### The `mark` functions It's unfortunate that we are stuck with the `mark` family of functions. The implementation of these are generic over `IO`, and relies on reading and writing fields of the IO objects which may not exist. They also rely on other undocumented behaviour like the field being a signed integer, and that the position of an IO is never negative. #### The `all` kwarg used in some IO methods There are methods with the `IOStream` type specifically that have the `all` keyword (`read` and `readbytes!`). This is bad API because setting the `all` keyword completely changes the semantics. It's sort of like if `append!` was instead written as `push!(;all=true)`. We should preserve the functionality of these keywords but not add them to any more types. ### Misc questions * Should we have a `growbuffer` function that cause the IO to expand its buffer? I suspect most IO types would not support this anyway - perhaps users who need at least N bytes should bring their own growable buffer. On the other hand, having this function is very useful, almost essential, for being able to do zero-copy IO work. * Should we document that `seek` is zero-indexed by default, such that we can implement the generic `seekstart(io) = seek(io, 0)` Or is this a wrong assumption for certain IO types? ## Migration plan How do we make these changes non-breaking for all existing user-defined IO types? The core problem is that the old functions are defined in terms of each other in an undocumented, unsystematic way. Changing the implementation of any of them to use the new API risks breakage, because users may rely on that function working without having defined the new API. [this part is a WIP] * perhaps in all old methods, use `hasmethod` to check for the presence of the new API, then only use it if it's implemented? ### Clarification of existing documentation * `unsafe_read` and `unsafe_write` should return the number of bytes read/written. It should document that the caller only needs to guarantee the pointer is valid and point to at least `nbytes` bytes. It is up to the implementer to make sure this function doesn't access OOB data in the IO itself. * `close` and `isopen`: Implementing these also requires implementing the other, and `eof` if the stream is readable. Also, `close` will currently `flush`: This means it can't be used for read-only IOs. We need to figure out what to do here - perhaps call flush only if this method is implemented (using a `hasmethod` check). * `closewrite`: Must implement `iswriteable` and `eof` * `reseteof`: Must implement `eof` * mark, ismarked, unmark: Must all be implemented together. Requires `position` * reset: Requires mark and `seek` * `isreadable`: The current default implementation for this is incorrect, as it falls back to `isopen`. However, `isopen` explicitly says an IO can be closed and still contain readable data. Recommend changing default implementation to `isopen(io) || !eof(io)`. * `iswriteable`: Current default implementation is incorrect, since a writer might implement `closewrite`. Link to `closewrite` from docs. * `seek`, `seekstart`, `position` and `seekend`. Link to each other. `seekstart` and `seekend` should require `position`. Should `seek` also, perhaps? * `bytesavailable` and `readavailable`: Link to each other. ## Async and cancellation All user-facing APIs are inherently blocking, and should be safe to call in a multi-tasking setting, without an expectation that there is a certain number of threads available. In order for blocking I/O to not block an entire thread or even the whole scheduler, all blocking operations must be implemented on top of async, scheduler-friendly primitives (`yield()`, `wait()`, etc.). ### Async - We can provide a helper for the poll-and-yield pattern (`yield()`-based) - We can provide a helper for libuv-integrated APIs (`wait()`-based) - If the underlying API will block the thread for an arbitrary length of time (libc `read` et. al), then operations should be done on a dedicated thread - We can provide a helper that allocates a thread, puts the internal blocking operation on that thread, and waits in a scheduler-friendly way on that thread to finish. This can be `yield()`-based or `wait()`-based; benchmarking should guide this, or both an be provided. ### Cancellation - Blocking operations may or may not support cancellation - Cancellation support should be something that IO interfaces declare; if an IO interface doesn't support cancellation, then the IO user may choose to change behavior. - We might need an agreed-upon cancellation API? See https://github.com/davidanthoff/CancellationTokens.jl. Should support polling and waiting methods. - Async helpers should provide a way to accept a cancellation token (keyword argument, or scoped value, maybe?), and will query the token in the most low-latency manner possible. - Polling helper: polls the token - Libuv waiting helper: TODO - Dedicated thread helper: TODO