# IO interface in Julia The current abstract type IO has no documented interface in Julia. This causes problems: 1. It's difficult and brittle to create your own subtype of IO, because you don't know what methods to implement. 2. Both current and new IO types are likely to hit slow fallback methods, because the current IO-related functions don't share any fast abstractions. 3. Current implementations, even between the few IOs in Base are quite inconsistent, see e.g. [#57942](https://github.com/JuliaLang/julia/issues/57942) and [#57944](https://github.com/JuliaLang/julia/issues/57944) ## Background reading Please add more material here that might be relevant * Issue: [readbytes! doesn't throw on a non-readable stream](https://github.com/JuliaLang/julia/issues/50584) * Issue: [readbytes!: support the all keyword for all methods](https://github.com/JuliaLang/julia/issues/40793) * Issue: [unsafe_read(io, pointer, n) should return the number of read bytes](https://github.com/JuliaLang/julia/issues/16656) * Issue: [review of IO blocking behaviour](https://github.com/JuliaLang/julia/issues/24526) * Issue: [Should we breathe new life into `readavailable`?](https://github.com/JuliaLang/julia/issues/57994) * Issue: [`close` is underdocumented, and inconsistently implemented](https://github.com/JuliaLang/julia/issues/57944) * Issue: [isreadable is underdocumented and inconsistently implemented](https://github.com/JuliaLang/julia/issues/57942) * Issue: [semantics of `position`](https://github.com/JuliaLang/julia/issues/57639) * PR: [added docs for IO interface](https://github.com/JuliaLang/julia/pull/41291) * Discourse post: [What is the IO interface?](https://discourse.julialang.org/t/what-is-the-io-interface/107350) ## About this proposal One solution could be to simply thoroughly document the existing methods. Better documentation would certainly help. However, since the status quo is characterized by a _lack of interface_, documenting an existing ad-hoc interface is not likely to result in a consistent and cohesive API. A more promising approach is to think IO from the ground up: Build a few low-level abstractions, and implement the existing functions in terms of these abstractions. Therefore, this proposal is a reimagining of IO in Julia from the ground up. It add several new functions, one trait, and deprecates one existing function. I believe it's better to start from an idealistic point and then scale back the ambitions if the changes are considered too comprehensive. Note that no changes proposed here should be breaking - everything can be put in a minor release. *This proposal is only about a reading interface for `IO`. If this proposal is implemented, I will make a similar proposal for the writing interface for `IO`.* ### Reviewing this proposal 1. First, review the core reading interface of `readbuffering`, `getbuffer`, `fillbuffer` and `consume`. 2. Then, review the newly proposed functions `readinto!` and `readall!` 3. Last, we can discuss clarifying the semantics of the many optional methods like `close` and `seek`. ## Implementation WIP implementation in https://github.com/JuliaLang/julia/pull/57982 ## How does Rust do it? When looking for good low-level abstractions, Rust is always worth a look. Let's look at the foundational IO abstractions in Rust. Rust manages to get very far with only 3 basic methods: * Trait [Read](https://doc.rust-lang.org/std/io/trait.Read.html) - `read(&mut self, buf: &mut [u8]) -> io::Result<usize>` * Trait [BufRead](https://doc.rust-lang.org/std/io/trait.BufRead.html) - `fill_buf(&mut self) -> io::Result<&[u8]>` - `consume(&mut self, amt: usize)` If we build on these abstractions, we need to re-fit them for Julia, so, let's make the following changes: 1. We don't organize the methods by these three traits. Instead, we use one trait to enable dispatch between buffered and unbuffered IO. The other methods are ducktyped, and simply implemented directly without any Julia-level traits. 2. There is no way in Rust to get the buffer of a `BufRead` without also filling it. This is annoying. We separate getting and filling the buffer into two distinct methods ## Proposed core Julia IO interface This proposal adds: New functions: `getbuffer`, `fillbuffer`, `consume`, `readinto!`, and `readall!` The trait: `readbuffering` ### Core IO interface Readers are _buffered_ if they expose a vector with buffered data to the consumer. Buffered IOs are faster, more convenient, and have more methods defined generically. Unbuffered IOs can be wrapped in a buffered IO (from a package) to make them buffered. * Buffered readers should implement: `getbuffer`, `fillbuffer`, `consume`. * Unbuffered readers should implement: `readbuffering`, and `readinto!`. ### New functions * `readbuffering(::Type{<:IO})::Union{Base.IsBuffered, Base.NotBuffered}` Trait to signal if a reader IO is buffered. Defaults to `IsBuffered()`. New unbuffered io types `T <: IO` should signal they are unbuffered by implementing `Base.IOBuffering(::Type{T}) = Base.NotBuffered()`. Public but unexported. * `getbuffer(io::IO)::AbstractVector{UInt8}` Get the available bytes of `io`. The returned vector must have indices `1:length(v)`. Callers should avoid mutating the buffer. Calling this function when the buffer is empty should not attempt to fill the buffer. This function should be implemented for buffered readers only, and together with [`fillbuffer`](@ref) and [`consume`](@ref). * `fillbuffer(io::IO)::Union{Int, Nothing}` Fill more bytes into the reading buffer from `io`'s underlying buffer, returning the number of bytes added. After calling `fillbuffer` and getting `n`, the buffer obtained by `getbuffer` should have `n` new bytes appended. This function must fill at least one byte, except * If the underlying io is EOF, or there is no underlying io, return `0` * If the buffer is not empty, and cannot be expanded, return `nothing`. `IO`s which do not wrap another underlying buffer, and therefore can't fill its buffer should return `0` unconditionally. This function should never return `nothing` if the buffer is empty. This function should be implemented for buffered readers only, and together with [`getbuffer`](@ref) and [`consume`](@ref). * `consume(io::IO, n::Int)::Nothing` Remove the first `n` bytes of the reading buffer of `io`. Consumed bytes will not be returned by future calls to `getbuffer`. If `n` is negative, or larger than the current reading buffer size, throw a `ConsumeBufferError` error. This function should be implemented for buffered readers only, and together with [`getbuffer`](@ref) and [`fillbuffer`](@ref). * `readinto!(io::IO, v::AbstractVector{UInt8})::Int` Read bytes from `io` into the beginning of `v`, returning the number of bytes read. This function should read at least one byte, except if `io` is EOF, or `v` is empty, in which case it should return `0`. Where possible, implementations should make sure `readinto!` performs at most one blocking reading IO operation, even when it means only filling part of `v`. * `readall!(io::IO, v::AbstractVector{UInt8})::Int` Read as many bytes as possible from `io` into the beginning of `v`, returning the number of bytes read. This function will continue to read until `io` is EOF, or `v` has been filled. ### Optional methods These functions should have no fallback definition * `closewrite` * `isopen` * `reseteof` * `mark`, `ismarked`, `unmark`, `reset` * `reset` * `isreadable` * `iswritable` * `seek` (implementing this gives `seekstart`) * `position` (implementing this and `seek` gives `skip`) * `seekend` (requires `seek`) * `lock` and `unlock` * `bytesavailable`: Has default impl for buffered readers, may be implemented for unbuffered ones * `readavailable`: Has default impl for buffered readers, or for readers with `bytesavailable` implemented * `eof`: Has default impl for buffered readers, may be implemented for unbuffered ones * `unsafe_read`: Has default impl for buffered readers, may be implemented for unbuffered ones ### Derived methods The following methods are defined in terms of the core API, and so new subtypes of IO get them for free * `close` * `readbytes!` * `read(::IO)` * `read(::IO, String)` * `read!` * `readeach` * `readall!` (new function) * `readinto!` (buffered reader only, new function) * `peek` (buffered readers only) * `unsafe_read` (buffered readers only) * `eof` (buffered readers only) * `bytesavailable` (buffered readers only) * `readavailable` (buffered readers only) * `copyline` (buffered readers only) * `readline` (buffered readers only) * `readlines` (buffered readers only) * `eachline` (buffered readers only) * `readuntil` (buffered readers only) * `copyuntil` (buffered readers only) ## Discussion: ### We need a common, documented API for dispatching to dense memory. Currently, Base recommends people define their IO methods in terms of the pointer-based `unsafe_read` and `unsafe_write`. However, pointers are tricky, dangerous and should be avoided where possible. Julia normally goes to great lengths to not expose APIs based on pointers, so it's a little strange that the IO interface singularly has been left out of the safety and convenience of Julia's memory management. So why is the current main IO reading and writing 'interface' pointer-based? A major reason is that IO in particular often interacts with non-Julia (e.g. C) code, which forces the use of pointers. Another reason is that a lot of the heavy lifting in IO work is done by libc's `memchr`, `memmove` etc, which requires pointers. This suggests that we need: * A *main* user-facing API that uses ordinary GC-aware Julia objects for safety and convenience, and revolves around copying bytes between `AbstractVector{UInt8}` using normal Julia functions like `copyto!`, as proposed here, * But *also*, the ability for implementers to know if the given `AbstractVector{UInt8}` can be read from or written to using a pointer. We don't have the latter, currently. That's a huge problem for the IO API in particular, because it makes it difficult to write methods like `readinto!(::MyType, ::AbstractVector{UInt8})` with a reasonable level of efficiency. I can see a few options: 1. Use a single, common vector type, such as my proposed `MemoryView` ([link to repo](https://github.com/BioJulia/MemoryViews.jl)), which would then need to live in Base. Implementers can then write one specialized method for `MemoryView`, and callers will know that if you want the optimised method, you need to use a `MemoryView`. 2. Invent a new trait that tells people if a) something is writable, and b) you can meaningfully take `pointer` and `sizeof` of the object, and then have most methods dispatch on that trait. See discussion in [#54581](https://github.com/JuliaLang/julia/issues/54581) ### Should we deprecate `mark`, `unmark`, `reset` and `ismarked`? It's unfortunate that we are stuck with the `mark` family of functions. The implementation of these are generic over `IO`, and relies on reading and writing fields of the IO objects which may not exist. They also rely on other undocumented behaviour like the field being a signed integer, and that the position of an IO is never negative. What is the purpose of `mark` etc? You can use it to mark a position in the exposed buffer of a reader, preventing the marked position and everything after it from being deleted from the buffer. This is useful for operating on the underlying buffer. However, this functionality is unnecessary and can also be done with the three buffered IO primitives. I can't think of any other purposes. If there are none, could these functions be deprecated? See discussion here: https://github.com/JuliaLang/julia/issues/58034 ### Should we define a noop fallback `close`? The purpose of close is to destroy / free the underlying resource under an IO. In the presence of wrapping IOs (e.g. TranscodingStreams.jl or BufferedStreams.jl), the wrapper type needs an interface to close its wrapping IO. For this to work generically and not error if the wrapper wraps an IO which does not implement `close`, perhaps we should define `close(::IO) = nothing`? ### Can we _change_ the definition of `readavailable` to actually read the available bytes? See https://github.com/JuliaLang/julia/issues/57994 ### Should we have a buffered API for writers? **Note: The current PR #57982 only concerns itself with reading** The proposed buffered *reader* API provides users direct access to the bytes of the IO's buffer. This is hugely helpful, as a lot of IO work revolves around reading bytes from an array and copying them arounds. What about doing the same thing for writers? Expose a buffer users can mutate to write data to the IO. The uses for a writing buffer are rarer, but they do exist. The implementation of `Base.show(::IO, ::Float64)`, for example, currently needs to allocate an intermediate `Memory` to store the written representation, because the digits are written backwards, and so can't be written one at a time to the `IO`. This could be done by implementing a "writer trio" of functions similar to `getbuffer`, `fillbuffer` and `consume`. Does this rare use case warrant three new (optional) functions in the interface? It's a _little bit_ nice, but I'm not sure these functions justify their own "API weight, so to speak. ### Should new, unbuffered IOs support slow generic fallbacks Some IO operations are fast on buffered IOs, but slow on unbuffered ones. Consider `readline`. This can be very efficient for a buffered IO, but for an unbuffered IO, the only generic implementation reads one byte at a time until a `0xA` is found. Indeed, `readline` does currently have such a fallback definition Now consider if the IO is a network with milisecond latency... So: Should new, unbuffered IOs use such a slow fallback, or should there be no such fallback? Old IOs will, of course, continue to have the fallback. **For**: Yes, a fallback makes generic code more generic, which increases the composability of Julian IO. People generally prefer an inefficient fallback over a MethodError **Against**: No, reading unbuffered IOs byte by byte is unacceptably slow. The user is better served with a MethodError, because debugging slow fallbacks is much harder than debugging exceptions. The error is easily fixed by wrapping their IO in a buffered IO wrapper which is way more efficient. Currently, I'm **against**, and the PR _does not_ have a fallback for new, unbuffered IOs. ## Misc ### Thoughts #### `readbytes!` is bad and should not be used to implement other methods. - It promises to resize the buffer, which is not generally possible for most buffer types, including the most obvious buffer type: `Memory`. Including the logic to conditionally resize the buffer also significantly complicates the implementation of this basic function, even though it's unnecessary in most cases. It's also inefficient that every call need to check `eof` to know whether to expand the buffer once it has been filled. - The `nb` keyword is superfluous: If the user wants to read fewer bytes than the buffer length, simply pass in a view of the buffer. If `nb` is larger than the buffer length, you can pass in a larger buffer, resizing it yourself. - The `all` kwarg - which is true by default, but ONLY for `IOStream` significantly changes the semantics of the function. This means that we essentially have the `IOStream` method of `readbytes!` behave very differently from the generic method. We should deprecate this method, or, at the very least, just make sure it works and direct users to other functions instead. But what if I want to: - Read into an array, only doing one read call? Use `readinto!` - Read N bytes into an array until EOF or array is filled? Use `readall!` - Read N bytes into an array, where N is shorter than the array length? Use `readall!` or `readinto!` with a view of the array - Read until EOF into an array, growing the array to fit? Use `read`. We could have a dedicated function to this that allows you to reuse a `Vector` - but growing will likely cause allocations anyway so I see little point. #### Justification of `readinto!` and `readall!` There are already so many reading functions. Why add more? The problem is that the existing reading functions are not clear about exactly how much they read. Part of this proposal is to clear up the semantics of existing functions. However, some methods currently have multiple intended use cases, which is possible _only because their current semantics are imprecise and/or inconsistent_. Therefore, if we clean up the semantics, we might need new functions to cover exposed "API holes". Here's how the land is currently: * `read(::IO; nb=typemax(Int))` promises to read _at most `nb` bytes_. * `read(::IO, String)` promises to read all of IO. * `read!(::IO, ::AbstractArray)` promises nothing, but currently reads to EOF or end of array. However, it returns the array, so if EOF is reached before array end, caller has no way of knowing how many bytes were read. * `readbytes!(::IO, ::AbstractVector{UInt8}; nb::Int)` reads at most `nb`. Docstring is not completely clear, but seems to suggest until EOF or `nb` has been read. Except for the `IOStream` method, where it's documented to do at most one read call. * `readavailable` says amount of data returned is implementation-dependent. It does not explicitly guarantee reading `bytesavailable`. Current implementations return `bytesavailable`, except where that would be zero, in which case it does a single read call. I suggest making **the following changes** * `read(::IO; nb=typemax(Int))` is now guaranteed to read until either EOF, or `nb` bytes is read, whatever comes first * `read!(::IO, ::AbstractArray)` now resizes the array if EOF was reached and array was not filled. This is **breaking**, but might be the only way to make that method useful (and it's unlikely anyone relies on the existing behaviour). * `readbytes!` should be completely deprecated, see the section of this document on the function * `readavailable` should be guaranteed to read exactly `bytesavailable`, and should throw an error when `bytesavailable` is not implemented. This is **possibly breaking**, depending on how the current docstring is interpreted. With these changes, we now have the following API "holes": * We have no way of doing a single read call. `readinto!` will be documented to read the available bytes if any, or else do one read call. I.e. it does what `readavailable` does (in its implementation) now. We _could_ instead document `readavailable` to do this (as it already does), but then we would need a new function `actually_readavailable` to read only the available bytes, and I think that is bad API. It also closes the API hole of `read!`, in that it is now the way to partially fill an array. * We have no way of reading the entire content of an `IO` into a buffer, hence why I propose `readall!`. A dedicated functions for this might not be super important, since performance insensitive code can just `read(io)` and copy the bytes, and performance sensitive code can implement it as a while loop of `readinto!`, which won't be a big problem as performance sensitive code will probably need to do all sort of low-level IO operations anyway. So, I'm not attached to `readall!`, but I think it woud be a generally useful function to have. #### The `all` kwarg used in some IO methods There are methods with the `IOStream` type specifically that have the `all` keyword (`read` and `readbytes!`). This is bad API because setting the `all` keyword completely changes the semantics. It's sort of like if `append!` was instead written as `push!(v, things; all=true)`. We should preserve the functionality of these keywords but not add them to any more types. #### `readuntil` and `copyuntil`'s vector argument should be limited to byte vectors These functions allow the user to copy from one IO to another, until a delimiter is found in the source. This delimiter can be an `AbstractVector{T}`. However, IMO it's not semantically meaningful if the delimiter is e.g. `Vector{Int}`, since IOs, which are conceptually streams of bytes, do not contain any `Int`s, only bytes. The implementation is also buggy. We should limit it to `AbstractVector{UInt8}`. See https://github.com/JuliaLang/julia/pull/58019 ## Migration plan #### Implement reader part of interface first As per above, I'm not completely sold on the `getwritebuffer`, `consumewrite`, `fillwritebuffer` trio. Begin by implementing only the reading interface, where I'm more confident on the design #### Test all existing functionality with an "Old IO emulator" This is a type that implements the minimal IO interface _before this proposal_. For the reading part, this corresponds to only `eof` and `read(::IO, ::Type{UInt8})`. Test Base code using this type to ensure nothing breaks when developing the PR. #### Use introspection to dispatch between old and new API in Base Many fallback Base reading methods are defined in terms of `read(::IO, ::Type{UInt8})`. That's extremely inefficient and the generic fallback ought to be implemented in terms of this newly proposed API. However, changing the generic method breaks existing types. To get around this, whenever I change a generic method from not requiring the new API to requiring the new API, I use introspection to check if the new API is implemented. An example ```julia if ( # These are mandatory API parts of the new interface for unbuffered # and buffered readers, respectively. readbuffering(typeof(io)) == Base.NotBuffered() || ( hasmethod(getbuffer, Tuple{typeof(io)}) ) # use new method definition else # use old method definition end ``` This is ugly, relies on a shady amount of introspection, and might dispatch to unacceptably slow old generic implementations. However, it's the only way I can think of to avoid breaking code. ## Clarification of existing documentation **THIS SECTION IS WORK IN PROGRESS** * `unsafe_read` and `unsafe_write` should return the number of bytes read/written. It should document that the caller only needs to guarantee the pointer is valid and point to at least `nbytes` bytes. It is up to the implementer to make sure this function doesn't access OOB data in the IO itself. * `close` and `isopen`: Implementing these also requires implementing the other, and `eof` if the stream is readable. Also, `close` will currently `flush`: This means it can't be used for read-only IOs. We need to figure out what to do here - perhaps call flush only if this method is implemented (using a `hasmethod` check). * `closewrite`: Must implement `iswriteable` and `eof` * `reseteof`: Must implement `eof` * mark, ismarked, unmark: Must all be implemented together. Requires `position` * reset: Requires mark and `seek` * `isreadable`: The current default implementation for this is incorrect, as it falls back to `isopen`. However, `isopen` explicitly says an IO can be closed and still contain readable data. Recommend changing default implementation to `isopen(io) || !eof(io)`. * `iswriteable`: Current default implementation is incorrect, since a writer might implement `closewrite`. Link to `closewrite` from docs. * `seek`, `seekstart`, `position` and `seekend`. Link to each other. `seekstart` and `seekend` should require `position`. Should `seek` also, perhaps? * `bytesavailable` and `readavailable`: Link to each other. * `read` is not documented to read all bytes. I think it should be. ## Previously decided discussion points ### We should not have a fallible API After a discussion on Zoom, we reached this conclusion. IOs should not error during normal operation, e.g. when reaching EOF. Exceptions are for either user errors (e.g. reading a closed stream), or for unexpected events (e.g. a file you are reading is suddenly unavailable). Instead, construct a new abstract `IOError` type, and make IO errors throw instances of (subtypes of) this type. ### Async IOs Where possible, IOs should not block the OS thread, but should be blocking in the sense that they may block the current task. All user-facing APIs should be safe to call in a multi-tasking setting, without an expectation that there is a certain number of threads available. In order for blocking I/O to not block an entire thread or even the whole scheduler, all blocking operations must be implemented on top of async, scheduler-friendly primitives (`yield()`, `wait()`, etc.). We should not implement API for managing async (waiting for, notifying etc.) IO operations. Users should use the existing `Task` APIs. One exception may be _cancellation_, which is different, and mostly orthogonal to this proposal. ## More on async and cancellation ### Async - We can provide a helper for the poll-and-yield pattern (`yield()`-based) - We can provide a helper for libuv-integrated APIs (`wait()`-based) - If the underlying API will block the thread for an arbitrary length of time (libc `read` et. al), then operations should be done on a dedicated thread - We can provide a helper that allocates a thread, puts the internal blocking operation on that thread, and waits in a scheduler-friendly way on that thread to finish. This can be `yield()`-based or `wait()`-based; benchmarking should guide this, or both an be provided. ### Cancellation **Note: The current PR #57982 does not implement cancellation, this is left for a future PR** - Blocking operations may or may not support cancellation - Cancellation support should be something that IO interfaces declare; if an IO interface doesn't support cancellation, then the IO user may choose to change behavior. - We might need an agreed-upon cancellation API? See https://github.com/davidanthoff/CancellationTokens.jl. Should support polling and waiting methods. - Async helpers should provide a way to accept a cancellation token (keyword argument, or scoped value, maybe?), and will query the token in the most low-latency manner possible. - Polling helper: polls the token - Libuv waiting helper: TODO - Dedicated thread helper: TODO ## TODO Work backwards through the abstractions from `read(::IO, ::Type{UInt8})` * X `read(::IO)` * X `read(::IO, String)` * X `readeach(::IO, ::Type{UInt8})` * `copyuntil(::IO, ::UInt8)` * `copyline(::IO, ::IO)` * `readline(::IO)` * `eachline` * `readline(::IO)` * `skipchars` * `unsafe_read(::IO, ::Ptr{UInt8}, ::UInt)` * `read(::IO, ::AbstractArray)` * `read(::IO, ::StridedArray)` ### Agenda for next meeting * Discuss `readbuffering`, `fillbuffer`, `getbuffer`, and `consume` * Discuss `readall!` and `readinto!` (see their justification), and how much all the read methods should read * `mark`, `reset`, `unmark`, `ismarked` * `close` * The pointer stuff