IO interface in Julia

The current abstract type IO has no documented interface in Julia. This causes problems:

  1. It's difficult and brittle to create your own subtype of IO, because you don't know what methods to implement.
  2. Both current and new IO types are likely to hit slow fallback methods, because the current IO-related functions don't share any fast abstractions.
  3. Current implementations, even between the few IOs in Base are quite inconsistent, see e.g. #57942 and #57944

Background reading

Please add more material here that might be relevant

About this proposal

One solution could be to simply thoroughly document the existing methods. Better documentation would certainly help. However, since the status quo is characterized by a lack of interface, documenting an existing ad-hoc interface is not likely to result in a consistent and cohesive API.
A more promising approach is to think IO from the ground up: Build a few low-level abstractions, and implement the existing functions in terms of these abstractions.

Therefore, this proposal is a reimagining of IO in Julia from the ground up. It add several new functions, one trait, and deprecates one existing function.
I believe it's better to start from an idealistic point and then scale back the ambitions if the changes are considered too comprehensive.

Note that no changes proposed here should be breaking - everything can be put in a minor release.

This proposal is only about a reading interface for IO. If this proposal is implemented, I will make a similar proposal for the writing interface for IO.

Reviewing this proposal

  1. First, review the core reading interface of readbuffering, getbuffer, fillbuffer and consume.
  2. Then, review the newly proposed functions readinto! and readall!
  3. Last, we can discuss clarifying the semantics of the many optional methods like close and seek.

Implementation

WIP implementation in https://github.com/JuliaLang/julia/pull/57982

How does Rust do it?

When looking for good low-level abstractions, Rust is always worth a look. Let's look at the foundational IO abstractions in Rust.
Rust manages to get very far with only 3 basic methods:

  • Trait Read
    • read(&mut self, buf: &mut [u8]) -> io::Result<usize>
  • Trait BufRead
    • fill_buf(&mut self) -> io::Result<&[u8]>
    • consume(&mut self, amt: usize)

If we build on these abstractions, we need to re-fit them for Julia, so, let's make the following changes:

  1. We don't organize the methods by these three traits. Instead, we use one trait to enable dispatch between buffered and unbuffered IO. The other methods are ducktyped, and simply implemented directly without any Julia-level traits.
  2. There is no way in Rust to get the buffer of a BufRead without also filling it. This is annoying. We separate getting and filling the buffer into two distinct methods

Proposed core Julia IO interface

This proposal adds:
New functions: getbuffer, fillbuffer, consume, readinto!, and readall!
The trait: readbuffering

Core IO interface

Readers are buffered if they expose a vector with buffered data to the consumer. Buffered IOs are faster, more convenient, and have more methods defined generically. Unbuffered IOs can be wrapped in a buffered IO (from a package) to make them buffered.

  • Buffered readers should implement: getbuffer, fillbuffer, consume.
  • Unbuffered readers should implement: readbuffering, and readinto!.

New functions

  • readbuffering(::Type{<:IO})::Union{Base.IsBuffered, Base.NotBuffered}

Trait to signal if a reader IO is buffered. Defaults to IsBuffered().
New unbuffered io types T <: IO should signal they are unbuffered by implementing Base.IOBuffering(::Type{T}) = Base.NotBuffered(). Public but unexported.

  • getbuffer(io::IO)::AbstractVector{UInt8}

Get the available bytes of io.

The returned vector must have indices 1:length(v). Callers should avoid mutating the buffer.
Calling this function when the buffer is empty should not attempt to fill the buffer.

This function should be implemented for buffered readers only, and together with
fillbuffer and consume.

  • fillbuffer(io::IO)::Union{Int, Nothing}

Fill more bytes into the reading buffer from io's underlying buffer, returning
the number of bytes added. After calling fillbuffer and getting n,
the buffer obtained by getbuffer should have n new bytes appended.

This function must fill at least one byte, except

  • If the underlying io is EOF, or there is no underlying io, return 0
  • If the buffer is not empty, and cannot be expanded, return nothing.

IOs which do not wrap another underlying buffer, and therefore can't fill
its buffer should return 0 unconditionally.
This function should never return nothing if the buffer is empty.

This function should be implemented for buffered readers only, and together with
getbuffer and consume.

  • consume(io::IO, n::Int)::Nothing

Remove the first n bytes of the reading buffer of io.
Consumed bytes will not be returned by future calls to getbuffer.
If n is negative, or larger than the current reading buffer size,
throw a ConsumeBufferError error.

This function should be implemented for buffered readers only, and together with
getbuffer and fillbuffer.

  • readinto!(io::IO, v::AbstractVector{UInt8})::Int

Read bytes from io into the beginning of v, returning the number of bytes read.
This function should read at least one byte, except if io is EOF, or v is empty,
in which case it should return 0.
Where possible, implementations should make sure readinto! performs at most one
blocking reading IO operation, even when it means only filling part of v.

  • readall!(io::IO, v::AbstractVector{UInt8})::Int

Read as many bytes as possible from io into the beginning of v, returning
the number of bytes read. This function will continue to read until io is EOF,
or v has been filled.

Optional methods

These functions should have no fallback definition

  • closewrite
  • isopen
  • reseteof
  • mark, ismarked, unmark, reset
  • reset
  • isreadable
  • iswritable
  • seek (implementing this gives seekstart)
  • position (implementing this and seek gives skip)
  • seekend (requires seek)
  • lock and unlock
  • bytesavailable: Has default impl for buffered readers, may be implemented for unbuffered ones
  • readavailable: Has default impl for buffered readers, or for readers with bytesavailable implemented
  • eof: Has default impl for buffered readers, may be implemented for unbuffered ones
  • unsafe_read: Has default impl for buffered readers, may be implemented for unbuffered ones

Derived methods

The following methods are defined in terms of the core API, and so new subtypes of IO get them for free

  • close
  • readbytes!
  • read(::IO)
  • read(::IO, String)
  • read!
  • readeach
  • readall! (new function)
  • readinto! (buffered reader only, new function)
  • peek (buffered readers only)
  • unsafe_read (buffered readers only)
  • eof (buffered readers only)
  • bytesavailable (buffered readers only)
  • readavailable (buffered readers only)
  • copyline (buffered readers only)
  • readline (buffered readers only)
  • readlines (buffered readers only)
  • eachline (buffered readers only)
  • readuntil (buffered readers only)
  • copyuntil (buffered readers only)

Discussion:

We need a common, documented API for dispatching to dense memory.

Currently, Base recommends people define their IO methods in terms of the pointer-based unsafe_read and unsafe_write.
However, pointers are tricky, dangerous and should be avoided where possible. Julia normally goes to great lengths to not expose APIs based on pointers, so it's a little strange that the IO interface singularly has been left out of the safety and convenience of Julia's memory management.
So why is the current main IO reading and writing 'interface' pointer-based?

A major reason is that IO in particular often interacts with non-Julia (e.g. C) code, which forces the use of pointers. Another reason is that a lot of the heavy lifting in IO work is done by libc's memchr, memmove etc, which requires pointers.
This suggests that we need:

  • A main user-facing API that uses ordinary GC-aware Julia objects for safety and convenience, and revolves around copying bytes between AbstractVector{UInt8} using normal Julia functions like copyto!, as proposed here,
  • But also, the ability for implementers to know if the given AbstractVector{UInt8} can be read from or written to using a pointer.

We don't have the latter, currently. That's a huge problem for the IO API in particular, because it makes it difficult to write methods like readinto!(::MyType, ::AbstractVector{UInt8}) with a reasonable level of efficiency.

I can see a few options:

  1. Use a single, common vector type, such as my proposed MemoryView (link to repo), which would then need to live in Base. Implementers can then write one specialized method for MemoryView, and callers will know that if you want the optimised method, you need to use a MemoryView.
  2. Invent a new trait that tells people if a) something is writable, and b) you can meaningfully take pointer and sizeof of the object, and then have most methods dispatch on that trait.

See discussion in #54581

Should we deprecate mark, unmark, reset and ismarked?

It's unfortunate that we are stuck with the mark family of functions. The implementation of these are generic over IO, and relies on reading and writing fields of the IO objects which may not exist. They also rely on other undocumented behaviour like the field being a signed integer, and that the position of an IO is never negative.

What is the purpose of mark etc?

You can use it to mark a position in the exposed buffer of a reader, preventing the marked position and everything after it from being deleted from the buffer. This is useful for operating on the underlying buffer. However, this functionality is unnecessary and can also be done with the three buffered IO primitives.

I can't think of any other purposes. If there are none, could these functions be deprecated?

See discussion here: https://github.com/JuliaLang/julia/issues/58034

Should we define a noop fallback close?

The purpose of close is to destroy / free the underlying resource under an IO. In the presence of wrapping IOs (e.g. TranscodingStreams.jl or BufferedStreams.jl), the wrapper type needs an interface to close its wrapping IO. For this to work generically and not error if the wrapper wraps an IO which does not implement close, perhaps we should define close(::IO) = nothing?

Can we change the definition of readavailable to actually read the available bytes?

See https://github.com/JuliaLang/julia/issues/57994

Should we have a buffered API for writers?

Note: The current PR #57982 only concerns itself with reading

The proposed buffered reader API provides users direct access to the bytes of the IO's buffer. This is hugely helpful, as a lot of IO work revolves around reading bytes from an array and copying them arounds.
What about doing the same thing for writers? Expose a buffer users can mutate to write data to the IO.

The uses for a writing buffer are rarer, but they do exist. The implementation of Base.show(::IO, ::Float64), for example, currently needs to allocate an intermediate Memory to store the written representation, because the digits are written backwards, and so can't be written one at a time to the IO.

This could be done by implementing a "writer trio" of functions similar to getbuffer, fillbuffer and consume.

Does this rare use case warrant three new (optional) functions in the interface?
It's a little bit nice, but I'm not sure these functions justify their own "API weight, so to speak.

Should new, unbuffered IOs support slow generic fallbacks

Some IO operations are fast on buffered IOs, but slow on unbuffered ones. Consider readline. This can be very efficient for a buffered IO, but for an unbuffered IO, the only generic implementation reads one byte at a time until a 0xA is found. Indeed, readline does currently have such a fallback definition
Now consider if the IO is a network with milisecond latency

So: Should new, unbuffered IOs use such a slow fallback, or should there be no such fallback? Old IOs will, of course, continue to have the fallback.

For: Yes, a fallback makes generic code more generic, which increases the composability of Julian IO. People generally prefer an inefficient fallback over a MethodError
Against: No, reading unbuffered IOs byte by byte is unacceptably slow. The user is better served with a MethodError, because debugging slow fallbacks is much harder than debugging exceptions. The error is easily fixed by wrapping their IO in a buffered IO wrapper which is way more efficient.

Currently, I'm against, and the PR does not have a fallback for new, unbuffered IOs.

Misc

Thoughts

readbytes! is bad and should not be used to implement other methods.

  • It promises to resize the buffer, which is not generally possible for most buffer types, including the most obvious buffer type: Memory. Including the logic to conditionally resize the buffer also significantly complicates the implementation of this basic function, even though it's unnecessary in most cases. It's also inefficient that every call need to check eof to know whether to expand the buffer once it has been filled.
  • The nb keyword is superfluous: If the user wants to read fewer bytes than the buffer length, simply pass in a view of the buffer. If nb is larger than the buffer length, you can pass in a larger buffer, resizing it yourself.
  • The all kwarg - which is true by default, but ONLY for IOStream significantly changes the semantics of the function. This means that we essentially have the IOStream method of readbytes! behave very differently from the generic method.

We should deprecate this method, or, at the very least, just make sure it works and direct users to other functions instead.

But what if I want to:

  • Read into an array, only doing one read call? Use readinto!
  • Read N bytes into an array until EOF or array is filled? Use readall!
  • Read N bytes into an array, where N is shorter than the array length? Use readall! or readinto! with a view of the array
  • Read until EOF into an array, growing the array to fit? Use read. We could have a dedicated function to this that allows you to reuse a Vector - but growing will likely cause allocations anyway so I see little point.

Justification of readinto! and readall!

There are already so many reading functions. Why add more?
The problem is that the existing reading functions are not clear about exactly how much they read.
Part of this proposal is to clear up the semantics of existing functions. However, some methods currently have multiple intended use cases, which is possible only because their current semantics are imprecise and/or inconsistent. Therefore, if we clean up the semantics, we might need new functions to cover exposed "API holes".

Here's how the land is currently:

  • read(::IO; nb=typemax(Int)) promises to read at most nb bytes.
  • read(::IO, String) promises to read all of IO.
  • read!(::IO, ::AbstractArray) promises nothing, but currently reads to EOF or end of array. However, it returns the array, so if EOF is reached before array end, caller has no way of knowing how many bytes were read.
  • readbytes!(::IO, ::AbstractVector{UInt8}; nb::Int) reads at most nb. Docstring is not completely clear, but seems to suggest until EOF or nb has been read. Except for the IOStream method, where it's documented to do at most one read call.
  • readavailable says amount of data returned is implementation-dependent. It does not explicitly guarantee reading bytesavailable. Current implementations return bytesavailable, except where that would be zero, in which case it does a single read call.

I suggest making the following changes

  • read(::IO; nb=typemax(Int)) is now guaranteed to read until either EOF, or nb bytes is read, whatever comes first
  • read!(::IO, ::AbstractArray) now resizes the array if EOF was reached and array was not filled. This is breaking, but might be the only way to make that method useful (and it's unlikely anyone relies on the existing behaviour).
  • readbytes! should be completely deprecated, see the section of this document on the function
  • readavailable should be guaranteed to read exactly bytesavailable, and should throw an error when bytesavailable is not implemented. This is possibly breaking, depending on how the current docstring is interpreted.

With these changes, we now have the following API "holes":

  • We have no way of doing a single read call. readinto! will be documented to read the available bytes if any, or else do one read call. I.e. it does what readavailable does (in its implementation) now. We could instead document readavailable to do this (as it already does), but then we would need a new function actually_readavailable to read only the available bytes, and I think that is bad API. It also closes the API hole of read!, in that it is now the way to partially fill an array.
  • We have no way of reading the entire content of an IO into a buffer, hence why I propose readall!. A dedicated functions for this might not be super important, since performance insensitive code can just read(io) and copy the bytes, and performance sensitive code can implement it as a while loop of readinto!, which won't be a big problem as performance sensitive code will probably need to do all sort of low-level IO operations anyway. So, I'm not attached to readall!, but I think it woud be a generally useful function to have.

The all kwarg used in some IO methods

There are methods with the IOStream type specifically that have the all keyword (read and readbytes!). This is bad API because setting the all keyword completely changes the semantics. It's sort of like if append! was instead written as push!(v, things; all=true). We should preserve the functionality of these keywords but not add them to any more types.

readuntil and copyuntil's vector argument should be limited to byte vectors

These functions allow the user to copy from one IO to another, until a delimiter is found in the source. This delimiter can be an AbstractVector{T}.
However, IMO it's not semantically meaningful if the delimiter is e.g. Vector{Int}, since IOs, which are conceptually streams of bytes, do not contain any Ints, only bytes. The implementation is also buggy.

We should limit it to AbstractVector{UInt8}.

See https://github.com/JuliaLang/julia/pull/58019

Migration plan

Implement reader part of interface first

As per above, I'm not completely sold on the getwritebuffer, consumewrite, fillwritebuffer trio. Begin by implementing only the reading interface, where I'm more confident on the design

Test all existing functionality with an "Old IO emulator"

This is a type that implements the minimal IO interface before this proposal. For the reading part, this corresponds to only eof and read(::IO, ::Type{UInt8}).
Test Base code using this type to ensure nothing breaks when developing the PR.

Use introspection to dispatch between old and new API in Base

Many fallback Base reading methods are defined in terms of read(::IO, ::Type{UInt8}).
That's extremely inefficient and the generic fallback ought to be implemented in terms of this newly proposed API. However, changing the generic method breaks existing types.

To get around this, whenever I change a generic method from not requiring the new API to requiring the new API, I use introspection to check if the new API is implemented.
An example

if (
    # These are mandatory API parts of the new interface for unbuffered
    # and buffered readers, respectively.
    readbuffering(typeof(io)) == Base.NotBuffered() || (
    hasmethod(getbuffer, Tuple{typeof(io)})
)
    # use new method definition
else
    # use old method definition
end

This is ugly, relies on a shady amount of introspection, and might dispatch to unacceptably slow old generic implementations.
However, it's the only way I can think of to avoid breaking code.

Clarification of existing documentation

THIS SECTION IS WORK IN PROGRESS

  • unsafe_read and unsafe_write should return the number of bytes read/written. It should document that the caller only needs to guarantee the pointer is valid and point to at least nbytes bytes. It is up to the implementer to make sure this function doesn't access OOB data in the IO itself.

  • close and isopen: Implementing these also requires implementing the other, and eof if the stream is readable. Also, close will currently flush: This means it can't be used for read-only IOs. We need to figure out what to do here - perhaps call flush only if this method is implemented (using a hasmethod check).

  • closewrite: Must implement iswriteable and eof

  • reseteof: Must implement eof

  • mark, ismarked, unmark: Must all be implemented together. Requires position

  • reset: Requires mark and seek

  • isreadable: The current default implementation for this is incorrect, as it falls back to isopen. However, isopen explicitly says an IO can be closed and still contain readable data. Recommend changing default implementation to isopen(io) || !eof(io).

  • iswriteable: Current default implementation is incorrect, since a writer might implement closewrite. Link to closewrite from docs.

  • seek, seekstart, position and seekend. Link to each other. seekstart and seekend should require position. Should seek also, perhaps?

  • bytesavailable and readavailable: Link to each other.

  • read is not documented to read all bytes. I think it should be.

Previously decided discussion points

We should not have a fallible API

After a discussion on Zoom, we reached this conclusion. IOs should not error during normal operation, e.g. when reaching EOF.
Exceptions are for either user errors (e.g. reading a closed stream), or for unexpected events (e.g. a file you are reading is suddenly unavailable).

Instead, construct a new abstract IOError type, and make IO errors throw instances of (subtypes of) this type.

Async IOs

Where possible, IOs should not block the OS thread, but should be blocking in the sense that they may block the current task.

All user-facing APIs should be safe to call in a multi-tasking setting, without an expectation that there is a certain number of threads available. In order for blocking I/O to not block an entire thread or even the whole scheduler, all blocking operations must be implemented on top of async, scheduler-friendly primitives (yield(), wait(), etc.).

We should not implement API for managing async (waiting for, notifying etc.) IO operations. Users should use the existing Task APIs. One exception may be cancellation, which is different, and mostly orthogonal to this proposal.

More on async and cancellation

Async

  • We can provide a helper for the poll-and-yield pattern (yield()-based)
  • We can provide a helper for libuv-integrated APIs (wait()-based)
  • If the underlying API will block the thread for an arbitrary length of time (libc read et. al), then operations should be done on a dedicated thread
  • We can provide a helper that allocates a thread, puts the internal blocking operation on that thread, and waits in a scheduler-friendly way on that thread to finish. This can be yield()-based or wait()-based; benchmarking should guide this, or both an be provided.

Cancellation

Note: The current PR #57982 does not implement cancellation, this is left for a future PR

  • Blocking operations may or may not support cancellation
  • Cancellation support should be something that IO interfaces declare; if an IO interface doesn't support cancellation, then the IO user may choose to change behavior.
  • We might need an agreed-upon cancellation API? See https://github.com/davidanthoff/CancellationTokens.jl. Should support polling and waiting methods.
  • Async helpers should provide a way to accept a cancellation token (keyword argument, or scoped value, maybe?), and will query the token in the most low-latency manner possible.
  • Polling helper: polls the token
  • Libuv waiting helper: TODO
  • Dedicated thread helper: TODO

TODO

Work backwards through the abstractions from read(::IO, ::Type{UInt8})

  • X read(::IO)

  • X read(::IO, String)

  • X readeach(::IO, ::Type{UInt8})

  • copyuntil(::IO, ::UInt8)

  • copyline(::IO, ::IO)

  • readline(::IO)

  • eachline

  • readline(::IO)

  • skipchars

  • unsafe_read(::IO, ::Ptr{UInt8}, ::UInt)

  • read(::IO, ::AbstractArray)

  • read(::IO, ::StridedArray)

Agenda for next meeting

  • Discuss readbuffering, fillbuffer, getbuffer, and consume
  • Discuss readall! and readinto! (see their justification), and how much all the read methods should read
  • mark, reset, unmark, ismarked
  • close
  • The pointer stuff