The current abstract type IO has no documented interface in Julia. This causes problems:
Please add more material here that might be relevant
readavailable
?close
is underdocumented, and inconsistently implementedposition
One solution could be to simply thoroughly document the existing methods. Better documentation would certainly help. However, since the status quo is characterized by a lack of interface, documenting an existing ad-hoc interface is not likely to result in a consistent and cohesive API.
A more promising approach is to think IO from the ground up: Build a few low-level abstractions, and implement the existing functions in terms of these abstractions.
Therefore, this proposal is a reimagining of IO in Julia from the ground up. It add several new functions, one trait, and deprecates one existing function.
I believe it's better to start from an idealistic point and then scale back the ambitions if the changes are considered too comprehensive.
Note that no changes proposed here should be breaking - everything can be put in a minor release.
This proposal is only about a reading interface for IO
. If this proposal is implemented, I will make a similar proposal for the writing interface for IO
.
readbuffering
, getbuffer
, fillbuffer
and consume
.readinto!
and readall!
close
and seek
.WIP implementation in https://github.com/JuliaLang/julia/pull/57982
When looking for good low-level abstractions, Rust is always worth a look. Let's look at the foundational IO abstractions in Rust.
Rust manages to get very far with only 3 basic methods:
read(&mut self, buf: &mut [u8]) -> io::Result<usize>
fill_buf(&mut self) -> io::Result<&[u8]>
consume(&mut self, amt: usize)
If we build on these abstractions, we need to re-fit them for Julia, so, let's make the following changes:
BufRead
without also filling it. This is annoying. We separate getting and filling the buffer into two distinct methodsThis proposal adds:
New functions: getbuffer
, fillbuffer
, consume
, readinto!
, and readall!
The trait: readbuffering
Readers are buffered if they expose a vector with buffered data to the consumer. Buffered IOs are faster, more convenient, and have more methods defined generically. Unbuffered IOs can be wrapped in a buffered IO (from a package) to make them buffered.
getbuffer
, fillbuffer
, consume
.readbuffering
, and readinto!
.readbuffering(::Type{<:IO})::Union{Base.IsBuffered, Base.NotBuffered}
Trait to signal if a reader IO is buffered. Defaults to IsBuffered()
.
New unbuffered io types T <: IO
should signal they are unbuffered by implementing Base.IOBuffering(::Type{T}) = Base.NotBuffered()
. Public but unexported.
getbuffer(io::IO)::AbstractVector{UInt8}
Get the available bytes of io
.
The returned vector must have indices 1:length(v)
. Callers should avoid mutating the buffer.
Calling this function when the buffer is empty should not attempt to fill the buffer.
This function should be implemented for buffered readers only, and together with
fillbuffer
and consume
.
fillbuffer(io::IO)::Union{Int, Nothing}
Fill more bytes into the reading buffer from io
's underlying buffer, returning
the number of bytes added. After calling fillbuffer
and getting n
,
the buffer obtained by getbuffer
should have n
new bytes appended.
This function must fill at least one byte, except
0
nothing
.IO
s which do not wrap another underlying buffer, and therefore can't fill
its buffer should return 0
unconditionally.
This function should never return nothing
if the buffer is empty.
This function should be implemented for buffered readers only, and together with
getbuffer
and consume
.
consume(io::IO, n::Int)::Nothing
Remove the first n
bytes of the reading buffer of io
.
Consumed bytes will not be returned by future calls to getbuffer
.
If n
is negative, or larger than the current reading buffer size,
throw a ConsumeBufferError
error.
This function should be implemented for buffered readers only, and together with
getbuffer
and fillbuffer
.
readinto!(io::IO, v::AbstractVector{UInt8})::Int
Read bytes from io
into the beginning of v
, returning the number of bytes read.
This function should read at least one byte, except if io
is EOF, or v
is empty,
in which case it should return 0
.
Where possible, implementations should make sure readinto!
performs at most one
blocking reading IO operation, even when it means only filling part of v
.
readall!(io::IO, v::AbstractVector{UInt8})::Int
Read as many bytes as possible from io
into the beginning of v
, returning
the number of bytes read. This function will continue to read until io
is EOF,
or v
has been filled.
These functions should have no fallback definition
closewrite
isopen
reseteof
mark
, ismarked
, unmark
, reset
reset
isreadable
iswritable
seek
(implementing this gives seekstart
)position
(implementing this and seek
gives skip
)seekend
(requires seek
)lock
and unlock
bytesavailable
: Has default impl for buffered readers, may be implemented for unbuffered onesreadavailable
: Has default impl for buffered readers, or for readers with bytesavailable
implementedeof
: Has default impl for buffered readers, may be implemented for unbuffered onesunsafe_read
: Has default impl for buffered readers, may be implemented for unbuffered onesThe following methods are defined in terms of the core API, and so new subtypes of IO get them for free
close
readbytes!
read(::IO)
read(::IO, String)
read!
readeach
readall!
(new function)readinto!
(buffered reader only, new function)peek
(buffered readers only)unsafe_read
(buffered readers only)eof
(buffered readers only)bytesavailable
(buffered readers only)readavailable
(buffered readers only)copyline
(buffered readers only)readline
(buffered readers only)readlines
(buffered readers only)eachline
(buffered readers only)readuntil
(buffered readers only)copyuntil
(buffered readers only)Currently, Base recommends people define their IO methods in terms of the pointer-based unsafe_read
and unsafe_write
.
However, pointers are tricky, dangerous and should be avoided where possible. Julia normally goes to great lengths to not expose APIs based on pointers, so it's a little strange that the IO interface singularly has been left out of the safety and convenience of Julia's memory management.
So why is the current main IO reading and writing 'interface' pointer-based?
A major reason is that IO in particular often interacts with non-Julia (e.g. C) code, which forces the use of pointers. Another reason is that a lot of the heavy lifting in IO work is done by libc's memchr
, memmove
etc, which requires pointers.
This suggests that we need:
AbstractVector{UInt8}
using normal Julia functions like copyto!
, as proposed here,AbstractVector{UInt8}
can be read from or written to using a pointer.We don't have the latter, currently. That's a huge problem for the IO API in particular, because it makes it difficult to write methods like readinto!(::MyType, ::AbstractVector{UInt8})
with a reasonable level of efficiency.
I can see a few options:
MemoryView
(link to repo), which would then need to live in Base. Implementers can then write one specialized method for MemoryView
, and callers will know that if you want the optimised method, you need to use a MemoryView
.pointer
and sizeof
of the object, and then have most methods dispatch on that trait.See discussion in #54581
mark
, unmark
, reset
and ismarked
?It's unfortunate that we are stuck with the mark
family of functions. The implementation of these are generic over IO
, and relies on reading and writing fields of the IO objects which may not exist. They also rely on other undocumented behaviour like the field being a signed integer, and that the position of an IO is never negative.
What is the purpose of mark
etc?
You can use it to mark a position in the exposed buffer of a reader, preventing the marked position and everything after it from being deleted from the buffer. This is useful for operating on the underlying buffer. However, this functionality is unnecessary and can also be done with the three buffered IO primitives.
I can't think of any other purposes. If there are none, could these functions be deprecated?
See discussion here: https://github.com/JuliaLang/julia/issues/58034
close
?The purpose of close is to destroy / free the underlying resource under an IO. In the presence of wrapping IOs (e.g. TranscodingStreams.jl or BufferedStreams.jl), the wrapper type needs an interface to close its wrapping IO. For this to work generically and not error if the wrapper wraps an IO which does not implement close
, perhaps we should define close(::IO) = nothing
?
readavailable
to actually read the available bytes?See https://github.com/JuliaLang/julia/issues/57994
Note: The current PR #57982 only concerns itself with reading
The proposed buffered reader API provides users direct access to the bytes of the IO's buffer. This is hugely helpful, as a lot of IO work revolves around reading bytes from an array and copying them arounds.
What about doing the same thing for writers? Expose a buffer users can mutate to write data to the IO.
The uses for a writing buffer are rarer, but they do exist. The implementation of Base.show(::IO, ::Float64)
, for example, currently needs to allocate an intermediate Memory
to store the written representation, because the digits are written backwards, and so can't be written one at a time to the IO
.
This could be done by implementing a "writer trio" of functions similar to getbuffer
, fillbuffer
and consume
.
Does this rare use case warrant three new (optional) functions in the interface?
It's a little bit nice, but I'm not sure these functions justify their own "API weight, so to speak.
Some IO operations are fast on buffered IOs, but slow on unbuffered ones. Consider readline
. This can be very efficient for a buffered IO, but for an unbuffered IO, the only generic implementation reads one byte at a time until a 0xA
is found. Indeed, readline
does currently have such a fallback definition
Now consider if the IO is a network with milisecond latency…
So: Should new, unbuffered IOs use such a slow fallback, or should there be no such fallback? Old IOs will, of course, continue to have the fallback.
For: Yes, a fallback makes generic code more generic, which increases the composability of Julian IO. People generally prefer an inefficient fallback over a MethodError
Against: No, reading unbuffered IOs byte by byte is unacceptably slow. The user is better served with a MethodError, because debugging slow fallbacks is much harder than debugging exceptions. The error is easily fixed by wrapping their IO in a buffered IO wrapper which is way more efficient.
Currently, I'm against, and the PR does not have a fallback for new, unbuffered IOs.
readbytes!
is bad and should not be used to implement other methods.Memory
. Including the logic to conditionally resize the buffer also significantly complicates the implementation of this basic function, even though it's unnecessary in most cases. It's also inefficient that every call need to check eof
to know whether to expand the buffer once it has been filled.nb
keyword is superfluous: If the user wants to read fewer bytes than the buffer length, simply pass in a view of the buffer. If nb
is larger than the buffer length, you can pass in a larger buffer, resizing it yourself.all
kwarg - which is true by default, but ONLY for IOStream
significantly changes the semantics of the function. This means that we essentially have the IOStream
method of readbytes!
behave very differently from the generic method.We should deprecate this method, or, at the very least, just make sure it works and direct users to other functions instead.
But what if I want to:
readinto!
readall!
readall!
or readinto!
with a view of the arrayread
. We could have a dedicated function to this that allows you to reuse a Vector
- but growing will likely cause allocations anyway so I see little point.readinto!
and readall!
There are already so many reading functions. Why add more?
The problem is that the existing reading functions are not clear about exactly how much they read.
Part of this proposal is to clear up the semantics of existing functions. However, some methods currently have multiple intended use cases, which is possible only because their current semantics are imprecise and/or inconsistent. Therefore, if we clean up the semantics, we might need new functions to cover exposed "API holes".
Here's how the land is currently:
read(::IO; nb=typemax(Int))
promises to read at most nb
bytes.read(::IO, String)
promises to read all of IO.read!(::IO, ::AbstractArray)
promises nothing, but currently reads to EOF or end of array. However, it returns the array, so if EOF is reached before array end, caller has no way of knowing how many bytes were read.readbytes!(::IO, ::AbstractVector{UInt8}; nb::Int)
reads at most nb
. Docstring is not completely clear, but seems to suggest until EOF or nb
has been read. Except for the IOStream
method, where it's documented to do at most one read call.readavailable
says amount of data returned is implementation-dependent. It does not explicitly guarantee reading bytesavailable
. Current implementations return bytesavailable
, except where that would be zero, in which case it does a single read call.I suggest making the following changes
read(::IO; nb=typemax(Int))
is now guaranteed to read until either EOF, or nb
bytes is read, whatever comes firstread!(::IO, ::AbstractArray)
now resizes the array if EOF was reached and array was not filled. This is breaking, but might be the only way to make that method useful (and it's unlikely anyone relies on the existing behaviour).readbytes!
should be completely deprecated, see the section of this document on the functionreadavailable
should be guaranteed to read exactly bytesavailable
, and should throw an error when bytesavailable
is not implemented. This is possibly breaking, depending on how the current docstring is interpreted.With these changes, we now have the following API "holes":
readinto!
will be documented to read the available bytes if any, or else do one read call. I.e. it does what readavailable
does (in its implementation) now. We could instead document readavailable
to do this (as it already does), but then we would need a new function actually_readavailable
to read only the available bytes, and I think that is bad API. It also closes the API hole of read!
, in that it is now the way to partially fill an array.IO
into a buffer, hence why I propose readall!
. A dedicated functions for this might not be super important, since performance insensitive code can just read(io)
and copy the bytes, and performance sensitive code can implement it as a while loop of readinto!
, which won't be a big problem as performance sensitive code will probably need to do all sort of low-level IO operations anyway. So, I'm not attached to readall!
, but I think it woud be a generally useful function to have.all
kwarg used in some IO methodsThere are methods with the IOStream
type specifically that have the all
keyword (read
and readbytes!
). This is bad API because setting the all
keyword completely changes the semantics. It's sort of like if append!
was instead written as push!(v, things; all=true)
. We should preserve the functionality of these keywords but not add them to any more types.
readuntil
and copyuntil
's vector argument should be limited to byte vectorsThese functions allow the user to copy from one IO to another, until a delimiter is found in the source. This delimiter can be an AbstractVector{T}
.
However, IMO it's not semantically meaningful if the delimiter is e.g. Vector{Int}
, since IOs, which are conceptually streams of bytes, do not contain any Int
s, only bytes. The implementation is also buggy.
We should limit it to AbstractVector{UInt8}
.
See https://github.com/JuliaLang/julia/pull/58019
As per above, I'm not completely sold on the getwritebuffer
, consumewrite
, fillwritebuffer
trio. Begin by implementing only the reading interface, where I'm more confident on the design
This is a type that implements the minimal IO interface before this proposal. For the reading part, this corresponds to only eof
and read(::IO, ::Type{UInt8})
.
Test Base code using this type to ensure nothing breaks when developing the PR.
Many fallback Base reading methods are defined in terms of read(::IO, ::Type{UInt8})
.
That's extremely inefficient and the generic fallback ought to be implemented in terms of this newly proposed API. However, changing the generic method breaks existing types.
To get around this, whenever I change a generic method from not requiring the new API to requiring the new API, I use introspection to check if the new API is implemented.
An example
if (
# These are mandatory API parts of the new interface for unbuffered
# and buffered readers, respectively.
readbuffering(typeof(io)) == Base.NotBuffered() || (
hasmethod(getbuffer, Tuple{typeof(io)})
)
# use new method definition
else
# use old method definition
end
This is ugly, relies on a shady amount of introspection, and might dispatch to unacceptably slow old generic implementations.
However, it's the only way I can think of to avoid breaking code.
THIS SECTION IS WORK IN PROGRESS
unsafe_read
and unsafe_write
should return the number of bytes read/written. It should document that the caller only needs to guarantee the pointer is valid and point to at least nbytes
bytes. It is up to the implementer to make sure this function doesn't access OOB data in the IO itself.
close
and isopen
: Implementing these also requires implementing the other, and eof
if the stream is readable. Also, close
will currently flush
: This means it can't be used for read-only IOs. We need to figure out what to do here - perhaps call flush only if this method is implemented (using a hasmethod
check).
closewrite
: Must implement iswriteable
and eof
reseteof
: Must implement eof
mark, ismarked, unmark: Must all be implemented together. Requires position
reset: Requires mark and seek
isreadable
: The current default implementation for this is incorrect, as it falls back to isopen
. However, isopen
explicitly says an IO can be closed and still contain readable data. Recommend changing default implementation to isopen(io) || !eof(io)
.
iswriteable
: Current default implementation is incorrect, since a writer might implement closewrite
. Link to closewrite
from docs.
seek
, seekstart
, position
and seekend
. Link to each other. seekstart
and seekend
should require position
. Should seek
also, perhaps?
bytesavailable
and readavailable
: Link to each other.
read
is not documented to read all bytes. I think it should be.
After a discussion on Zoom, we reached this conclusion. IOs should not error during normal operation, e.g. when reaching EOF.
Exceptions are for either user errors (e.g. reading a closed stream), or for unexpected events (e.g. a file you are reading is suddenly unavailable).
Instead, construct a new abstract IOError
type, and make IO errors throw instances of (subtypes of) this type.
Where possible, IOs should not block the OS thread, but should be blocking in the sense that they may block the current task.
All user-facing APIs should be safe to call in a multi-tasking setting, without an expectation that there is a certain number of threads available. In order for blocking I/O to not block an entire thread or even the whole scheduler, all blocking operations must be implemented on top of async, scheduler-friendly primitives (yield()
, wait()
, etc.).
We should not implement API for managing async (waiting for, notifying etc.) IO operations. Users should use the existing Task
APIs. One exception may be cancellation, which is different, and mostly orthogonal to this proposal.
yield()
-based)wait()
-based)read
et. al), then operations should be done on a dedicated threadyield()
-based or wait()
-based; benchmarking should guide this, or both an be provided.Note: The current PR #57982 does not implement cancellation, this is left for a future PR
Work backwards through the abstractions from read(::IO, ::Type{UInt8})
X read(::IO)
X read(::IO, String)
X readeach(::IO, ::Type{UInt8})
copyuntil(::IO, ::UInt8)
copyline(::IO, ::IO)
readline(::IO)
eachline
readline(::IO)
skipchars
unsafe_read(::IO, ::Ptr{UInt8}, ::UInt)
read(::IO, ::AbstractArray)
read(::IO, ::StridedArray)
readbuffering
, fillbuffer
, getbuffer
, and consume
readall!
and readinto!
(see their justification), and how much all the read methods should readmark
, reset
, unmark
, ismarked
close