or
or
By clicking below, you agree to our terms of service.
New to HackMD? Sign up
Syntax | Example | Reference | |
---|---|---|---|
# Header | Header | 基本排版 | |
- Unordered List |
|
||
1. Ordered List |
|
||
- [ ] Todo List |
|
||
> Blockquote | Blockquote |
||
**Bold font** | Bold font | ||
*Italics font* | Italics font | ||
~~Strikethrough~~ | |||
19^th^ | 19th | ||
H~2~O | H2O | ||
++Inserted text++ | Inserted text | ||
==Marked text== | Marked text | ||
[link text](https:// "title") | Link | ||
 | Image | ||
`Code` | Code |
在筆記中貼入程式碼 | |
```javascript var i = 0; ``` |
|
||
:smile: | ![]() |
Emoji list | |
{%youtube youtube_id %} | Externals | ||
$L^aT_eX$ | LaTeX | ||
:::info This is a alert area. ::: |
This is a alert area. |
On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?
Please give us some advice and help us improve HackMD.
Syncing
xxxxxxxxxx
IO interface in Julia
The current abstract type IO has no documented interface in Julia. This causes problems:
Background reading
Please add more material here that might be relevant
readavailable
?close
is underdocumented, and inconsistently implementedposition
About this proposal
One solution could be to simply thoroughly document the existing methods. Better documentation would certainly help. However, since the status quo is characterized by a lack of interface, documenting an existing ad-hoc interface is not likely to result in a consistent and cohesive API.
A more promising approach is to think IO from the ground up: Build a few low-level abstractions, and implement the existing functions in terms of these abstractions.
Therefore, this proposal is a reimagining of IO in Julia from the ground up. It add several new functions, one trait, and deprecates one existing function.
I believe it's better to start from an idealistic point and then scale back the ambitions if the changes are considered too comprehensive.
Note that no changes proposed here should be breaking - everything can be put in a minor release.
This proposal is only about a reading interface for
IO
. If this proposal is implemented, I will make a similar proposal for the writing interface forIO
.Reviewing this proposal
readbuffering
,getbuffer
,fillbuffer
andconsume
.readinto!
andreadall!
close
andseek
.Implementation
WIP implementation in https://github.com/JuliaLang/julia/pull/57982
How does Rust do it?
When looking for good low-level abstractions, Rust is always worth a look. Let's look at the foundational IO abstractions in Rust.
Rust manages to get very far with only 3 basic methods:
read(&mut self, buf: &mut [u8]) -> io::Result<usize>
fill_buf(&mut self) -> io::Result<&[u8]>
consume(&mut self, amt: usize)
If we build on these abstractions, we need to re-fit them for Julia, so, let's make the following changes:
BufRead
without also filling it. This is annoying. We separate getting and filling the buffer into two distinct methodsProposed core Julia IO interface
This proposal adds:
New functions:
getbuffer
,fillbuffer
,consume
,readinto!
, andreadall!
The trait:
readbuffering
Core IO interface
Readers are buffered if they expose a vector with buffered data to the consumer. Buffered IOs are faster, more convenient, and have more methods defined generically. Unbuffered IOs can be wrapped in a buffered IO (from a package) to make them buffered.
getbuffer
,fillbuffer
,consume
.readbuffering
, andreadinto!
.New functions
readbuffering(::Type{<:IO})::Union{Base.IsBuffered, Base.NotBuffered}
Trait to signal if a reader IO is buffered. Defaults to
IsBuffered()
.New unbuffered io types
T <: IO
should signal they are unbuffered by implementingBase.IOBuffering(::Type{T}) = Base.NotBuffered()
. Public but unexported.getbuffer(io::IO)::AbstractVector{UInt8}
Get the available bytes of
io
.The returned vector must have indices
1:length(v)
. Callers should avoid mutating the buffer.Calling this function when the buffer is empty should not attempt to fill the buffer.
This function should be implemented for buffered readers only, and together with
fillbuffer
andconsume
.fillbuffer(io::IO)::Union{Int, Nothing}
Fill more bytes into the reading buffer from
io
's underlying buffer, returningthe number of bytes added. After calling
fillbuffer
and gettingn
,the buffer obtained by
getbuffer
should haven
new bytes appended.This function must fill at least one byte, except
0
nothing
.IO
s which do not wrap another underlying buffer, and therefore can't fillits buffer should return
0
unconditionally.This function should never return
nothing
if the buffer is empty.This function should be implemented for buffered readers only, and together with
getbuffer
andconsume
.consume(io::IO, n::Int)::Nothing
Remove the first
n
bytes of the reading buffer ofio
.Consumed bytes will not be returned by future calls to
getbuffer
.If
n
is negative, or larger than the current reading buffer size,throw a
ConsumeBufferError
error.This function should be implemented for buffered readers only, and together with
getbuffer
andfillbuffer
.readinto!(io::IO, v::AbstractVector{UInt8})::Int
Read bytes from
io
into the beginning ofv
, returning the number of bytes read.This function should read at least one byte, except if
io
is EOF, orv
is empty,in which case it should return
0
.Where possible, implementations should make sure
readinto!
performs at most oneblocking reading IO operation, even when it means only filling part of
v
.readall!(io::IO, v::AbstractVector{UInt8})::Int
Read as many bytes as possible from
io
into the beginning ofv
, returningthe number of bytes read. This function will continue to read until
io
is EOF,or
v
has been filled.Optional methods
These functions should have no fallback definition
closewrite
isopen
reseteof
mark
,ismarked
,unmark
,reset
reset
isreadable
iswritable
seek
(implementing this givesseekstart
)position
(implementing this andseek
givesskip
)seekend
(requiresseek
)lock
andunlock
bytesavailable
: Has default impl for buffered readers, may be implemented for unbuffered onesreadavailable
: Has default impl for buffered readers, or for readers withbytesavailable
implementedeof
: Has default impl for buffered readers, may be implemented for unbuffered onesunsafe_read
: Has default impl for buffered readers, may be implemented for unbuffered onesDerived methods
The following methods are defined in terms of the core API, and so new subtypes of IO get them for free
close
readbytes!
read(::IO)
read(::IO, String)
read!
readeach
readall!
(new function)readinto!
(buffered reader only, new function)peek
(buffered readers only)unsafe_read
(buffered readers only)eof
(buffered readers only)bytesavailable
(buffered readers only)readavailable
(buffered readers only)copyline
(buffered readers only)readline
(buffered readers only)readlines
(buffered readers only)eachline
(buffered readers only)readuntil
(buffered readers only)copyuntil
(buffered readers only)Discussion:
We need a common, documented API for dispatching to dense memory.
Currently, Base recommends people define their IO methods in terms of the pointer-based
unsafe_read
andunsafe_write
.However, pointers are tricky, dangerous and should be avoided where possible. Julia normally goes to great lengths to not expose APIs based on pointers, so it's a little strange that the IO interface singularly has been left out of the safety and convenience of Julia's memory management.
So why is the current main IO reading and writing 'interface' pointer-based?
A major reason is that IO in particular often interacts with non-Julia (e.g. C) code, which forces the use of pointers. Another reason is that a lot of the heavy lifting in IO work is done by libc's
memchr
,memmove
etc, which requires pointers.This suggests that we need:
AbstractVector{UInt8}
using normal Julia functions likecopyto!
, as proposed here,AbstractVector{UInt8}
can be read from or written to using a pointer.We don't have the latter, currently. That's a huge problem for the IO API in particular, because it makes it difficult to write methods like
readinto!(::MyType, ::AbstractVector{UInt8})
with a reasonable level of efficiency.I can see a few options:
MemoryView
(link to repo), which would then need to live in Base. Implementers can then write one specialized method forMemoryView
, and callers will know that if you want the optimised method, you need to use aMemoryView
.pointer
andsizeof
of the object, and then have most methods dispatch on that trait.See discussion in #54581
Should we deprecate
mark
,unmark
,reset
andismarked
?It's unfortunate that we are stuck with the
mark
family of functions. The implementation of these are generic overIO
, and relies on reading and writing fields of the IO objects which may not exist. They also rely on other undocumented behaviour like the field being a signed integer, and that the position of an IO is never negative.What is the purpose of
mark
etc?You can use it to mark a position in the exposed buffer of a reader, preventing the marked position and everything after it from being deleted from the buffer. This is useful for operating on the underlying buffer. However, this functionality is unnecessary and can also be done with the three buffered IO primitives.
I can't think of any other purposes. If there are none, could these functions be deprecated?
See discussion here: https://github.com/JuliaLang/julia/issues/58034
Should we define a noop fallback
close
?The purpose of close is to destroy / free the underlying resource under an IO. In the presence of wrapping IOs (e.g. TranscodingStreams.jl or BufferedStreams.jl), the wrapper type needs an interface to close its wrapping IO. For this to work generically and not error if the wrapper wraps an IO which does not implement
close
, perhaps we should defineclose(::IO) = nothing
?Can we change the definition of
readavailable
to actually read the available bytes?See https://github.com/JuliaLang/julia/issues/57994
Should we have a buffered API for writers?
Note: The current PR #57982 only concerns itself with reading
The proposed buffered reader API provides users direct access to the bytes of the IO's buffer. This is hugely helpful, as a lot of IO work revolves around reading bytes from an array and copying them arounds.
What about doing the same thing for writers? Expose a buffer users can mutate to write data to the IO.
The uses for a writing buffer are rarer, but they do exist. The implementation of
Base.show(::IO, ::Float64)
, for example, currently needs to allocate an intermediateMemory
to store the written representation, because the digits are written backwards, and so can't be written one at a time to theIO
.This could be done by implementing a "writer trio" of functions similar to
getbuffer
,fillbuffer
andconsume
.Does this rare use case warrant three new (optional) functions in the interface?
It's a little bit nice, but I'm not sure these functions justify their own "API weight, so to speak.
Should new, unbuffered IOs support slow generic fallbacks
Some IO operations are fast on buffered IOs, but slow on unbuffered ones. Consider
readline
. This can be very efficient for a buffered IO, but for an unbuffered IO, the only generic implementation reads one byte at a time until a0xA
is found. Indeed,readline
does currently have such a fallback definitionNow consider if the IO is a network with milisecond latency…
So: Should new, unbuffered IOs use such a slow fallback, or should there be no such fallback? Old IOs will, of course, continue to have the fallback.
For: Yes, a fallback makes generic code more generic, which increases the composability of Julian IO. People generally prefer an inefficient fallback over a MethodError
Against: No, reading unbuffered IOs byte by byte is unacceptably slow. The user is better served with a MethodError, because debugging slow fallbacks is much harder than debugging exceptions. The error is easily fixed by wrapping their IO in a buffered IO wrapper which is way more efficient.
Currently, I'm against, and the PR does not have a fallback for new, unbuffered IOs.
Misc
Thoughts
readbytes!
is bad and should not be used to implement other methods.Memory
. Including the logic to conditionally resize the buffer also significantly complicates the implementation of this basic function, even though it's unnecessary in most cases. It's also inefficient that every call need to checkeof
to know whether to expand the buffer once it has been filled.nb
keyword is superfluous: If the user wants to read fewer bytes than the buffer length, simply pass in a view of the buffer. Ifnb
is larger than the buffer length, you can pass in a larger buffer, resizing it yourself.all
kwarg - which is true by default, but ONLY forIOStream
significantly changes the semantics of the function. This means that we essentially have theIOStream
method ofreadbytes!
behave very differently from the generic method.We should deprecate this method, or, at the very least, just make sure it works and direct users to other functions instead.
But what if I want to:
readinto!
readall!
readall!
orreadinto!
with a view of the arrayread
. We could have a dedicated function to this that allows you to reuse aVector
- but growing will likely cause allocations anyway so I see little point.Justification of
readinto!
andreadall!
There are already so many reading functions. Why add more?
The problem is that the existing reading functions are not clear about exactly how much they read.
Part of this proposal is to clear up the semantics of existing functions. However, some methods currently have multiple intended use cases, which is possible only because their current semantics are imprecise and/or inconsistent. Therefore, if we clean up the semantics, we might need new functions to cover exposed "API holes".
Here's how the land is currently:
read(::IO; nb=typemax(Int))
promises to read at mostnb
bytes.read(::IO, String)
promises to read all of IO.read!(::IO, ::AbstractArray)
promises nothing, but currently reads to EOF or end of array. However, it returns the array, so if EOF is reached before array end, caller has no way of knowing how many bytes were read.readbytes!(::IO, ::AbstractVector{UInt8}; nb::Int)
reads at mostnb
. Docstring is not completely clear, but seems to suggest until EOF ornb
has been read. Except for theIOStream
method, where it's documented to do at most one read call.readavailable
says amount of data returned is implementation-dependent. It does not explicitly guarantee readingbytesavailable
. Current implementations returnbytesavailable
, except where that would be zero, in which case it does a single read call.I suggest making the following changes
read(::IO; nb=typemax(Int))
is now guaranteed to read until either EOF, ornb
bytes is read, whatever comes firstread!(::IO, ::AbstractArray)
now resizes the array if EOF was reached and array was not filled. This is breaking, but might be the only way to make that method useful (and it's unlikely anyone relies on the existing behaviour).readbytes!
should be completely deprecated, see the section of this document on the functionreadavailable
should be guaranteed to read exactlybytesavailable
, and should throw an error whenbytesavailable
is not implemented. This is possibly breaking, depending on how the current docstring is interpreted.With these changes, we now have the following API "holes":
readinto!
will be documented to read the available bytes if any, or else do one read call. I.e. it does whatreadavailable
does (in its implementation) now. We could instead documentreadavailable
to do this (as it already does), but then we would need a new functionactually_readavailable
to read only the available bytes, and I think that is bad API. It also closes the API hole ofread!
, in that it is now the way to partially fill an array.IO
into a buffer, hence why I proposereadall!
. A dedicated functions for this might not be super important, since performance insensitive code can justread(io)
and copy the bytes, and performance sensitive code can implement it as a while loop ofreadinto!
, which won't be a big problem as performance sensitive code will probably need to do all sort of low-level IO operations anyway. So, I'm not attached toreadall!
, but I think it woud be a generally useful function to have.The
all
kwarg used in some IO methodsThere are methods with the
IOStream
type specifically that have theall
keyword (read
andreadbytes!
). This is bad API because setting theall
keyword completely changes the semantics. It's sort of like ifappend!
was instead written aspush!(v, things; all=true)
. We should preserve the functionality of these keywords but not add them to any more types.readuntil
andcopyuntil
's vector argument should be limited to byte vectorsThese functions allow the user to copy from one IO to another, until a delimiter is found in the source. This delimiter can be an
AbstractVector{T}
.However, IMO it's not semantically meaningful if the delimiter is e.g.
Vector{Int}
, since IOs, which are conceptually streams of bytes, do not contain anyInt
s, only bytes. The implementation is also buggy.We should limit it to
AbstractVector{UInt8}
.See https://github.com/JuliaLang/julia/pull/58019
Migration plan
Implement reader part of interface first
As per above, I'm not completely sold on the
getwritebuffer
,consumewrite
,fillwritebuffer
trio. Begin by implementing only the reading interface, where I'm more confident on the designTest all existing functionality with an "Old IO emulator"
This is a type that implements the minimal IO interface before this proposal. For the reading part, this corresponds to only
eof
andread(::IO, ::Type{UInt8})
.Test Base code using this type to ensure nothing breaks when developing the PR.
Use introspection to dispatch between old and new API in Base
Many fallback Base reading methods are defined in terms of
read(::IO, ::Type{UInt8})
.That's extremely inefficient and the generic fallback ought to be implemented in terms of this newly proposed API. However, changing the generic method breaks existing types.
To get around this, whenever I change a generic method from not requiring the new API to requiring the new API, I use introspection to check if the new API is implemented.
An example
This is ugly, relies on a shady amount of introspection, and might dispatch to unacceptably slow old generic implementations.
However, it's the only way I can think of to avoid breaking code.
Clarification of existing documentation
THIS SECTION IS WORK IN PROGRESS
unsafe_read
andunsafe_write
should return the number of bytes read/written. It should document that the caller only needs to guarantee the pointer is valid and point to at leastnbytes
bytes. It is up to the implementer to make sure this function doesn't access OOB data in the IO itself.close
andisopen
: Implementing these also requires implementing the other, andeof
if the stream is readable. Also,close
will currentlyflush
: This means it can't be used for read-only IOs. We need to figure out what to do here - perhaps call flush only if this method is implemented (using ahasmethod
check).closewrite
: Must implementiswriteable
andeof
reseteof
: Must implementeof
mark, ismarked, unmark: Must all be implemented together. Requires
position
reset: Requires mark and
seek
isreadable
: The current default implementation for this is incorrect, as it falls back toisopen
. However,isopen
explicitly says an IO can be closed and still contain readable data. Recommend changing default implementation toisopen(io) || !eof(io)
.iswriteable
: Current default implementation is incorrect, since a writer might implementclosewrite
. Link toclosewrite
from docs.seek
,seekstart
,position
andseekend
. Link to each other.seekstart
andseekend
should requireposition
. Shouldseek
also, perhaps?bytesavailable
andreadavailable
: Link to each other.read
is not documented to read all bytes. I think it should be.Previously decided discussion points
We should not have a fallible API
After a discussion on Zoom, we reached this conclusion. IOs should not error during normal operation, e.g. when reaching EOF.
Exceptions are for either user errors (e.g. reading a closed stream), or for unexpected events (e.g. a file you are reading is suddenly unavailable).
Instead, construct a new abstract
IOError
type, and make IO errors throw instances of (subtypes of) this type.Async IOs
Where possible, IOs should not block the OS thread, but should be blocking in the sense that they may block the current task.
All user-facing APIs should be safe to call in a multi-tasking setting, without an expectation that there is a certain number of threads available. In order for blocking I/O to not block an entire thread or even the whole scheduler, all blocking operations must be implemented on top of async, scheduler-friendly primitives (
yield()
,wait()
, etc.).We should not implement API for managing async (waiting for, notifying etc.) IO operations. Users should use the existing
Task
APIs. One exception may be cancellation, which is different, and mostly orthogonal to this proposal.More on async and cancellation
Async
yield()
-based)wait()
-based)read
et. al), then operations should be done on a dedicated threadyield()
-based orwait()
-based; benchmarking should guide this, or both an be provided.Cancellation
Note: The current PR #57982 does not implement cancellation, this is left for a future PR
TODO
Work backwards through the abstractions from
read(::IO, ::Type{UInt8})
X
read(::IO)
X
read(::IO, String)
X
readeach(::IO, ::Type{UInt8})
copyuntil(::IO, ::UInt8)
copyline(::IO, ::IO)
readline(::IO)
eachline
readline(::IO)
skipchars
unsafe_read(::IO, ::Ptr{UInt8}, ::UInt)
read(::IO, ::AbstractArray)
read(::IO, ::StridedArray)
Agenda for next meeting
readbuffering
,fillbuffer
,getbuffer
, andconsume
readall!
andreadinto!
(see their justification), and how much all the read methods should readmark
,reset
,unmark
,ismarked
close