Work in progress
The case for Path
as a fundamental type that belongs in Base Julia, and a proposal for implementing it well.
Authors: Timothy, [add your name if you've made substantial edits]
Reviewed-by: [your name here]
Julia's approach to file paths was largely inspired by Python …just before Pathlib
was adopted. In the years since, the idea that a path type would benefit Julia has been articulated multiple times, in different ways.
@stevengj
made an issue proposing a mildly cursed partial workaround for the lack of a path type.Julia's Slack only keeps 90 days of conversation history, but you can usually search for "path type"
and find somebody running into papercuts/headaches. Ignoring my recent gripe, doing this I see Mosé in response to some platform-specific handling that came up on a Julia PR adressing the difference a trailing slash makes with some depot operations.
We should really have a proper path type, strings are simple bad for manipulating them
(💯 ×2)
While C and friends use char-vector types for strings, paths, and more, most modern high-level languages have settled on a dedicated path type, and for good reason. I hold that the value of a dedicated path type, and the importance of it being a part of the base language/stdlib is self-evident to anybody who has subtantial experience with a language that does so. Julia already has dedicated non-String
types for regular expressions, substitution strings, and more. There are many reasons why a path type benefits Julia, some of which are:
This path type must exist in Julia's base, for two primary reasons:
Python's pathlib is generally praised for offering an ergonomic way of handling filesystem paths.
Pathlib provides per-platform path classes, and aliases Path
/PurePath
based on the current platform.
Pathlib separates out purely conceptual and filesystem-grounded paths as pure and concrete paths.
Operating on pure paths does not involve any interaction with the filesystem, while concrete paths check for symlinks, resolve symlinks, and verify various path operations using the filesystem.
The Pathlib API provides the basics (parent
, parts
, joinpath
, home
), and also a decent collection of utilities on top:
suffix
suffixes
stem
with_name
with_stem
with_suffix
with_segments
from_uri
as_uri
Currently Julia covers the basics, but could probably do with some more convenience functions.
Rust has two major path types:
Path
: an immutable unsized type similar to str
, andPathBuf
:a mutable (owned), growable path buffer similar to String
Having an owned type is a concern specific to Rust, but the idea of a type optimised for mutation is interesting.
You can just use Path
/PathBuf
without worrying about what platform you're on. +1 for ergonomics.
Rust sits somewhere between Pathlib and Julia in terms of convenience functions. Beyond the basics it's got a few things that Julia doesn't like file_stem
, with_extension
, and with_file_name
, but doesn't have as much as Pathlib.
Most racket functions that accept filesystem paths and path
objects, with strings being put through string->path
on-demand.
https://www.oxinabox.net/2016/09/14/an-algebraic-structure-for-path-schema-take2.html
The classic library for paths in Julia. Get a lot of things right, but suffers from a few fatal flaws (such as type instability).
Tuple{Vararg{String}}
for segments makes access/modification O(1)
, but makes any operations that require the full path O(depth)
which isn't great.FilePaths2 uses a String
for the overall path, and returns SubString
s for segments, which is a rather nice approach IMO.
One of the distinguishing aspects of FilePaths2 is that it implements paths as a subtype of AbstractTree
. I find this a rather interesting embrase of the nature of the filesystem as a tree (though arguably, with symlinks it is a directed graph)
While we want to end up with a Path
type, it would be good to take a step back, consider what makes a "path" conceptually, and define an abstract type we can then specialise on.
I propose that in the abstract, a path is an ordered series of directions that takes you to a location.
From this, we can conceptualise a path as a list of direction segments, and arrive at a few fundamental operations:
root
: the origin pointparent
: the sequence of directions up to but excluding the most recent onelength
: the number of directions in the pathiterate
: give each direction of the pathbasename
: the most recent segmentchildren
: the immediate next paths one may takejoinpath
: combine two sets of directionsA path that includes a root is considered absolute, and other paths are relative.
Thinking of a path as a representation of a location, the existance of the .
and ..
pseudopath components complicates path considerations:
foo/bar/..
, foo/.
, and foo
equal?foo/bar/
the same as foo/bar
?foo\\bar
and foo/bar
equivalent on Windows?These questions tend to fall under path normalisation, but by using a dedicated path type and path-specific operations I contend that we can decide on rules for a canonical representation of a path, and make it the only form that can be constructed. There is no need for path normalisation, as it is no longer possible to construct an abnormal path.
Frames also discusses the algebraic appeal of normalised paths in her blog post. It also feels like the more principled choice to me, and I suspect it makes it harder to fall into a few edge cases.
It has come to my attention that due to symlinks the first question cannot be answered in the context of a real/concrete path without querying the filesystem. On-disk, foo/bar/..
!= foo
with symlinks.
This has been the cause of much consternation, and extensive discussion on Slack. Various ideas were discussed on how to best handle this complication, including:
realpath
in the backgroundnothing
or throwing an error when parent
is called on a path ending in ..
While mired in the messy filesystem details, Julius made the excellent point that given the filesystem is in a constant state of flux –realpath
can be wrong the moment after it returns– and so it's overly presumptuous of us to try to handle this using information in the path object itself (using the type domain or runtime information).
So, we should be upfront about this and just say that if you want operations on a path to take into account the filesystem state at the time, you need to call realpath
.
Note Should
realpath
beresolve
post-Path
? It would also be good to have a unnormalised string to realPath
function (ideally the same function).
This has the following benefits:
There are some characters that may not appear in a path.
\0
).|
, the null byte (\0
), and ASCII codes \x01
~ \x31
.There are also some restrictions on filenames:
/
and the null byte (\0
)..
and ..
<
, >
, :
, "
, \
, /
, |
, ?
, and *
.\0
)\x01
~ \x31
.CON
, PRN
, AUX
, NUL
, CON1
… COM9
, LPT1
… LPT9
regardless of extension
or .
We will also pretend that the empty path is never allowed (see: Compromises).
References:
GetInvalidFileNameChars()
, GetInvalidPathChars()
- dotnet/runtimeIt's fairly easy to apply the Posix path restrictions when constructing paths, but Windows is a bit of a pain, making me think perhaps it's not worth the effort.
Since the Windows restrictions are a (large) superset of the Posix restrictions, one approach that I'd like to explore is validating the Posix requirements are met during path construction, and then maybe checking for forms Windows wouldn't like in literal path construction (with the p""
macro) and emitting a warning.
Posix exclusively uses the /
delimiter, and Windows accepts \
(preferred) or /
.
As such, we can reasonably settle on /
as the in-Julia syntax for paths, and handle operating system dependent normalisation in the background. This makes it impossible to accidently hardcode a particular platform's delimiters.
~
home shorthandThe handling of ~
in paths in Julia tends to be trip up people used to shell expansion, but there's a very good reason why Julia doesn't go ahead an interpret "~/dir"
as /home/$USER/dir
but requires expanduser
: ~
is a valid path segment.
Without knowing the intent with which the ~
was written (or generated and passed around) it is not possible to reasonably decide whether it should be interpreted as a "~"
segment or a reference to the home directory.
With a path macro, this changes. We can differentiate between a ~
that has been put literally at the start of a path, and a ~
that's come from elsewhere. This makes the convienent ~
-home interpretation viable, without re-introducing the current issues. As a tradeoff expressing an initial ~
segment becomes less convenient, but given the relative frequency of home vs. "~"
forms, this seems like a worthwhile tradeoff.
@
project shorthandNote I'm not completely sold on this idea, but I'm interested and it seems worth exploring
Within package and project code, it is common to see forms like joinpath(@__DIR__, "..", "..", "assets", "file.txt")
. There are two major issues with this:
@__DIR__, "..", ".."
) only makes this less apparent.Extending the "special literal prefix" handling to treat @
as a project-prefix as ~
is a user-prefix can improve this situation. The choice of @
seems natural given the existing use of @
-prefixed special paths in DEPOT_PATH
and --project
already.
Besides the two issues above, a @
-prefix also provides an oppotunity to improve the status quo with regard to relocatability. Enough Julia packages use @__DIR__
in paths to make relocatability a general issue (motivating RelocatableFolders.jl and julia/PR#55146). Implementing @
as a relocatable project-relative path (determined at compile-time) creates a form that is both more convenient and more robust, a "pit of success".
It seems likely useful to still be able to model paths of other platforms, and we can do this without compromising ergonomics fairly easily by defining <Platform>Path
types and then aliasing Path
to the current platform.
Using a contiguous null terminated char array (whether a String
, Memory{UInt8}
, or Vector{UInt8}
) for the internal representation of system paths makes it possible to pass the path representation directly to Libuv with no overhead.
Since we've got good reason to want a single contiguous on-disk format, operations that fetch components of the path will either have to allocate a new object … or we can use a SubString
. This is the approach FilePaths2.jl takes, and I like it.
Since we're viewing a path as an ordered sequence of directions, it makes sense to be able to iterate
through them. Together with length
, this also makes it possible to simply collect
the segments of a path.
When interpreting externally provided content as a path, the existence of the pseudopath element introduces a risk of ending up in an unexpected directory. The intent of code like joinpath(workdir, subdirname)
can be subverted (deliberately or accidently) by provididing a subdirname
like /some/other/dir
, ../stuff
, the empty string, or even a null byte. This is most of the Path Traversal class of CVEs. When interpreting a string as a path segment, we can validate that it is a "normal" path segment and raise an error otherwise, preventing a suprising result from appearing with forms like p"path/$var/$name.txt"
.
In conversation, I've recieved a fair bit of pushback from multiple individuals on the normalisation I've proposed in the prior section. The essential argument is that the nice, algebraic model of paths isn't able to fully abstract over the world of system-specific details. To give a few examples:
stat foo/bar
is different to stat foo/bar/.
rsync
treat foo
differently to foo/
/..
is /
(root is a fixed point essentially on Linux)I really dislike these complications, particularly because in accepting this messiness we abdicate the handling of it to the user, and thus make it easier to write naievely buggy code.
Looking at this another way, even more pushback along these lines is deserved. There is an abundence of unquestioned issues of this nature with the current status quo. For example: it is currently not possible to write to a file and then move it without there being an oppotunity for the file to be replaced entirely inbetween each step.
This kind of issue and much of the pushback I've recieved essentially stems from one core issue: we often like to think of paths as unique resource descriptors, when infact they are unique location descriptors. This is a subtle but important distinction. It is responsible for a large class of bugs and vunerabilities known as TOCTTOU (Time of Check to Time of Use). Essentially any time a path is reused with any degree of outside influence (over the path or filesystem), it is near trivial to swap out the file in between operations by constructing a deep directory nesting and monitoring the directory atime
s (yes, really). The only system where we can truthfully say this is not an issue is one with no concurrency of any sort (including time-sharing).
This is in large part a consequence of the initial POSIX standard being path oriented, a limitation that is gradually being rectified with the addition of f<op>
and <op>at
calls, such as faccessat
. These calls operate not on a path to the file, but a handle to the file itself: a file descriptor, or FD for short. File descriptors essentially sit in between a description of the location of a resource, and the data on disk. I am less familiar with the NT situation, but am lead to believe that it has been ahead of nix in supporting handle-based path operations.
Other programing languages have also recognised this issue, for example Python's Pathlib seperates paths into pure and concrete paths, creating a clear split between an abstract conception of a path (as I've found myself attracted to thus far), and something that actually exists on the filesystem. I suspect we can go even further, and consider a scheme by which we provoke the user into obtaining a reference to the resource at a path when they want to work on it, and so avoid TOCTTOU-style issues and related messiness wholesale.
I conjecture that with the development of the file descriptor based API in POSIX 2008, Linux 2.6, and OpenBSD we have the capability to fufill this ideal by building a path-like type that is oriented around file descriptors rather than path strings. We can make a more concrete path type than Pathlib's "concrete" paths. Arguably this isn't really a path any more so much as a handle to a filesystem-adressed resource. I'm not sure what best to call this, but regardless it seems exceptionally useful for writing safe filesystem-interacting code.
There are a few downsides to this approach, which largely stem from the need to actually aquire a file descriptor.
TODO
TODO
From all the investigation we've done so far, we know that:
readdir
usage, and for other TOCTTOU-safe filesystem operationsCurrently, we just use String
for all of these purposes. This is "simple" in the sense that all of the inherrent complexity is put off to the user of the API to think about. By contrast, this trio of system path types requires a little more upfront thinking, but this is paid for several times over in the reduction of edge cases that package developers and end users may hit.
While I like this set of design goals, they're ultimately a compromise between various concerns, and so produce some potential pain points. This should be mitigated as much as possible.
With the separation of concerns of the design, path operations are very predictable. However, if pseudopaths are present and/or symlinks need to be accounted for, realpath
will need to be called.
This is something that people using Path
objects will simply need to remember to do, and so we will sprinkle mention of this liberally into the documentation.
Made easier via interpolation within the p""
macro.
If one reads on Linux Pathname Lookup, one may notice that while empty segments are generally invalid, with the appropriate flags it can be valid to interact with the empty path.
I cannot begin to imagine any legitimate use case for this, and so am inclined to pretend this edge case doesn't exist, particularly since there's no easy way to supply the necesary flag to Julia's (public) filesystem functions.
/
as the path root on Windows good enough?uperm
?..
and requiring realpath
when you need to guard against symlink shenanigans?\\?\
), for long file name support?Ideally we'd use a time-travel machine to shoehorn this into Julia 1.0, but the second best time to add a path type to Julia is now.
Avoiding breaking changes means we can't remove papercuts like eachline(::String)
, but we can provide a better alternative, gradually adopt it, and push for it to become the status quo in the long term.
https://code.tecosaur.net/tec/julia-basic-paths
If you'd like to make a PR etc. this is also now mirrored to GitHub: https://github.com/tecosaur/julia-basic-paths
I'm happy to take feedback in any form you're willing to give it. If easy/possible I like recieving a .patch
with inline comments