Try   HackMD

Charting a Path for Julia

Work in progress

The case for Path as a fundamental type that belongs in Base Julia, and a proposal for implementing it well.

Authors: Timothy, [add your name if you've made substantial edits]
Reviewed-by: [your name here]

History

Julia's approach to file paths was largely inspired by Python just before Pathlib was adopted. In the years since, the idea that a path type would benefit Julia has been articulated multiple times, in different ways.

  • In 2013 the path methods we know today were introduced.
  • In 2014, @stevengj made an issue proposing a mildly cursed partial workaround for the lack of a path type.
  • In 2016, FilePathsBase.jl was started.
  • In 2017 (just before Julia 1.0), Frames wrote a Julep advocating for a path type, but unfortunately it didn't go anywhere before Julia 1.0 was out.
  • In 2018 this was incidentally mentioned in a discourse topic.
  • In 2020, a newcomer from CommonLisp opened a discourse topic about missing a path type
  • In 2020, an issue was opened in the main Julia repo on this.
  • In 2021 Jakob wrote a post that stuck in my mind examining flaws in Julia, including the lack of a path type in the language.
  • In 2022, ExpandingMan starts working on FilePaths2.jl.
  • In 2024, I got fed up enough with this after the latest papercut I experienced to write a gripe on Slack.

Julia's Slack only keeps 90 days of conversation history, but you can usually search for "path type" and find somebody running into papercuts/headaches. Ignoring my recent gripe, doing this I see Mosé in response to some platform-specific handling that came up on a Julia PR adressing the difference a trailing slash makes with some depot operations.

We should really have a proper path type, strings are simple bad for manipulating them (💯 ×2)

Motivation

While C and friends use char-vector types for strings, paths, and more, most modern high-level languages have settled on a dedicated path type, and for good reason. I hold that the value of a dedicated path type, and the importance of it being a part of the base language/stdlib is self-evident to anybody who has subtantial experience with a language that does so. Julia already has dedicated non-String types for regular expressions, substitution strings, and more. There are many reasons why a path type benefits Julia, some of which are:

  • Resolving the ambiguity of representing data as a string directly vs. a path to the data
  • Allowing for dispatch on text content/paths, and more generic functions
  • A platform-independent syntax and methods for working with paths
  • Less footguns/papercuts, from reduced ambiguity and more rigorous handling

This path type must exist in Julia's base, for two primary reasons:

  • Base itself makes extensive use of paths
  • A 3rd party library simply isn't good (blessed) enough

Prior art

Python's Pathlib

https://docs.python.org/3/library/pathlib.html

Python's pathlib is generally praised for offering an ergonomic way of handling filesystem paths.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Posix and Windows paths

Pathlib provides per-platform path classes, and aliases Path/PurePath based on the current platform.

Pure and Concrete paths

Pathlib separates out purely conceptual and filesystem-grounded paths as pure and concrete paths.

Operating on pure paths does not involve any interaction with the filesystem, while concrete paths check for symlinks, resolve symlinks, and verify various path operations using the filesystem.

Solid path API

The Pathlib API provides the basics (parent, parts, joinpath, home), and also a decent collection of utilities on top:

  • suffix
  • suffixes
  • stem
  • with_name
  • with_stem
  • with_suffix
  • with_segments
  • from_uri
  • as_uri

Currently Julia covers the basics, but could probably do with some more convenience functions.

Rust's paths

https://doc.rust-lang.org/std/path/index.html

Immutable and mutable path types

Rust has two major path types:

  • Path: an immutable unsized type similar to str, and
  • PathBuf:a mutable (owned), growable path buffer similar to String

Having an owned type is a concern specific to Rust, but the idea of a type optimised for mutation is interesting.

Platform-agnostic design

You can just use Path/PathBuf without worrying about what platform you're on. +1 for ergonomics.

Path operations API

Rust sits somewhere between Pathlib and Julia in terms of convenience functions. Beyond the basics it's got a few things that Julia doesn't like file_stem, with_extension, and with_file_name, but doesn't have as much as Pathlib.

Racket's path

https://docs.racket-lang.org/reference/pathutils.html

Dual representation

Most racket functions that accept filesystem paths and path objects, with strings being put through string->path on-demand.

Comprehensive API

Modern CL paths library

https://codeberg.org/fosskers/filepaths

Strong points

Frames' proposal

https://www.oxinabox.net/2016/09/14/an-algebraic-structure-for-path-schema-take2.html

Paths as an algebraic structure

Canonical normalisation

FilePathsBase.jl

https://github.com/rofinn/FilePathsBase.jl

The classic library for paths in Julia. Get a lot of things right, but suffers from a few fatal flaws (such as type instability).

Areas for improvement

  • Being type unstable
  • Using a Tuple{Vararg{String}} for segments makes access/modification O(1), but makes any operations that require the full path O(depth) which isn't great.

FilePaths2.jl

https://gitlab.com/ExpandingMan/FilePaths2.jl

Efficient segment access

FilePaths2 uses a String for the overall path, and returns SubStrings for segments, which is a rather nice approach IMO.

Implementing paths as nodes of a tree

One of the distinguishing aspects of FilePaths2 is that it implements paths as a subtype of AbstractTree. I find this a rather interesting embrase of the nature of the filesystem as a tree (though arguably, with symlinks it is a directed graph)

Proposal

Design goals

High level path interface

While we want to end up with a Path type, it would be good to take a step back, consider what makes a "path" conceptually, and define an abstract type we can then specialise on.

I propose that in the abstract, a path is an ordered series of directions that takes you to a location.

From this, we can conceptualise a path as a list of direction segments, and arrive at a few fundamental operations:

  • root: the origin point
  • parent: the sequence of directions up to but excluding the most recent one
  • length: the number of directions in the path
  • iterate: give each direction of the path
  • basename: the most recent segment
  • children: the immediate next paths one may take
  • joinpath: combine two sets of directions

A path that includes a root is considered absolute, and other paths are relative.

Avoid representational ambiguities

Thinking of a path as a representation of a location, the existance of the . and .. pseudopath components complicates path considerations:

  • are foo/bar/.., foo/., and foo equal?
  • is foo/bar/ the same as foo/bar?
  • are foo\\bar and foo/bar equivalent on Windows?

These questions tend to fall under path normalisation, but by using a dedicated path type and path-specific operations I contend that we can decide on rules for a canonical representation of a path, and make it the only form that can be constructed. There is no need for path normalisation, as it is no longer possible to construct an abnormal path.

Frames also discusses the algebraic appeal of normalised paths in her blog post. It also feels like the more principled choice to me, and I suspect it makes it harder to fall into a few edge cases.

It has come to my attention that due to symlinks the first question cannot be answered in the context of a real/concrete path without querying the filesystem. On-disk, foo/bar/.. != foo with symlinks.

This has been the cause of much consternation, and extensive discussion on Slack. Various ideas were discussed on how to best handle this complication, including:

  • Calling realpath in the background
  • Returning nothing or throwing an error when parent is called on a path ending in ..
  • Using type/field information to split types into pure/concrete types a la Pathlib, and handle them differently

While mired in the messy filesystem details, Julius made the excellent point that given the filesystem is in a constant state of flux realpath can be wrong the moment after it returns and so it's overly presumptuous of us to try to handle this using information in the path object itself (using the type domain or runtime information).

So, we should be upfront about this and just say that if you want operations on a path to take into account the filesystem state at the time, you need to call realpath.

Note Should realpath be resolve post-Path? It would also be good to have a unnormalised string to real Path function (ideally the same function).

This has the following benefits:

  • Simplified model for path operations
  • Predictable normalisation
  • Separation of concerns

Make invalid paths unconstructable

There are some characters that may not appear in a path.

  • Posix (Linux/BSD/Mac): the null byte (\0).
  • Windows: |, the null byte (\0), and ASCII codes \x01 ~ \x31.

There are also some restrictions on filenames:

  • Posix:
    • / and the null byte (\0).
    • Reserved file names: . and ..
  • Windows: This is a superset of the restrictions for Path
    • Reserved characters: <, >, :, ", \, /, |, ?, and *.
    • null byte (\0)
    • ASCII codes \x01 ~ \x31.
    • Reserved names: CON, PRN, AUX, NUL, CON1 COM9, LPT1 LPT9 regardless of extension
    • Any filename ending with or .

We will also pretend that the empty path is never allowed (see: Compromises).

References:

It's fairly easy to apply the Posix path restrictions when constructing paths, but Windows is a bit of a pain, making me think perhaps it's not worth the effort.

Since the Windows restrictions are a (large) superset of the Posix restrictions, one approach that I'd like to explore is validating the Posix requirements are met during path construction, and then maybe checking for forms Windows wouldn't like in literal path construction (with the p"" macro) and emitting a warning.

Cross-platform path construction

Posix exclusively uses the / delimiter, and Windows accepts \ (preferred) or /.

As such, we can reasonably settle on / as the in-Julia syntax for paths, and handle operating system dependent normalisation in the background. This makes it impossible to accidently hardcode a particular platform's delimiters.

Convenient prefixes

The ~ home shorthand

The handling of ~ in paths in Julia tends to be trip up people used to shell expansion, but there's a very good reason why Julia doesn't go ahead an interpret "~/dir" as /home/$USER/dir but requires expanduser: ~ is a valid path segment.

Without knowing the intent with which the ~ was written (or generated and passed around) it is not possible to reasonably decide whether it should be interpreted as a "~" segment or a reference to the home directory.

With a path macro, this changes. We can differentiate between a ~ that has been put literally at the start of a path, and a ~ that's come from elsewhere. This makes the convienent ~-home interpretation viable, without re-introducing the current issues. As a tradeoff expressing an initial ~ segment becomes less convenient, but given the relative frequency of home vs. "~" forms, this seems like a worthwhile tradeoff.

Introducing @ project shorthand

Note I'm not completely sold on this idea, but I'm interested and it seems worth exploring

Within package and project code, it is common to see forms like joinpath(@__DIR__, "..", "..", "assets", "file.txt"). There are two major issues with this:

  1. Poor clarity of intent: this is an attempt to express the a target location within a project, but the path is expressed relative to the current file (wherever it may be within the project) rather than the project itself. The fact the form is twisted as a result (with @__DIR__, "..", "..") only makes this less apparent.
  2. As a result of the poor clarity/expression, this form is slightly fragile: moving the source file around will break the reference, even if the target remains in the same location within the project.

Extending the "special literal prefix" handling to treat @ as a project-prefix as ~ is a user-prefix can improve this situation. The choice of @ seems natural given the existing use of @-prefixed special paths in DEPOT_PATH and --project already.

Besides the two issues above, a @-prefix also provides an oppotunity to improve the status quo with regard to relocatability. Enough Julia packages use @__DIR__ in paths to make relocatability a general issue (motivating RelocatableFolders.jl and julia/PR#55146). Implementing @ as a relocatable project-relative path (determined at compile-time) creates a form that is both more convenient and more robust, a "pit of success".

Platform-specific path types

It seems likely useful to still be able to model paths of other platforms, and we can do this without compromising ergonomics fairly easily by defining <Platform>Path types and then aliasing Path to the current platform.

Low overhead when invoking Libuv path methods

Using a contiguous null terminated char array (whether a String, Memory{UInt8}, or Vector{UInt8}) for the internal representation of system paths makes it possible to pass the path representation directly to Libuv with no overhead.

SubStrings as the type of path components

Since we've got good reason to want a single contiguous on-disk format, operations that fetch components of the path will either have to allocate a new object or we can use a SubString. This is the approach FilePaths2.jl takes, and I like it.

Iterable segments

Since we're viewing a path as an ordered sequence of directions, it makes sense to be able to iterate through them. Together with length, this also makes it possible to simply collect the segments of a path.

Safe path interpolation

When interpreting externally provided content as a path, the existence of the pseudopath element introduces a risk of ending up in an unexpected directory. The intent of code like joinpath(workdir, subdirname) can be subverted (deliberately or accidently) by provididing a subdirname like /some/other/dir, ../stuff, the empty string, or even a null byte. This is most of the Path Traversal class of CVEs. When interpreting a string as a path segment, we can validate that it is a "normal" path segment and raise an error otherwise, preventing a suprising result from appearing with forms like p"path/$var/$name.txt".

Recognising the difference between locations and resources

In conversation, I've recieved a fair bit of pushback from multiple individuals on the normalisation I've proposed in the prior section. The essential argument is that the nice, algebraic model of paths isn't able to fully abstract over the world of system-specific details. To give a few examples:

  • There's the symlink stuff from ealier
  • stat foo/bar is different to stat foo/bar/.
  • Some tools like rsync treat foo differently to foo/
  • /.. is / (root is a fixed point essentially on Linux)

I really dislike these complications, particularly because in accepting this messiness we abdicate the handling of it to the user, and thus make it easier to write naievely buggy code.

Looking at this another way, even more pushback along these lines is deserved. There is an abundence of unquestioned issues of this nature with the current status quo. For example: it is currently not possible to write to a file and then move it without there being an oppotunity for the file to be replaced entirely inbetween each step.

This kind of issue and much of the pushback I've recieved essentially stems from one core issue: we often like to think of paths as unique resource descriptors, when infact they are unique location descriptors. This is a subtle but important distinction. It is responsible for a large class of bugs and vunerabilities known as TOCTTOU (Time of Check to Time of Use). Essentially any time a path is reused with any degree of outside influence (over the path or filesystem), it is near trivial to swap out the file in between operations by constructing a deep directory nesting and monitoring the directory atimes (yes, really). The only system where we can truthfully say this is not an issue is one with no concurrency of any sort (including time-sharing).

This is in large part a consequence of the initial POSIX standard being path oriented, a limitation that is gradually being rectified with the addition of f<op> and <op>at calls, such as faccessat. These calls operate not on a path to the file, but a handle to the file itself: a file descriptor, or FD for short. File descriptors essentially sit in between a description of the location of a resource, and the data on disk. I am less familiar with the NT situation, but am lead to believe that it has been ahead of nix in supporting handle-based path operations.

Other programing languages have also recognised this issue, for example Python's Pathlib seperates paths into pure and concrete paths, creating a clear split between an abstract conception of a path (as I've found myself attracted to thus far), and something that actually exists on the filesystem. I suspect we can go even further, and consider a scheme by which we provoke the user into obtaining a reference to the resource at a path when they want to work on it, and so avoid TOCTTOU-style issues and related messiness wholesale.

I conjecture that with the development of the file descriptor based API in POSIX 2008, Linux 2.6, and OpenBSD we have the capability to fufill this ideal by building a path-like type that is oriented around file descriptors rather than path strings. We can make a more concrete path type than Pathlib's "concrete" paths. Arguably this isn't really a path any more so much as a handle to a filesystem-adressed resource. I'm not sure what best to call this, but regardless it seems exceptionally useful for writing safe filesystem-interacting code.

Performance considerations

There are a few downsides to this approach, which largely stem from the need to actually aquire a file descriptor.

TODO

Implementation details

TODO

Making the safe path the happy path

From all the investigation we've done so far, we know that:

  • An abstract path type allows for sensible and efficient path manipulation
  • A concrete fd-based path type allows for some TOCTTOU-safe filesystem operations
  • A specialised directory entry type allows for efficient readdir usage, and for other TOCTTOU-safe filesystem operations

Currently, we just use String for all of these purposes. This is "simple" in the sense that all of the inherrent complexity is put off to the user of the API to think about. By contrast, this trio of system path types requires a little more upfront thinking, but this is paid for several times over in the reduction of edge cases that package developers and end users may hit.

Mitigating Pain Points

While I like this set of design goals, they're ultimately a compromise between various concerns, and so produce some potential pain points. This should be mitigated as much as possible.

With the separation of concerns of the design, path operations are very predictable. However, if pseudopaths are present and/or symlinks need to be accounted for, realpath will need to be called.

This is something that people using Path objects will simply need to remember to do, and so we will sprinkle mention of this liberally into the documentation.

Need to convert strings to paths for common operations

Made easier via interpolation within the p"" macro.

Compromises

Pretending empty segments are invalid

If one reads on Linux Pathname Lookup, one may notice that while empty segments are generally invalid, with the appropriate flags it can be valid to interact with the empty path.

I cannot begin to imagine any legitimate use case for this, and so am inclined to pretend this edge case doesn't exist, particularly since there's no easy way to supply the necesary flag to Julia's (public) filesystem functions.

Unresolved questions

  • Is treating the drive + / as the path root on Windows good enough?
  • Should we take this oppotunity to copy FilePathsBase.jl / FilePaths2.jl and provide more structured outputs to functions like uperm?
  • Can we get away with eagerly normalising .. and requiring realpath when you need to guard against symlink shenanigans?
  • Do we want to under-the-hood transform absolute Windows paths to verbatim-prefixed paths (\\?\), for long file name support?

Now resolved through community discussion

  • Should joining two absolute paths return the latter absolute paths, or raise a runtime warning/error?
    • An error should be thrown

Non-breaking changes

Ideally we'd use a time-travel machine to shoehorn this into Julia 1.0, but the second best time to add a path type to Julia is now.

Avoiding breaking changes means we can't remove papercuts like eachline(::String), but we can provide a better alternative, gradually adopt it, and push for it to become the status quo in the long term.

Prototype

https://code.tecosaur.net/tec/julia-basic-paths

If you'd like to make a PR etc. this is also now mirrored to GitHub: https://github.com/tecosaur/julia-basic-paths

I'm happy to take feedback in any form you're willing to give it. If easy/possible I like recieving a .patch with inline comments

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
.