owned this note
owned this note
Published
Linked with GitHub
# TzProvider Proposal / TzdbProvider Research and Design Proposal
## General Motivation
The below attempts to layout the general research for designing a `TzdbProvider` for ICU4X. In order to complete this objective. This proposal will first look at the current state of the Rust ecosystem tooling for the Olsen/IANA timezone database (hereafter referred to as `tzdb`).
----
----
## Rust Ecosystem overview
Below are the current `tzdb` focused utilities available via crates.io and lib.rs. This is potentially a non-exhaustive list, and consists primarily of crates found during research.
### IANA zoneinfo file parsing
- [parse-zoneinfo](https://crates.io/crates/parse-zoneinfo) - Parser for `tzdb` zoneinfo files. Fork of `zoneinfo_parse`. Used and maintained by `chrono-tz`
- [zoneinfo_parse](https://crates.io/crates/zoneinfo_parse) - Parser for `tzdb` zoneinfo files. Appears to be no longer maintained
- [zoneinfo](https://crates.io/crates/zoneinfo) - Unimplemented. Reserved crate by `time` maintainer
### ZIC - Zoneinfo compiler
No crates were found. Although, just how much the zoneinfo parsing dips into zic is debateable. There does seem to be general date/time community interest in implementing zic.
General zic information can be found [here](https://data.iana.org/time-zones/data/zic.8.txt)
**Note:** "fat" and "slim" flag that may need to be adjusted for tz calculations.
Relevant issues:
- https://github.com/BurntSushi/jiff/issues/20
- https://github.com/BurntSushi/jiff/issues/120
### Time zone information format
In general, TZifs can come in 3 versions, with various configurations that are laid out by the [zic compiler options](https://data.iana.org/time-zones/data/zic.8.txt).
- [tzif](https://crates.io/crates/tzif) - ICU4X's utility for parsing tzif files / POSIX timezone strings.
- [libtzfile](https://crates.io/crates/libtzfile) - Third party utility for parsing tzifs
**NOTE:** `tzif` crate note, `tzif` may need to be updated to have a `TzifData::from_bytes` method and `no_std`, if possible.
**NOTE:** `tzif` crate currently only supports version 2+ blocks.
### Current TZDB implementations
- [chrono-tz](https://crates.io/crates/chrono-tz) - tzdb implementation/integration for `chrono`
- `tzdb-data`/[tzdb](https://crates.io/crates/tzdb) - tzdb implementation for `tz-rs`
- [jiff_tzdb](https://crates.io/crates/jiff-tzdb) - tzdb implementation for `jiff`
- [time-tz](https://docs.rs/time-tz/latest/time_tz/) - tzdb implementation for `time`
- [time-tzdb](https://crates.io/crates/time-tzdb) - Unimplemented. Reserved create by `time` maintainer
### Time Zone libraries
#### tz
[tz-rs](https://docs.rs/tz-rs/latest/tz/index.html) appears to have been created as a result of [CVE-2020-26235](#CVE-2020-26235).
Primarily, this crate is meant to be an implementation of the libc functions `localtime`, `gmtime`, and `mktime`.
`tz` implements a `TimeZone` and `TimeZoneRef` struct. These structs are
TZif representations specific to `tz-rs`.
```rust
// tz-rs' `TimeZone`
pub struct TimeZone {
/// List of transitions
transitions: Vec<Transition>,
/// List of local time types (cannot be empty)
local_time_types: Vec<LocalTimeType>,
/// List of leap seconds
leap_seconds: Vec<LeapSecond>,
/// Extra transition rule applicable after the last transition
extra_rule: Option<TransitionRule>,
}
/// Reference to a time zone
#[derive(Debug, Copy, Clone, Eq, PartialEq)]
pub struct TimeZoneRef<'a> {
/// List of transitions
transitions: &'a [Transition],
/// List of local time types (cannot be empty)
local_time_types: &'a [LocalTimeType],
/// List of leap seconds
leap_seconds: &'a [LeapSecond],
/// Extra transition rule applicable after the last transition
extra_rule: &'a Option<TransitionRule>,
}
```
The above can be found [here](https://github.com/x-hgg-x/tz-rs/blob/master/src/timezone/mod.rs#L244).
`tz-rs` also provides a `DateTime`(as defined [here](https://github.com/x-hgg-x/tz-rs/blob/master/src/datetime/mod.rs#L184)) that appears to actually be a `ZonedDateTime` with a `localtime_type` time zone field. There is also a `UtcDateTime`, which is a `DateTime` without the associated time zone.
`tzdb` crate does parse time zones into `tz-rs` types and return a `TimeZoneRef`.
### Summary of Rust Ecosystem
The current Rust ecosystem for time zones and tzdb is rather varied with coverage for some functionality but no coverage for others. There are various utility crates of interacting with tzif data and zoneinfo files. Both `tzif` and `libtzfile` represent the tzif data appropriately and functionality, if not there can be built on (`tzif` is a ICU4X util crate). The parsing of zoneinfo files is also fairly well maintained and tested as it is already being used for `chrono-tz`.
There is one notable omission of a readily available [zoneinfo compiler](https://data.iana.org/time-zones/tzdb-2024a/zic.c). Although, it is probably arguable just how much of zic.c is ultimately implemented in `parse-zoneinfo`.
----
## Tzdb implementation deep dive
The primary concern for any `TzdbProvider` is going to be sourcing `tzdb` data for operating systems that do not have a precompiled version of the tzdb. If an application can source time zones from the OS/env, then it should first use that path.
**General question**: Should a precompiled version still be included in these scenarios as a fallback?
### Tzdb Crate Walkthrough
The current tzdb crates that can be found are the below implementations. These tzdbs vary in their design and implementation.
#### `chrono-tz` crate
`chrono-tz` is the tzdb implementation for `chrono`. It is automatically generated via the `chrono-tz-build` crate with the Github mirror of the tzdb repository `eggert/tz` as a git submodule. Overall, `chrono-tz` is highly integrated with `chrono` and its types.
```rust
use chrono::TimeZone;
use chrono_tz::Tz;
// Note: Tz does have a private method `timespans` that
// would return a `FixedTimespanSet` that appears to be
// akin to tzif data
let tz: Tz = "America/Chicago".parse().unwrap();
let dt = tz.ymd(2016, 10, 22).and_hms(12, 0, 0);
```
##### Positives of `chrono-tz`:
- Actively maintained.
- Used in the `chrono`, which has wide spread adoption.
- Git submodules `eggert/tz` (depending on opinions of git submodules)
- Support no_std with "default-features = false"
##### Negatives:
- Major concerns about compatibility issues
- `chrono-tz` is heavily integrated with `chrono` and designed for use with Chrono’s TimeZone trait, which is also coupled with many chrono date structs.
- No clear way to fetch tzif data from a IANA identifier
- Not no_std by default
**NOTE:** `chrono-tz` may actually compile the zoneinfo files, but it is done in a crate specific manner that appears to be private.
----
#### `tzdb` crate
`tzdb` is a tzdb implementation for `tz-rs`.
```rust
use tzdb;
use tzif;
let tz_name = "America/Chicago";
// raw_tz_by_name returns `Option<&'static [u8]>`
let tzif_data = tzdb::raw_tz_by_name(tz_name).unwrap();
```
##### Positives:
- No crate specific trait dependencies.
- Appears to be actively maintained to IANA release versions
- no_std
##### Negatives:
- Potential minor compatibility issues: Designed for `tz-rs`
- Provides a `tzdb::find_raw_by_name`, which returns an `Option<&'static [u8]>`
----
#### `time-tz` crate
A tzdb crate made for use with the `time` crate. The maintainer of time appears to not be connected to the crate. Was implemented due to [CVE-2020-26235](#CVE-2020-26235) (See `time` issue [here](https://github.com/time-rs/time/issues/293))
##### License
- BSD-3
##### Positives
- TBD
##### Negatives
- TBD
----
#### `jiff_tzdb` crate
The tzdb implementation for the `jiff` crate.
##### Example
```rust
use jiff_tzdb;
// `jiff_tzdb::get` returns `Option<(&'static str, &'static [u8])>`
let (canonical_name, tzif_data) = jiff_tzdb::get("America/Chicago").unwrap();
// `jiff_tzdb::get` is not case sensitive
let (canonical_name, tzif_data) = jiff_tzdb::get("america/chicago").unwrap();
```
##### Positives:
- Actively maintained currently.
- No crate specific traits
- Provides crate agnostic `&'static [u8]` value as the only retreival method
- no_std
##### Negatives:
- Uses .dat file over IANA zoneinfo files, which may be maintenance intensive on the jiff's maintainer.
- No current API for `get_from_bytes` vs. `get_from_str`
- "slim" version requires a way to determine possible instants from a Posix Time string (but this should be supported anyways).
**NOTE:** `jiff_tzdb` does release with TZifs compiled as "slim" version.
**General Question:** What is the best way to retrieve possible instants from nanoseconds?
----
#### `time-tzdb` crate
N/A as the crate is a reserved name.
----
### Summary of tzdb crate Ecosystem
The summary of the entire current state of tzdb in the ecosystem is that nearly all crates are designed for use with a specific crate and integrates with those crates to varying degrees. As a result, some crates may be more useable than others.
Across every current tzdb crate, there are two main approaches to the current crates:
- Build libraries / utilities to generate the .rs files of the actual crate (tzdb's make-tzdb, chrono-tz's chrono-tz-build)
- Prepackaged versions of the data in a .dat file (jiff-tzdb)
Generally, due to the focus on specific usage, the current crates do not readily support zoneinfo compiling. Compiling zoneinfo files may be viewed as a useful in order to embed or generate zoneinfo files for any operating system or environment based off a set of IANA zoneinfo files. Although, this functionality may fall outside of the bounds of ICU4X and an internationalization library.
Although, since there is no current support for compiling and embedding zoneinfo files, there is a possibility that another tzdb crate may enter the space to fill the space for anyone looking to avoid interacting with `glibc` (although, this may eventually be addressed by [jiff](https://github.com/BurntSushi/jiff/issues/20)).
### Some relevant TZDB issues
There are potential concerns about maintaining a tzdb.
The IANA tzdb updates at various times in a year, which may require a minor releases for updates. This could be averted by providing a way to fetch zoneinfo files from the [IANA time zone repository](https://data.iana.org/time-zones/) between releases.
Even then, there are some things to be considered when handling zoneinfo file fetching.
TZDB Example Issues
- https://github.com/BurntSushi/jiff/issues/113
----
### General thoughts on tzdb
Is there value in reimplementing a tzdb? From an ICU4X perspective, there would be little benefit over using a pre-existing impelementation. In fact, reimplementing would most likely just add to the overall maintenance burden; however, from the ecosystem as a whole, there is a decent argument to be made for a general tzdb crate to be implemented that can be used across the date/time library ecosystem.
While there is potential for another crate to exist and provide functionality that does not currently appear to exist, any additional fragmentation in the tzdb space would most likely not be beneficial without some buy in from the wider date/time ecosystem.
A general proposal would be for a shared GitHub org and repository for the TZDB crate for the rust date/time ecosystem that does not integrate with external libraries and is `no_std`. This would ideally minimize fragmentation across the ecosystem and could serve as the primary source for any Rust timezone utilities. This would also allow sharing of the maintenance burden across the general date/time library ecosystem.
- tzdb-org
- tzif
- zic
- tzdb
-----
## TimeZoneProvider trait design
A `TimeZoneProvider` trait would be a trait to implement as a front end for any time zone database provider. In other words, this trait is a front-end of traits that would impower a Bring Your Own Time Zone Database.
The goal in this case would be to offer a trait that empowers the general date/time ecosystem to change their backend databases while also offering some stability to the overarching ecosystem.
### General TimeZoneProvider Goals
- Offer a general Time Zone struct
- Support multiple backends.
- Example code for how to interact with provider.
- Display back ends
There is a strong argument for a `TimeZoneProvider` trait to also exist. The purpose of the trait would be to enable a B.Y.O.TZDB model.
```rust
trait TimeZoneProvider {
fn load_time_zone(&self) -> TimeZone;
fn next_transition(&self) -> TimeZone;
fn previous_transition(&self) -> TimeZone;
}
```
#### What is a time zone?
The question here is more academic / rhetorical. If reworded into a more practical question, what fields should make up a time zone that would be consistent across all backends.
If generally defined, a time zone is a area/region that observes the same time based off local, commercial, and social rules. This time is primarily represented as an offset in seconds. A time zone may also have an identifier as well as a daylight savings flag.
> [name=robertbastian]
I don't think this is correct. A time zone in IANA is defined as a set of rules that map a timestamp to a UTC offset. PDT and PST are the same time zone (i.e. PT, `America/Vancouver`), and a timezone does not have an inherent `is_dst` attribute.
```rust
struct TimeZone {
offset: Seconds(i32),
is_dst: bool,
identifier: &'static str,
}
```
However, the question here may be primarily the usage of a `&'static str`. Will this be consisitent across all backends? Would that be best for the general usecase?
There are maybe two way in which this design could be expanded.
```rust
// 1. TimeZone is defined by a lifetime
struct TimeZone<'id> {
offset: Seconds(i32),
is_dst: bool,
identifier: &'id str
}
// 2. TimeZone uses an TinyAsciiStr for identifier
struct TimeZone {
offset: Seconds(i32),
is_dst: bool,
identifier: TinyAsciiStr<32>
}
```
2 would allow for `TimeZone` to be `Copy`; however, that may come with some level of tradeoffs. Also, with `TimeZone` now taking a const generic, `TimeZoneProvider` would need to be adjusted likewise. Furthermore, both windows zones and tzdb identifiers both appear to currently follow a <= 32 rule. Though, automatically defaulting to `TinyAsciiStr<32>` may cause compatibility issues if a larger identifier is introduced in the future.
## TzdbProvider Design Proposal
The most important question for a `TzdbProvider` is where to source tzif data from. First and foremost, a `TzdbProvider` should default to the OS/environment. All unix devices come with their own version of the `tzdb`, which is managed by the operating system, but the ability to source tzdb data for certain operating systems, primarily Windows, and other environments without a tzdb is still important.
For the below proof of concept, `jiff_tzdb` will be used as it appears to be the most agnostic of the current crates released from the above deep dive.
At it's most basic the implementation may look like below.
```rust
// components/timezone/provider/tzdb.rs
// NOTE: Below is a rough outline
use tzif::Tzdata;
// Store both the provided identifier and the canonical identifier
#[derive(Debug)]
struct Tzif {
identifier: &'static [u8],
canonical_id: &'static [u8]
data: TzData,
}
struct TzdbProvider;
impl TzdbProvider {
pub fn get_from_str(iana_identifier: &str) -> Result<Option<Tzif>> {
Self::tzif_from_byte(iana_identifier.as_bytes())
}
pub fn get_from_bytes(iana_id: &[u8]) -> Result<Option<Tzif>> {
// Fetch time zone by OS or jiff_tzdb
let Some((canonical, tz_data)) = get_tzif_date(iana_id)? else {
return Ok(None);
};
Ok(Tzif {
identifier: iana_id,
canonical_identifier: canonical,
data: tz_data,
})
}
}
#[cfg(any(target_os = "linux", target_os = "macos"))]
fn get_tzif_data(id: &[u8]) -> Result<Option<(&'static [u8], TzData)>> {
// read from file system
todo!();
}
#[cfg(target_os = "windows")]
fn get_tzif_data(id: &[u8]) -> Result<Option<(&'static [u8], TzData)>> {
// Fetch from jiff_tzdb
todo!();
}
```
This provides a simple struct that would then allow the user to fetch on a requested identifier.
```rust
let tzif: Option<Tzif> = TzdbProvider::get_from_str("America/Chicago")?;
```
It may also be worthwhile to expand on this to create a way to cache the tzif in order to avoid either reading from the filesystem (on linux/macos) or retreiving the data from `jiff_tzdb`.
```rust
#[derive(Default)]
struct CachedTzdbProvider {
cache: Option<Tzif>,
}
impl CachedTzdbProvider<'_> {
pub fn get_from_str(&mut self, iana_identifier: &'name str) -> Result<Option<Tzif>> {
self.tzif_from_byte(iana_identifier.as_bytes())
}
pub fn get_from_bytes(&mut self, iana_id: &'name [u8]) -> Result<Option<Tzif>> {
match &self.cache {
Some(tzif)
if tzif.identifier == iana_id || tzif.canonical_id == iana_id =>
{
Ok(tzif.clone())
}
_=> {
let Some((canonical, tz_data)) =
get_tzif_data(iana_identifier)? else {
return None
};
let tzif = Tzif {
identifier: iana_identifier,
canonical_identifier: canonical
data: data,
};
let _ = self.cache.insert(tzif.clone());
Ok(tzif)
},
}
}
}
```
Some general concerns about the above approach to caching is that it relies on cloning the tzif, which may be inefficient, especially for a TzdbProvider, which should be performant.
That being said, cloning may still be preferable to reading the a zoneinfo file and parsing it again into a `Tzif`.
The ergonomics of `CachedTzdbProvider` may be less than ideal (hence the separation from `TzdbProvider`).
```rust
let mut provider = CachedTzdbProvider::default();
let tzif: Option<Tzif> = provider.get_from_str("America/Chicago")?;
```
**NOTE:** Open to any different names than `CachedTzdbProvider`, if it were to be implemented.
**General Question:** Would it be worthwhile for tzif's be redesign for `no_std` and be stored on stack vs. storing portions on the heap?
**General Question:** Should linux and mac systems still ship with a tzdb as a fallback? Maybe, have it feature flagged?
## Notes and errata
The below are generally related thoughts to potentially add or ignore for the current proposal. Depending on the interest and support, they may affect the structure of the above.
### Converting from Windows -> IANA Time Zones
cldr can be used to map from [Windows Time Zones to IANA](https://github.com/unicode-org/cldr/blob/main/common/supplemental/windowsZones.xml)
[Rough Proof of Concept for getting IANA time zone names from Windows](https://github.com/nekevss/win-iana-rs)
**Note:** this approach may require fetching the a territory code from Windows (this appears to be Windows GeoName).
@sffc's thoughts:
- Use ZeroTrie where the keys are "Windows zone name/Territory"
- The ZeroTrie should map to an index into a `ZeroVec<TimeZoneBcp47Id>`. Just pick one, I guess the first one from the list.
- Possibly have a second data struct that contains the rest of the list, but the use case seems more limited. That data struct can map to a `VarZeroVec<ZeroSlice<TimeZoneBcp47Id>>`. I think don't add that one right away.
- Use the cursor for lookup. You can loop over the Windows u16 string with the cursor. Just return early if there is a non-ASCII character in the Windows string.
### Implementing an IANA Identifier over `&str`?
**NOTE:** Leaving the below as a reference of not good ideas. But any identifier for IANA should just be a new type on `TinyAsciiStr<32>`. This appears to be a consistent size for both IANA and Windows Zones, and it would hopefully also prevent any code bloat. It's also just smaller overall.
```rust
struct IanaIdentifier(pub TinyAsciiStr<32>)
```
**Note** values should most likely not be stored in the above representation as it would increase storage by ~10617 bytes (calculated by taking the sum 32 - the length of an identifier for all IANA ids)
@sffc: Stick with variable-length field for IANA IDs for the reason stated above.
~~Put simply, would it be worthwhile creating a `tinystr` representation of an IANA identifier with `TinyAsciiStr` over ``&str`?~~
~~With data coming from either a zoneinfo file or a tzdb implementation (jiff_tzdb). The lifetimes of the data may vary from static to short lived lifetimes. What is the best way to handle this as a whole? Currently, `tzif` parses the data into Strings and Vecs.~~
```rust
// NOTE: max IANA part size appears to be 14
// The max for the first part => 10
// Both the second and third parts can be 14 in length
struct IanaIdentifierVersion1<const N: usize> {
parts: [TinyAsciiStr<14>; N]
}
struct IanaIdentifierVersion2<const SUB_PARTS: usize> {
first: TinyAsciiStr<10>,
parts: [TinyAsciiStr<14>; SUB_PARTS]
}
```
~~The above may not actually be the best option as any `from_str` or `from_bytes` must be `IanaIdentifier<3>`.~~
~~Instead, these could be represented in a way where the entire struct size is already known.~~
```rust
struct IanaIdentifierVersion3 {
first: TinyAsciiStr<10>,
parts: Option<[TinyAsciiStr<14>; 2]>
}
struct IanaIdentifierVersion4 {
first_part: TinyAsciiStr<10>,
second_part: Option<TinyAsciiStr<14>>,
third_part: Option<TinyAsciiStr<14>>,
}
```
or even better:
```rust
struct IanaIdentifier {
first: TinyAsciiStr<10>
specifier: Option<TinyAsciiStr<24>>
}
```
~~The longest the sub-parts can be is 24 (`Argentina/ComodRivadavia`), which
allows this representation to save be a consistent 34.~~
This could be interesting also because a comparison may be as fast as comparing the amount of parts.
With this representation a `Tzif` could then be:
```rust
struct Tzif {
identifier: IanaIdentifier,
canonical_identifier: IanaIdentifier
data: TzData,
}
```
There are definitely tradeoffs/negatives to these representations of IANA Identifiers. `UTC`, `EST`, and `NZ` are now much larger than their actual size. Mainly the entire representation may be larger than needed: `IanaIdentifier` is 38 bytes vs. the 16 bytes of `&str`/`&[u8]`. Also, any benefit that may be derived from `IanaIdentifier` being implementing Copy, would depend on changes to `TzData`. Furthermore, it may actually be slower overall when having to serialize and deserialize from a `&[u8]`.
### POSIX Tz Strings
A markdown table of POSIX Tz string can be found [here](https://github.com/nekevss/TZif-POSIX-TZ-strings)
### Relevant Time Zone related CVEs
#### CVE-2020-26235
More information can be found on this [issue](https://github.com/time-rs/time/issues/293)
This CVE involved an unsound call to `localtime_r`, which may dereference a potentially dangling pointer when accessing an environment variable.
- [CVE](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-26235)
- [NIST](https://nvd.nist.gov/vuln/detail/CVE-2020-26235)
- Relevant Rust CVEs
- [RUSTSEC-2020-0071](https://rustsec.org/advisories/RUSTSEC-2020-0071.html)
- [RUSTSEC-2020-0159](https://rustsec.org/advisories/RUSTSEC-2020-0159.html)