Non-Ascii idents stabilization report

# Non-Ascii idents stabilization report Stabilization report for [RFC 2457](https://rust-lang.github.io/rfcs/2457-non-ascii-idents.html); [tracking issue](https://github.com/rust-lang/rust/issues/55467) ## What this feature enables This feature enables writing identifiers in Rust outside of the current set of characters `[A-Za-z0-9_]`, to span all "valid identifier" characters in Unicode as defined in [UAX #31](![](https://i.imgur.com/T70ceNz.jpg) ). For example, [the following code](https://play.rust-lang.org/?version=nightly&mode=debug&edition=2018&gist=74ba374aa750a7cb05ebd846787aaad2) will become legal: ```rust #[derive(Debug)] struct 人 { /// 普通话名字: String, /// 廣東話屋企: String, } fn main () { let 我的名字 = "मनीष".to_string(); let 我嘅屋企 = "Berkeley".to_string(); // मराठी let मनीष = 人 { 名字: 我的名字, 屋企: 我嘅屋企, }; // हिंदी let उसका_नाम = "مصطفى".to_string(); let 他的家 = "Oakland".to_string(); // اردو let مصطفى = 人 { 名字: उसका_नाम, 屋企: 他的家, }; println!("मी: {:?}", मनीष); println!("माझा मित्र: {:?}", مصطفى); } ``` Currently, only the comments and strings in this code are legal, everything else is disallowed by the compiler. This code is an exaggerated example that uses five languages for demonstrative purposes, but one can imagine this being used with just one. Motivation behind this feature is supplied [in the RFC](https://rust-lang.github.io/rfcs/2457-non-ascii-idents.html#motivation). This RFC also introduces a suite of lints to protect from confusions caused by this vast expansion of the identifier set; crafted in such a way that people writing code using their own writing systems are minimally impacted: - `confusable_idents`: This checks if multiple identifiers used in the same crate are confusable with each other - `uncommon_codepoints`: This checks for codepoints which are allowed in identifiers but are unlikely to be used by someone just writing identifiers in some language, based on [UTS #39](http://unicode.org/reports/tr39/#General_Security_Profile) - `mixed_script_confusables`: This checks if the _only_ time a script is used is for characters which are confusable with other scripts. Mixing scripts is expected to be an operation people will need (someone might want to write stuff like `家_opt`), so we do not lint on all instances of mixed scripts, just when a script is introduced solely with confusables. - (The existing style lints are also updated to make sense for non-Latin scripts) - `non_ascii_idents` (`Allow` by default) for forbidding these outright. Lints have a rather relaxed stability policy and as such these can be tweaked in the future. All of these lints except for `non_ascii_idents` are currently `Warn` by default, and should stay so after stabilization. As such they only affect code containing non ascii identifiers. ## Implementation and tests Implementation: - The main parser work has existed since before 1.0 - Normalizing idents during parsing: [#66670](https://github.com/rust-lang/rust/pull/66670), [#67702](https://github.com/rust-lang/rust/pull/67702) - Implementation of `confusable_idents`: [#71542](https://github.com/rust-lang/rust/pull/71542), [#72770](https://github.com/rust-lang/rust/pull/72770) - [Implementation of `uncommon_codepoints`](https://github.com/rust-lang/rust/pull/67810) - [Adjustments to style lints](https://github.com/rust-lang/rust/pull/73839) - [Implementation of `mixed_script_confusables`](https://github.com/rust-lang/rust/pull/72770) Tests: [lint tests](https://github.com/rust-lang/rust/tree/cebc8fef5f4391a9ed8e4c1dc566a6c5824e2901/src/test/ui/lint/rfc-2457-non-ascii-idents), tests for `nonstandard_style` ([1](https://github.com/rust-lang/rust/blob/cebc8fef5f4391a9ed8e4c1dc566a6c5824e2901/src/test/ui/lint/lint-nonstandard-style-unicode-1.rs), [2](https://github.com/rust-lang/rust/blob/cebc8fef5f4391a9ed8e4c1dc566a6c5824e2901/src/test/ui/lint/lint-nonstandard-style-unicode-2.rs), [3](https://github.com/rust-lang/rust/blob/cebc8fef5f4391a9ed8e4c1dc566a6c5824e2901/src/test/ui/lint/lint-nonstandard-style-unicode-3.rs)), [tests for the feature itself](https://github.com/rust-lang/rust/tree/cebc8fef5f4391a9ed8e4c1dc566a6c5824e2901/src/test/ui/rfc-2457). There are other scattered tests throughout that can be found with by grepping for `non_ascii_idents`. ## Resolved questions These are the "unresolved questions" from the RFC that have already been resolved by past discussions: > Which name mangling scheme is used by the compiler? Punycode, see RFC2603 > Is there a better name for the `less_used_codepoints` lint? We picked `uncommon_codepoints` > right-to-left scripts can lead to weird rendering in mixed contexts (depending on the software used), especially when mixed with operators. This is not something that should block stabilization, however we feel it is important to explicitly call out Unnecessary to resolve as noted in the RFC > Similarly to out-of-line modules (`mod фоо;``), extern crates and paths with a first segment naming a crate should not be able to do filesystem search using those non-ASCII identifiers (i.e. no , `extern crate ьаг;`` or `му_сгате::baz`). [We have chosen to disallow this](https://github.com/rust-lang/rust/pull/73305). > How are non-ASCII idents best supported in debuggers? DWARF seems to handle utf-8 just fine > Which lint should the global mixed scripts confusables detection trigger? `mixed_script_confusables`, we decided against the more complicated lint that was proposed as an alternative in the RFC. ## Hopefully resolved questions These are "unresolved questions" that I propose a resolution for. None of these decisions are irreversible. > Which context is adequate for confusable detection: file, current scope, crate? Crate, as implemented, seems fine for now. We can tweak later if needed. > Should [ZWNJ and ZWJ be allowed in identifiers](https://www.unicode.org/reports/tr31/#Layout_and_Format_Control_Characters)? There are potential reasons to allow this but this can be tweaked later. We should disallow for now and open a followup issue to point people at, and see what comes up. > How badly do non-ASCII idents exacerbate const pattern confusion > (#7526, #49680)? Not a blocker; best to see it get used and improve lints as necessary. > In `mixed_script_confusables`, do we actually need to make an exception for `Latin` identifiers? The exception is necessary to avoid programs using short ASCII identifiers from erroring. > Terminal width is a tricky with unicode. Some characters are long, some have lengths dependent on the fonts installed (e.g. emoji sequences), and modifiers are a thing. The concept of monospace font doesn't generalize to other scripts as well. How does rustfmt deal with this when determining line width? Not a blocker, and already a problem for strings. > Tweak `XID_Start` / `XID_Continue`? #4928 > > http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1518.htm > > The ISO JTC1/SC22/WG14 (C language) think that possibly UTR#31 didn't quite hit the nail on the head in terms of defining identifier syntax. They have a couple tweaks in mind. Consider following their lead. This specification isn't clear enough and we should follow Unicode.