# GSoC2023(GCC) Final Report Organization: [The GNU Compiler Collection](https://gcc.gnu.org/) Contributor: Raiki Tamura Mentors: Philip Herron, Arthur Cohen ## Introduction In my Google Summer of Code 2023 project, I worked on Unicode support for [GCC Rust](https://rust-gcc.github.io/). The main goals of the project are supporting Unicode identifiers in GCC Rust, and improving their location information for better error messages. ## Results Currently, GCC Rust compiles any program containing non-ASCII strings like this: ```rust mod モジュール { pub fn funcţie<'живот>() { let variable: &'живот str = "変数"; let 변수 = '\u{3042}'; } } ``` After tokenization, all identifiers are normalized to [the NFC form](https://unicode.org/reports/tr15/), then some of them are mangled. In the legacy mangling scheme, non-ASCII names are encoded by escaping. On the other hand, in [the v0 mangling scheme](https://rust-lang.github.io/rfcs/2603-rust-symbol-name-mangling-v0.html), they are encoded as [Punycode](https://datatracker.ietf.org/doc/html/rfc3492). For now, GCC Rust supports the legacy mangling scheme but does not fully support the v0 mangling scheme. However, Punycode encoding has been implemented in the backend. ## What was done ### First Step At first, refactoring of the lexer was performed to treat all characters as UTF-8 characters. Then, the lexer was modified to tokenize non-ASCII identifiers, whitespaces, and newlines. `libcpp/`, a library for the C pre-processor in GCC, was reused to look up codepoints that can be used in identifiers. ### Second Step Check for `#[crate_name="XXX"]` attributes was added in this step. Values for the attribute must consist of Unicode alphabetic or numeric codepoints. To look up these properties, [a Python script](https://github.com/Rust-GCC/gccrs/blob/87c5cc15163351f4ab0f0740b44a895a405a3eff/gcc/rust/util/make-rust-unicode.py) which generates a C++ header file containing Unicode tables was written. The generator was written in Python, because it is easy to keep up with the new Unicode version. ### Third Step The internal data structure for identifiers was refactored to also store their location information, making it easier to use the location information of identifiers in the error diagnostics. To use the location information in error messages, some parts of the compiler which outputs error messages will be modified in the future. ### Fourth Step Next, Unicode normalization of identifiers was added. Normalization is a process in which canonically equivalent but different byte strings are converted to the same byte string. For example, `Å`(U+212B) is normalized to `Å`(U+00C5), the [Normalization Form Canonical](https://unicode.org/reports/tr15/), or NFC shortly. Since [the Rust RFC 2457](https://rust-lang.github.io/rfcs/2457-non-ascii-idents.html) specifies that all identifiers must be normalized to NFC, NFC normalization has been completely implemented. However, optimization strategies such as [Quick Check](https://unicode.org/reports/tr15/#Optimization_Strategies) should be used to speed up the tokenization process. In addition, much more tests for NFC normalization should be added, which will be achieved by making the use of [NormalizationTest.txt](https://unicode.org/Public/UNIDATA/NormalizationTest.txt). ### Last Step In the last step, Punycode encoding was implemented. Punycode is used in the v0 mangling scheme to encode Unicode identifiers. Alongside this, the legacy mangling is fixed to support Unicode identifiers. ## Remaining Tasks As mentioned above, three tasks related to Unicode are left: - adding Quick Check for NFC, - adding more tests for NFC normalization, and - completing implementation of the v0 mangling scheme I will go on to work on the v0 mangling scheme after GSoC2023. ## Pull Requests and Issues ### Pull Requests First Step - https://github.com/Rust-GCC/gccrs/pull/2284 - https://github.com/Rust-GCC/gccrs/pull/2307 - https://github.com/Rust-GCC/gccrs/pull/2320 - https://github.com/Rust-GCC/gccrs/pull/2338 - https://github.com/Rust-GCC/gccrs/pull/2339 - https://github.com/Rust-GCC/gccrs/pull/2347 - https://github.com/Rust-GCC/gccrs/pull/2374 Second Step - https://github.com/Rust-GCC/gccrs/pull/2425 - https://github.com/Rust-GCC/gccrs/pull/2463 - https://github.com/Rust-GCC/gccrs/pull/2492 - https://github.com/Rust-GCC/gccrs/pull/2529 Third Step - https://github.com/Rust-GCC/gccrs/pull/2364 Fourth Step - https://github.com/Rust-GCC/gccrs/pull/2467 - https://github.com/Rust-GCC/gccrs/pull/2530 Last Step - https://github.com/Rust-GCC/gccrs/pull/2533 - https://github.com/Rust-GCC/gccrs/pull/2535 - https://github.com/Rust-GCC/gccrs/pull/2547 - https://github.com/Rust-GCC/gccrs/pull/2552 ### Issues - https://github.com/Rust-GCC/gccrs/issues/2287 - https://github.com/Rust-GCC/gccrs/issues/2306 - https://github.com/Rust-GCC/gccrs/issues/2308 - https://github.com/Rust-GCC/gccrs/issues/2309 - https://github.com/Rust-GCC/gccrs/issues/2379 - https://github.com/Rust-GCC/gccrs/issues/2548 All of them can also be found [here](https://github.com/Rust-GCC/gccrs/pulls?q=author%3Atamaroning+created%3A2023-06-01..2023-08-20). ## My Learnings I am glad to have learned many things during the project. First of all, I did not have experience with such large C++ codebases before, so it was a great opportunity to learn about compiler implementation as well as C++. I have also read and implemented several specifications, including Rust RFCs, the Unicode Standard, and IETF RFCs, which were technically interesting. Also, I was very anxious about my English skills, especially my speaking skills. However, my mentors have been very patient and helped me to express what I wanted to say. It was a precious opportunity to improve my English living in Japan. ## Acknowledgements This project would not have been possible without the continued support of many people. I would like to especially thank - [Arthur Cohen](https://github.com/CohenArthur), [Phyllip Herron](https://github.com/philberty): for being fantastic mentors and having continuously guided me since last year. - [Pierre-Emmanuel Patry](https://github.com/P-E-P), [Marc Poulhiès](https://github.com/dkm), and [Thomas Schwinge](https://github.com/tschwinge): for code review and dedicated advices - [GCC Community](https://gcc.gnu.org/): for disscussion and technical advice - [Southball](https://github.com/southball): for correcting my proposal and this report as my friend