Organization: The GNU Compiler Collection
Contributor: Raiki Tamura
Mentors: Philip Herron, Arthur Cohen
In my Google Summer of Code 2023 project, I worked on Unicode support for GCC Rust.
The main goals of the project are supporting Unicode identifiers in GCC Rust, and improving their location information for better error messages.
Currently, GCC Rust compiles any program containing non-ASCII strings like this:
After tokenization, all identifiers are normalized to the NFC form, then some of them are mangled. In the legacy mangling scheme, non-ASCII names are encoded by escaping. On the other hand, in the v0 mangling scheme, they are encoded as Punycode. For now, GCC Rust supports the legacy mangling scheme but does not fully support the v0 mangling scheme. However, Punycode encoding has been implemented in the backend.
At first, refactoring of the lexer was performed to treat all characters as UTF-8 characters. Then, the lexer was modified to tokenize non-ASCII identifiers, whitespaces, and newlines. libcpp/
, a library for the C pre-processor in GCC, was reused to look up codepoints that can be used in identifiers.
Check for #[crate_name="XXX"]
attributes was added in this step. Values for the attribute must consist of Unicode alphabetic or numeric codepoints. To look up these properties, a Python script which generates a C++ header file containing Unicode tables was written. The generator was written in Python, because it is easy to keep up with the new Unicode version.
The internal data structure for identifiers was refactored to also store their location information, making it easier to use the location information of identifiers in the error diagnostics. To use the location information in error messages, some parts of the compiler which outputs error messages will be modified in the future.
Next, Unicode normalization of identifiers was added. Normalization is a process in which canonically equivalent but different byte strings are converted to the same byte string. For example, Å
(U+212B) is normalized to Å
(U+00C5), the Normalization Form Canonical, or NFC shortly.
Since the Rust RFC 2457 specifies that all identifiers must be normalized to NFC,
NFC normalization has been completely implemented. However, optimization strategies such as Quick Check should be used to speed up the tokenization process. In addition, much more tests for NFC normalization should be added, which will be achieved by making the use of NormalizationTest.txt.
In the last step, Punycode encoding was implemented. Punycode is used in the v0 mangling scheme to encode Unicode identifiers. Alongside this, the legacy mangling is fixed to support Unicode identifiers.
As mentioned above, three tasks related to Unicode are left:
I will go on to work on the v0 mangling scheme after GSoC2023.
First Step
Second Step
Third Step
Fourth Step
Last Step
All of them can also be found here.
I am glad to have learned many things during the project.
First of all, I did not have experience with such large C++ codebases before, so it was a great opportunity to learn about compiler implementation as well as C++. I have also read and implemented several specifications, including Rust RFCs, the Unicode Standard, and IETF RFCs, which were technically interesting.
Also, I was very anxious about my English skills, especially my speaking skills. However, my mentors have been very patient and helped me to express what I wanted to say. It was a precious opportunity to improve my English living in Japan.
This project would not have been possible without the continued support of many people. I would like to especially thank