GSoC2023(GCC) Final Report

Organization: The GNU Compiler Collection
Contributor: Raiki Tamura
Mentors: Philip Herron, Arthur Cohen

Introduction

In my Google Summer of Code 2023 project, I worked on Unicode support for GCC Rust.
The main goals of the project are supporting Unicode identifiers in GCC Rust, and improving their location information for better error messages.

Results

Currently, GCC Rust compiles any program containing non-ASCII strings like this:

pub mod モジュール {
    pub fn funcţie<'живот>() {
        let variable: &'живот str = "変数";
        let 변수 = '\u{3042}';
    }
}

After tokenization, all identifiers are normalized to the NFC form, then some of them are mangled. In the legacy mangling scheme, non-ASCII names are encoded by escaping. On the other hand, in the v0 mangling scheme, they are encoded as Punycode. For now, GCC Rust supports the legacy mangling scheme but does not fully support the v0 mangling scheme. However, Punycode encoding has been implemented in the backend.

What was done

First Step

At first, refactoring of the lexer was performed to treat all characters as UTF-8 characters. Then, the lexer was modified to tokenize non-ASCII identifiers, whitespaces, and newlines. libcpp/, a library for the C pre-processor in GCC, was reused to look up codepoints that can be used in identifiers.

Second Step

Check for #[crate_name="XXX"] attributes was added in this step. Values for the attribute must consist of Unicode alphabetic or numeric codepoints. To look up these properties, a Python script which generates a C++ header file containing Unicode tables was written. The generator was written in Python, because it is easy to keep up with the new Unicode version.

Third Step

The internal data structure for identifiers was refactored to also store their location information, making it easier to use the location information of identifiers in the error diagnostics. To use the location information in error messages, some parts of the compiler which outputs error messages will be modified in the future.

Fourth Step

Next, Unicode normalization of identifiers was added. Normalization is a process in which canonically equivalent but different byte strings are converted to the same byte string. For example, Å(U+212B) is normalized to Å(U+00C5), the Normalization Form Canonical, or NFC shortly.

Since the Rust RFC 2457 specifies that all identifiers must be normalized to NFC,
NFC normalization has been completely implemented. However, optimization strategies such as Quick Check should be used to speed up the tokenization process. In addition, much more tests for NFC normalization should be added, which will be achieved by making the use of NormalizationTest.txt.

Last Step

In the last step, Punycode encoding was implemented. Punycode is used in the v0 mangling scheme to encode Unicode identifiers. Alongside this, the legacy mangling is fixed to support Unicode identifiers.

Remaining Tasks

As mentioned above, three tasks related to Unicode are left:

adding Quick Check for NFC,
adding more tests for NFC normalization, and
completing implementation of the v0 mangling scheme

I will go on to work on the v0 mangling scheme after GSoC2023.

Pull Requests and Issues

Pull Requests

First Step

Second Step

Third Step

https://github.com/Rust-GCC/gccrs/pull/2364

Fourth Step

Last Step

Issues

All of them can also be found here.

My Learnings

I am glad to have learned many things during the project.

First of all, I did not have experience with such large C++ codebases before, so it was a great opportunity to learn about compiler implementation as well as C++. I have also read and implemented several specifications, including Rust RFCs, the Unicode Standard, and IETF RFCs, which were technically interesting.

Also, I was very anxious about my English skills, especially my speaking skills. However, my mentors have been very patient and helped me to express what I wanted to say. It was a precious opportunity to improve my English living in Japan.

Acknowledgements

This project would not have been possible without the continued support of many people. I would like to especially thank

Arthur Cohen, Phyllip Herron: for being fantastic mentors and having continuously guided me since last year.
Pierre-Emmanuel Patry, Marc Poulhiès, and Thomas Schwinge: for code review and dedicated advices
GCC Community: for disscussion and technical advice
Southball: for correcting my proposal and this report as my friend