owned this note
owned this note
Published
Linked with GitHub
# Compiler Project(Draft)
Build an educational compiler from scratch.
## Phase 1
### TBD
- Project name:
- e.g. Orange
> Not even having a *C* :laughing:
> How about *VitaminC*? I come up this idea with the C.C. Lemon thing.
> [name=Lai-YT]
> **VitaminC** it is!
> [name=Lee]
- GitHub Organization name:
- Fruits
> Then, I suggest that we can use **fruits** as our organization name, and "Food for thought" as the description. 🤟🍊
> [name=Lee]
- Source Language:
- [x] C (Simple)
- Target Language:
- [x] RISC-V (Simple)
- ARM (Accessible Hardware such as Raspberry Pi or use simulator)
> It would be much more enjoyable if we could run the compiled C executable on an x86-64 computer, wouldn't it? XD (looking forward on supporting it as one of our goals)
> [name=Lai-YT]
> Sure, we can add x86-64 as our second target. I chose RISC-V and ARM because they are RISC family instructions, which are easier to implement. Let's add x86-64 to phase 2.
> [name=Lee]
- Compiler Language:
- C (Simple, but we need to write our own data structures)
- [x] C++ (Standard Library support, align with LLVM)
- Rust (Cargo package management and better document support, less familiar)
- Do we write our own frontend? Or use lex, yacc instead?
- write our own frontend
> I would like to handcraft the parser, but I currently have no idea how to handle the lexer.
> I will explore how other developers have dealt with the lexer and see if that gives me any ideas.
> [name=Lai-YT]
> Another option would be to use Lex and Yacc in Phase 1 to accelerate the development process. We could then prioritize optimization in Phase 2 or use a hand-written parser if optimization is not a major concern.
> But make sure we're careful about using them so that our compiler is well-modularized and each component can evolve with as little pain as possible.
> [name=Lai-YT]
> Sure, using Lex and Yacc can save us time after finished writing our compiler homework.
> [name=Lee]
### Goals
- [ ] Simple compiler without IR
> I think there is an advantage to using intermediate representation (*IR*) early in the compiler development process.
> By doing so, we can leverage the back-end of *LLVM* after implementing the front-end, rather than having to implement the entire compiler before being able to test it.
> However, I am unsure about the level of difficulty involved in converting the code to *IR* instead of directly converting it into *RISC-V* instructions.
> If it requires significant effort, the potential benefits may not justify the extra work involved.
> [name=Lai-YT, revised by ChatGPT]
> Nice one ChatGPT 😎
>
> Do we want to implement our own IR?
> In the **long** term, yes, I do want us to development our own IR.
> In the **short** term ,I think its more complicated since we don't have much experience in designing an IR. That's why I move implementing IR to phase 2 after we finished crafting a simple compile from frontend to backend. Also, I think after we implemented our own backend, we will have more thoughts on designing an IR.
> Yet, you did mention one point on having an existing backend for us to test. Maybe we can try to leverage a small compiler backend like [QBE](https://c9x.me/compile/docs.html)?
>
> [name=Lee]
>
> Designing a new IR seems quite unnecessary. It's hard to work with existing tools.
> While *LLVM*'s IR can be quite complex, the *QBE* back-end that you suggested seems like a good starting point for beginners like us. :smile:
> [name=Lai-YT]
> Great! QBE's author even has a yacc example implementation, we can reference that example.
> [name=Lee]
> Cool!
> [name=Lai-YT]
- [ ] CI + test
> Will we need a simulator to actually run the the compiled executable?
> Such as using [Spike](https://github.com/riscv-software-src/riscv-isa-sim) and [QEMU](https://github.com/qemu/qemu).
> [name=Lai-YT]
> Yes! 🤟
> [name=Lee]
- [ ] Documentation
### References
- [awesome-compilers](https://github.com/aalhour/awesome-compilers#educational-and-toy-projects)
- [chibicc](https://github.com/rui314/chibicc)
- [shecc](https://github.com/jserv/shecc): targeted at 32-bit _Arm_ and _RISC-V_ architecture
- [tinycc](https://github.com/TinyCC/tinycc): *ANSI C*. Seems like it now supports _RISC-V_ as also.
- [QBE](https://c9x.me/compile/docs.html): A small compiler backend written in C, supports multiple backends, such as amd64 (linux and osx), arm64, and riscv64.
- [How I wrote a self-hosting C compiler in 40 days](https://www.sigbus.info/how-i-wrote-a-self-hosting-c-compiler-in-40-days):
> Summary:
> 1. Start with small things and expand its features.
> 2. Always remember to initialize struct or other objects to zero, or you will get garbage. (I made this mistake several times. 😂)
> 3. Shocking how someone can write a C compiler in a month, but he said he had 15 years of experience in C, so I guess that's why. 😧
> 4. Rei(Author): *Although I'm thinking that 8cc is one of the best programs I have ever written, I'd choose a different design than that if I were to write it again. Particularly, I'd use yacc instead of writing a parser by hand and introduce an intermediate language early on.*
> ☝️ Something we can think about, but we're newbies compared with his experience. 😂
> [name=Lee]
## Phase 2
### Goals
- [ ] Intermediate Representation
- [ ] Optimization
- [ ] Target multiple backends, including x86-64