# sse2neon Contribution Team Sharing
###### tags: sse2neon
---
## What's sse2neon?
----
### So what's intrinsic in computer science?
- Intrinsic, or formally, intrinsic functions, is functions with their implementations are handled specially by the compiler [^8].
- Usually, intrinsic implements high-performance code.
[^8]: https://en.wikipedia.org/wiki/Intrinsic_function
----
### What's SSE and NEON?
- SSE (Streaming SIMD Extensions) is a SIMD instruction set extension on x86.
- NEON is a SIMD instruction set extension on Arm Cortex-A and Arm Cortex-R series of processors.
- You use SSE/NEON for digital signal processing, graphics processing, and string matching, etc.
----
### Get back to the point, what's sse2neon?
- `sse2neon` is a translator of SSE intrinsics to Arm NEON, shortening the time needed to get an Arm working program that then can be used to extract profiles and to identify hot paths in the code. [^1]
- There are a lot of (open-source) projects that have adopted `sse2neon` for Arm/Aarch64 support including Apple M1 CPU.
[^1]: https://github.com/DLTcollab/sse2neon
---
## My contribution
- `_rdtsc`
----
### `_rdtsc`
- Read the current value of the processor’s time-stamp counter.
- https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_rdtsc
- Usually used for precise-time performance measurement.
---
## Interaction with Open Source Community (taking sse2neon as example)
----
### What abilities you need
- passion
- C fundamentals (function pointer, macro, etc.)
- compiler
- SIMD fundamentals (optional)
- Linux kernel fundamentals
----
### How to make a contribution
- GitHub issues
- #good first issue
- Talk to maintainers
----
### How to let your work being reviewed
- Read contribution guide, including:
- Git commit messages
- Thumb-up rules of writing C programs
----
### Should I prepare test cases?
- Of course! For unit test and CI/CD
- Thumb-up rules of writing test cases
- reflect the target of testing
- generator (macro) to shorten lines and generalization
---
## Terms
----
### Arm Coprocessors
Arm processors can be extended with a number of coprocessors to perform non-standard calculations and to avoid having to do these calculations in software. Coprocessors are available for memory management, floating point operations, debugging, media, cryptography, ... [^4]
[^4]: https://low-level.readthedocs.io/en/latest/arch/arm/
----
### privilege/exception level
- Same term, while different name in Armv7/Armv8.
- In Arm, you can set privilege level [^7] to let user not to use certain registers.
- In Armv7-A, user mode Performance Monitoring Unit can be accessed by user only if the privilege is set.
- Therefore, you can only use kernel API (syscall) if you are not permitted to access the registers.
[^7]: https://developer.arm.com/documentation/102412/0102/Privilege-and-Exception-levels
----
### `__asm__ __volatile__("" ::: "memory");`
- It creates a compiler level memory barrier forcing optimizer to not re-ordering memory accesses across the barrier [^6].
[^6]: https://stackoverflow.com/questions/14950614/working-of-asm-volatile-memory
---
## The road of implementation
----
### Armv8-A
```c
__asm__ __volatile__("mrs %0, cntvct_el0" : "=r"(val));
```
- Read the value from `CNTVCT_EL0` (Counter-timer Virtual Count register). [^2]
[^2]: https://developer.arm.com/documentation/ddi0595/2021-03/AArch64-Registers/CNTVCT-EL0--Counter-timer-Virtual-Count-register
----
### Armv7-A
```c
// Read the user mode Performance Monitoring Unit (PMU)
// User Enable Register (PMUSERENR) access permissions.
__asm__ __volatile__("mrc p15, 0, %0, c9, c14, 0" : "=r"(pmuseren));
if (pmuseren & 1) { // Allows reading PMUSERENR for user mode code.
__asm__ __volatile__("mrc p15, 0, %0, c9, c12, 1" : "=r"(pmcntenset));
if (pmcntenset & 0x80000000UL) { // Is it counting?
__asm__ __volatile__("mrc p15, 0, %0, c9, c13, 0" : "=r"(pmccntr));
// The counter is set up to count every 64th cycle
return (uint64_t) (pmccntr) << 6;
}
}
// Fallback to syscall (gettimeofday) as we can't enable PMUSERENR in user mode.
struct timeval tv;
gettimeofday(&tv, NULL);
return (uint64_t) (tv.tv_sec) * 1000000 + tv.tv_usec;
```
----
### Armv7-A (cont.)
- Read `PMCCNTR` register of a co-processor `p15` (not an actual co-processor, just an entry point for CPU functions) to obtain cycle counts. [^3]
- `PMCCNTR` is only available to an unprivileged application if:
1. Unprivileged `PMCCNTR` reads are alowed:
Bit 0 of `PMUSERENR` register must be set to 1.
2. `PMCCNTR` is actually counting cycles:
Bit 31 of `PMCNTENSET` register must be set to 1.
- If the above situation happens, use system call to get system time instead.
[^3]: https://stackoverflow.com/a/40455065
----
### test case
```cpp
uint64_t start = _rdtsc();
for (int i = 0; i < 100000; i++)
__asm__ __volatile__("" ::: "memory");
uint64_t end = _rdtsc();
return end > start ? TEST_SUCCESS : TEST_FAIL;
```
- Use `__asm__ __volatile__("" ::: "memory");` to avoid compiler optimization which will eliminate the no-op for loop.
---
## Q&A
---
## Appendix
----
### syscall -> vsyscall -> vsdo [^5]
- To use some service related to OS, we can use syscall.
- However, syscall requires context switch which will slow down the overall performance. Therefore, vsyscall is created to reduce the latency.
- Yet vsyscall requires predictable address space, which violates the ASLR. Hence, we create vsdo for recuding the latency as well as randomizing the address space.
[^5]: https://hackmd.io/@sysprog/linux-vdso
<style>
.reveal {
font-size: 32px;
}
</style>
<!--
### improvement
1. explain what is intrinsic
2. what is SSE and NEON
3. give an example when you would use SSE/NEON
### Jill's feedback
- 10 min max
- no one knows what the heck are you talking
- Armv7-A implementation
- just talk about normal and syscall
- maybe no need to talk about the difference of Armv7-A and Armv8-A
- maybe no need to talk bout syscall term
- put in appendix/backup slide
- assembly go to appendix
- "memory" -> sleep for a while
-->