Google Summer Of Code 2022 experience at VideoLAN

It's been a while since I wrote a blog post! (2 year since Fast Convolutional Neural Network with Numpy and Automatic differentiation engine !)
Today will be a bit different as I will describe my journey and work during the 2022 edition of the Google Summer Of Code under the supervision of Alexandre Janniaux (unidan) and Jean-Baptiste Kempf (j-b) from the VLC team and Pierre-Loup Cabarat from INSA Rennes.
Readers will learn about performance in general, my process of learning new (mostly low-level) materials and software engineering tricks that can be useful in everyday development. I will also dive into the technical coding details I did for this Google Summer Of Code.

Shoutout to Tsiry Mayet (Maugrim) for taking the time to read my blog post and give meaningful feedbacks. (even though it isn't his field of expertise !)

I/ Learning the materials

`Film grain application & synthesis`

My work focuses on Versatile Video Coding (VVC), a next generation video compression codec, for which I will write SIMD code in C (using intrinsics) and in x86-64 assembly for optimization purposes.
More precisely, I worked on the application and the synthesis of film grain.
To better understand what it means, let's briefly explain the process of a video codec.
- A video codec is composed of two parts: an encoder and a decoder.
- Compressing a video file involves passing the input video to the encoder, which will change the representation of your video so that it is easier for your decoder to reconstruct it at a lower spatial cost (which usually reduces the file size while trying to maintain the best quality).
- If one wants to compress a video with film grain (like old school tv style) such as the following:
  Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →
  Then, recent video codecs (i.e: AV1, VVC) will automatically remove the grain, encode, and generate it back at decoding stage to save up bandwidth. As an example, Netflix saves around 30% of the data during transmission by doing so.
Film grain accounts for a small part in a video codec. Therefore, it was easier for me to get started into the project. Otherwise, it would be almost impossible without proper video coding expertise. All the logic happened in the file pp_film_grain.c.

`x86-64`

Here comes the challenging part. To complete this GSOC, I needed to learn just enough (let's say 60%) to start coding.
I am quite satisfied with the knowledge I have gained, so here are some tips on how to proceed:
- Invest time in dev tooling: You must be comfortable using your debugger. In my case, I used GDB augmented with gdb-dashboard plus some customs commands for automation, all this to make debugging less painful. Here is an example of how it looks:
- Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →
- Write snippet/gist of code: I can't stress this enough, nothing beats practice especially if you are a more hands-on learner like me. One thing I did and that helped me a lot was to code a little snippet, let's say dot_product, pass it to godbolt and analyzes the generated assembly code. Since it is a valid code, you can step through the code using your favorite debugger and try to understand what each line means.
  - Image Not Showing Possible Reasons
    The image file may be corrupted
    The server hosting the image is unavailable
    The image path is incorrect
    The image format is not supported
    Learn More →
  left: C code / middle: x86-64 code / right: SIMD x86-64 code
- Keep a tab open for the reference manual: Coding in assembly is challenging because you have complete control over your machine. With great power comes great responsability, that is why you need to respect the conventions defined in the x86-64 System V Application Binary Interface in order for your code to run smoothly.
- Don't be afraid to ask for help: big shoutout to Peter Cordes for his investment in the x86 StackOverflow.

`SIMD intrinsics`

Part of my work will consist of writing code using SIMD intrinsics.
- Here are valuable ressources to learn SIMD intrinsics:
  - Intel intrinsic guide
  - x86 Intrinsics Cheat Sheet
- Please refer to x86-64 part for the learning process.
Let's explain briefly what are SIMD intrinsics:
- SIMD stands for Single instruction Multiple Data. Instead of working with a single scalar value, you will work with multiple data at the same time and perform the same operation on them in parallel, yielding a proportional increase in performance. The most low-level way to use SIMD is to use the assembly vector instructions directly.
- Instead, we will use intrinsics which are mapping to these assembly vector instructions but in C-style way, available in modern C/C++ compilers.
- Most SIMD intrinsics follow a naming convention similar to _mm[size]_<action>_<type>.
  
  In example, _mm_add_epi16 will "add two 128-bit vectors of 16-bit extended packed integers (short type)".
Your CPU is capable of doing so because it supports an extension to the x86 instruction set called "SSE" (Streaming SIMD Extensions).

SSE is one of the many so-called “SIMD extensions” for x86. There is also, AVX and AVX512 extensions for example.
SSE originally added 8 new 128-bit registers known as xmm0-xmm7. The AMD64 extensions from AMD (originally called x86-64) later added 8 registers xmm8-xmm15 (this will be ported to Intel 64 architecture as well).

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Since registers are 128-bits wide, you can either load/fetch:
- 1 value of size 128-bit.
- 2 values of size 64-bit.
- 4 values of size 32-bit.
- 8 values of size 16-bit.
- 16 values of size 8-bit.
An important thing to remember using SIMD instrinsics is that you are limited to only 1 vector register(xmm0)

II/ Workflow setup

`x86inc.asm`

Now that I have learnt enough, we can start to get our hands dirty. My mentors advise me to read the dav1d code to familiarize myself with how real assembly development looks like in a software.
The reflex to have here in my opinion is to see if there is a test suite and then go through it with your debugger to understand the flow of the program, how things are structured. I found using a pen a paper to track down which functions were called, quite useful.
dav1d writes x86-64 SIMD code by using an abstraction provided by the x86inc.asm file.
- x86inc.asm enables the developer to write hand-written assembly on x86 easier by unifying the differents syntax between x86 in Windows/Linux and the differents version of vectorized instructions sets (SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2 …)
  - Here is an awesome guide to get started.
To be honest, I find it useful to expand the macros by hand (as x86inc.asm is mostly composed of macros) to get a better feel on how things work.
- For example, to understand how a function declared with x86inc.asm is called in the C side, I isolated the code in a gist and expanded the macros using nasm -E prologue.asm.
- From this, we can deduce that the syntax to declare a function in x86inc.asm is as follow: cglobal function_name, #args (%1), #regs (%2), #xmm_regs (%3), [stack_size,] (%4) arg_names... (%5-*)
- Thus, cglobal sad_16x16, 4, 7, 5, src, src_stride, dst, dst_stride, src_stride3, dst_stride3, cnt means:
  - automatically loads 4 arguments (src, src_stride, dst, dst_stride) into registers
  - uses 3 additional register (src_stride3, dst_stride3, cnt)
  - 5 vector registers (m0-m4)
  - perform no allocation of stack space.
- Here are the mapping of variables to registers:
  - src => r0 (rdi)
  - src_stride => r1 (rsi)
  - dst => r2 (rdx)
  - dst_stride => r3 (rcx)
  - src_stride3 => r4 (R8)
  - dst_stride3 => r5 (R9)
  - cont => r6 (rax)

`Sanity check setup`

All my work will be done on the OpenVVC repository. Therefore, after making sense of the part I have to work on, I decided to setup some sanity check to make sure that my feature works as intended.
Since the OpenVVC decoder outputs an image at the end, one can think of performing a diff(1) between the reference image and your function output image.
Even though this method is useful to make sure the function works properly, it gives almost no useful informations for the developer in case of a bug.

I don't use a small custom dummy image as input for debugging because I don't know whether or not my custom image will activate the flags to generate film grain ! That's why I decided to stick with the sample that my mentor provided me.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

That's where logging comes into play. Since I am dealing with images (matrices), we can print each pixel value in a log file. This way, you can use diff between the 2 log files to see the differences. Really useful, especially when you are debugging multi-threaded program.

Debugging a multi-threaded program can be a pain in the ass. Here's a protocol I've put together to help you get started:
- Suppose your diff log_ref log_res output the following:
  Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →
- Using your debugger, place a watchpoint (in my case: watch ($xmm1.v4_int32[1])==14).
  
  breakpoint are set for function or line of code while watchpoints are set on variables.
- Once the watchpoint is triggered and the value found, each time you step into your debugger, generate a core file with generate-core-file <number>. This will let you replay the debugging session as much as you want.
- Replay the debugging session and compare with the output of reference code line by line.

`Benchmark`

Setup

Since we will be writing performant code, we need to measure how fast it is compared to the baseline function.

To do so, I wrote the structure BenchWrapper containing my benchmark function as a function pointer so that I don't need to change the structure of the existing code to bench.


    
      
        
  
    
    

        


  


  
        
          
          struct BenchWrapper
        
        
          
          {
        
        
          
              size_t nb_fct_call;
        
        
          
              size_t iter_per_fct;
        
        
          
              // Define function arguments here
        
        
          
              int (*func)(const OVSEI *sei, OVFrame **frame);
        
        
          
          };
        
        
          
          

        
        
          
          void bench_decorator(struct BenchWrapper bw, char* name, const OVSEI *sei, OVFrame **frame)
        
        
          
          {
        
        
          
              volatile int dummy;
        
        
          
              double total_avg_time = 0;
        
        
          
              
        
        
          
              for (size_t i = 0; i < bw.nb_fct_call; ++i)
        
        
          
              {
        
        
          
                  double avg_time = 0;
        
        
          
          

        
        
          
                  for (size_t j = 0; j < bw.iter_per_fct; ++j)
        
        
          
                  {
        
        
          
                      double t1 = clock();
        
        
          
                      volatile const int r = bw.func(sei, frame);
        
        
          
                      double t2 = clock();
        
        
          
          

        
        
          
                      dummy = r;
        
        
          
                      double dt = (t2 - t1) * 1000 / CLOCKS_PER_SEC; // millisecond
        
        
          
          

        
        
          
                      avg_time += dt;
        
        
          
                  }
        
        
          
          

        
        
          
                  avg_time /= (double)bw.iter_per_fct;
        
        
          
                  total_avg_time += avg_time;
        
        
          
              }
        
        
          
          

        
        
          
              total_avg_time /= (double)bw.nb_fct_call;
        
        
          
              printf("%-20s: %f ms (average time / %lu calls of %lu iteration each)]\n",
        
        
          
                  name,
        
        
          
                  total_avg_time,
        
        
          
                  bw.nb_fct_call,
        
        
          
                  bw.iter_per_fct);
        
        
          
          }
        
  



    

  


      
      
        view raw
        
          bench.c
        
        hosted with ❤ by GitHub

An example of how one will call the benchmark function:








#include "../examples/bench_decorator.h"
... 
{
    // call the function `pp_process_frame` 1000 times with 100 iterations for each call.
    // pp_process_frame(sei, frame_p);
    struct BenchWrapper bw = { 1000, 100, &pp_process_frame };
    bench_decorator(bw, "pp_process_frame", sei, frame_p);
}

perf-record(1) will be used to gather performance statistics over all those runs.
I also used the Linux perf GUI Hotspot for performance analysis, useful to pinpoint potential hot spots. We will go over each tab of the GUI:
- Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →
  - Summary: shows top hotpost and display system information.
  - Bottom up: shows all functions that call ov_clip_uintp2
    - Image Not Showing Possible Reasons
      The image file may be corrupted
      The server hosting the image is unavailable
      The image path is incorrect
      The image format is not supported
      Learn More →
  - Top Down: shows what fg_blend_stripe calls.
    - Image Not Showing Possible Reasons
      The image file may be corrupted
      The server hosting the image is unavailable
      The image path is incorrect
      The image format is not supported
      Learn More →
  - Flame graph: shows a collection of stack traces/call stacks.
    - Image Not Showing Possible Reasons
      The image file may be corrupted
      The server hosting the image is unavailable
      The image path is incorrect
      The image format is not supported
      Learn More →
    - A stack trace is represented as a column of boxes, where each box represents a function (a stack frame).
    - The y-axis shows the stack depth from bottom to top. The top box shows the function that was on-CPU when the stack trace was collected, and everything beneath is the caller/parent function.
    - The x-axis covers the collection of stack traces. It does not show the passage of time
    - The width of each function box shows the frequency at which that function was present in the stack traces. Functions with wide boxes were more present in the stack traces than those with narrow boxes
      
      To learn about flame graph, please read Brendan Gregg awesome post.
  - cycles:u(self)/cycles:u(incl.): In Top down, you can see 2 columns:
    - cycles:u(self): total program time passed in the function.
    - cycles:u(incl.): total program time passed in the function without taking into account callee functions/sub-functions.
    - It is better visualised on a flame graph:
      - Image Not Showing Possible Reasons
        The image file may be corrupted
        The server hosting the image is unavailable
        The image path is incorrect
        The image format is not supported
        Learn More →
      - cycles:u(self): corresponds to ov_clip_uintp2 (which includes the other ov_clip_uintp2 and fg_blend_stripe calls)
      - cycles:u(incl.): corresponds to only fg_blend_stripe (which exclude the 2 ov_clip_uintp2 calls)
        
        Thus, it makes sense to not have cycles:u(self) in Bottom up because we are showing who are calling my function.
  - Caller/Callee:
    - Image Not Showing Possible Reasons
      The image file may be corrupted
      The server hosting the image is unavailable
      The image path is incorrect
      The image format is not supported
      Learn More →
    - You have 2 tabs:
      - caller: cost of calling fg_blend_stripe.
      - callee: cost of calling the sub-function of fg_blend_stripe and its location in the code (the line pp_film_grain.c:629 is responsible for 29.6% of the time spent in fg_blend_stripe only (cycles:u(incl.)) which is 38.1%)
      Example: The function fg_blend_stripe accounts for 21.4% of the program total time (cycles:u(self)) in which 16.8% is spent on ov_clip_uintp2 and 83.2% on the function itself.

Theory

Now that we are more familiar with Hotspot, How does one interpret speedup ?. Let's define the following:
- $T_{r e f}$ : total reference program time.
- $T_{o p t}$ : total optimized program time.
- $P_{r e f}$ : function representation in reference program.
- $P_{o p t}$ : function representation in optimized program.
  - In example, the SSE4 version represents 10% (
    $P_{o p t}$ ) of the total optimized program time (
    $T_{o p t}$ ) compared to 25% (
    $P_{r e f}$ ) of the total time for the reference version (
    $T_{r e f}$ )
The global speedup of the program can be computed as following:
${speedup}_{g l o b a l} = \frac{(total time in ref program)}{(total time in opti program)} = \frac{T_{r e f}}{T_{o p t}}$
However, how much the function accounts in the total program can influence how our speedup in the function is perceived by the program. Indeed, a speedup of x10 in our function will not necessarily translate into a speedup of x10 in the overall program.
This is why computing the speedup of the function as following can be useful:
${speedup}_{f c t} = \frac{(total time in ref program) * (fct % in ref program)}{(total time in opti program) * (SIMD fct % in opti program)} = \frac{T_{r e f} * P_{r e f}}{T_{o p t} * P_{o p t}}$
This way, we can see that:
- The higher
  $P_{r e f}$ is, the higher my speedup will be. In other words, if my function accounts for 90% of my program, then the speedup of my function will be significant.
  
  $\frac{T_{r e f}}{T_{o p t}} * \frac{90}{10} = \frac{T_{r e f}}{T_{o p t}} * 9$
- The smaller
  $P_{r e f}$ is, the smaller my speedup will be. In other words, if my function accounts for 11% of my program, then the speedup of my function will be unsignificant.
  
  $\frac{T_{r e f}}{T_{o p t}} * \frac{11}{10} = \frac{T_{r e f}}{T_{o p t}} * 1.1$
One can see
$P_{r e f}$ as a weighting of how much the speedup of our function will be reflected in the program.
From this, we can deduce that even if our function has a speedup of x10 but the function accounts for nothing in the program, then the speedup will not be noticeable in the program.
- This implies a diminishing returns in optimizing the same function over and over (especially if it accounts for nothing in the reference program).
However, this is not the case anymore when all other parts of the program are as well optimized because our function will now account for more in our program.
The more we optimize other parts, the more our already optimized function will account in the whole program, and the more its speedup will be reflected in the whole program.
This is best understood with the following example:
- Suppose a function which takes 50% of the total time of your program. You succeed to speedup the function by half. Now the function accounts for 33% of your total time and not 25%.
  
  Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →
  - If your function takes 5s and the whole program 10s, then your function now takes 2.5s and your program 7.5s.

III/ Technical coding details

Let's dive into the important part. The function to optimize is fg_grain_apply_pic() which is called by pp_process_frame(). Main hot spots in fg_grain_apply_pic() are:
Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →
Remark: The function ov_clip_uintp2() is called in fg_blend_stripe and fg_compute_block_avg but is used in a different way. One at granular level and one at vectorized level.

`fg_simulate_grain_blk8x8`

Here is the reference version of fg_simulate_grain_blk8x8().


    
      
        
  
    
    

        


  


  
        
          
          void fg_simulate_grain_blk8x8(int32_t *grainStripe, uint32_t grainStripeOffsetBlk8,
        
        
          
            uint32_t width, uint8_t log2ScaleFactor, int16_t scaleFactor, uint32_t kOffset, uint32_t lOffset, uint8_t h, uint8_t v, uint32_t xSize)
        
        
          
          {
        
        
          
            uint32_t k, l;
        
        
          
            uint32_t idx, idx_offset, idx_offset_l, grainStripeOffsetBlk8_l;
        
        
          
          

        
        
          
            idx_offset = ( h*NUM_CUT_OFF_FREQ +  v ) * DATA_BASE_SIZE  * DATA_BASE_SIZE;
        
        
          
            for (l = 0; l < 8; l++) /* y direction */
        
        
          
            {
        
        
          
              idx_offset_l            = idx_offset + (l + lOffset) * DATA_BASE_SIZE;
        
        
          
              grainStripeOffsetBlk8_l = grainStripeOffsetBlk8 + (l*width);
        
        
          
              for (k = 0; k < xSize; k++) /* x direction */
        
        
          
              {
        
        
          
                idx = idx_offset_l + (k + kOffset);
        
        
          
                grainStripe[grainStripeOffsetBlk8_l + k ] = ((scaleFactor * fg_data_base[idx]) >> (log2ScaleFactor + GRAIN_SCALE));
        
        
          
              }
        
        
          
            }
        
        
          
            return;
        
        
          
          }
        
  



    

  


      
      
        view raw
        
          fg_simulate_grain_blk8x8.c
        
        hosted with ❤ by GitHub

Given an array grainStripe, we want to fill only a subset of this array (of size 8 x xSize) at the offset grainStripeOffsetBlk8.
Let's takexSize=8 for convenience. We will fill each cell of grainStripe with afg_data_base value (chosen based on some idx_offset) that will then be scaled and right shifted.
Here is an illustration to make things clearer:
Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →

SIMD C version

Algorithm

The full Pull request for this part can be found at [GSOC] fg simulate grain blk8x8 sse4 #9.


    
      
        
  
    
    

        


  


  
        
          
          void fg_simulate_grain_blk8x8_sse4(int32_t *grainStripe, uint32_t grainStripeOffsetBlk8,
        
        
          
            uint32_t width, uint8_t log2ScaleFactor, int16_t scaleFactor, uint32_t kOffset, uint32_t lOffset, uint8_t h, uint8_t v, uint32_t xSize)
        
        
          
          {
        
        
          
              uint32_t idx_offset_l1, idx_offset_l2, idx_offset_l3, idx_offset_l4;
        
        
          
              uint32_t grainStripeOffsetBlk8_l1, grainStripeOffsetBlk8_l2, grainStripeOffsetBlk8_l3, grainStripeOffsetBlk8_l4;
        
        
          
          

        
        
          
              uint32_t idx_offset = ( h*NUM_CUT_OFF_FREQ +  v ) * DATA_BASE_SIZE  * DATA_BASE_SIZE;
        
        
          
          

        
        
          
              __m128i scale = _mm_set_epi32(scaleFactor, scaleFactor, scaleFactor, scaleFactor);
        
        
          
          

        
        
          
              for (uint32_t l = 0; l < 8; l+=4) {
        
        
          
          

        
        
          
                idx_offset_l1            = idx_offset + (l + lOffset) * DATA_BASE_SIZE;
        
        
          
                idx_offset_l2            = idx_offset + (l + 1 + lOffset) * DATA_BASE_SIZE;
        
        
          
                idx_offset_l3            = idx_offset + (l + 2 + lOffset) * DATA_BASE_SIZE;
        
        
          
                idx_offset_l4            = idx_offset + (l + 3 + lOffset) * DATA_BASE_SIZE;
        
        
          
          

        
        
          
                grainStripeOffsetBlk8_l1 = grainStripeOffsetBlk8 + (l*width);
        
        
          
                grainStripeOffsetBlk8_l2 = grainStripeOffsetBlk8 + ((l + 1)*width);
        
        
          
                grainStripeOffsetBlk8_l3 = grainStripeOffsetBlk8 + ((l + 2)*width);
        
        
          
                grainStripeOffsetBlk8_l4 = grainStripeOffsetBlk8 + ((l + 3)*width);
        
        
          
          

        
        
          
                for (uint32_t k = 0; k < xSize; k+=4) 
        
        
          
                {
        
        
          
                  __m128i fg_data_1 = _mm_loadu_si64(fg_data_base + idx_offset_l1 + (k + kOffset));
        
        
          
                   __m128i fg_data_1_lo = _mm_cvtepi8_epi32(fg_data_1);
        
        
          
                  __m128i chunk_1 = _mm_srai_epi32(_mm_mullo_epi32(fg_data_1_lo, scale), log2ScaleFactor + GRAIN_SCALE);
        
        
          
                  _mm_store_si128(&grainStripe[grainStripeOffsetBlk8_l1 + k], chunk_1);
        
        
          
          

        
        
          
                  __m128i fg_data_2 = _mm_loadu_si64(fg_data_base + idx_offset_l2 + (k + kOffset));
        
        
          
                  __m128i fg_data_2_lo = _mm_cvtepi8_epi32(fg_data_2);
        
        
          
                  __m128i chunk_2 = _mm_srai_epi32(_mm_mullo_epi32(fg_data_2_lo, scale), log2ScaleFactor + GRAIN_SCALE);
        
        
          
                  _mm_store_si128(&grainStripe[grainStripeOffsetBlk8_l2 + k], chunk_2);
        
        
          
          

        
        
          
                  __m128i fg_data_3 = _mm_loadu_si64(fg_data_base + idx_offset_l3 + (k + kOffset));
        
        
          
                  __m128i fg_data_3_lo = _mm_cvtepi8_epi32(fg_data_3);
        
        
          
                  __m128i chunk_3 = _mm_srai_epi32(_mm_mullo_epi32(fg_data_3_lo, scale), log2ScaleFactor + GRAIN_SCALE);
        
        
          
                  _mm_store_si128(&grainStripe[grainStripeOffsetBlk8_l3 + k], chunk_3);
        
        
          
          

        
        
          
                  __m128i fg_data_4 = _mm_loadu_si64(fg_data_base + idx_offset_l4 + (k + kOffset));
        
        
          
                  __m128i fg_data_4_lo = _mm_cvtepi8_epi32(fg_data_4);
        
        
          
                  __m128i chunk_4 = _mm_srai_epi32(_mm_mullo_epi32(fg_data_4_lo, scale), log2ScaleFactor + GRAIN_SCALE);
        
        
          
                  _mm_store_si128(&grainStripe[grainStripeOffsetBlk8_l4 + k], chunk_4);
        
        
          
                }
        
        
          
          

        
        
          
              return;
        
        
          
           }
        
  



    

  


      
      
        view raw
        
          fg_simulate_grain_blk8x8_sse4.c
        
        hosted with ❤ by GitHub

Here, instead of going through each row of the block 1 by 1 (as the previous illustration show), we will deal with 4 rows at the time and load 4 values (see below gif) from fg_data_base at the same time.
We then perfom the scaling and right shifting on those values before writing to grainStripe
One important thing to note here is the logic that happens at line 25-26. The register is 128-bits wide, which mean that we have to load 128-bits at a time. As fg_data_base is of type int8_t, using this register will fetch 16 of these values (128/8).
I decided to use _mm_loadu_si64 which will load 8 values in int8_t section such as following:

Actually, I could have used _mm_loadu_si32 but it doesn't work for older version of gcc <= 11 (see this stackoverflow post)

Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →
Since we are fetching more value than we need (8 instead of 4), we will have to keep only the lower 32-bits which is done by calling _mm_cvtepi8_epi32 and store the result in fg_data_X_lo.

Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →

Benchmark

Top: Reference version / Bottom: SSE4 version

When comparing the performance between the reference function and the SSE4 function, we can see that our optimized version now accounts for 11.2% of the total program time compared to 25.5% before.

`fg_compute_block_avg`

Here is the reference version of fg_compute_block_avg().


    
      
        
  
    
    

        


  


  
        
          
          int16_t fg_compute_block_avg(int16_t *dstSampleBlk8, uint32_t widthComp, uint16_t *pNumSamples,
        
        
          
                                uint8_t ySize, uint8_t xSize, uint8_t bitDepth)
        
        
          
          {
        
        
          
            uint32_t blockAvg   = 0;
        
        
          
            uint16_t numSamples = 0;
        
        
          
            uint8_t  k, l;
        
        
          
          

        
        
          
            for (k = 0; k < ySize; k++)
        
        
          
            {
        
        
          
              for (l = 0; l < xSize; l++)
        
        
          
              {
        
        
          
                blockAvg += dstSampleBlk8[(k*widthComp)+l];
        
        
          
                numSamples++;
        
        
          
              }
        
        
          
            }
        
        
          
            if (numSamples > 0)
        
        
          
            {
        
        
          
              blockAvg /= numSamples;
        
        
          
              blockAvg >>= (bitDepth - 8); /* to handle high bit depths */
        
        
          
            }
        
        
          
          

        
        
          
            *pNumSamples = numSamples;
        
        
          
          

        
        
          
            blockAvg = (int16_t) ov_clip_uintp2(blockAvg, 8 );
        
        
          
            return blockAvg;
        
        
          
          }
        
  



    

  


      
      
        view raw
        
          fg_compute_block_avg.c
        
        hosted with ❤ by GitHub

It simply compute the average inside a block, nothing fancy here. Notice that ov_clip_uintp2 is called only after the average is computed (granular level) which means that we don't need to deal with it.

SIMD C version

Algorithm

The full Pull request for this part can be found at [GSOC] fg compute block avg sse4 #8.


    
      
        
  
    
    

        


  


  
        
          
          int16_t fg_compute_block_avg_sse4(int16_t *dstSampleBlk8, uint32_t widthComp, uint16_t *pNumSamples,
        
        
          
                                uint8_t ySize, uint8_t xSize, uint8_t bitDepth)
        
        
          
          {
        
        
          
            uint16_t blockAvg   = 0;
        
        
          
            uint16_t numSamples = 0;
        
        
          
          

        
        
          
            __m128i acc = _mm_setzero_si128();
        
        
          
          

        
        
          
            for (int i = 0; i < ySize; i+=1, numSamples+=8)
        
        
          
            {
        
        
          
                __m128i x = _mm_loadu_si128(&dstSampleBlk8[i*widthComp]);
        
        
          
                acc = _mm_adds_epi16(acc, x);
        
        
          
            }
        
        
          
          

        
        
          
            if (numSamples > 0)
        
        
          
            {
        
        
          
              acc = _mm_hadd_epi16(acc, acc);
        
        
          
              acc = _mm_hadd_epi16(acc, acc);
        
        
          
              acc = _mm_hadd_epi16(acc, acc);
        
        
          
              blockAvg = _mm_cvtsi128_si32(acc);
        
        
          
              blockAvg /= numSamples;
        
        
          
              blockAvg >>= (bitDepth - 8); /* to handle high bit depths */
        
        
          
            }
        
        
          
          

        
        
          
            // assert(blockAvg < (1 << 8));
        
        
          
            *pNumSamples = numSamples;
        
        
          
          

        
        
          
            // blockAvg = (int16_t) OVMIN(OVMAX(0, blockAvg), (1 << 8) - 1 );
        
        
          
            blockAvg = (int16_t) ov_clip_uintp2((uint32_t)blockAvg, 8);
        
        
          
            return blockAvg;
        
        
          
          }
        
  



    

  


      
      
        view raw
        
          fg_compute_block_avg_sse4.c
        
        hosted with ❤ by GitHub

In the reference version, the function returns an int16_t but we accumulate on uint32_t blockAvg
Thus, I decided to accumulate on int16_t (Line 4) to fetch more values (8 instead of 4) during _mm_loadu_si128 call. Doing so forces us to perform a cast into uint32_t when calling ov_clip_uintp2.
When performing the horizontal sum, we do _mm_hadd_epi16 3 times in a row to get the final sum. However, the final sum will be in all elements of the register but t'as not an issue, we just need to copy the lower 32-bit integer using _mm_cvtsi128_si32 in order to get only one value.

Benchmark

Top: Reference version / Bottom: SSE4 version.

When comparing the performance between the reference function and the SSE4 function, we can see that our optimized version now accounts for 7.48% of the total program time compared to 15% before.

SIMD x86 version

Algorithm

The full code can be find here [GSOC] x86inc fg compute block avg sse4 #11. Unfortunately, it is too long to be displayed here.

For the x86 version, we have access to far more registers now (we will use 8 registers xmm0-7).

Before delving into the algorithm, we have to talk about the context the function is called:


    
      
        
  
    
    

        


  


  
        
          
           /* Loop of 16x16 blocks */
        
        
          
          for (y = 0; y < heightComp[compCtr]; y += 16)
        
        
          
          {
        
        
          
              ...
        
        
          
              for (x = 0; x < widthComp[compCtr]; x += 16)
        
        
          
              {
        
        
          
                  ...
        
        
          
                  for (blkId = 0; blkId < 4; blkId++)
        
        
          
                  {
        
        
          
                      yOffset8x8 = (blkId >> 1) * 8;
        
        
          
                      xOffset8x8 = (blkId & 0x1)* 8;
        
        
          
                      
        
        
          
                      ...
        
        
          
                      blockAvg  = fg_compute_block_avg(srcSampleBlk8, strideComp[compCtr], &numSamples,
        
        
          
                           OVMIN(8, (heightComp[compCtr] - y - yOffset8x8)),
        
        
          
                           OVMIN(8, (widthComp[compCtr] - x - xOffset8x8)),
        
        
          
                           bitDepth);
        
        
          
                  }
        
        
          
                  ...
        
        
          
              }
        
        
          
              ...
        
        
          
          }
        
  



    

  


      
      
        view raw
        
          context.c
        
        hosted with ❤ by GitHub

We are looping over a block of 16x16 inside which, fg_compute_block_avg is called to compute the average on a block of maximum shape 8x8 (because of the clipping operation) at 4 different locations:
```
[blkId = 0]: (yOffset8x8, xOffset8x8) = (0, 0)
[blkId = 1]: (yOffset8x8, xOffset8x8) = (0, 8)
[blkId = 2]: (yOffset8x8, xOffset8x8) = (8, 0)
[blkId = 3]: (yOffset8x8, xOffset8x8) = (8, 8) 
```
Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →

Since block average is done over a block of 8x8 and we have 8 vector registers, we no longer need for loops. For each vector register, we will load 8 values at the time.

Benchmark

Top: Reference version / Bottom: SSE4 x86 version.

When comparing the performance between the reference function and the sse4 function, we can see that our optimized version now accounts for 4.8% of the total program time compared to 16.5%.
It is even faster than the SIMD intrinsic version, which accounted for 7.48% of the total program time.

`fg_blend_stripe`

Here is the reference version of fg_blend_stripe() with ov_clip_uintp2().


    
      
        
  
    
    

        


  


  
        
          
          static inline uint32_t ov_clip_uintp2(int32_t val, uint32_t a)
        
        
          
          {
        
        
          
              if (val > 0) {
        
        
          
                  int32_t mask  = (1 << a) - 1;
        
        
          
                  int32_t overflow = !!(val & (~mask));
        
        
          
                  return ((-overflow) & mask) | (val & mask);
        
        
          
              } else {
        
        
          
                  return 0;
        
        
          
              }
        
        
          
              #if 0
        
        
          
              return OVMIN(OVMAX(0, val), (1 << a) - 1);
        
        
          
              #endif
        
        
          
          }
        
        
          
          

        
        
          
          void fg_blend_stripe(int16_t *dstSampleOffsetY, int16_t *srcSampleOffsetY, int32_t *grainStripe, uint32_t widthComp, uint32_t blockHeight, uint8_t bitDepth)
        
        
          
          {
        
        
          
            uint32_t  k, l;
        
        
          
            int32_t   grainSample;
        
        
          
          

        
        
          
            for (l = 0; l < blockHeight; l++) /* y direction */
        
        
          
            {
        
        
          
              for (k = 0; k < widthComp; k++) /* x direction */
        
        
          
              {
        
        
          
                  grainSample   =   grainStripe[k + (l*widthComp)];
        
        
          
                  grainSample   <<=  (bitDepth - 8);
        
        
          
                  dstSampleOffsetY[k + (l*widthComp)] = (int16_t) ov_clip_uintp2(grainSample + srcSampleOffsetY[k + (l*widthComp)], bitDepth);
        
        
          
              }
        
        
          
            }
        
        
          
            return;
        
        
          
          }
        
  



    

  


      
      
        view raw
        
          fg_blend_stripe.c
        
        hosted with ❤ by GitHub

From reading the code, we can deduce the following:
- We are going through each pixel of the matrix of size blockHeight x widthComp.
In fact, the function is called in the program in such a way both dimensions are guaranteed to be a multiple of 8 (for vectorization purpose).
- For each pixel, we are:
  - shifting the value by (bitDepth - 8).
  - Add an offset to the value.
  - performing a clipping operation on the value.
  - casting down to int16_t.
  - writting to output array.
Notice that this time, ov_clip_uintp2() is called in the for loop which means that we have to vectorized it as well.

SIMD C version

Algorithm

The full Pull request for this part can be found at [GSOC] fg blend stripe sse4 #12.


    
      
        
  
    
    

        


  


  
        
          
          void fg_blend_stripe_sse4(int16_t *dstSampleOffsetY, int16_t *srcSampleOffsetY, int32_t *grainStripe, uint32_t widthComp, uint32_t blockHeight, uint8_t bitDepth)
        
        
          
          {
        
        
          
            uint32_t  k, l;
        
        
          
          

        
        
          
            // Prepare SIMD SSE4 ov_clip_uintp2
        
        
          
            __m128i mask = _mm_set1_epi32((1 << bitDepth));
        
        
          
            __m128i not_mask = _mm_xor_si128(mask, mask);
        
        
          
            not_mask  = _mm_sub_epi32(not_mask, mask);
        
        
          
            mask = _mm_sub_epi32(mask, _mm_set1_epi32(1));
        
        
          
          

        
        
          
            for (l = 0; l < blockHeight; l+=1) /* y direction */
        
        
          
            {
        
        
          
              for (k = 0; k < widthComp; k+=4) /* x direction */
        
        
          
              {
        
        
          
                __m128i grainSample = _mm_loadu_si128((__m128i*)&grainStripe[((l + 0) * widthComp) + k]);
        
        
          
          

        
        
          
                grainSample = _mm_slli_epi32(grainSample, (bitDepth - 8));
        
        
          
          

        
        
          
                // Can't use load as srcSampleOffsetY is of type int16_t (thus loading 8 value instead of 4)
        
        
          
                __m128i offset = _mm_set_epi32((int32_t)srcSampleOffsetY[k + 3 + ((l + 0) * widthComp)],
        
        
          
                                                (int32_t)srcSampleOffsetY[k + 2 + ((l + 0) * widthComp)],
        
        
          
                                                (int32_t)srcSampleOffsetY[k + 1 + ((l + 0) * widthComp)],
        
        
          
                                                (int32_t)srcSampleOffsetY[k + 0 + ((l + 0) * widthComp)]
        
        
          
                                               );
        
        
          
          

        
        
          
                grainSample = _mm_add_epi32(grainSample, offset);
        
        
          
          

        
        
          
                // SIMD SSE4 ov_clip_uintp2
        
        
          
          

        
        
          
                // Set to 0 all negative values.
        
        
          
                grainSample = _mm_max_epi32(grainSample, _mm_setzero_si128());
        
        
          
          

        
        
          
                //int32_t overflow = !!(val & (~mask));
        
        
          
                __m128i overflow = _mm_and_si128(grainSample, not_mask);
        
        
          
                overflow = _mm_min_epi32(overflow, _mm_set1_epi32(1));
        
        
          
                overflow = _mm_sub_epi32(_mm_set1_epi32(0), overflow);
        
        
          
          

        
        
          
                // ((-overflow) & mask) | (val & mask);
        
        
          
                __m128i lhs = _mm_and_si128(overflow, mask);
        
        
          
                __m128i rhs = _mm_and_si128(grainSample, mask);
        
        
          
          

        
        
          
                __m128i clipped_val = _mm_or_si128(lhs, rhs);
        
        
          
          

        
        
          
                int32_t *val = (int32_t *)&clipped_val;
        
        
          
          

        
        
          
                dstSampleOffsetY[((l + 0) * widthComp) + (k + 0)] = (int16_t)val[0];
        
        
          
                dstSampleOffsetY[((l + 0) * widthComp) + (k + 1)] = (int16_t)val[1];
        
        
          
                dstSampleOffsetY[((l + 0) * widthComp) + (k + 2)] = (int16_t)val[2];
        
        
          
                dstSampleOffsetY[((l + 0) * widthComp) + (k + 3)] = (int16_t)val[3];
        
        
          
              }
        
        
          
            }
        
        
          
            return;
        
        
          
          }
        
  



    

  


      
      
        view raw
        
          fg_blend_stripe_sse4.c
        
        hosted with ❤ by GitHub

The algorithm can be understood as follow:
- Line 15: Since the dimensions are guaranted to be a multiple of 8, we can use SSE instructions to process multiple values at the same time (in our case, 4 values because grainStripe is of type int32_t and remember, xmm registers are 128-bits wide)
- Line 13: This is why we increment the inner loop by 4 (because we load 4 values).
  - An attentive reader may ask why the outer loop at line 11 is only incremented by 1 and not 4.
  - I tried both versions and they yield similar performance. Thus, I decided to stick with the "increment by 1" version as it makes the code easier to read.
  - My guess on why it performs the same: you only have 1 register available (a.k.a xmm0) when using SIMD intrinsics. It will make sense to use "increment by 4" version if you have more registers at hand to iterate over multiple rows (cf SIMD x86 version) at the same time.
- Line 17: here we perform the left shift on 4 values at the same time.
- Line 20-26: We add an offset to our 4 values. Be careful here, srcSampleOffsetY is of type int16_t, thus we need to upcast to int32_t first otherwise we will load 8 values instead of 4.
- Line 28-42: We are performing the clipping on the 4 values.
  - We can save up computation by only creating the mask once at Line 5-9.
- Line 44-49: write the 4 values in output array while casting them down to int16_t.

Benchmark

Top: Reference version / Bottom: SSE4 version.

When comparing the performance between the reference function and the SSE4 function, we can see that our optimized version now accounts for 27.5% of the total program time compared to 38% before.

SIMD x86 version

Algorithm

The full code can be find here [GSOC] x86inc fg blend stripe sse4 #13. Unfortunately, it is too long to be displayed here.

For the x86 version, we have access to far more registers now (we will use 8 registers xmm0-7). Thus, we can think of the following algorithm:
- Suppose the dimensions are 8 x 16 (8 rows and 16 columns).
- Then, for the 4 first rows, we will use 2 couple of vector regisers at the same time (xmm0, xmm1, xmm2-xmm3, xmm4, xmm5, xmm6-xmm7) to load 8 int32_t values for each row (resulting in a total of 32 values loaded).
- We then proceed to do the same computation as previous version (left shift, add offset, clipping, cast down to int16_t, write to output array)
- When we are done, increment by 8 to go to next columns.
- Once we are done with all columns, we add 4 to go the next rows.
- We repeat the process until all rows are visited.
Here is an illustration to make things clearer:

However, there are some important points to talk about:
1. When working with vectorized instructions, make sure your adresses are 16-bytes aligned, otherwise it will segfault any time you try use vector instructions !
2. Since x86inc.asm will move arguments to registers, we find ourselves running short of registers when developing. To solve this problem, we can put our variables (grainSample, lhs, row, col, left shift) in the stack as following:
- Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →

Benchmark

Top: Reference version / Bottom: SSE4 x86 version.

When comparing the performance between the reference function and the sse4 function, we can see that our optimized version now accounts for 29.4% of the total program time compared to 37.2% before.
However, it seems to be performing the same as the intrinsics version while using way more registers.
Due to time constraint, I didn't have the time to investigate why they yield similar performance. An educated guess will be that my function is performing too many write operations
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

IV/ Conclusion

This summer of code was a great experience because I was able to better hone my software engineering skills and learnt a lot about open source development. It was also a great opportunity to learn x86 assembly (something I would never think of doing) and performance in general.
For those who want to be involved in GSOC with VideoLAN, don't be scared ! You don't need any knowledge about video coding. Plus, they make sure to make you work on features that are easier to get start with (such as mine).
Finally, I also want to thank Alexandre for being here to answer all my questions, Jean-Baptiste for giving me the opportunity to participate in this GSOC edition and Pierre-Loup Cabarat for being my OpenVVC mentor!

	struct BenchWrapper
	{
	size_t nb_fct_call;
	size_t iter_per_fct;
	// Define function arguments here
	int (func)(const OVSEI sei, OVFrame **frame);
	};

	void bench_decorator(struct BenchWrapper bw, char* name, const OVSEI sei, OVFrame *frame)
	{
	volatile int dummy;
	double total_avg_time = 0;

	for (size_t i = 0; i < bw.nb_fct_call; ++i)
	{
	double avg_time = 0;

	for (size_t j = 0; j < bw.iter_per_fct; ++j)
	{
	double t1 = clock();
	volatile const int r = bw.func(sei, frame);
	double t2 = clock();

	dummy = r;
	double dt = (t2 - t1) * 1000 / CLOCKS_PER_SEC; // millisecond

	avg_time += dt;
	}

	avg_time /= (double)bw.iter_per_fct;
	total_avg_time += avg_time;
	}

	total_avg_time /= (double)bw.nb_fct_call;
	printf("%-20s: %f ms (average time / %lu calls of %lu iteration each)]\n",
	name,
	total_avg_time,
	bw.nb_fct_call,
	bw.iter_per_fct);
	}

	void fg_simulate_grain_blk8x8(int32_t *grainStripe, uint32_t grainStripeOffsetBlk8,
	uint32_t width, uint8_t log2ScaleFactor, int16_t scaleFactor, uint32_t kOffset, uint32_t lOffset, uint8_t h, uint8_t v, uint32_t xSize)
	{
	uint32_t k, l;
	uint32_t idx, idx_offset, idx_offset_l, grainStripeOffsetBlk8_l;

	idx_offset = ( hNUM_CUT_OFF_FREQ + v ) DATA_BASE_SIZE * DATA_BASE_SIZE;
	for (l = 0; l < 8; l++) /* y direction */
	{
	idx_offset_l = idx_offset + (l + lOffset) * DATA_BASE_SIZE;
	grainStripeOffsetBlk8_l = grainStripeOffsetBlk8 + (l*width);
	for (k = 0; k < xSize; k++) /* x direction */
	{
	idx = idx_offset_l + (k + kOffset);
	grainStripe[grainStripeOffsetBlk8_l + k ] = ((scaleFactor * fg_data_base[idx]) >> (log2ScaleFactor + GRAIN_SCALE));
	}
	}
	return;
	}

	void fg_simulate_grain_blk8x8_sse4(int32_t *grainStripe, uint32_t grainStripeOffsetBlk8,
	uint32_t width, uint8_t log2ScaleFactor, int16_t scaleFactor, uint32_t kOffset, uint32_t lOffset, uint8_t h, uint8_t v, uint32_t xSize)
	{
	uint32_t idx_offset_l1, idx_offset_l2, idx_offset_l3, idx_offset_l4;
	uint32_t grainStripeOffsetBlk8_l1, grainStripeOffsetBlk8_l2, grainStripeOffsetBlk8_l3, grainStripeOffsetBlk8_l4;

	uint32_t idx_offset = ( hNUM_CUT_OFF_FREQ + v ) DATA_BASE_SIZE * DATA_BASE_SIZE;

	__m128i scale = _mm_set_epi32(scaleFactor, scaleFactor, scaleFactor, scaleFactor);

	for (uint32_t l = 0; l < 8; l+=4) {

	idx_offset_l1 = idx_offset + (l + lOffset) * DATA_BASE_SIZE;
	idx_offset_l2 = idx_offset + (l + 1 + lOffset) * DATA_BASE_SIZE;
	idx_offset_l3 = idx_offset + (l + 2 + lOffset) * DATA_BASE_SIZE;
	idx_offset_l4 = idx_offset + (l + 3 + lOffset) * DATA_BASE_SIZE;

	grainStripeOffsetBlk8_l1 = grainStripeOffsetBlk8 + (l*width);
	grainStripeOffsetBlk8_l2 = grainStripeOffsetBlk8 + ((l + 1)*width);
	grainStripeOffsetBlk8_l3 = grainStripeOffsetBlk8 + ((l + 2)*width);
	grainStripeOffsetBlk8_l4 = grainStripeOffsetBlk8 + ((l + 3)*width);

	for (uint32_t k = 0; k < xSize; k+=4)
	{
	__m128i fg_data_1 = _mm_loadu_si64(fg_data_base + idx_offset_l1 + (k + kOffset));
	__m128i fg_data_1_lo = _mm_cvtepi8_epi32(fg_data_1);
	__m128i chunk_1 = _mm_srai_epi32(_mm_mullo_epi32(fg_data_1_lo, scale), log2ScaleFactor + GRAIN_SCALE);
	_mm_store_si128(&grainStripe[grainStripeOffsetBlk8_l1 + k], chunk_1);

	__m128i fg_data_2 = _mm_loadu_si64(fg_data_base + idx_offset_l2 + (k + kOffset));
	__m128i fg_data_2_lo = _mm_cvtepi8_epi32(fg_data_2);
	__m128i chunk_2 = _mm_srai_epi32(_mm_mullo_epi32(fg_data_2_lo, scale), log2ScaleFactor + GRAIN_SCALE);
	_mm_store_si128(&grainStripe[grainStripeOffsetBlk8_l2 + k], chunk_2);

	__m128i fg_data_3 = _mm_loadu_si64(fg_data_base + idx_offset_l3 + (k + kOffset));
	__m128i fg_data_3_lo = _mm_cvtepi8_epi32(fg_data_3);
	__m128i chunk_3 = _mm_srai_epi32(_mm_mullo_epi32(fg_data_3_lo, scale), log2ScaleFactor + GRAIN_SCALE);
	_mm_store_si128(&grainStripe[grainStripeOffsetBlk8_l3 + k], chunk_3);

	__m128i fg_data_4 = _mm_loadu_si64(fg_data_base + idx_offset_l4 + (k + kOffset));
	__m128i fg_data_4_lo = _mm_cvtepi8_epi32(fg_data_4);
	__m128i chunk_4 = _mm_srai_epi32(_mm_mullo_epi32(fg_data_4_lo, scale), log2ScaleFactor + GRAIN_SCALE);
	_mm_store_si128(&grainStripe[grainStripeOffsetBlk8_l4 + k], chunk_4);
	}

	return;
	}

	int16_t fg_compute_block_avg(int16_t dstSampleBlk8, uint32_t widthComp, uint16_t pNumSamples,
	uint8_t ySize, uint8_t xSize, uint8_t bitDepth)
	{
	uint32_t blockAvg = 0;
	uint16_t numSamples = 0;
	uint8_t k, l;

	for (k = 0; k < ySize; k++)
	{
	for (l = 0; l < xSize; l++)
	{
	blockAvg += dstSampleBlk8[(k*widthComp)+l];
	numSamples++;
	}
	}
	if (numSamples > 0)
	{
	blockAvg /= numSamples;
	blockAvg >>= (bitDepth - 8); /* to handle high bit depths */
	}

	*pNumSamples = numSamples;

	blockAvg = (int16_t) ov_clip_uintp2(blockAvg, 8 );
	return blockAvg;
	}

	int16_t fg_compute_block_avg_sse4(int16_t dstSampleBlk8, uint32_t widthComp, uint16_t pNumSamples,
	uint8_t ySize, uint8_t xSize, uint8_t bitDepth)
	{
	uint16_t blockAvg = 0;
	uint16_t numSamples = 0;

	__m128i acc = _mm_setzero_si128();

	for (int i = 0; i < ySize; i+=1, numSamples+=8)
	{
	__m128i x = _mm_loadu_si128(&dstSampleBlk8[i*widthComp]);
	acc = _mm_adds_epi16(acc, x);
	}

	if (numSamples > 0)
	{
	acc = _mm_hadd_epi16(acc, acc);
	acc = _mm_hadd_epi16(acc, acc);
	acc = _mm_hadd_epi16(acc, acc);
	blockAvg = _mm_cvtsi128_si32(acc);
	blockAvg /= numSamples;
	blockAvg >>= (bitDepth - 8); /* to handle high bit depths */
	}

	// assert(blockAvg < (1 << 8));
	*pNumSamples = numSamples;

	// blockAvg = (int16_t) OVMIN(OVMAX(0, blockAvg), (1 << 8) - 1 );
	blockAvg = (int16_t) ov_clip_uintp2((uint32_t)blockAvg, 8);
	return blockAvg;
	}

	static inline uint32_t ov_clip_uintp2(int32_t val, uint32_t a)
	{
	if (val > 0) {
	int32_t mask = (1 << a) - 1;
	int32_t overflow = !!(val & (~mask));
	return ((-overflow) & mask) \| (val & mask);
	} else {
	return 0;
	}
	#if 0
	return OVMIN(OVMAX(0, val), (1 << a) - 1);
	#endif
	}

	void fg_blend_stripe(int16_t dstSampleOffsetY, int16_t srcSampleOffsetY, int32_t *grainStripe, uint32_t widthComp, uint32_t blockHeight, uint8_t bitDepth)
	{
	uint32_t k, l;
	int32_t grainSample;

	for (l = 0; l < blockHeight; l++) /* y direction */
	{
	for (k = 0; k < widthComp; k++) /* x direction */
	{
	grainSample = grainStripe[k + (l*widthComp)];
	grainSample <<= (bitDepth - 8);
	dstSampleOffsetY[k + (lwidthComp)] = (int16_t) ov_clip_uintp2(grainSample + srcSampleOffsetY[k + (lwidthComp)], bitDepth);
	}
	}
	return;
	}

	void fg_blend_stripe_sse4(int16_t dstSampleOffsetY, int16_t srcSampleOffsetY, int32_t *grainStripe, uint32_t widthComp, uint32_t blockHeight, uint8_t bitDepth)
	{
	uint32_t k, l;

	// Prepare SIMD SSE4 ov_clip_uintp2
	__m128i mask = _mm_set1_epi32((1 << bitDepth));
	__m128i not_mask = _mm_xor_si128(mask, mask);
	not_mask = _mm_sub_epi32(not_mask, mask);
	mask = _mm_sub_epi32(mask, _mm_set1_epi32(1));

	for (l = 0; l < blockHeight; l+=1) /* y direction */
	{
	for (k = 0; k < widthComp; k+=4) /* x direction */
	{
	__m128i grainSample = _mm_loadu_si128((__m128i)&grainStripe[((l + 0) widthComp) + k]);

	grainSample = _mm_slli_epi32(grainSample, (bitDepth - 8));

	// Can't use load as srcSampleOffsetY is of type int16_t (thus loading 8 value instead of 4)
	__m128i offset = _mm_set_epi32((int32_t)srcSampleOffsetY[k + 3 + ((l + 0) * widthComp)],
	(int32_t)srcSampleOffsetY[k + 2 + ((l + 0) * widthComp)],
	(int32_t)srcSampleOffsetY[k + 1 + ((l + 0) * widthComp)],
	(int32_t)srcSampleOffsetY[k + 0 + ((l + 0) * widthComp)]
	);

	grainSample = _mm_add_epi32(grainSample, offset);

	// SIMD SSE4 ov_clip_uintp2

	// Set to 0 all negative values.
	grainSample = _mm_max_epi32(grainSample, _mm_setzero_si128());

	//int32_t overflow = !!(val & (~mask));
	__m128i overflow = _mm_and_si128(grainSample, not_mask);
	overflow = _mm_min_epi32(overflow, _mm_set1_epi32(1));
	overflow = _mm_sub_epi32(_mm_set1_epi32(0), overflow);

	// ((-overflow) & mask) \| (val & mask);
	__m128i lhs = _mm_and_si128(overflow, mask);
	__m128i rhs = _mm_and_si128(grainSample, mask);

	__m128i clipped_val = _mm_or_si128(lhs, rhs);

	int32_t val = (int32_t )&clipped_val;

	dstSampleOffsetY[((l + 0) * widthComp) + (k + 0)] = (int16_t)val[0];
	dstSampleOffsetY[((l + 0) * widthComp) + (k + 1)] = (int16_t)val[1];
	dstSampleOffsetY[((l + 0) * widthComp) + (k + 2)] = (int16_t)val[2];
	dstSampleOffsetY[((l + 0) * widthComp) + (k + 3)] = (int16_t)val[3];
	}
	}
	return;
	}