Try   HackMD

Google Summer Of Code 2022 experience at VideoLAN

  • Shoutout to Tsiry Mayet (Maugrim) for taking the time to read my blog post and give meaningful feedbacks. (even though it isn't his field of expertise !)

I/ Learning the materials

Film grain application & synthesis

  • My work focuses on Versatile Video Coding (VVC), a next generation video compression codec, for which I will write SIMD code in C (using intrinsics) and in x86-64 assembly for optimization purposes.
  • More precisely, I worked on the application and the synthesis of film grain.
  • To better understand what it means, let's briefly explain the process of a video codec.
    • A video codec is composed of two parts: an encoder and a decoder.
    • Compressing a video file involves passing the input video to the encoder, which will change the representation of your video so that it is easier for your decoder to reconstruct it at a lower spatial cost (which usually reduces the file size while trying to maintain the best quality).
    • If one wants to compress a video with film grain (like old school tv style) such as the following:
      Image Not Showing Possible Reasons
      • The image file may be corrupted
      • The server hosting the image is unavailable
      • The image path is incorrect
      • The image format is not supported
      Learn More →

      Then, recent video codecs (i.e: AV1, VVC) will automatically remove the grain, encode, and generate it back at decoding stage to save up bandwidth. As an example, Netflix saves around 30% of the data during transmission by doing so.
  • Film grain accounts for a small part in a video codec. Therefore, it was easier for me to get started into the project. Otherwise, it would be almost impossible without proper video coding expertise. All the logic happened in the file pp_film_grain.c.

x86-64

SIMD intrinsics

  • Part of my work will consist of writing code using SIMD intrinsics.
  • Let's explain briefly what are SIMD intrinsics:
    • SIMD stands for Single instruction Multiple Data. Instead of working with a single scalar value, you will work with multiple data at the same time and perform the same operation on them in parallel, yielding a proportional increase in performance. The most low-level way to use SIMD is to use the assembly vector instructions directly.
    • Instead, we will use intrinsics which are mapping to these assembly vector instructions but in C-style way, available in modern C/C++ compilers.
    • Most SIMD intrinsics follow a naming convention similar to _mm[size]_<action>_<type>.

      In example, _mm_add_epi16 will "add two 128-bit vectors of 16-bit extended packed integers (short type)".

  • Your CPU is capable of doing so because it supports an extension to the x86 instruction set called "SSE" (Streaming SIMD Extensions).

    SSE is one of the many so-called “SIMD extensions” for x86. There is also, AVX and AVX512 extensions for example.

  • SSE originally added 8 new 128-bit registers known as xmm0-xmm7. The AMD64 extensions from AMD (originally called x86-64) later added 8 registers xmm8-xmm15 (this will be ported to Intel 64 architecture as well).

  • Since registers are 128-bits wide, you can either load/fetch:
    • 1 value of size 128-bit.
    • 2 values of size 64-bit.
    • 4 values of size 32-bit.
    • 8 values of size 16-bit.
    • 16 values of size 8-bit.
  • An important thing to remember using SIMD instrinsics is that you are limited to only 1 vector register(xmm0)

II/ Workflow setup

x86inc.asm

  • Now that I have learnt enough, we can start to get our hands dirty. My mentors advise me to read the dav1d code to familiarize myself with how real assembly development looks like in a software.
  • The reflex to have here in my opinion is to see if there is a test suite and then go through it with your debugger to understand the flow of the program, how things are structured. I found using a pen a paper to track down which functions were called, quite useful.
  • dav1d writes x86-64 SIMD code by using an abstraction provided by the x86inc.asm file.
    • x86inc.asm enables the developer to write hand-written assembly on x86 easier by unifying the differents syntax between x86 in Windows/Linux and the differents version of vectorized instructions sets (SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2 )
      • Here is an awesome guide to get started.
  • To be honest, I find it useful to expand the macros by hand (as x86inc.asm is mostly composed of macros) to get a better feel on how things work.
    • For example, to understand how a function declared with x86inc.asm is called in the C side, I isolated the code in a gist and expanded the macros using nasm -E prologue.asm.
    • From this, we can deduce that the syntax to declare a function in x86inc.asm is as follow: cglobal function_name, #args (%1), #regs (%2), #xmm_regs (%3), [stack_size,] (%4) arg_names... (%5-*)
    • Thus, cglobal sad_16x16, 4, 7, 5, src, src_stride, dst, dst_stride, src_stride3, dst_stride3, cnt means:
      • automatically loads 4 arguments (src, src_stride, dst, dst_stride) into registers
      • uses 3 additional register (src_stride3, dst_stride3, cnt)
      • 5 vector registers (m0-m4)
      • perform no allocation of stack space.
    • Here are the mapping of variables to registers:
      • src => r0 (rdi)
      • src_stride => r1 (rsi)
      • dst => r2 (rdx)
      • dst_stride => r3 (rcx)
      • src_stride3 => r4 (R8)
      • dst_stride3 => r5 (R9)
      • cont => r6 (rax)

Sanity check setup

  • All my work will be done on the OpenVVC repository. Therefore, after making sense of the part I have to work on, I decided to setup some sanity check to make sure that my feature works as intended.
  • Since the OpenVVC decoder outputs an image at the end, one can think of performing a diff(1) between the reference image and your function output image.
  • Even though this method is useful to make sure the function works properly, it gives almost no useful informations for the developer in case of a bug.
  • I don't use a small custom dummy image as input for debugging because I don't know whether or not my custom image will activate the flags to generate film grain ! That's why I decided to stick with the sample that my mentor provided me.

  • That's where logging comes into play. Since I am dealing with images (matrices), we can print each pixel value in a log file. This way, you can use diff between the 2 log files to see the differences. Really useful, especially when you are debugging multi-threaded program.
  • Debugging a multi-threaded program can be a pain in the ass. Here's a protocol I've put together to help you get started:
    • Suppose your diff log_ref log_res output the following:
      Image Not Showing Possible Reasons
      • The image file may be corrupted
      • The server hosting the image is unavailable
      • The image path is incorrect
      • The image format is not supported
      Learn More →
    • Using your debugger, place a watchpoint (in my case: watch ($xmm1.v4_int32[1])==14).

      breakpoint are set for function or line of code while watchpoints are set on variables.

    • Once the watchpoint is triggered and the value found, each time you step into your debugger, generate a core file with generate-core-file <number>. This will let you replay the debugging session as much as you want.
    • Replay the debugging session and compare with the output of reference code line by line.

Benchmark

Setup


Theory

  • Now that we are more familiar with Hotspot, How does one interpret speedup ?. Let's define the following:
    • Tref
      : total reference program time.
    • Topt
      : total optimized program time.
    • Pref
      : function representation in reference program.
    • Popt
      : function representation in optimized program.
      • In example, the SSE4 version represents 10% (
        Popt
        ) of the total optimized program time (
        Topt
        ) compared to 25% (
        Pref
        ) of the total time for the reference version (
        Tref
        )
  • The global speedup of the program can be computed as following:
    speedupglobal=(total time in ref program)(total time in opti program)=TrefTopt
  • However, how much the function accounts in the total program can influence how our speedup in the function is perceived by the program. Indeed, a speedup of x10 in our function will not necessarily translate into a speedup of x10 in the overall program.
  • This is why computing the speedup of the function as following can be useful:
    speedupfct=(total time in ref program)(fct % in ref program)(total time in opti program)(SIMD fct % in opti program)=TrefPrefToptPopt
  • This way, we can see that:
    • The higher
      Pref
      is, the higher my speedup will be. In other words, if my function accounts for 90% of my program, then the speedup of my function will be significant.

      TrefTopt9010=TrefTopt9

    • The smaller
      Pref
      is, the smaller my speedup will be. In other words, if my function accounts for 11% of my program, then the speedup of my function will be unsignificant.

      TrefTopt1110=TrefTopt1.1

    One can see

    Pref as a weighting of how much the speedup of our function will be reflected in the program.

  • From this, we can deduce that even if our function has a speedup of x10 but the function accounts for nothing in the program, then the speedup will not be noticeable in the program.
    • This implies a diminishing returns in optimizing the same function over and over (especially if it accounts for nothing in the reference program).
  • However, this is not the case anymore when all other parts of the program are as well optimized because our function will now account for more in our program.
  • The more we optimize other parts, the more our already optimized function will account in the whole program, and the more its speedup will be reflected in the whole program.
  • This is best understood with the following example:

III/ Technical coding details

  • Let's dive into the important part. The function to optimize is fg_grain_apply_pic() which is called by pp_process_frame(). Main hot spots in fg_grain_apply_pic() are:

    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

    Remark: The function ov_clip_uintp2() is called in fg_blend_stripe and fg_compute_block_avg but is used in a different way. One at granular level and one at vectorized level.


fg_simulate_grain_blk8x8

  • Here is the reference version of fg_simulate_grain_blk8x8().
    void fg_simulate_grain_blk8x8(int32_t *grainStripe, uint32_t grainStripeOffsetBlk8,
    uint32_t width, uint8_t log2ScaleFactor, int16_t scaleFactor, uint32_t kOffset, uint32_t lOffset, uint8_t h, uint8_t v, uint32_t xSize)
    {
    uint32_t k, l;
    uint32_t idx, idx_offset, idx_offset_l, grainStripeOffsetBlk8_l;
    idx_offset = ( h*NUM_CUT_OFF_FREQ + v ) * DATA_BASE_SIZE * DATA_BASE_SIZE;
    for (l = 0; l < 8; l++) /* y direction */
    {
    idx_offset_l = idx_offset + (l + lOffset) * DATA_BASE_SIZE;
    grainStripeOffsetBlk8_l = grainStripeOffsetBlk8 + (l*width);
    for (k = 0; k < xSize; k++) /* x direction */
    {
    idx = idx_offset_l + (k + kOffset);
    grainStripe[grainStripeOffsetBlk8_l + k ] = ((scaleFactor * fg_data_base[idx]) >> (log2ScaleFactor + GRAIN_SCALE));
    }
    }
    return;
    }
    • Given an array grainStripe, we want to fill only a subset of this array (of size 8 x xSize) at the offset grainStripeOffsetBlk8.
    • Let's takexSize=8 for convenience. We will fill each cell of grainStripe with afg_data_base value (chosen based on some idx_offset) that will then be scaled and right shifted.
    • Here is an illustration to make things clearer:
      Image Not Showing Possible Reasons
      • The image file may be corrupted
      • The server hosting the image is unavailable
      • The image path is incorrect
      • The image format is not supported
      Learn More →

SIMD C version

Algorithm

void fg_simulate_grain_blk8x8_sse4(int32_t *grainStripe, uint32_t grainStripeOffsetBlk8,
uint32_t width, uint8_t log2ScaleFactor, int16_t scaleFactor, uint32_t kOffset, uint32_t lOffset, uint8_t h, uint8_t v, uint32_t xSize)
{
uint32_t idx_offset_l1, idx_offset_l2, idx_offset_l3, idx_offset_l4;
uint32_t grainStripeOffsetBlk8_l1, grainStripeOffsetBlk8_l2, grainStripeOffsetBlk8_l3, grainStripeOffsetBlk8_l4;
uint32_t idx_offset = ( h*NUM_CUT_OFF_FREQ + v ) * DATA_BASE_SIZE * DATA_BASE_SIZE;
__m128i scale = _mm_set_epi32(scaleFactor, scaleFactor, scaleFactor, scaleFactor);
for (uint32_t l = 0; l < 8; l+=4) {
idx_offset_l1 = idx_offset + (l + lOffset) * DATA_BASE_SIZE;
idx_offset_l2 = idx_offset + (l + 1 + lOffset) * DATA_BASE_SIZE;
idx_offset_l3 = idx_offset + (l + 2 + lOffset) * DATA_BASE_SIZE;
idx_offset_l4 = idx_offset + (l + 3 + lOffset) * DATA_BASE_SIZE;
grainStripeOffsetBlk8_l1 = grainStripeOffsetBlk8 + (l*width);
grainStripeOffsetBlk8_l2 = grainStripeOffsetBlk8 + ((l + 1)*width);
grainStripeOffsetBlk8_l3 = grainStripeOffsetBlk8 + ((l + 2)*width);
grainStripeOffsetBlk8_l4 = grainStripeOffsetBlk8 + ((l + 3)*width);
for (uint32_t k = 0; k < xSize; k+=4)
{
__m128i fg_data_1 = _mm_loadu_si64(fg_data_base + idx_offset_l1 + (k + kOffset));
__m128i fg_data_1_lo = _mm_cvtepi8_epi32(fg_data_1);
__m128i chunk_1 = _mm_srai_epi32(_mm_mullo_epi32(fg_data_1_lo, scale), log2ScaleFactor + GRAIN_SCALE);
_mm_store_si128(&grainStripe[grainStripeOffsetBlk8_l1 + k], chunk_1);
__m128i fg_data_2 = _mm_loadu_si64(fg_data_base + idx_offset_l2 + (k + kOffset));
__m128i fg_data_2_lo = _mm_cvtepi8_epi32(fg_data_2);
__m128i chunk_2 = _mm_srai_epi32(_mm_mullo_epi32(fg_data_2_lo, scale), log2ScaleFactor + GRAIN_SCALE);
_mm_store_si128(&grainStripe[grainStripeOffsetBlk8_l2 + k], chunk_2);
__m128i fg_data_3 = _mm_loadu_si64(fg_data_base + idx_offset_l3 + (k + kOffset));
__m128i fg_data_3_lo = _mm_cvtepi8_epi32(fg_data_3);
__m128i chunk_3 = _mm_srai_epi32(_mm_mullo_epi32(fg_data_3_lo, scale), log2ScaleFactor + GRAIN_SCALE);
_mm_store_si128(&grainStripe[grainStripeOffsetBlk8_l3 + k], chunk_3);
__m128i fg_data_4 = _mm_loadu_si64(fg_data_base + idx_offset_l4 + (k + kOffset));
__m128i fg_data_4_lo = _mm_cvtepi8_epi32(fg_data_4);
__m128i chunk_4 = _mm_srai_epi32(_mm_mullo_epi32(fg_data_4_lo, scale), log2ScaleFactor + GRAIN_SCALE);
_mm_store_si128(&grainStripe[grainStripeOffsetBlk8_l4 + k], chunk_4);
}
return;
}

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Benchmark

Top: Reference version / Bottom: SSE4 version

  • When comparing the performance between the reference function and the SSE4 function, we can see that our optimized version now accounts for 11.2% of the total program time compared to 25.5% before.

fg_compute_block_avg

  • Here is the reference version of fg_compute_block_avg().
    int16_t fg_compute_block_avg(int16_t *dstSampleBlk8, uint32_t widthComp, uint16_t *pNumSamples,
    uint8_t ySize, uint8_t xSize, uint8_t bitDepth)
    {
    uint32_t blockAvg = 0;
    uint16_t numSamples = 0;
    uint8_t k, l;
    for (k = 0; k < ySize; k++)
    {
    for (l = 0; l < xSize; l++)
    {
    blockAvg += dstSampleBlk8[(k*widthComp)+l];
    numSamples++;
    }
    }
    if (numSamples > 0)
    {
    blockAvg /= numSamples;
    blockAvg >>= (bitDepth - 8); /* to handle high bit depths */
    }
    *pNumSamples = numSamples;
    blockAvg = (int16_t) ov_clip_uintp2(blockAvg, 8 );
    return blockAvg;
    }
    • It simply compute the average inside a block, nothing fancy here. Notice that ov_clip_uintp2 is called only after the average is computed (granular level) which means that we don't need to deal with it.

SIMD C version

Algorithm

int16_t fg_compute_block_avg_sse4(int16_t *dstSampleBlk8, uint32_t widthComp, uint16_t *pNumSamples,
uint8_t ySize, uint8_t xSize, uint8_t bitDepth)
{
uint16_t blockAvg = 0;
uint16_t numSamples = 0;
__m128i acc = _mm_setzero_si128();
for (int i = 0; i < ySize; i+=1, numSamples+=8)
{
__m128i x = _mm_loadu_si128(&dstSampleBlk8[i*widthComp]);
acc = _mm_adds_epi16(acc, x);
}
if (numSamples > 0)
{
acc = _mm_hadd_epi16(acc, acc);
acc = _mm_hadd_epi16(acc, acc);
acc = _mm_hadd_epi16(acc, acc);
blockAvg = _mm_cvtsi128_si32(acc);
blockAvg /= numSamples;
blockAvg >>= (bitDepth - 8); /* to handle high bit depths */
}
// assert(blockAvg < (1 << 8));
*pNumSamples = numSamples;
// blockAvg = (int16_t) OVMIN(OVMAX(0, blockAvg), (1 << 8) - 1 );
blockAvg = (int16_t) ov_clip_uintp2((uint32_t)blockAvg, 8);
return blockAvg;
}

  • In the reference version, the function returns an int16_t but we accumulate on uint32_t blockAvg
  • Thus, I decided to accumulate on int16_t (Line 4) to fetch more values (8 instead of 4) during _mm_loadu_si128 call. Doing so forces us to perform a cast into uint32_t when calling ov_clip_uintp2.
  • When performing the horizontal sum, we do _mm_hadd_epi16 3 times in a row to get the final sum. However, the final sum will be in all elements of the register but t'as not an issue, we just need to copy the lower 32-bit integer using _mm_cvtsi128_si32 in order to get only one value.

Benchmark

Top: Reference version / Bottom: SSE4 version.

  • When comparing the performance between the reference function and the SSE4 function, we can see that our optimized version now accounts for 7.48% of the total program time compared to 15% before.

SIMD x86 version

Algorithm

  • For the x86 version, we have access to far more registers now (we will use 8 registers xmm0-7).
  • Before delving into the algorithm, we have to talk about the context the function is called:
    /* Loop of 16x16 blocks */
    for (y = 0; y < heightComp[compCtr]; y += 16)
    {
    ...
    for (x = 0; x < widthComp[compCtr]; x += 16)
    {
    ...
    for (blkId = 0; blkId < 4; blkId++)
    {
    yOffset8x8 = (blkId >> 1) * 8;
    xOffset8x8 = (blkId & 0x1)* 8;
    ...
    blockAvg = fg_compute_block_avg(srcSampleBlk8, strideComp[compCtr], &numSamples,
    OVMIN(8, (heightComp[compCtr] - y - yOffset8x8)),
    OVMIN(8, (widthComp[compCtr] - x - xOffset8x8)),
    bitDepth);
    }
    ...
    }
    ...
    }
    view raw context.c hosted with ❤ by GitHub
    • We are looping over a block of 16x16 inside which, fg_compute_block_avg is called to compute the average on a block of maximum shape 8x8 (because of the clipping operation) at 4 different locations:
      ​​​​​​​​[blkId = 0]: (yOffset8x8, xOffset8x8) = (0, 0)
      ​​​​​​​​[blkId = 1]: (yOffset8x8, xOffset8x8) = (0, 8)
      ​​​​​​​​[blkId = 2]: (yOffset8x8, xOffset8x8) = (8, 0)
      ​​​​​​​​[blkId = 3]: (yOffset8x8, xOffset8x8) = (8, 8) 
      
      Image Not Showing Possible Reasons
      • The image file may be corrupted
      • The server hosting the image is unavailable
      • The image path is incorrect
      • The image format is not supported
      Learn More →

  • Since block average is done over a block of 8x8 and we have 8 vector registers, we no longer need for loops. For each vector register, we will load 8 values at the time.

Benchmark

Top: Reference version / Bottom: SSE4 x86 version.

  • When comparing the performance between the reference function and the sse4 function, we can see that our optimized version now accounts for 4.8% of the total program time compared to 16.5%.
  • It is even faster than the SIMD intrinsic version, which accounted for 7.48% of the total program time.

fg_blend_stripe

  • Here is the reference version of fg_blend_stripe() with ov_clip_uintp2().
    static inline uint32_t ov_clip_uintp2(int32_t val, uint32_t a)
    {
    if (val > 0) {
    int32_t mask = (1 << a) - 1;
    int32_t overflow = !!(val & (~mask));
    return ((-overflow) & mask) | (val & mask);
    } else {
    return 0;
    }
    #if 0
    return OVMIN(OVMAX(0, val), (1 << a) - 1);
    #endif
    }
    void fg_blend_stripe(int16_t *dstSampleOffsetY, int16_t *srcSampleOffsetY, int32_t *grainStripe, uint32_t widthComp, uint32_t blockHeight, uint8_t bitDepth)
    {
    uint32_t k, l;
    int32_t grainSample;
    for (l = 0; l < blockHeight; l++) /* y direction */
    {
    for (k = 0; k < widthComp; k++) /* x direction */
    {
    grainSample = grainStripe[k + (l*widthComp)];
    grainSample <<= (bitDepth - 8);
    dstSampleOffsetY[k + (l*widthComp)] = (int16_t) ov_clip_uintp2(grainSample + srcSampleOffsetY[k + (l*widthComp)], bitDepth);
    }
    }
    return;
    }
    • From reading the code, we can deduce the following:
      • We are going through each pixel of the matrix of size blockHeight x widthComp.

      In fact, the function is called in the program in such a way both dimensions are guaranteed to be a multiple of 8 (for vectorization purpose).

      • For each pixel, we are:
        • shifting the value by (bitDepth - 8).
        • Add an offset to the value.
        • performing a clipping operation on the value.
        • casting down to int16_t.
        • writting to output array.
    • Notice that this time, ov_clip_uintp2() is called in the for loop which means that we have to vectorized it as well.

SIMD C version

Algorithm

void fg_blend_stripe_sse4(int16_t *dstSampleOffsetY, int16_t *srcSampleOffsetY, int32_t *grainStripe, uint32_t widthComp, uint32_t blockHeight, uint8_t bitDepth)
{
uint32_t k, l;
// Prepare SIMD SSE4 ov_clip_uintp2
__m128i mask = _mm_set1_epi32((1 << bitDepth));
__m128i not_mask = _mm_xor_si128(mask, mask);
not_mask = _mm_sub_epi32(not_mask, mask);
mask = _mm_sub_epi32(mask, _mm_set1_epi32(1));
for (l = 0; l < blockHeight; l+=1) /* y direction */
{
for (k = 0; k < widthComp; k+=4) /* x direction */
{
__m128i grainSample = _mm_loadu_si128((__m128i*)&grainStripe[((l + 0) * widthComp) + k]);
grainSample = _mm_slli_epi32(grainSample, (bitDepth - 8));
// Can't use load as srcSampleOffsetY is of type int16_t (thus loading 8 value instead of 4)
__m128i offset = _mm_set_epi32((int32_t)srcSampleOffsetY[k + 3 + ((l + 0) * widthComp)],
(int32_t)srcSampleOffsetY[k + 2 + ((l + 0) * widthComp)],
(int32_t)srcSampleOffsetY[k + 1 + ((l + 0) * widthComp)],
(int32_t)srcSampleOffsetY[k + 0 + ((l + 0) * widthComp)]
);
grainSample = _mm_add_epi32(grainSample, offset);
// SIMD SSE4 ov_clip_uintp2
// Set to 0 all negative values.
grainSample = _mm_max_epi32(grainSample, _mm_setzero_si128());
//int32_t overflow = !!(val & (~mask));
__m128i overflow = _mm_and_si128(grainSample, not_mask);
overflow = _mm_min_epi32(overflow, _mm_set1_epi32(1));
overflow = _mm_sub_epi32(_mm_set1_epi32(0), overflow);
// ((-overflow) & mask) | (val & mask);
__m128i lhs = _mm_and_si128(overflow, mask);
__m128i rhs = _mm_and_si128(grainSample, mask);
__m128i clipped_val = _mm_or_si128(lhs, rhs);
int32_t *val = (int32_t *)&clipped_val;
dstSampleOffsetY[((l + 0) * widthComp) + (k + 0)] = (int16_t)val[0];
dstSampleOffsetY[((l + 0) * widthComp) + (k + 1)] = (int16_t)val[1];
dstSampleOffsetY[((l + 0) * widthComp) + (k + 2)] = (int16_t)val[2];
dstSampleOffsetY[((l + 0) * widthComp) + (k + 3)] = (int16_t)val[3];
}
}
return;
}

  • The algorithm can be understood as follow:
    • Line 15: Since the dimensions are guaranted to be a multiple of 8, we can use SSE instructions to process multiple values at the same time (in our case, 4 values because grainStripe is of type int32_t and remember, xmm registers are 128-bits wide)
    • Line 13: This is why we increment the inner loop by 4 (because we load 4 values).
      • An attentive reader may ask why the outer loop at line 11 is only incremented by 1 and not 4.
      • I tried both versions and they yield similar performance. Thus, I decided to stick with the "increment by 1" version as it makes the code easier to read.
      • My guess on why it performs the same: you only have 1 register available (a.k.a xmm0) when using SIMD intrinsics. It will make sense to use "increment by 4" version if you have more registers at hand to iterate over multiple rows (cf SIMD x86 version) at the same time.
    • Line 17: here we perform the left shift on 4 values at the same time.
    • Line 20-26: We add an offset to our 4 values. Be careful here, srcSampleOffsetY is of type int16_t, thus we need to upcast to int32_t first otherwise we will load 8 values instead of 4.
    • Line 28-42: We are performing the clipping on the 4 values.
      • We can save up computation by only creating the mask once at Line 5-9.
    • Line 44-49: write the 4 values in output array while casting them down to int16_t.

Benchmark

Top: Reference version / Bottom: SSE4 version.

  • When comparing the performance between the reference function and the SSE4 function, we can see that our optimized version now accounts for 27.5% of the total program time compared to 38% before.

SIMD x86 version

Algorithm

  • For the x86 version, we have access to far more registers now (we will use 8 registers xmm0-7). Thus, we can think of the following algorithm:
    • Suppose the dimensions are 8 x 16 (8 rows and 16 columns).
    • Then, for the 4 first rows, we will use 2 couple of vector regisers at the same time (xmm0, xmm1, xmm2-xmm3, xmm4, xmm5, xmm6-xmm7) to load 8 int32_t values for each row (resulting in a total of 32 values loaded).
    • We then proceed to do the same computation as previous version (left shift, add offset, clipping, cast down to int16_t, write to output array)
    • When we are done, increment by 8 to go to next columns.
    • Once we are done with all columns, we add 4 to go the next rows.
    • We repeat the process until all rows are visited.
  • Here is an illustration to make things clearer:
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  • However, there are some important points to talk about:
    1. When working with vectorized instructions, make sure your adresses are 16-bytes aligned, otherwise it will segfault any time you try use vector instructions !
    2. Since x86inc.asm will move arguments to registers, we find ourselves running short of registers when developing. To solve this problem, we can put our variables (grainSample, lhs, row, col, left shift) in the stack as following:
    • Image Not Showing Possible Reasons
      • The image file may be corrupted
      • The server hosting the image is unavailable
      • The image path is incorrect
      • The image format is not supported
      Learn More →

Benchmark

Top: Reference version / Bottom: SSE4 x86 version.

  • When comparing the performance between the reference function and the sse4 function, we can see that our optimized version now accounts for 29.4% of the total program time compared to 37.2% before.
  • However, it seems to be performing the same as the intrinsics version while using way more registers.
  • Due to time constraint, I didn't have the time to investigate why they yield similar performance. An educated guess will be that my function is performing too many write operations
    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

IV/ Conclusion

  • This summer of code was a great experience because I was able to better hone my software engineering skills and learnt a lot about open source development. It was also a great opportunity to learn x86 assembly (something I would never think of doing) and performance in general.
  • For those who want to be involved in GSOC with VideoLAN, don't be scared ! You don't need any knowledge about video coding. Plus, they make sure to make you work on features that are easier to get start with (such as mine).
  • Finally, I also want to thank Alexandre for being here to answer all my questions, Jean-Baptiste for giving me the opportunity to participate in this GSOC edition and Pierre-Loup Cabarat for being my OpenVVC mentor!