- Shoutout to Tsiry Mayet (Maugrim) for taking the time to read my blog post and give meaningful feedbacks. (even though it isn't his field of expertise !)
Film grain application & synthesis
pp_film_grain.c
.x86-64
dot_product
, pass it to godbolt and analyzes the generated assembly code. Since it is a valid code, you can step through the code using your favorite debugger and try to understand what each line means.
left: C code / middle: x86-64 code / right: SIMD x86-64 code
SIMD intrinsics
- Here are valuable ressources to learn SIMD intrinsics:
- Please refer to
x86-64
part for the learning process.
_mm[size]_<action>_<type>
.
In example,
_mm_add_epi16
will "add two 128-bit vectors of 16-bit extended packed integers (short type)".
SSE is one of the many so-called “SIMD extensions” for x86. There is also, AVX and AVX512 extensions for example.
xmm0-xmm7
. The AMD64 extensions from AMD (originally called x86-64) later added 8 registers xmm8-xmm15
(this will be ported to Intel 64 architecture as well).xmm0
)x86inc.asm
dav1d
writes x86-64 SIMD code by using an abstraction provided by the x86inc.asm
file.
x86inc.asm
enables the developer to write hand-written assembly on x86 easier by unifying the differents syntax between x86 in Windows/Linux and the differents version of vectorized instructions sets (SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2 …)
- Here is an awesome guide to get started.
x86inc.asm
is mostly composed of macros) to get a better feel on how things work.
x86inc.asm
is called in the C side, I isolated the code in a gist and expanded the macros using nasm -E prologue.asm
.x86inc.asm
is as follow: cglobal function_name, #args (%1), #regs (%2), #xmm_regs (%3), [stack_size,] (%4) arg_names... (%5-*)
cglobal sad_16x16, 4, 7, 5, src, src_stride, dst, dst_stride, src_stride3, dst_stride3, cnt
means:
src, src_stride, dst, dst_stride
) into registerssrc_stride3, dst_stride3, cnt
)src
=> r0 (rdi)src_stride
=> r1 (rsi)dst
=> r2 (rdx)dst_stride
=> r3 (rcx)src_stride3
=> r4 (R8)dst_stride3
=> r5 (R9)cont
=> r6 (rax)Sanity check setup
diff(1)
between the reference image and your function output image.diff
between the 2 log files to see the differences. Really useful, especially when you are debugging multi-threaded program.diff log_ref log_res
output the following:watchpoint
(in my case: watch ($xmm1.v4_int32[1])==14
).
breakpoint
are set for function or line of code whilewatchpoints
are set on variables.
watchpoint
is triggered and the value found, each time you step into
your debugger, generate a core file with generate-core-file <number>
. This will let you replay the debugging session as much as you want.Benchmark
BenchWrapper
containing my benchmark function as a function pointer so that I don't need to change the structure of the existing code to bench.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
struct BenchWrapper
{
size_t nb_fct_call;
size_t iter_per_fct;
// Define function arguments here
int (*func)(const OVSEI *sei, OVFrame **frame);
};
void bench_decorator(struct BenchWrapper bw, char* name, const OVSEI *sei, OVFrame **frame)
{
volatile int dummy;
double total_avg_time = 0;
for (size_t i = 0; i < bw.nb_fct_call; ++i)
{
double avg_time = 0;
for (size_t j = 0; j < bw.iter_per_fct; ++j)
{
double t1 = clock();
volatile const int r = bw.func(sei, frame);
double t2 = clock();
dummy = r;
double dt = (t2 - t1) * 1000 / CLOCKS_PER_SEC; // millisecond
avg_time += dt;
}
avg_time /= (double)bw.iter_per_fct;
total_avg_time += avg_time;
}
total_avg_time /= (double)bw.nb_fct_call;
printf("%-20s: %f ms (average time / %lu calls of %lu iteration each)]\n",
name,
total_avg_time,
bw.nb_fct_call,
bw.iter_per_fct);
}
Summary
: shows top hotpost and display system information.Bottom up
: shows all functions that call ov_clip_uintp2
Top Down
: shows what fg_blend_stripe
calls.
Flame graph
: shows a collection of stack traces/call stacks.
To learn about flame graph, please read Brendan Gregg awesome post.
cycles:u(self)/cycles:u(incl.)
: In Top down
, you can see 2 columns:
cycles:u(self)
: total program time passed in the function.cycles:u(incl.)
: total program time passed in the function without taking into account callee functions/sub-functions.cycles:u(self)
: corresponds to ov_clip_uintp2
(which includes the other ov_clip_uintp2
and fg_blend_stripe
calls)cycles:u(incl.)
: corresponds to only fg_blend_stripe
(which exclude the 2 ov_clip_uintp2
calls)
Thus, it makes sense to not have
cycles:u(self)
inBottom up
because we are showing who are calling my function.
Caller/Callee
:
caller
: cost of calling fg_blend_stripe
.callee
: cost of calling the sub-function of fg_blend_stripe
and its location in the code (the line pp_film_grain.c:629
is responsible for 29.6% of the time spent in fg_blend_stripe
only (cycles:u(incl.)
) which is 38.1%)Example: The function
fg_blend_stripe
accounts for 21.4% of the program total time (cycles:u(self)
) in which 16.8% is spent onov_clip_uintp2
and 83.2% on the function itself.
Hotspot
, How does one interpret speedup ?. Let's define the following:
- In example, the SSE4 version represents 10% () of the total optimized program time () compared to 25% () of the total time for the reference version ()
One can see as a weighting of how much the speedup of our function will be reflected in the program.
- This implies a diminishing returns in optimizing the same function over and over (especially if it accounts for nothing in the reference program).
Let's dive into the important part. The function to optimize is fg_grain_apply_pic()
which is called by pp_process_frame()
. Main hot spots in fg_grain_apply_pic()
are:
Remark: The function
ov_clip_uintp2()
is called infg_blend_stripe
andfg_compute_block_avg
but is used in a different way. One at granular level and one at vectorized level.
fg_simulate_grain_blk8x8
fg_simulate_grain_blk8x8()
.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
void fg_simulate_grain_blk8x8(int32_t *grainStripe, uint32_t grainStripeOffsetBlk8,
uint32_t width, uint8_t log2ScaleFactor, int16_t scaleFactor, uint32_t kOffset, uint32_t lOffset, uint8_t h, uint8_t v, uint32_t xSize)
{
uint32_t k, l;
uint32_t idx, idx_offset, idx_offset_l, grainStripeOffsetBlk8_l;
idx_offset = ( h*NUM_CUT_OFF_FREQ + v ) * DATA_BASE_SIZE * DATA_BASE_SIZE;
for (l = 0; l < 8; l++) /* y direction */
{
idx_offset_l = idx_offset + (l + lOffset) * DATA_BASE_SIZE;
grainStripeOffsetBlk8_l = grainStripeOffsetBlk8 + (l*width);
for (k = 0; k < xSize; k++) /* x direction */
{
idx = idx_offset_l + (k + kOffset);
grainStripe[grainStripeOffsetBlk8_l + k ] = ((scaleFactor * fg_data_base[idx]) >> (log2ScaleFactor + GRAIN_SCALE));
}
}
return;
}
grainStripe
, we want to fill only a subset of this array (of size 8 x xSize
) at the offset grainStripeOffsetBlk8
.xSize=8
for convenience. We will fill each cell of grainStripe
with afg_data_base
value (chosen based on some idx_offset
) that will then be scaled and right shifted.
void fg_simulate_grain_blk8x8_sse4(int32_t *grainStripe, uint32_t grainStripeOffsetBlk8,
uint32_t width, uint8_t log2ScaleFactor, int16_t scaleFactor, uint32_t kOffset, uint32_t lOffset, uint8_t h, uint8_t v, uint32_t xSize)
{
uint32_t idx_offset_l1, idx_offset_l2, idx_offset_l3, idx_offset_l4;
uint32_t grainStripeOffsetBlk8_l1, grainStripeOffsetBlk8_l2, grainStripeOffsetBlk8_l3, grainStripeOffsetBlk8_l4;
uint32_t idx_offset = ( h*NUM_CUT_OFF_FREQ + v ) * DATA_BASE_SIZE * DATA_BASE_SIZE;
__m128i scale = _mm_set_epi32(scaleFactor, scaleFactor, scaleFactor, scaleFactor);
for (uint32_t l = 0; l < 8; l+=4) {
idx_offset_l1 = idx_offset + (l + lOffset) * DATA_BASE_SIZE;
idx_offset_l2 = idx_offset + (l + 1 + lOffset) * DATA_BASE_SIZE;
idx_offset_l3 = idx_offset + (l + 2 + lOffset) * DATA_BASE_SIZE;
idx_offset_l4 = idx_offset + (l + 3 + lOffset) * DATA_BASE_SIZE;
grainStripeOffsetBlk8_l1 = grainStripeOffsetBlk8 + (l*width);
grainStripeOffsetBlk8_l2 = grainStripeOffsetBlk8 + ((l + 1)*width);
grainStripeOffsetBlk8_l3 = grainStripeOffsetBlk8 + ((l + 2)*width);
grainStripeOffsetBlk8_l4 = grainStripeOffsetBlk8 + ((l + 3)*width);
for (uint32_t k = 0; k < xSize; k+=4)
{
__m128i fg_data_1 = _mm_loadu_si64(fg_data_base + idx_offset_l1 + (k + kOffset));
__m128i fg_data_1_lo = _mm_cvtepi8_epi32(fg_data_1);
__m128i chunk_1 = _mm_srai_epi32(_mm_mullo_epi32(fg_data_1_lo, scale), log2ScaleFactor + GRAIN_SCALE);
_mm_store_si128(&grainStripe[grainStripeOffsetBlk8_l1 + k], chunk_1);
__m128i fg_data_2 = _mm_loadu_si64(fg_data_base + idx_offset_l2 + (k + kOffset));
__m128i fg_data_2_lo = _mm_cvtepi8_epi32(fg_data_2);
__m128i chunk_2 = _mm_srai_epi32(_mm_mullo_epi32(fg_data_2_lo, scale), log2ScaleFactor + GRAIN_SCALE);
_mm_store_si128(&grainStripe[grainStripeOffsetBlk8_l2 + k], chunk_2);
__m128i fg_data_3 = _mm_loadu_si64(fg_data_base + idx_offset_l3 + (k + kOffset));
__m128i fg_data_3_lo = _mm_cvtepi8_epi32(fg_data_3);
__m128i chunk_3 = _mm_srai_epi32(_mm_mullo_epi32(fg_data_3_lo, scale), log2ScaleFactor + GRAIN_SCALE);
_mm_store_si128(&grainStripe[grainStripeOffsetBlk8_l3 + k], chunk_3);
__m128i fg_data_4 = _mm_loadu_si64(fg_data_base + idx_offset_l4 + (k + kOffset));
__m128i fg_data_4_lo = _mm_cvtepi8_epi32(fg_data_4);
__m128i chunk_4 = _mm_srai_epi32(_mm_mullo_epi32(fg_data_4_lo, scale), log2ScaleFactor + GRAIN_SCALE);
_mm_store_si128(&grainStripe[grainStripeOffsetBlk8_l4 + k], chunk_4);
}
return;
}
fg_data_base
at the same time.grainStripe
fg_data_base
is of type int8_t
, using this register will fetch 16 of these values (128/8)._mm_loadu_si64
which will load 8 values in int8_t
section such as following:
Actually, I could have used
_mm_loadu_si32
but it doesn't work for older version of gcc <= 11 (see this stackoverflow post)
_mm_cvtepi8_epi32
and store the result in fg_data_X_lo
.
Top: Reference version / Bottom: SSE4 version
fg_compute_block_avg
fg_compute_block_avg()
.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
int16_t fg_compute_block_avg(int16_t *dstSampleBlk8, uint32_t widthComp, uint16_t *pNumSamples,
uint8_t ySize, uint8_t xSize, uint8_t bitDepth)
{
uint32_t blockAvg = 0;
uint16_t numSamples = 0;
uint8_t k, l;
for (k = 0; k < ySize; k++)
{
for (l = 0; l < xSize; l++)
{
blockAvg += dstSampleBlk8[(k*widthComp)+l];
numSamples++;
}
}
if (numSamples > 0)
{
blockAvg /= numSamples;
blockAvg >>= (bitDepth - 8); /* to handle high bit depths */
}
*pNumSamples = numSamples;
blockAvg = (int16_t) ov_clip_uintp2(blockAvg, 8 );
return blockAvg;
}
ov_clip_uintp2
is called only after the average is computed (granular level) which means that we don't need to deal with it.
int16_t fg_compute_block_avg_sse4(int16_t *dstSampleBlk8, uint32_t widthComp, uint16_t *pNumSamples,
uint8_t ySize, uint8_t xSize, uint8_t bitDepth)
{
uint16_t blockAvg = 0;
uint16_t numSamples = 0;
__m128i acc = _mm_setzero_si128();
for (int i = 0; i < ySize; i+=1, numSamples+=8)
{
__m128i x = _mm_loadu_si128(&dstSampleBlk8[i*widthComp]);
acc = _mm_adds_epi16(acc, x);
}
if (numSamples > 0)
{
acc = _mm_hadd_epi16(acc, acc);
acc = _mm_hadd_epi16(acc, acc);
acc = _mm_hadd_epi16(acc, acc);
blockAvg = _mm_cvtsi128_si32(acc);
blockAvg /= numSamples;
blockAvg >>= (bitDepth - 8); /* to handle high bit depths */
}
// assert(blockAvg < (1 << 8));
*pNumSamples = numSamples;
// blockAvg = (int16_t) OVMIN(OVMAX(0, blockAvg), (1 << 8) - 1 );
blockAvg = (int16_t) ov_clip_uintp2((uint32_t)blockAvg, 8);
return blockAvg;
}
int16_t
but we accumulate on uint32_t blockAvg
int16_t
(Line 4) to fetch more values (8 instead of 4) during _mm_loadu_si128
call. Doing so forces us to perform a cast into uint32_t
when calling ov_clip_uintp2
._mm_hadd_epi16
3 times in a row to get the final sum. However, the final sum will be in all elements of the register but t'as not an issue, we just need to copy the lower 32-bit integer using _mm_cvtsi128_si32
in order to get only one value.Top: Reference version / Bottom: SSE4 version.
xmm0-7
).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
/* Loop of 16x16 blocks */
for (y = 0; y < heightComp[compCtr]; y += 16)
{
...
for (x = 0; x < widthComp[compCtr]; x += 16)
{
...
for (blkId = 0; blkId < 4; blkId++)
{
yOffset8x8 = (blkId >> 1) * 8;
xOffset8x8 = (blkId & 0x1)* 8;
...
blockAvg = fg_compute_block_avg(srcSampleBlk8, strideComp[compCtr], &numSamples,
OVMIN(8, (heightComp[compCtr] - y - yOffset8x8)),
OVMIN(8, (widthComp[compCtr] - x - xOffset8x8)),
bitDepth);
}
...
}
...
}
16x16
inside which, fg_compute_block_avg
is called to compute the average on a block of maximum shape 8x8
(because of the clipping operation) at 4 different locations:
8x8
and we have 8 vector registers, we no longer need for loops. For each vector register, we will load 8 values at the time.Top: Reference version / Bottom: SSE4 x86 version.
fg_blend_stripe
fg_blend_stripe()
with ov_clip_uintp2()
.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
static inline uint32_t ov_clip_uintp2(int32_t val, uint32_t a)
{
if (val > 0) {
int32_t mask = (1 << a) - 1;
int32_t overflow = !!(val & (~mask));
return ((-overflow) & mask) | (val & mask);
} else {
return 0;
}
#if 0
return OVMIN(OVMAX(0, val), (1 << a) - 1);
#endif
}
void fg_blend_stripe(int16_t *dstSampleOffsetY, int16_t *srcSampleOffsetY, int32_t *grainStripe, uint32_t widthComp, uint32_t blockHeight, uint8_t bitDepth)
{
uint32_t k, l;
int32_t grainSample;
for (l = 0; l < blockHeight; l++) /* y direction */
{
for (k = 0; k < widthComp; k++) /* x direction */
{
grainSample = grainStripe[k + (l*widthComp)];
grainSample <<= (bitDepth - 8);
dstSampleOffsetY[k + (l*widthComp)] = (int16_t) ov_clip_uintp2(grainSample + srcSampleOffsetY[k + (l*widthComp)], bitDepth);
}
}
return;
}
blockHeight x widthComp
.In fact, the function is called in the program in such a way both dimensions are guaranteed to be a multiple of 8 (for vectorization purpose).
(bitDepth - 8)
.int16_t
.ov_clip_uintp2()
is called in the for loop which means that we have to vectorized it as well.
void fg_blend_stripe_sse4(int16_t *dstSampleOffsetY, int16_t *srcSampleOffsetY, int32_t *grainStripe, uint32_t widthComp, uint32_t blockHeight, uint8_t bitDepth)
{
uint32_t k, l;
// Prepare SIMD SSE4 ov_clip_uintp2
__m128i mask = _mm_set1_epi32((1 << bitDepth));
__m128i not_mask = _mm_xor_si128(mask, mask);
not_mask = _mm_sub_epi32(not_mask, mask);
mask = _mm_sub_epi32(mask, _mm_set1_epi32(1));
for (l = 0; l < blockHeight; l+=1) /* y direction */
{
for (k = 0; k < widthComp; k+=4) /* x direction */
{
__m128i grainSample = _mm_loadu_si128((__m128i*)&grainStripe[((l + 0) * widthComp) + k]);
grainSample = _mm_slli_epi32(grainSample, (bitDepth - 8));
// Can't use load as srcSampleOffsetY is of type int16_t (thus loading 8 value instead of 4)
__m128i offset = _mm_set_epi32((int32_t)srcSampleOffsetY[k + 3 + ((l + 0) * widthComp)],
(int32_t)srcSampleOffsetY[k + 2 + ((l + 0) * widthComp)],
(int32_t)srcSampleOffsetY[k + 1 + ((l + 0) * widthComp)],
(int32_t)srcSampleOffsetY[k + 0 + ((l + 0) * widthComp)]
);
grainSample = _mm_add_epi32(grainSample, offset);
// SIMD SSE4 ov_clip_uintp2
// Set to 0 all negative values.
grainSample = _mm_max_epi32(grainSample, _mm_setzero_si128());
//int32_t overflow = !!(val & (~mask));
__m128i overflow = _mm_and_si128(grainSample, not_mask);
overflow = _mm_min_epi32(overflow, _mm_set1_epi32(1));
overflow = _mm_sub_epi32(_mm_set1_epi32(0), overflow);
// ((-overflow) & mask) | (val & mask);
__m128i lhs = _mm_and_si128(overflow, mask);
__m128i rhs = _mm_and_si128(grainSample, mask);
__m128i clipped_val = _mm_or_si128(lhs, rhs);
int32_t *val = (int32_t *)&clipped_val;
dstSampleOffsetY[((l + 0) * widthComp) + (k + 0)] = (int16_t)val[0];
dstSampleOffsetY[((l + 0) * widthComp) + (k + 1)] = (int16_t)val[1];
dstSampleOffsetY[((l + 0) * widthComp) + (k + 2)] = (int16_t)val[2];
dstSampleOffsetY[((l + 0) * widthComp) + (k + 3)] = (int16_t)val[3];
}
}
return;
}
grainStripe
is of type int32_t
and remember, xmm
registers are 128-bits wide)xmm0
) when using SIMD intrinsics. It will make sense to use "increment by 4" version if you have more registers at hand to iterate over multiple rows (cf SIMD x86 version
) at the same time.srcSampleOffsetY
is of type int16_t
, thus we need to upcast to int32_t
first otherwise we will load 8 values instead of 4.
- We can save up computation by only creating the
mask
once at Line 5-9.
int16_t
.Top: Reference version / Bottom: SSE4 version.
xmm0-7
). Thus, we can think of the following algorithm:
8 x 16
(8 rows and 16 columns).xmm0, xmm1
, xmm2-xmm3
, xmm4, xmm5
, xmm6-xmm7
) to load 8 int32_t
values for each row (resulting in a total of 32 values loaded).int16_t
, write to output array)x86inc.asm
will move arguments to registers, we find ourselves running short of registers when developing. To solve this problem, we can put our variables (grainSample
, lhs
, row
, col
, left shift
) in the stack as following:Top: Reference version / Bottom: SSE4 x86 version.