Geometrical data augmentations for computer vision and internals

## Geometric data augmentations for computer vision and internals TechShare, 13 april 2022 by Victor   --- ## Content - Data augmentations and Deep Learning - Geometric data augmentations: - Affine, resize transformations - New Transforms API in TorchVision - `torch.nn.functional.interpolate`: - Porting implementation using `TensorIterator` and benchmarks - Anti-aliasing feature - Bugs influencing Deep Learning --- ### Data augmentations and Deep Learning _Here, data augmentation == image augmentation_ Random transformations: geometric, color, GAN ... :+1: - Seems like more training data - Better model's generalization Limitations: - Data augs depend on domain --- ### Data augmentations and Deep Learning - https://github.com/aleju/imgaug <img width=500 src="https://i.imgur.com/ueaMdZF.png"/> --- ### Data augmentations and Deep Learning Example augmentations: - https://github.com/aleju/imgaug#example-images - https://github.com/albumentations-team/albumentations#a-few-more-examples-of-augmentations --- ### Data augmentations and Deep Learning Other libraries for computer vision: - torchvision - albumentations (OpenCV as backend) - NVidia DALI (on GPU) - MONAI (on medical data) - ... --- ### :triangular_ruler: Geometric augmentations - Cropping / Padding - Flips / Rotation - **Resizing** - **Affine** / Perspective / Elastic - etc --- ### Image resizing <img style="background: darkgray;" width=400 src="https://i.imgur.com/SV3NHqk.png"/> ```python from torchvision.transforms.functional import resize, InterpolationMode out = resize(inpt, (64, 64), InterpolationMode.NEAREST, **kwargs) ``` Parameters: - output size or scale - interpolation method - anti-aliasing and other options --- ### Image resizing _torchvision_ uses either `PIL.Image.resize` or `torch.nn.functional.interpolate`: ```python # NEAREST interpolation ix = round((ox + 0.5) * scale - 0.5) ... output[oy, ox] = input[iy, ix] ``` ```python # LINEAR interpolation ix1 = floor(f(ox, scale)) ix2 = ix1 + 1 w2 = f(ox, scale) - ix1 w1 = 1.0 - w2 ... output[oy, ox] = w1 * input[iy1, ix1] + w2 * input[iy2, ix2] ``` --- ### Image/BBox/Mask resizing ```python from torchvision.prototype import features from torchvision.prototype.transforms.functional import resize_image_tensor, resize_bounding_box, resize_segmentation_mask out_boxes = resize_bounding_box(in_boxes, out_size, in_boxes.image_size) out_mask = resize_segmentation_mask(in_mask, out_size) out_image = resize_image_tensor(in_image, out_size) ``` <img style="background: darkgray;" width=450 src="https://i.imgur.com/yxAafHx.png"/> --- ### Image/BBox/Mask resizing Prototype Transforms API in _torchvision_ by Philip ```python from torchvision.prototype import transforms resize_op = transforms.Resize((48, 52)) transformed_data = resize_op(in_image, in_boxes, in_mask) out_image, out_boxes, out_mask = transformed_data ``` Stay tuned for the official announcement :slightly_smiling_face: --- ### Affine image transformation <img style="background: darkgray;" width=450 src="https://i.imgur.com/PIPBQyW.png"/> Affine matrix: `C * R * S * Sh * IC * Tr` ``` M = [[a, b, t1], [c, d, t2], [0, 0, 1]] [nx, ny, 1] = [x, y, 1] @ M.T ``` --- ### Affine image transformation Parameters: - rotation angle - scale - translations - shear X/Y - transformation center to generate an affine matrix --- ### Affine image transformation <img style="background: darkgray;" width=400 src="https://i.imgur.com/PoI6Giy.png"/> For images (nearest interpolation mode): ```python inv_affine_matrix = inverse(affine_matrix) for out_y in range(out_image_h): for out_x in range(out_image_w): output_pt = [out_x + 0.5, out_y + 0.5, 1.0] input_pt = inv_affine_matrix @ output_pt in_x, in_y, _ = round(input_pt - 0.5) if 0 <= in_x < in_image_w and 0 <= in_y < in_image_h: out_image[out_y, out_x] = in_image[in_y, in_x] ``` --- ### Affine image transformation ``` bboxes = [[x1, y1, x2, y2], ...] # xyxy format ``` For bounding boxes: ```python out_boxes = [] for xmin, ymin, xmax, ymax in in_boxes: points = [(xmin, ymin, 1), (xmax, ymin, 1), (xmax, ymax, 1), (xmin, ymax, 1)] new_points = points @ affine_matrix.T out_boxes.append( [ min(new_points, 0), min(new_points, 1), max(new_points, 0), max(new_points, 1), ] ) ``` --- ### :rocket: Performance matter - Dataflow runs in parallel to NN computations - Dataflow can be a bottleneck: loading, decoding, **augs**, etc --- ### :gear: Image resizing combinatorics Input: - Dimensions: 3D, 4D, 5D tensors - Data type: uint8, float32, uint16, ... - Memory format: channel last, channel first - Device: cpu, cuda Parameters: - Interpolation modes: nearest, bilinear, bicubic - Anti-aliasing: true / false --- #### [`torch.nn.functional.interpolate`](https://pytorch.org/docs/stable/generated/torch.nn.functional.interpolate.html?highlight=interpolate#torch.nn.functional.interpolate) A single python method to resize 3D, 4D, 5D tensors - supports mostly floating dtypes - CF/CL memory formats, cuda/cpu devices - modes: nearest, bilinear, bicubic, area, ... Performance (cpu): PIL ~ torch interpolate >> OpenCV --- #### [`torch.nn.functional.interpolate`](https://pytorch.org/docs/stable/generated/torch.nn.functional.interpolate.html?highlight=interpolate#torch.nn.functional.interpolate) #### using `TensorIterator` for CPU - :+1: Removed specific 1D, 2D, 3D loops - :+1: Optimized computations - most of cases (~2x-3x speed up) - :+1: Unified the code for modes and dims - :-1: But still worse for 2d/3d channels last cases --- #### How it works with `TensorIterator` ? - Precompute source indices and weights - for each dimension For example, for bilinear mode: - output, source (restrided) - index_x1, index_x2, wx1, wx2 of size owidth - index_y1, index_y2, wy1, wy2 of size oheight --- #### How it works with `TensorIterator` ? - Use implicit compiler vectorization - static assumptions on strides ```c++ // special-cases to let the compiler apply compile-time input-specific optimizations if ((strides[0] == sizeof(scalar_t) && (strides[1] == 0) && // NOLINTNEXTLINE(bugprone-branch-clone) check_almost_all_zero_stride<out_ndims, 1, scalar_t, int64_t, interp_size>(&strides[2]))) { // contiguous channels-first case basic_loop<scalar_t, int64_t, out_ndims, interp_size>(data, strides, n); } else if ((strides[0] == sizeof(scalar_t) && (strides[1] == sizeof(scalar_t)) && check_almost_all_zero_stride<out_ndims, -1, scalar_t, int64_t, interp_size>(&strides[2]))) { // contiguous channels-last case basic_loop<scalar_t, int64_t, out_ndims, interp_size>(data, strides, n); } else { // fallback basic_loop<scalar_t, int64_t, out_ndims, interp_size>(data, strides, n); } ``` --- ### Benchmarks - https://github.com/pytorch/pytorch/pull/51653 - https://github.com/pytorch/pytorch/pull/54500 It was a fun challenge to beat previous benchmarks --- ### Challenges - Other dtypes support: uint8, ... - Improvements over channels last memory format --- ### Adding anti-aliasing (AA) option <img width=300 src="https://raw.githubusercontent.com/GaParmar/clean-fid/main/docs/images/resize_circle_extended.png"/> <img width=300 src="https://pbs.twimg.com/media/FDVGYBgVIAEsjsg?format=jpg&name=900x900"/> - https://github.com/GaParmar/clean-fid - Image scaling attacks --- ### How AA works ? For example, bilinear mode and scale=4: ``` i1, i2, w1, w2 -> 9 indices and weights ``` Larger number of source pixels is used to compute output pixel --- #### Implementation using `TensorIterator` (CPU) Sub-optimal solution: - Naively precomputing all indices and weights for all dims (e.g. like bicubic) Better solution: - Separable resizing: resizing dim by dim - Using bounds for indices --- #### Implementation for GPU Sub-optimal solution: - local max-size weights allocation ```c++ # for each thread scalar_t wx[256]; scalar_t wy[256]; ``` Better solution: - use shared memory and compute shared weights for specific blocks --- #### Implementation for GPU ```c++ extern __shared__ int smem[]; scalar_t* wx = reinterpret_cast<scalar_t*>(smem) + interp_width * threadIdx.x; scalar_t* wy = reinterpret_cast<scalar_t*>(smem) + interp_width * blockDim.x + interp_height * threadIdx.y; if (threadIdx.y == 0) { // All threadIdx.y have the same wx weights upsample_antialias::_compute_weights<scalar_t, accscalar_t>(wx, ...); } if (threadIdx.x == 0) { // All threadIdx.x have the same wy weights upsample_antialias::_compute_weights<scalar_t, accscalar_t>(wy, ...); } ``` - It was fun to write CUDA kernels and optimize them --- #### Benchmarks: downsampling with AA PIL vs torch CPU vs torch GPU ``` Num threads: 8 [----------------------------------- Downsampling (bilinear): torch.Size([1, 3, 906, 438]) -> (320, 196) -----------------------------------] | Reference, PIL 8.4.0, mode: RGB | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 | 2851.2 | 874.1 | 57.1 channels_last non-contiguous torch.float32 | 2856.1 | 1155.8 | 130.6 Times are in microseconds (us). Num threads: 8 [------------------------------------ Downsampling (bicubic): torch.Size([1, 3, 906, 438]) -> (320, 196) -----------------------------------] | Reference, PIL 8.4.0, mode: RGB | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 | 4522.4 | 1406.7 | 170.3 channels_last non-contiguous torch.float32 | 4530.0 | 1435.4 | 242.2 Times are in microseconds (us). ``` --- #### Bugs in resize op influencing Deep Learning - Nearest interpolation mode is broken for OpenCV and PyTorch - both introduced `nearest-exact` to fix it - TF1 resize op and Deep Lab image size `321x513` https://ppwwyyxx.com/blog/2021/Where-are-Pixels/ --- #### Bugs in resize op influencing Deep Learning Compatible implementations ? ``` PyTorch vs OpenCV vs Scikit-Image vs Pillow vs TF ``` - PyTorch <--> OpenCV with bugs --- ### Thank you! Any questions :question: