uint8 support for images in interpolate()

# uint8 support for images in interpolate() **Goal**: support uint8 images in interpolate on CPU for the cross-product of: - *linear* and *cubic* interpolation (*nearest* already supported) - with or without antialias - channels_first and channels_last layout We're writing 2 main implementations: an optimized AVX version as inspired from PIL-SIMD, and a fallback that works in more general cases (and if AVX isn't supported). dev branch: https://github.com/pytorch/pytorch/tree/interpolate_uint8_images_linear_cpu_support_dev We'll submit PRs there and then migrate all to master when done. **ETA: Dec 16** Current status ============== AVX code -------- - on 2D images, C == 3 only - channels last only - N == 1 only - bilinear only - antialias=True only - Can easily extend to: - C < 3 (by unpacking differently) - channels first (by unpacking differently) - bicubic filter. Nearest[exact] is not critical for now as already supported - what about antialias=False: can this just be a different way to compute the weights? Fallback -------- - 3D - maybe, but we only care about 2D for now - Supports all C and N values - Supports channels first or last - Supports bilinear + bicubic (TODO: test bicubic) -- nearest[exact] already supported - antialias=False not yet supported TODO ==== (decreasing pri order) - 1 Add consistency tests between AVX vs fallback and between uint8 vs float - 2 (Victor) AVX and fallback: support for antialias=False and dtype=uint8 for bilinear and bicubic. - 3 ~~(Nicolas) AVX: support N > 1, C <= 4 and other filters~~ Done - 4 ~~(Nicolas) AVX: support for channels first~~ - 5 ~~(Victor) if possible, merge weight computation between AVX and fallback~~ - 6 clean up AVX code - 7 clean up fallback code ---- mergeable PR threshold ---- - Fallback: optimized version for channels last or first - avoid memory copy in AVX version - or perhaps copy single rows instead of entire image - support SSE and / or port to Vec.h - Dispatch to AVX version later instead of early, i.e. use AVX implementation within the inner loops of TensorIterator. Done: - (Nicolas) basic Port of PIL-SIMD implem - (Victor) Write fallback