owned this note
owned this note
Published
Linked with GitHub
# Colorspace CTEs: Chromaticities, Transfer Functions, and Encodings
> *(Ow my head)*
A color space is a mapping between signal values and corresponding absolute colors.
### References
**Use Wikipedia for quick reference.**
But use the official specs for detailed and authoritative reference (generally the first link in the Wikipedia article's References section), though they are very terse. The specs generally specify *encodings*, so RGB->YCbCr, leaving it to us to invert them into YCbCr->RGB decoders.
> *We usually (but not always) say "Rec709" and "Rec2020", but you'll sometimes see "BT.709", "BT.2020", or similar. They're the same. Use whatever name other people are using.*
E.g. for Rec709:
* https://en.wikipedia.org/wiki/Rec._709
* https://www.itu.int/rec/R-REC-BT.709
---
## Chromaticity: Primaries and Whitepoint
The gamut graph we always see is a triangle, with vertices at the red, the green, and the blue primaries. These set the boundary for what colors are representable in our RGB colorspace.
The gamut triangle is usually overlayed on the piecewise blob that represents all human-visible colors. The curved edge traverses the colors for each wavelength of monochromatic light, often marked with nanometer wavelengths. The straight "line of purples" edge connects the shortest- (blue/violet) and longest-wavelength (red) colors of monochromatic light.
For example, the red primary is the absolute color you get when you ask for rgb(1,0,0) in that colorspace. The whitepoint is the X,Y coordinate of all achromatic (gray) colors from rgb(0,0,0) black through rgb(1,1,1) white.
Most colorspaces use the same "D65" whitepoint, though some don't. One of the important changes between DCI-P3 and Display-P3 is that while the latter uses the common D65 whitepoint, DCI-P3 uses an unusual whitepoint optized for the white of the 6300K xenon arc lamp projectors commonly used in theaters. We can still kind of convert between them, but it's a little more troublesome, in that even their whites are different. (100% display-referred white is mutually unrepresentable between the two spaces)
Each colorspace specifies primary and whitepoint coordinates in a reference colorspace, often CIE1931 X and Y coordinates.
```javascript=
const WP_D65 = [0.3127,0.3290]; // CCT ~6504K
const WP_CCT6300 = [0.314,0.351]; // Correlated Color Temperature 6300K
const REC709_PRIMS = {
red: [0.640,0.330],
green: [0.300,0.600],
blue: [0.150,0.060],
white: WP_D65,
};
const SRGB_PRIMS = REC709_PRIMS; // Yep! They have different TFs though!
const REC601_525_NTSC_PRIMS = { // AKA SMPTE170M
red: [0.630,0.340], // Every single coeff is slightly
green: [0.310,0.595], // different from rec709.
blue: [0.155,0.070],
white: WP_D65,
}
const REC601_625_PAL_PRIMS = { // AKA Bt470bg
red: [0.640,0.330],
green: [0.290,0.600], // 0.290 instead of rec709's 0.300.
blue: [0.150,0.060], // Otherwise the same.
white: WP_D65,
}
const REC2020_PRIMS = { // Huge gamut compared to rec709!
red: [0.708,0.292],
green: [0.170,0.797],
blue: [0.131,0.046],
white: WP_D65,
};
const DCI_P3_PRIMS = {
red: [0.680,0.320],
green: [0.265,0.690],
blue: [0.150,0.060],
white: WP_CCT6300,
};
const DISPLAY_P3_PRIMS = { // Between rec709 and rec2020 in gamut size.
red: [0.680,0.320],
green: [0.265,0.690],
blue: [0.150,0.060],
white: WP_D65, // The only change from DCI-P3
};
```
> *I love DisplayP3, because Apple did a great job with compromises in choosing it for their srgb replacement. It's the same transfer function as SRGB, so no new math or formats needed for linear-blending. It's significantly bigger than sRGB, but without being quite so huge as rec2020. Rec2020 more or less* needs *10 bits or more per channel, whereas while DisplayP3 benefits from 10+ bits, 8-bit channels don't suffer too much. This means you can upgrade to DisplayP3 while still using an rgba8 (or srgba8!) format, without having to jump immediately to rgb10_a2 or something. Finally, the whitepoint is the standard D65 shared by most other colorspaces in active use on consumer displays. Great artists steal!*
Ok, so, great! Based on the colorspace, we know what to display e.g. 100% red or 100% white as! It'll depend on our display, but if we're asked to display 100% red in sRGB, we tell our display to show (or approximate) the well-defined CIE1931 {x=0.640,y=0.330}. (This might likely include some tiny bit of our display's green or blue, not just red!)
But what about 50% red?
---
## Transfer Functions
We now have our RGB values in some kind of tuple like `(f32,f32,f32)`!
In normal Standard Dynamic Range (SDR) 0.0 represents darkest-possible and 1.0 represents brightest-possible. Thus (0,0,0) is darkest-possible black and (1,1,1) is brightest-possible white.
What is "half as bright"? Naively we'd want 0.5 to be "half as bright", and it is! ...depending on what "as bright" means!
Historically, people worked pretty hard to ensure that 0.5 generally means "perceptually half brightness". However, the eye's perception of brightness is *not at all linear*. Roughly speaking, if you double the photons the eye receives, you only increase perceived brightness by ~35%. It takes about *five times* the photons before something "looks twice as bright".
> *A classic ground-truth test for this is to make thin alternating stripes of e.g. 1.0 white and 0.0 black. It sure* looks *much brighter than a 0.5 gray!*
>
> *When we resize images using naive linear value interpolation, this same effect is why we call it "gamma-incorrect scaling". Channel values are interpolated incorrectly, and therefore even color hues can shift. In practice, it's usually not as drastically bad as this makes it sound, but in some cases it can be. Doing it right is like cleaning a dirty lens, where it's better after, even if it didn't seem awful before.*
But that might be ok, right? But what if we did then say 0.5 is 50% of the photons as 1.0? Well when packed into e.g. 8 bits linear in number of photons, going from 0/255 black to 1/255 is just 0.4% of the photons as 1.0, but that's approximately `Math.pow(1/255, 0.45) -> 8.2%` perceived brightness of our maximum already! By 12/255 we're already at a 25.3% mid-dark-grey! On the top end, 254/255 in photons is just 0.17% less bright perceptually, practically indistinguishable!
If we store ratios of physical brightness **linear**ly, then we end up with crushed low-precision lows/blacks, and an overabundance of precision in the highs/whites.
The solution to this is a "transfer function". Transfer functions are a mapping from the "physical" brightness to a "signal" value. (and back again)
The transfer function everyone is most familiar with is an idealized gamma curve:
```javascript=
function from_gamma_22(enc) {
return Math.pow(enc, 2.2);
}
```
In practice, for historic reasons the most common style of TF is sRGB's "piecewise gamma" curve:
```javascript=
function from_piecewise_gamma_srgb(enc) {
if (enc <= 0.04045 / 12.92) return enc / 12.92;
return Math.pow((enc + 0.055) / 1.055, 2.4);
}
```
Or more generically:
```javascript=
function to_piecewise_gamma(desc, linear) {
if (linear < desc.b) return linear / desc.k;
return desc.a * Math.pow(linear, 1/desc.g) - (desc.a - 1);
}
function from_piecewise_gamma(desc, enc) {
let linear_if_low = enc / desc.k;
if (linear_if_low < desc.b) return linear_if_low;
return Math.pow((enc + (desc.a - 1)) / desc.a, desc.g);
}
const SRGB = {
a: 1.055,
b: 0.04045 / 12.92,
g: 2.4,
k: 12.92,
};
const DISPLAY_P3 = SRGB; // How reasonable!
const REC709 = {
a: 1.099,
b: 0.018,
g: 1.0 / 0.45, // ~2.222
k: 4.5,
};
const REC2020_10BIT = REC709;
const REC2020_12BIT = {
a: 1.0993,
b: 0.0181,
g: 1.0 / 0.45, // ~2.222
k: 4.5,
};
```
This starts at 0.0 with a linear section (approximating, and then switches to a gamma curve that ends at 1.0->1.0.
### Physical vs Perceptual: So what's rgb(0.5, 0.5, 0.5) mean?
Also for historical reasons, by default graphics APIs generally operate on perceptual brightnesses. This means that outputting (0.5, 0.5, 0.5) via the sRGB TF gets you `Math.pow((0.50+0.055) / 1.055, 2.4)` = 21.4% of the photons as compared to (1,1,1). But it *looks* half as bright. This is often what people intuitively want.
The big downside: For a screen that outputs 300 lumens (a common monitor brightness), adding one light to a scene might output 0.5, or just ~86 lumens. But duplicating this light could be done naively as 0.5+0.5=1.0, but 1.0 would be all 300 lumens of output!
You might want...
### Physically Linear Blending
Under (phyiscally) "linear blending", 21% photons + 21% photons isn't 100% photons, it's reasonably 42% of our max photons! Great! Unfortunately, we need to quantize these nice percents back into our u8s, and so we revisit the crushed-black problem.
> *Unfortunately, the term-of-art for this in color and lighting is just "linear", not "physically linear" or any other term that might clarify* what *we're operating linear to. In my own work, I usually say "physical" and "perceptual", which you've seen above. Working in "non-linear" perceptually-linear spaces is still a* kind *of linear, and being precise-though-verbose pays dividends in intelligibility.*
>
> *However, when industry says "linear", we need to understand their meaning, not be mad at terms we can't change.*
What we do is we actually still read and write via the transfer function for quantization, but we reverse the transfer function when we read data, do the math (physically) linearly.
Render targets that blend this way are called "srgb formats", because they apply the srgb TF during writing, and unapply it during reading. Incredibly unfortunately, within graphics API specs, formats that instead read and write with naive fixed-point normalization are called "linear formats". This means that for physically **linear blending** we use **srgb formats**, and for **perceptual blending** we use **linear formats**.
---
### HDR: Beyond 1.0
But what if we want to go to 1.1?
> "Does that mean it's brighter? Is it any brighter?"
> "Well, it's 0.1 brighter, isn't it? It's not 1.0. You see, most blokes, you know, will be rendering at 1.0...Where can you go from there? Where?"
> "Why don't you make 1.0 a little brighter? Make that the top number and make that a little brighter?"
> "...These go to 1.1!"
Imagine these devices:[^1]
* A has a max brightness of 250 nits
* B has a max brightness of 500 nits
* C has a max brightness of 1000 nits
[^1]: You don't have to imagine: A is e.g. a budget Lenovo Ideapad (2022), B is e.g. Dell XPS 9310 (2022), C is e.g. MacBook Pro 14" (2022) in XDR mode. (In SDR mode, the MacBook Pro is limited to 500 nits)
And consider that we want to display a very light-colored scene (such as sunny clouds, or a beach), where much of the content is at or near 1.0.
* On A, it'll look totally fine indoors, but will wash out in bright locations like outdoors.
* B will still look great outdoors, but indoors it'll look ok, but just a bit too bright, and you'll be reaching for the brightness adjustment setting.
* 1.0 on C will be *blinding* indoors.
Pedantically, before HDR, everything was SDR. We want a way to be *able* to use high brightnesses, without automatically upgrading legacy SDR content to eye-watering brightnesses.
### Absolute Values: Display-Referred vs Scene-Referred
One of the main uses of HDR is for specular highlights. Indeed, generally for HDR we see specification of a specific value for "diffuse white". This is (usually) your SDR v=1.0.
So the display has a maximum brightness, and the system chooses a brightness level for "diffuse white". The system will map 1.0 to "diffuse white", but still let values >1.0 (like speculars) through to display as brighter, up the the max brightness of the display. (at which point it might e.g. clip and saturate)
But ultimately, this is the system and the user choosing a reference brightness for 1.0 "diffuse white" of scenes, within the capabilities of the display. This is called **Scene-Referred**, and is the most common form of TF, e.g. BT.1886 (for e.g. sRGB and BT.709) or other compatible TFs (e.g. for Rec2100 HLG, and scRGB)
But what if we want the same image of a scene to display the same regardless of system or user preferences? You might say, "this bright area of sand should be 231 nits" and also "the specular sparkle on the waves should peak at 797 nits". You can scale the image such that the sand is 1.0 and the waves sparkle at 3.45, but the output on displays of two different systems will very likely be different.
Well, how about directly encoding pixel values in absolute nits of brightness instead of scene-referred relative values? This is called **Display-Referred**, and is used most notably by the Rec2100 PQ transfer function (SMPTE ST 2084).
---
## Encodings
In practice, colors are often not given to us in nice `(f32,f32,f32)` tuples.
Very commonly, we'll work with 8-bit-per-channel "fixed-point normalized" data, where 0.0 is 0x00, 1.0 is 0xff, and 0.5 is (drat!) 0x7f or 0x80. (or maybe dithered to both!)
Very often, we'll receive (especially for video) colors via YUV channels instead of RGB. Encoded YUV channel data can be more properly called YCbCr, but YUV and YCbCr are often treated in practice as interchangeable terms. (except when used to contrast eachother, as follows)
Strictly speaking, converting from encoded YCbCr data goes YCbCr->YUV->RGB. First we (at least conceptually) decode YCbCr (e.g. `(u8,u8,u8)`) data into `(f32,f32,f32)` YUV. Then, we apply a linear (matrix) transform to convert YUV to RGB.
### YUV->RGB
The modern reason for YUV instead of RGB is that the eye is *very* sensitive to changes in brightness, and much less sensitive to individual colors. So the idea is that we often have "half-width-half-height" frame data, where for a 400x400 frame, we'll have 400x400 Y samples, but only 200x200 U and V samples. This is a very simple and efficient form of compression that I'll touch on [further on](#Multiplane-Encodings).
Y is the luminance, and U and V are proportional to the difference between Y-and-B and Y-and-R respectively:
```javascript=
function to_yuv_rec709(r,g,b) {
// Rec709 Item 3.2
const y = 0.2126*r + 0.7152*g + 0.0722*b; // E'Y [0...1]
// Item 3.3
const u = (b - y) / 1.8556; // E'CB [-0.5...+0.5]
const v = (r - y) / 1.5748; // E'CR [-0.5...+0.5]
return [y,u,v];
}
```
> *Occasionally you'll see GBR (or gbrp) as a format, as opposed to RGB or YUV. You can see from the rgb->luminance coefficients that green matters the most. (and blue the least! The human eye is* really *bad at blue) So if you have a pipeline that is built around YUV, you can sort of squint at GBR and see it as a shoddy no-cost YUV-like, if you're starting with RGB. It's unusual though, and I only mention it because I think it's interesting.*
This is a linear transform, which means this is often done with matrix multiplication.
```=
y = 0.2126*r + 0.7152*g + 0.0722*b
u = (b - y) / (u_max - u_min) // u_min = -u_max
= (b - y) / (2 * (1 - y(0,0,1)))
= (b - y) / (2 * (1 - 0.0722))
= (-0.2126*r + -0.7152*g + (1-0.0722)*b) / 1.8556
v = (r - y) / 1.5748;
= ((1-0.2126)*r + -0.7152*g + -0.0722*b) / 1.5748
| y | | +0.2126 +0.7152 +0.0722 | | r |
| u | = | -0.2126/1.8556 -0.7152/1.8556 (1-0.0722)/1.8556 | x | g |
| v | | (1-0.2126)/1.5748 -0.7152/1.5748 -0.0722/1.5748 | | b |
```
We can ["simply" invert](https://en.wikipedia.org/wiki/Invertible_matrix#Methods_of_matrix_inversion) this YUV_FROM_RGB matrix to get the RGB_FROM_YUV matrix we need for decode.
### YCbCr->YUV
YCbCr bit-encodings are pretty different from the straight-forward fixed-point normalized 8bit encodings used for RGB data. In YUV, U and V are `[-1...+1]` instead of `[0...1]`. However, we don't use `i8` signed types for this, but rather we manually bias and scale.
There are generally "full-range" and "limited-range" encodings for ycbcr data. (also known as pc/studio and tv/broadcast respectively)
```javascript=
ycbcr_full8_to_yuv(yy, cb, cr) {
y = yy / (255 - 0); // 0->0, 255->1
u = (cb - 128) / (255 - 1); // 1->-0.5, 128->0, 255->+0.5
v = (cr - 128) / (255 - 1);
return [y,u,v];
}
ycbcr_limited8_to_yuv(yy, cb, cr) {
y = (yy - 16) / (235 - 16); // 16->0, 235->1
u = (cb - 128) / (240 - 16); // 16->-0.5, 128->0, 240->+0.5
v = (cr - 128) / (240 - 16);
return [y,u,v];
}
```
### In a Shader
In shading languages, ycbcr texture samples are handed to us as `[0...1]` "lowp" floats. (i.e. the u8/255 fixed-point normalization has been decode to float for us)
We can phrase the ycbcr->yuv transform as a (4x3) matrix as well, even though we have a bunch of zeros. The real trick is that we can then combine matrices!
In all, there are not that many constants needed to rederive the transform that we'll need to apply to each sample in the fragment shader. Since there are usually just 4 verts and millions of fragments, we can just lean on the vertex shader to calculate the matrices via glsl builtins, rather than having to use our own matrix math in our host language:
```glsl=
#version 300 es
uniform vec3 Y_COEFFS;
uniform vec4 YY0_YY1_CB0_CB1;
flat out mat4x3 RGB_FROM_YCBCR;
/// For rec709 8bit limited:
/// * y_coeffs = vec3(0.2126, 0.7152, 0.0722);
/// * yy0_yy1_cb0_cb1 = vec4(16/255, 235/255, 128/255, 240/255);
mat4x3 make_rgb_from_ycbcr(vec3 y_coeffs, vec4 yy0_yy1_cb0_cb1) {
// U = (b - y) / (2*u_max)
float u_max = 1.0 - dot(vec3(0,0,1), y_coeffs); // at rgb(0,0,1)
float v_max = 1.0 - dot(vec3(1,0,0), y_coeffs); // at rgb(1,0,0)
// ctors are col-major, so declare in row-major then transpose
mat3 yuv_from_rgb = transpose(mat3(
y_coeffs,
(vec3(0,0,1) - y_coeffs) / (2.0*u_max),
(vec3(1,0,0) - y_coeffs) / (2.0*v_max)));
mat3 rgb_from_yuv = inverse(yuv_from_rgb);
mat4x3 yuv_from_ycbcr = transpose(mat3x4(
1.0 / (yy0_yy1_cb0_cb1.y - yy0_yy1_cb0_cb1.x), 0, 0, -yy0_yy1_cb0_cb1.x,
0, 1.0 / (yy0_yy1_cb0_cb1.w - yy0_yy1_cb0_cb1.z), 0, -yy0_yy1_cb0_cb1.z,
0, 0, 1.0 / (yy0_yy1_cb0_cb1.w - yy0_yy1_cb0_cb1.z), -yy0_yy1_cb0_cb1.z));
mat4x3 rgb_from_ycbcr = rgb_from_yuv * yuv_from_ycbcr;
return rgb_from_ycbcr;
}
void main() {
gl_Position = vec4(gl_VertexID & 1, gl_VertexID >> 1, 0, 1);
RGB_FROM_YCBCR = make_rgb_from_ycbcr(Y_COEFFS, YY0_YY1_CB0_CB1);
}
```
### Whence BGRA?
Just accept that sometimes APIs want you to use BGRA instead of RGBA (often older desktop APIs), or conversely to stick to RGBA with limited or no support for BGRA (often newer and/or mobile APIs). BGRA is still around for hysterical raisins.
But really, try to stick to RGBA-order. Otherwise you will have two modes which will mostly-but-not-quite match eachother.
> *Like y-flip, avoid "oh I'll just add some `bool swap_r_b` arguments", because two wrongs will make a right, which makes things hard to reason about and maintain. (Prefer RGB vs BGR, like YUp vs YDown) Sometimes for performance you might eventually have to cheat, but you should feel sufficiently bad about it. Be as simple as you can, but no simpler.
>
> But always, always always always be clear about what you expect where.*
### Multiplane Encodings
As touched on [previously](#YUV-gtRGB), compressing the color data of an image to half-height-half-width is a simple and effective form of compression.
The most common multiplane encoding is "4:2:0" "NV12", which does so in two planes:
* a (full-size) Y (luma) plane
* a half-width, half-height UV (chroma) plane
"4:2:0" means half-width-half-height, where as "4:4:4" means full-width-full-height. Those are the only ones you generally need to know, but there are a bunch of other lesser-used modes that you can learn about on wikipedia: https://en.wikipedia.org/wiki/Chroma_subsampling *(I will also defer to Wikipedia to explain why half of two is zero)*
Occasionally, you'll have 3-plane "planar YUV":
* Y plane
* U plane
* V plane
Sometimes you'll even have 1-plane interleaved YUYV.
### 10-, 12-, and 16-bit Channels
The extension from 8 bits per channel to 16 is exactly what you'd expect: s/u8/u16/, and rescale our constants. (But double-check the spec! E.g. Rec2020 defines 10- and 12-bit constants in detail)
We also see 10-bit channels in packed formats like RGB10_A2, where we give up alpha bits (or just X2 for opaques!) in exchange for being able to back three 10-bit channels into 32 bits.
RGB10_A2 works great for coplanar interleaved data, but generally we will see 2-plane NV12-like packings. But for 10 and 12 bits, we don't have a great way to pack just two of them for UV. Thus we expand to u16 and have a choice: pad to most-significant bit, or instead to least-significant?
(Check your docs, but) Generally we pad to *most*-significant bit. A fantastic side effect is that by padding to MSB, we can just reuse our 16-bit fixed-point machinery! We shouldn't need separate paths for 10-, 12-, and 16-bit since they can all naively use 16-bit decoding. Instead of NV12 for 8-bit, these are called P010, P012, and P016 respectively.
---
## Putting it all together
A colorspace is a tuple of:
* Chromaticity: "What color is it?"
* Primaries (Positions of R,G,B, in e.g. CIE1931)
* *Whitepoint (but almost always D65)*
* Transfer Function: "How bright is it?"
* *Encoding: "How do we represent values?"*
Chromaticity and Tranfer Function are *definitely* formal parts of "a color space", while Encoding is a more practical addition. Frequently we're already working with abstract float values, in which case the encoding is the no-op identity function, but very often in practice we'll have to care about the difference between the various ways our colors are encoded.
I can see space for an argument that Encoding is part of Transfer Function, or that these two are part of some broader category, just as Chromaticity contains Primaries and Whitepoint.
> *If someone tells you that no, encoding is orthogonal to colorspace, nod your head, they are technically theoretically correct.*
**The goal for all of these color terms is clear communication and mutual intellibility.** Ask clarifying questions, come up with examples, make sure your mental landmarks are in the same place as other peoples' are, at least for communication with them. Don't get caught up in who's technically or semantically correct. Instead, figure out how to translate from your mental model into something the group understands. Successful communication is the goal.