H.264 Video Codec, MP4 and CMAF For Live Streaming

# H.264 Video Codec, MP4 and CMAF For Live Streaming :::info :bulb: This article doesn't claim to be 100% correct and should not be used as reference in academic manner. ::: Video is basically a group of pictures or images that are shown alternately called frame. Usually, the video we see in everyday life has 30 FPS rate, this means that for every 1 second, the video displays "approximately" 30 frames. Since each frame is basically an image, it means the frame consists of pixels. Each pixels use 3 different color channels to represent graphic, namely Red, Green, and Blue (RGB) each sized 1 byte. Video (to be exact, the frame) has several resolutions, the currently common one is FULL HD or 1920 x 1080 pixels. Using the information we established above, that means, for every 1 frame of FULL HD resolution, it takes around **1920 x 1080 x 3 == 6220800 bytes** of storage. and for every 1 second of video, it takes around **30 x 6220800 == 186624000 bytes (~ 186 MB)** of storage. This situation makes video storing, processing and delivery infeasible, hence we need to somehow compress the video ## Video Compression Video compression basically has the same goal to data compression in general, to reduce the data size while still able to retain the information. There are many video compression (lossless and lossy), but this time we will discuss H.264 compression or codec, which is widely used in numerous use cases. ### H.264 In H.264, there are 3 types of important frames, as explained below ![image](https://hackmd.io/_uploads/B1Jt8QqGkl.png) Frames in H.264 by <a href="https://www.networkwebcams.co.uk/blog/h264-video-compression-in-ip-video-surveillance-systems/" target="_blank"> Network Webcams</a> #### I-Frame A self-contained frame that is compressed directly from the image without reference to any other frames. #### P-Frame A frame that is "predicted" by using the previous frame. P-Frame only stores the **difference** between the previous frame and itself. #### B-Frame The same as P-Frame, but it references to both past and future I and/or P frames. By using these approach, H.264 is able to reduce the size of each frame, leading to smaller data size. Whenever the video is decoded for playback, the decoder will use the *differences* to calculate and reconstruct the original frame information. ## MP4 (MPEG-4) MP4 is just a media container, a media format to wrap the video or audio streams into a file that follows certain rules, syntax and structure. This way, the media can be processed (compatible) by myriad of applications or programss. MP4 divides its data onto a unit called **atom**. Each atom has its own functionality and may includes other atoms (by encapsulation). There are many [atoms](https://cconcolato.github.io/mp4ra/atoms.html#), but here are the key atoms we have to understand ![image](https://hackmd.io/_uploads/By8yOi9zJl.png) ### FTYP Atom **File Type (FTYP)** is an atom that contains movie file type information e.g (m4v isom, m4v avc1). This is used by the media player to correctly process and decode the media stream inside the MP4 container. ### MOOV & MVHD Atom **Movie (Moov)** is an atom that encapsulate **Movie Header (MVHD)** atom. MVHD contains specific information (or configuration) regarding the media playback, such as video timescale, video duration, preferred playback speed, preferred volume, transformation matrix for points mapping, and next id track. This atom is crucial for playback. ### MDAT Atom Media Data (MDAT) is an atom that contains our media data. The Video and Audio bitstreams are stored in certain syntax and structure, called Network Abstraction Layer (NAL) unit (frame). each MDAT atom typically have more than one NAL unit, and must not be corrupted, otherwise the media can't decode the media bitstreams (due to the H.264 compression method, where it references to different frames) ![image](https://hackmd.io/_uploads/Hkx3zus9G1g.png) >For further details, check [here](https://www.cimarronsystems.com/wp-content/uploads/2017/04/Elements-of-the-H.264-VideoAAC-Audio-MP4-Movie-v2_0.pdf) ## Fragmented MP4 So far we've learnt H.264 compression that significantly reduces the file size and we've understand MP4 container. That being said, This is still not optimal for media distribution over the network, especially for Video on Demand (VoD) or live streaming use case. Why? Try to Think about it. Typically, a movie has the duration of 1 - 2 hours. In Full HD resolution, this could take around 1 - 3 GB storage or even higher, depending on the media encoding configuration. We know that for MP4 file to be processed (decode and playback), It has to be fully intact. This mean, in VoD scenario, the client (user) has to fully download the file to be able to play the video on their device. It's even impossible for live streaming use cases, as we never know when the video finishes. This is where Fragmented MP4 (fMP4) came into the rescue. fMP4 works by splitting the MP4 MOOV and MDAT atom into several smaller fragments called **Segment**. This Segment has shorter duration and contains its own MOOF and MDAT atom. Therefore, the media can be processed progressively segment by segment, rather than waiting for the whole MOOV and MDAT to be retrieved. this significantly decrease the playback latency and make the live streaming scenario possible. ![image](https://hackmd.io/_uploads/HyFz2i9MJl.png) ## Low-Latency Common Media Application Format (LL-CMAF) The fMP4 does enable live streaming scenario, but it still introduces latency that's infavourable when we are pursuing *near-realtime* latency. This is where LL-CMAF comes in handy. CMAF is basically a media format made to unify all types of media streaming format -- e.g DASH and HLS -- in [Over The Top (OTT)](https://en.wikipedia.org/wiki/Over-the-top_media_service) scenario into one format that is compatible to numerous system or platform -- e.g apple, android, windows. ### Chunk Encoding LL-CMAF works by splitting further segment in fmp4 into smaller fragments, called fragment. Similar to segment, fragment also has its own moof and mdat atom, but it's even shorter than 1 segment. In distribution or streaming context, fragment is further split into smaller part, called chunks. This is done to let the server granularly send the media data, thus enabling the client (user) able to play the video gradually, chunks by chunk, fragment by fragment. ![image](https://hackmd.io/_uploads/Bk3jZp9G1e.png) CMAF Terminology by <a href="https://www.wowza.com/blog/low-latency-cmaf-chunked-transfer-encoding" target="_blank"> Traci Ruether, Wowza</a> ![image](https://hackmd.io/_uploads/Bk9mzTqfJx.png) CMAF Chunks by <a href="https://speakerdeck.com/stswe/cmaf-low-latency-streaming-by-will-law-from-akamai" target="_blank"> Will Law, Akamai</a> In a nutshell, chunk is the smallest unit, a Fragment consists of multiple chunks, a Segment consist of multiple Fragments. #### notes on CMAF chunks Contrary to CMAF Fragment, CMAF Chunk is not independently decodeable, because the chunk may not have an I-Frame, a keyframe that's needed for the decoding process. But as long as the first chunk that contains I-Frame within the same Fragment is received, the subsequent chunk can be directly played without waiting the whole fragment to be reassembled as the player already have the keyframe used for decoding. ### Chunk-Transfer Encoding With Chunk Encoding, we don't know the total size of the data yet. Therefore, to Support the distribution of LL-CMAF, we have to use Chunk-Transfer Encoding compatible protocol, such as HTTP. This way the server can directly send the newest available chunks to the client to be consumed. That's the gist of H.264, MP4 and CMAF in live streaming scenario, hope this could help in any way :hand_with_index_and_middle_fingers_crossed: ## Contact Refer to my [hackmd bio](https://hackmd.io/@asterrr)