# 50.012 Lecture 3: Multimedia Application Networks
## Context
Majority of Internet bandwidth consumers are from the video traffic.
How do we serve > 1 billion heterogeneous users (e.g. wired vs. mobile)?
## Audio Multimedia

* Analog audio signal sampled at a constant rate.
* e.g. telephone: 8000 samples/sec, CD music: 44,100 samples/sec
* The higher the number of bits sampled per second, the more accurate the audio recorded is.
* Each sample is quantized (rounded), represented by bits (e.g. 2^8^=256 possible quantized values)
* Each quantized value represented by bits (8 bits for values of 256)
## Video Multimedia
* Sequence of images displayed at a constant rate.
* Digital image: array of pixels, where each pixel is represented by bits.
* Optimization is done to use redundancy within and between images to decrease the number of bits used to encode image.
* spatial (within image): Instead of sending N values of same colour (all purple), send only 2 values: colour value (purple) and number of repeated values (N)
* temporal (from one image to next): Instead of sending complete frame at i+1, send only differences from frame i.
## Multimedia Networking: Application Types
* Streaming stored audio, video
* streaming: can begin playout before downloading the entire file
* stored: can transmit faster than audio / video will be rendered (storing/buffering at client).
* Conversational voice / video over IP
* Interactive nature of human-to-human conversation limits delay tolerance.
* Streaming live audio, video
## Streaming stored video
### Concept

### Challenges
* Continuous playout constraint: once client playout begins, playback must match original timing.
* However, network delays are variable (jitter), we need client-side buffer to match playout requirements.
* Other challenges:
* client interactivity: pause, fast-forward, rewind, jump through video.
* video packets lost which need to be retransmitted.
### Client-side Buffering, playout

* Client-side buffering and playout delay to compensate for network-added delay, delay jitter.
* The frames that are yet to be played will be stored on client-side buffer.

1. Initial fill of buffer until playout begins at t~p~.
2. Playout begins at t~p~.

3. Buffer fill level varies over time as fill rate x(t) varies and playout rate r is constant.
Playout buffering: average fill rate ($\bar{x}$), playout rate (r):
* x < r: buffer eventually empties (causing video playout to freeze until buffer fills again)
* x > r: buffer will not be empty, provided initial playout delay is large enough to abosrb variability in x(t).
* Initial playout delay tradeoff: it will take longer to fill up the buffer initially if we want for a "smoother" experience.
### Streaming Multimedia: UDP
* The reason why UDP is used because we have control over the rate.
* Server sends at a rate that is appropriate for the client.
* Usually send rate = encoding rate = playback rate.
* Send rate can be oblivious to congestion levels.
* Short playout delay (2-5 seconds) to remove network jitter.
* Error recovery: application-level (unlike TCP), time permitting.
* Real-time Transport Protocol (RTP) [RFC 2326]: multimedia payload types
* UDP may not go through firewalls, because sometimes it's not friendly
### Streaming Multimedia: HTTP
* Mutlimedia file retrieved via HTTP GET
* Send at maximum possible rate under TCP

* Fill rate fluctuates due to TCP congestion control, retransmissions (in-order delivery)
* Larger playout delay: smooth TCP delivery rate.
* HTTP/TCP passess more easily through firewalls.
### Streaming Multimedia: DASH
* DASH: Dynamic, Adaptive Streaming over HTTP.
* Other adaptive solutions: Apple's HTTP Live Streaming (HLS), Adobe Systems HTTP Dynamic Streaming, Microsoft Smooth Streaming
* Server:
* Encodes video files into multiple versions
* Each version is stored and encoded at a different rate.
* Manifest file: provides URLs for different versions.
* Client:
* Periodically measures server-client bandwidth
* Consulting manifest, requests one chunk at a time.
* Chooses maximum coding rate sustainable given current bandwidth.
* Can choose different coding rates at different points in time depending on available bandwidth at the time.
* Benefit: "intelligence" at client
* When to request chunk so that buffer starvation / overflow doesn't occur
* What encoding rate to request
* Where to request chunk (from URL server that is "close" to the client or has high available bandwidth).
* Can leverage web and its existing infrastructure (proxy, caching, etc.)
## Voice-over-IP
### Challenge
VoIP end-to-end delay requirement: needed to maintain "conversational" aspect.
* Higher delays are noticeable and can impair interactivity
* Delay should be < 150 msec.
* Includes application-level (packetization, playout), network delays.
### VoIP Characteristics
* Speaker's audio: alternating talk spurts, silent periods.
* 64 kbps during talk spurt
* packets generated only during talk spurts.
* 20 msec chunks at 8 Kbytes/sec: 160 bytes of data
* Application-layer header added to each chunk.
* Chunk + header encapsulated into UDP or TCP segment.
* Application sends segment into socket every 20 msec during talk spurt.
### VoIP: packet loss, delay
* Network loss: IP datagram lost due to network congestion (router buffer overflow)
* Delay loss: IP datagram arrives too late for playout at receiver.
* Delay varies due to queuing in network; end-system (sender, receiver) delays.
* Typical m ax tolerable delay: 400 ms.
* Loss tolerance
### VoIP: Delay Jitter

End-to-end delays of two consecutive packets: difference an be more or less than 20 msec (transmission time difference).
### VoIP: Fixed Playout Delay
* Receiver attempts to playout each chunk exactly q msecs after chunk was generated.
### VoIP: Adaptive Playout Delay
* Goal: low playout delay, low late loss rate
* Approach: adaptive playout delay adjustment:
* Estimate network delay, adjust playout delay at beginning of each talk spurt.
* Silent periods compressed and elongated
* Chunks still played out every 20 msec during talk spurt.
* Adaptively estimate packet delay: EWMA (Exponentially Weighted Moving Average):
$$
d_{i}=(1-\alpha) d_{i-1}+\alpha\left(r_{i}-t_{i}\right)
$$
* d~i~: delay estimate after i-th packet
* α: small constant
* r~i~: time received
* t~i~: time sent
* (r-t~i~): measured delay of i-th packet
We can also estimate average deviation of delay, v~i~:
$$
v_{i}=(1-\beta) v_{i-1}+\beta\left|r_{i}-t_{i}-d_{i}\right|
$$
* Estimates d~i~, v~i~ calculated for every received packet, but used only at the start of talk spurt.
* For first packet in talk spurt, playout time is:
$playoutTime_i = t_i+d_i+Kv_i$
* Remaining packets in talk spurt are played out periodically.
How does receiver determine whether packet is in a talk spurt?
* If no loss, receiver looks at successive timestamps
* difference of successive stamps > 20 msec => talk spurt begins.
* With loss possible, receiver must look at both timestamps and sequence numbers.
* Difference of successive stamps > 20 msec and sequence numbers without gaps => talk spurt begins.
## VoIP: Recovery from Packet Loss
Challenge: Recover from packet loss given small tolerable delay between original transmission and playout
* each ACK/NAK takes ~ one RTT.
* alternative: Forward Error Correction (FEC) => sending enough bits to allow recovery without retransmission.
### Simple FEC
* For every group of n chunks, create a redundant chunk by XOR-ing n original chunks.
* send n+1 chunks, increasing bandwidth by factor 1/n
* can reconstruct original n chunks if at most one lost chunk from n+1 chunks
* Playout delay will increase with n.
If we lose the first chunk, need to wait for the entire n+1 chunk to arrive to recover the first chunk.
### Another FEC Scheme
* "piggyback lower quality stream"
* Send lower resolution audio stream as redundant information
* e.g. nominal stream PCM at 64 kbps and redundant stream GSM at 13 kbps.
* non-consecutive loss: receiver can conceal loss
* generalization: can also append (n-1)st and (n-2)nd low-bit rate chunk
### Interleaving to conceal loss
* Audio chunks are divided into smaller units, e.g. four 5 msec units per 20 msec audio chunk
* Packet contains small units from different chunks
* If packet lost, still have most of every origianl chunk
* No redundancy overhead, but increases playout delay.