This tutorial assumes knowledge of the following concepts/tools:
WebRTC (Web Real-Time Communication) is a collection of standards, protocols, and JavaScript APIs which enables peer-to-peer audio, video, and data sharing between browsers, without requiring any third-party plug-ins or libraries.
Delivering rich, high-quality, real-time content is a complex problem. Thankfully, the browser abstracts most of the complex task behind a few simple JavaScript APIs, making it possible to write a teleconferencing web application in a few dozens of lines of JavaScript code. This tutorial will show you, step by step, how to create a sample WebRTC P2P video chat system.
P2P means the peers will directly communicate with each other. Doing this is not as straightforward as it seems. There are a few problems that we need to solve here:
Do keep these problems in mind while reading the rest of the tutorial. We will solve them one by one.
Do not fret, though. The WebRTC client in your browser encapsulates most of the complex tasks into a few simple APIs. We only need to understand how they work and piece a few components together.
We will be building a P2P video/audio chat application using WebRTC. The app will allow two users to join and video chat with each other.
This tutorial will use Node.js to build the application, so install it if it's not installed yet. Navigate to the project folder and execute the following two commands:
What we need is a out-of-band channel that allows the two peers to communciate before they have a direct channel for communication. After the session is established, the peers will then be able to communicate directly.
In theory, any channel (including shouting across the room) could work. This allows interoperability with a variaty of other signaling protocols used in existing communication infrastructure (e.g., SIP, Jingle, ISUP). If you have a server that already tracks which users are logged in, then you can simply use that server as the signalling server. We can also use an dedicated server to relay messages between the peers. The server does not need to understand the content of the messages, but need to resides in the public internet where both peers can reach.
In this tutorial, we'll implement a custom signalling server that uses WebSocket to relay information between the peers.
Create a new directory src
containing a new file index.js
. This file will contain the implementation for the server. Start by setting up a websocket server which listens on a HTTP port.
On each request to connect, the server creates a WebSocket, assigns a random id to it and keeps it in an array. When this socket receives messages from the client, the server simply forwards it to all other connected clients together with the ID of the client who sent the message. When the socket is closed, the server informs all remaining clients of the closure.
Create a new directory src
containing a new file index.js
. This file will contain the implementation for the server. Start by setting up a socket.io server which listens on a HTTP port.
On each request to connect, the server creates a socket.io
socket, assigns a random id to it and keeps it in an array. When this socket receives messages from the client, the server simply forwards it to all other connected clients together with the ID of the client who sent the message. When the socket is closed, the server informs all remaining clients of the closure.
Note that in a more realistic example, the message from the client should include the other peer's ID and the server should only forward the message or send the closure notification to the intended peer.
With our signalling server in place, our peers can use it to communicate with each other to exchange routing information and negotiate the session parameters. Note that the signalling server does not transfer the video, audio, or data of our application. It simply serves as a starting point for the peers to find a direct route and establish a session. Once these are done, the video, audio, and data are sent between the peers directly.
In essence, the signalling server solves the first problem we listed above.
The client side is slightly more complicated compared to the server side, since this is where most of the complexity of WebRTC lies. We'll build this part step by step and explain what we are doing at each step.
Firstly, since we're building a video chat application, we have to first obtain streams from the user's webcam and microphone. Thankfully, Javascript provides the Media Capture and Streams API for the application to capture, manipulate, and process video and audio streams from the underlying platform. All the audio and video processing, such as noise cancellation, equalization, image enhancement, and more are automatically handled by the audio and video engines.
Navigate to the public/
directory, where the client scripts will be located. Let's first create a HTML document for the client page. It's just a simple one, containing a start button and two video streams. We'll focus on the self-view
in this section. You can safely ignore remove-view
for now.
In our index.js
file, we will define a listener startChat()
for the start button we created. For now, it displays the user's video stream (and also plays the user's audio).
The browser might prompt you for your permision to open your webcam. After you give permission, you should be able to see yourself on the left-hand side of the document.
Our problem 2 is also out of the way.
With the video stream, we can then discuss how to send it over WebRTC to our remote peer.
Assuming the signalling server is in place, the peers can now use it to establish a P2P session.
The two peers first need to agree on the session parameters. This includes the types of media to be exchanged (audio, video, and application data), used codecs and their settings, bandwidth information, and other metadata. These data are collectively called the session description and should be encoded in the Session Description Protocol (SDP) format.
Each peer will have a local description. To establish a connection, the peers need to exchange their respective descriptions. The session initiator will send an 'offer' containing its local description and the callee must send an 'answer' containing its own local description.
This happens in a symmetric manner:
The result of the offer/answer process is that both peers are aware of each other's descriptions. This solves our third problem listed above.
To start, we establish a Websocket connection with the signalling server and create a RTCPeerConnection
object. As we'll see, almost all the WebRTC functionalities are encapsulated in the RTCPeerConnection
API.
To help us send messages over to other peers, we define the following helper function:
On the receiving side, we break down the encrypted message to obtain its sender ID, message type and content.
When there is an need to start a session negotiation, the negotiationneeded
event will be fired on the RTCPeerConnection
. We attach an onnegotiationneeded
event handler that sends a connection offer to the signalling server, which should be relayed to the remote peer.
When the remote peer receives the offer through the signaling server, it will set its remote description, and send an answer containing its own session description.
When the answer signal is received by the initiating peer, it will set its remote description to the received description.
The more careful readers might have noticed that a potential race condition can occur here. If two clients send offers to each other at the same time, the RTCPeerConnection
object might try to establish two connections. To avoid this, there is a technique known as perfect negotiation. For a detailed discussion on this issue, see this Mozilla developer guide article.
To start, we establish a socket.io connection with the signalling server and create a RTCPeerConnection
object. As we'll see, almost all the WebRTC functionalities are encapsulated in the RTCPeerConnection
API.
When there is an need to start a session negotiation, the negotiationneeded
event will be fired on the RTCPeerConnection
. We attach an onnegotiationneeded
event handler that sends a connection offer to the signalling server, which should be relayed to the remote peer.
When the remote peer receives the offer through the signaling server, it will set its remote description, and send an answer containing its own session description.
When the answer signal is received by the initiating peer, it will set its remote description to the received description.
The more careful readers might have noticed that a potential race condition can occur here. If two clients send offers to each other at the same time, the RTCPeerConnection
object might try to establish two connections. To avoid this, there is a technique known as perfect negotiation. For a detailed discussion on this issue, see this Mozilla developer guide article.
In order to establish a peer-to-peer connection, the peers must be able to route packets to each other. This sounds trivial, but is very hard to achieve in practice. There are a few distinct scenarios that can happen: none, one, or both of them can be located behind a NAT; they can be behind the same NAT, or distinct NATs; there can be numerous layers of NATs between them; worse still, they can be located behind address- and port- dependent NATs. So how do can we find a route between the two peers?
The ICE protocol was designed to solve the problem of finding a route between the peers. In short, each peer would generate a list of transport candidates and send to the other peer. There are three types of candidates:
A TURN server is certainly less than optimal, therefore it is only used as a last resort when all other candidates fail.
Upon receiving the candidates from the other peer, each peer should perform connectivity checks to determine which candidate is viable and use that to establish the direct connection.
Thankfully, the RTCPeerConnection
object has an ICE agent. All we need to do is to send the local candidates to the remote peer through the signalling channel, and receive the candidates sent by the remote peer. Similar to session description, the ICE candidates are also encoded in SDP format.
This solves our problem 4.
For this demo, we will use a public STUN server. You can also include a TURN server if you have one available.
Change the definition of the peerConnection
object to this.
After the session description is set, the RTCPeerConnection
object will start to gather ICE candidates. Each time it finds a candidate, it will emit an icecandidate
event. We need to listen to this event and send the candidate to the other peer through the signalling server.
In response to receiving an ICE candidate, the remote client should add the received candidate to the RTCPeerConnection
object, which will automatically start connectivity checks.
Now that we have exchange session descriptions and ICE candidates, the RTCPeerConnection
will establish a working P2P connection. But wait, we are not sending any media! However, most of the battle is already won. All we need to do is to pass our local stream to the RTCPeerConnection
object and display the remote stream in our remote-view
video tag.
Modify the startChat
function to pass our stream to the RTCPeerConnection
.
Adding local tracks will fire an negotiationneeded
event, which invokes the onnegotiationneeded
event handler which would send our media description to the remote peer.
In turn, when we receive a remote description (either offer or answer), and it contains tracks, our RTCPeerConnection
object will receive a track
event.
We listen to this event to display the stream sent by our remote peer.
When the media is sent over WebRTC, it will be automatically optimized, encoded, and decoded by the WebRTC audio and video engines based on the session description agreed between the peers.
negotiationneeded
event is most commonly triggered when a send MediaStreamTrack
is first added to the RTCPeerConnection
and signalingState
is stable
. Actually, the steps to check whether negotiation is needed is quite complex.track
event is triggered after setting the remote description, if the description contains tracks.ontrack
event in SetLocalDescription
too as long as the description is an answer.ICE candidate discovery is started once a description (local or remote) is set on the peer connection. It is fully automated by the internal ICE agent of the peer connection and runs asynchronously to the rest of the code. We only need to listen for the icecandidate
event, which is fired when a new ICE candidate is found.
The state of the SDP negotiation is represented by the signaling state.
The standard defines 5 states, but have-local-pranswer
and have-remote-pranswer
are usually only employed by legacy hardware. In this demo, we only use 3 of them.
The W3C specs recommends that when the state changes to disconnected
, an ICE restart should be attempted.
When the network gets disrupted, ICE connection will enter disconnected
state, and the ICE agent will attempt to self-recover. If it fails to recover, ICE connection will enter failed
state, in which an ICE restart has to be initiated. (See Network Interruption below).
An ICE restart is very similar to a fresh offer/answer exchange.
To end a RTCPeerConnection, we can call RTCPeerConnection.close()
. It is recommended to set all event handlers of the RTCPeerConnection to null
before doing so, so that these handlers will not trigger erroneously duing the disconnection process. As an example, the following function will close the peer connection and reset the video elements.
Of course, it is important to inform the other peer to hang up as well. This can be done through the signalling server. When a peer hangs up the call, a "hang-up" message is sent to the signaling server, which is forwarded to the other peer. When this message is received, the peer connection is closed.
For the initiating side (assume there is a button with ID stop
for hanging up),
For the receiving side,
For the receiving side,
There are some other events you might be interested in listening to to understand the inner workings of the WebRTC process. For example, if we want to log the new state each time, we can use the following code:
WebRTC is also resilient to network interruptions and changes. RTCPeerConnection
can self-recover from a temporary network interruption without renegotiation. If a previously exchanged ICE candidate is still valid, e.g. in the case of a temporary disconnection, the client only needs to re-establish a connection with the signaling server.
However, if a peer's network interface changes, e.g. switching from Wi-Fi to 4G, RTCPeerConnection
can no longer self-recover since the candidates exchanged previously are no longer valid. If this happens, we can initiate an "ICE restart" to renegotiate the session. The following code snippet illustrates how to do so (only works in Chrome):
Real-Time Control Transport Protocol (RTCP) tracks the number of sent and lost bytes and packets, last received sequence number, inter-arrival jitter for each RTP packet, and other RTP statistics. Then, periodically, both peers exchange this data and use it to adjust the sending rate, encoding quality, and other parameters of each stream.
WebRTC requires that all traffic be encrypted, therefore, the secure versions of RTCP and RTP are actually used. They are encrypted before sending by the browser. This means it is impossible to capture the packets with WireShark and read them.
However, Firefox supports logging RTCP packets before they are encrypted. Check out this Mozilla Blogpost.
Google Chrome provides a handy tool to inspect the internal workings of WebRTC. You can go to chrome://webrtc-internals
in the address bar, and view the current WebRTC events and statistics.
https://webrtchacks.com/sdp-anatomy/
Breaks down every single line of an SDP to help you understand what is going on under the hood.
https://sdp.garyliu.dev
Visualize long SDPs. Quickly find the part you're interested in and collapse the rest.