Originally developed to support browser-based video calling, the WebRTC standard has since become a core building block for a wide range of real-time communication and streaming applications. It enables real-time communication on browsers and mobile applications via the use of various standardized protocols and APIs. The protocols provide a set of rules for WebRTC peers to establish bi-directional, secure, real-time communication, while the WebRTC API enables developers to utilize these protocols simply.
The appeal of WebRTC lies in its ability to deliver sub-second latency, secure peer-to-peer communication, and broad compatibility across modern browsers and devices. It simplifies the delivery of real-time experiences, from video conferencing, live customer support, and interactive broadcasts to online gaming, auctions, and surveillance systems.
This article explores what makes up WebRTC, including the protocols and APIs that enable the transmission of audio, video, and data in real-time with minimal latency.
WebRTC: An overview
Transmitting audio and video over the internet in real time is hard — it demands immediate transmission with minimal buffering. These requirements rule out TCP as the transport protocol, because its reliability mechanisms, such as head-of-line blocking (where a lost packet holds up the delivery of subsequent packets until it is retransmitted), introduce unacceptable delays. This necessitates the use of UDP, which prioritizes speedy delivery, albeit without guaranteeing packet arrival or order. As a result, codecs must be robust enough to tolerate packet loss while keeping distracting artifacts to a minimum.
To minimize latency, WebRTC employs a peer-to-peer approach, utilizing the UDP protocol as its underlying transport. Two WebRTC endpoints, frequently browsers, communicate directly, avoiding the delays introduced by routing data through a central server. However, establishing this direct connection still requires a signaling server to coordinate the initial handshake and exchange of metadata so that each peer can locate and negotiate with the other.
Complicating matters further, most users are behind Network Address Translation (NAT) routers. NAT is a technique that allows multiple devices to share a single public IP address. All the users in an office location likely share the same public IP, but you definitely don’t want them all to receive the same real-time communication streams. NAT obscures the local IP addresses of individual devices and is one of the reasons the Internet is so hugely scalable. But it also makes it difficult for two peers behind different NATs to establish a direct connection, as neither knows how to reach the other.
Recognizing the complexity of these challenges, the designers of WebRTC didn’t reinvent the wheel. Instead, they integrated a range of well-established, single-purpose technologies such as ICE, STUN, TURN, DTLS, and RTP into a cohesive framework that enables real-time communication across diverse network conditions.

In the browser, a JavaScript API sits on top of these and abstracts away most of the complexity.
We discuss the need for and applications of each technology below:
WebRTC signalling
To establish a WebRTC connection between two or more peers, they must agree on parameters, such as which video and/or audio codec will be used. The WebRTC standard itself does not mandate how this signalling process occurs, although more recently the WHEP and WHIP standards (see below) have emerged to provide consistent, interoperable methods for servers and clients to perform the signalling step. But if you want to write the settings on a paper airplane and exchange them that way, the WebRTC standard won’t stop you!
That said, if you are setting up a WebRTC session from a browser, then Session Description Protocol (SDP) will be used in the negotiation. The WebRTC APIs in browsers (RTCPeerConnection, createOffer, createAnswer) produce and consume SDP strings, as defined in RFC 4566 and extended by the JSEP specification (RFC 8829). JSEP explicitly specifies that SDP is the wire format for describing the media, codecs, encryption keys, and transport parameters during the offer/answer exchange.

In the diagram above, a selection of possibilities for signaling is depicted, for example, SIP over WebSocket. In practice, HTTP(S) is a very common transport protocol for this signalling exchange.
Connecting with ICE, and NAT Traversal with STUN/TURN
Once WebRTC peers have exchanged SDPs through a signalling process, they have enough information to attempt to connect, but we are not out of the woods yet!
The problem stems from how the Internet handles IP addresses. With IPv4’s limited address space of around 4 billion addresses and the increasing number of internet-connected devices, there aren’t enough unique public IP addresses to go around. Network Address Translation (NAT) solves this scarcity by allowing multiple devices on a private network to share a single public IP address. While this conserves IP addresses and provides security benefits by hiding the internal network structure, it creates a connectivity challenge for peer-to-peer applications.
When devices behind NATs communicate with the internet, NAT devices translate their private IP addresses (e.g., 192.168.1.100) into a shared public address. This translation prevents devices behind different NATs from reaching each other directly using the private addresses exchanged in their SDPs, since those addresses are only valid within the private networks where they originated. To solve this, WebRTC relies on STUN and TURN servers, which help discover and relay connection paths. The ICE protocol coordinates these technologies to establish peer-to-peer connectivity.
STUN
STUN (Session Traversal Utilities for NAT) is a protocol that allows a client behind a NAT to learn its public-facing IP address and port — and the transport protocol in use — by sending a request to a STUN server, usually located on the public internet. The server replies with the address and port it sees, revealing how the client appears externally.
NAT traversal refers to the collection of techniques used to establish communication between devices when one or both are behind a NAT. In WebRTC and similar systems, this typically involves STUN to discover public-facing addresses and ICE (Interactive Connectivity Establishment — see below) to methodically test possible connection paths.
Many firewalls, particularly those built into consumer NAT routers, are stateful: They track outbound traffic and automatically allow inbound packets that match an established session. In practice, once a client sends a packet to a peer, the firewall treats any matching return traffic as part of that session — much like allowing the response to an outbound TCP connection without pre-opening the port. This behaviour is a key reason STUN and ICE often succeed in establishing direct peer-to-peer connections without explicit firewall rules.
However, some gateways — particularly those using symmetric NAT — employ address- and port-dependent mapping, where the external port assigned to the client varies for each distinct destination address and port. In these cases, the mapping learned from a STUN server is only valid for communicating with that STUN server; attempts to use it with other peers will fail, preventing a direct connection.
If a direct route cannot be established — as often happens with symmetric NAT or restrictive network policies — the connection falls back to using a relay such as TURN (Traversal Using Relays around NAT), which forwards traffic through a publicly reachable server at the cost of increased latency and bandwidth usage.
TURN
When direct peer-to-peer communication is not possible — for example, due to symmetric NAT (address- and port-dependent mapping) or restrictive firewalls — TURN (Traversal Using Relays around NAT) servers provide a potential solution.
A TURN server operates in the public network, receiving and relaying IP packets between communication endpoints. The keyword in TURN is “relays.” The protocol relies on the presence and availability of a public relay to transmit the data between the peers.

The trade-off with TURN is that it is no longer a peer-to-peer connection. TURN is the most reliable method for providing connectivity between any two peers across networks; however, it incurs a high operating cost for the TURN server. At the very least, the relay must have sufficient capacity to service all the data flows. As a result, TURN is best deployed as a last resort fallback for cases where direct connectivity fails.
ICE
Interactive Connectivity Establishment (ICE) coordinates the use of STUN and TURN to facilitate NAT traversal for all communication scenarios. ICE gathers candidates — IP address and port pairs derived from local network interfaces, STUN servers, and TURN servers. These candidates are then prioritized, with direct connections favored over relayed ones, especially when latency and packet loss are critical concerns.
After gathering candidates, ICE performs connectivity checks to determine which candidate pairs can establish successful communication. The protocol then nominates one valid candidate pair for use, and the system establishes the media connection using the most efficient available path.
Transmitting media and application data with WebRTC
Assuming a successful session establishment (not requiring TURN), both peers then maintain open raw UDP connections to each other, but additional protocols are still required. While raw UDP delivers packets quickly, it lacks the security and reliability features that WebRTC demands. To address these limitations, WebRTC layers several protocols on top of UDP, including DTLS for security, SRTP for secure media transport, and SCTP for reliable data channels.
Security and encryption with DTLS
WebRTC requires that all communication be encrypted to enhance the security of transmitted data. Datagram Transport Layer Security (DTLS) provides an encryption layer by adapting the widely deployed TLS protocol to work over UDP connections.
DTLS performs a secure handshake between peers, similar to how HTTPS establishes secure connections on the web. Unlike HTTPS, WebRTC doesn’t use a central authority for certificates. It simply asserts that the certificate exchanged via DTLS matches the fingerprint shared via signaling.
Delivering Media with RTP and RTCP
The Real-Time Transport Protocol (RTP) and the Real-Time Transport Control Protocol (RTCP) are used to transmit audio and video streams in WebRTC.
RTP transports the actual media packets while embedding critical timing and sequencing data into each one. Each RTP packet contains a timestamp that instructs the receiving peer when to play the content, along with a sequence number that enables detection of lost or reordered packets. The timing information is essential for synchronizing audio and video streams and ensuring smooth playback even when network conditions cause delays or packet loss.
RTCP (Real-time Transport Control Protocol) transports feedback about the quality of media transmission — including packet loss, jitter, and network delays — between peers. Applications can use the feedback to adapt streaming behavior, such as lowering video resolution when high packet loss is detected, or improving quality when network conditions stabilize.
WebRTC secures media streams by encrypting RTP and RTCP packets using keys established during the DTLS handshake process. The system uses DTLS-SRTP (RFC 5764) to derive encryption keys specifically tailored for SRTP protocols. This transformation produces SRTP (Secure RTP) and SRTCP (Secure RTCP), which maintain the underlying timing and control mechanisms while adding end-to-end encryption for all media content.
While RTP and RTCP handle the transport and timing of media packets, the choice of video encoding techniques can significantly impact the real-time performance and latency characteristics of a WebRTC stream.
WebRTC and B-Frames: A Challenging Pair
While B-frames (bi-directional predictive frames) offer improved compression efficiency by referencing both past and future frames, they introduce a fundamental challenge for WebRTC’s real-time streaming architecture.
WebRTC requires frames to be transmitted and decoded in presentation order with minimal delay to maintain ultra-low latency. B-frames, however, rely on future frames for decoding, which forces the encoder to buffer and reorder frames before transmission. This buffering adds delay and breaks WebRTC’s real-time playout assumptions. When B-frames are included, the output often exhibits increased latency, jittery or bursty playback where several frames arrive at once after a delay, and occasional audio-video sync issues. These artifacts undermine the seamless interactive experience WebRTC aims to deliver.
For these reasons, B-frames are generally disabled in WebRTC video streams to prioritize consistent low-latency delivery over compression gains.
Delivering application data with SCTP
While RTP handles media streams, WebRTC’s DataChannel API uses Stream Control Transport Protocol (SCTP) to transport arbitrary application data between peers. Unlike media streams that prioritize speed and can tolerate some packet loss, application data often has different requirements, depending on the use case.
SCTP provides flexible delivery options to match these varying needs. For applications that require reliable delivery, such as file transfers or chat messages, SCTP can provide guaranteed, ordered delivery similar to TCP. However, for applications such as gaming or collaborative editing, SCTP also supports partial reliability modes, where newer data can overwrite older data that the protocol has not yet delivered, preventing outdated information from blocking the pipeline.
The JavaScript API
The WebRTC JavaScript API offers developers a straightforward interface for building real-time communication applications.
The API centers around three main objects:
- An RTCPeerConnection instance allows an application to establish peer-to-peer communications with another RTCPeerConnection instance in another browser or with another endpoint that implements WebRTC.
- MediaStream represents a stream of media content, typically containing audio and/or video tracks obtained from sources such as the user’s camera and microphone. The streams are then added to the RTCPeerConnection to be transmitted to remote peers.
- RTCDataChannel provides an interface for sending arbitrary (text, binary, or structured) application data.
A typical WebRTC application follows a straightforward pattern: create a peer connection, add local media streams, exchange SDP offer and answer through a signaling server, and let the WebRTC agent handle all the rest (NAT traversal, ICE candidate gathering, DTLS handshakes, codec negotiation, etc.).
WHIP and WHEP
While WebRTC has proved very successful in applications that require bidirectional peer-to-peer communication, such as video conferencing, its adoption in live broadcasting was hindered by the lack of a standard way to use it to ingest media into a server or deliver it for playback, leaving live streamers to rely on protocols such as RTMP and SRT for ingest and HLS and DASH for playback. The need to standardize WebRTC ingest and delivery arose from the fact that it doesn’t specify protocols for signalling, leaving developers to implement custom solutions that often weren’t interoperable between vendors.
WHIP (WebRTC-HTTP Ingestion Protocol) addresses ingest by standardizing the method by which encoders and broadcasting applications transmit WebRTC streams to media servers. Instead of custom signaling implementations, WHIP uses simple HTTP POST requests. Encoders send SDP offers via HTTP to WHIP-enabled endpoints and receive SDP answers, establishing WebRTC connections through familiar web infrastructure.
WHEP (WebRTC-HTTP Egress Protocol) streamlines delivery by standardizing how WebRTC-based viewers consume content from streaming services. Like WHIP, WHEP utilizes HTTP POST requests for SDP offer/answer exchange, enabling interoperability between WebRTC services and reusable player software.
Norsk supports both WHIP and WHEP, enabling you to ingest media from your sources and deliver it for playback with sub-second latency using WebRTC.
WebRTC simulcast
If you are delivering live video to a broad audience, they’re bound to have different bandwidth and device capabilities. WebRTC simulcasting enables the sending of multiple copies of a single video stream. Instead of sending a single high-quality stream, the WebRTC client encodes and transmits multiple streams simultaneously, each at a different resolution and bitrate (e.g., 1080p, 720p, and 360p).
This approach enables the media server to intelligently forward the most suitable stream to each viewer, based on their individual network conditions. A viewer on a high-speed connection receives the high-resolution stream, while a viewer on a slower network automatically gets a lower-quality stream, preventing buffering and ensuring a smooth viewing experience for everyone.
Using Norsk, you can leverage WebRTC simulcasting to deliver a high-quality, low-latency viewing experience to a broad audience, streaming on various devices.
Deploying WebRTC using Norsk Studio
Norsk enables you to ingest or deliver live media using WebRTC. With our drag-and-drop interface, Norsk Studio, all you need to do is drag a few boxes onto a canvas, and you’re set. Additionally, you can ingest media using some other protocol, say SRT, and have that delivered to your end users using WebRTC, all from the same interface:

WebRTC is a significant addition to the streaming media ecosystem; it delivers ultra-low latency, is widely supported, and offers broad codec support: H.264, AV1, VP9, and VP8 for video, and Opus and G.711 for audio. Implementing a low-latency, peer-to-peer transport is a non-trivial engineering challenge, as it involves handling NAT traversals, connectivity checks, signaling, security, congestion control, and numerous other details. WebRTC handles all of the above for us. Another crucial feature of WebRTC is its pervasiveness — it’s supported by all the major browsers.
Norsk supports WebRTC both as an ingest and egress protocol, enabling seamless integration of real-time streams into your media workflows, whether you’re receiving live content or delivering ultra-low-latency streams directly to web clients.