Specs, Lies, and Video Players

June 30, 2023

At id3as, we’ve lived and breathed live streaming for more than a dozen years. In that time, we’ve delivered more than 1 million live events to a plethora of devices over a wide range of protocols: HLS, DASH, WebRTC, TS (UDP, multicast and TCP), RTMP, Smooth Streaming, custom WebSockets … You name it, we’ve sent streams to it.

And in that time we have been forced to lie, over and over again.

Standards—I’ve Think I’ve Heard of Them

The root cause of all of this lying has been consumer devices and their often scant regard for the details of the standards that are their lifeblood.

Let’s start with a spec we can all agree on—possibly the most influential and widely adopted spec the planet has ever seen. Your ability to read this article is almost certainly only possible because of this spec. I’m talking about good old HTTP 1.0 and above. Surely we can all agree on that?

One of the things that all flavors of the HTTP spec agree on is that HTTP is a case-insensitive spec. Host names, headers, field names, media types—they are all defined to be case-insensitive. So much so that most HTTP servers use standardized case when serving headers so that the code processing them can assume a particular form without having to add toUpper and toLower all over the place.

Except I can name several TVs that, if you deliver live video to them as a transport stream via a progressive HTTP download and you are foolish enough to disagree with their preferred case when filling in HTTP headers, HAVE TO BE SWITCHED ON AND OFF AT THE WALL to recover.

And of course, one manufacturer only allows CamelCase and another only lowercase. So unless you carefully control what each TV gets, things are going to end badly for at least one of them!

To say nothing of other mainstays of the HTTP (1.1 and above) specs, such as chunked encoding where for (pick your set top box/TV manufacturer) it is either mandatory or not supported at all …

Thankfully, examples of spec-avoidance quite this bad are getting rarer.

Lies, Damned Lies, and Video Players

But not nearly rare enough.

Let’s say you have an inbound stream delivered in a transport stream over satellite and you simply forward each of the video and audio packets to clients in a progressive download of a TS stream.

In a world full of rainbows and unicorns, it is well known that it never rains (the rainbows are entirely magical and not related to diffusion of light through water particles). And all of your packets are precious and none of them ever get lost. In the real world, however, it does occasionally rain, and packets sometimes do get lost. Which means that sometimes, you lose an audio frame or a video frame—or maybe even several.

And how do almost all video players deal with this? “Badly” is the short answer. Typically, they do honor video timestamps. To be honest, they don’t really have a choice in the matter, as video motion is an illusion formed by viewing lots of still pictures one after the other. If you lose a frame of video, most video players just display the last frame they did get for a bit longer and pick up again once the next good frame arrives (plus or minus decoding artifacts you get, especially if that frame was the start of a group of pictures). But by and large, a few seconds later, you probably have a good video displayed that is in sync. Hurrah!

But if you lose a frame of audio, the story isn’t quite so good. Despite every modern media format having detailed timestamps associated with each media packet, almost all video players seem to completely ignore these timestamps once they are actually playing and just output audio as it arrives.

So if on a dark and stormy night you get quite a few lost audio packets, then over the course of time your audio slips further and further out of sync with the video. Restart the stream and it’s all good—at start-up, timestamps are obeyed—it’s just that (typically) it goes out the window once the stream is up and running.

A gold star goes out to WebRTC, where just about every video player we have come across does the right thing. But the rest of the video player world needs to sit on the naughty step and think about what it’s done.

The Specshank Redemption

The good news is that modern protocols have you covered. They know that they operate in the real world and that, in the real world, bad stuff can happen and maybe frames are dropped.

So protocols such as HLS and DASH have sophisticated ways of marking “bad stuff happened” moments in the stream and telling clients to re-sync. Hurrah! Happy day! Problem solved! Cinderella goes to the ball!

Except it’s not always quite like that in practice. Many years back, when we first implemented DASH, we enthusiastically read all of the spec and said “Great! We can describe the things that go wrong with real-world live events and generate manifests that reflect the complex realities that occur.”

Oh, how we laughed!

Even mainstream video players do not necessarily interpret complex manifests the same way. Add into the mix less mainstream players and devices, and you have a wild west of behaviors. While some video players made valiant attempts to interpret complex and sometimes ambiguous specifications with understandably inconsistent responses, we can only assume that other providers printed out the spec and then used the nice heavy ring-binder that it produced to prop open the door to get a nice draft flowing through the room, as there is certainly very little evidence that they ever bothered to read the damned things.

And they ran tests that worked just fine when given the simplest, “everything is just perfect” playlists.

Some of the behavior was so off-spec that we pointed it out to manufacturers (or more accurately, asked our large industry customers to point it out, as they are big enough to have significant sway). In short, the message was “you are so far off spec it’s not even funny.”

And they all instantly changed their ways, and we all lived happily ever after.

Except, as you well know dear reader, that’s not what happened at all.

A World Full of Unicorns and Rainbows

So we live in a world where, unless you know and can control exactly what devices are viewing your streams, you have to cater for lowest-common-denominator implementations of some of the specs we deal with every day, where the only way to get the desired behavior is for everything to be perfect all the time.

And so we wave a magic wand. We tell downstream devices that everything is perfect in the world and that nothing ever goes wrong, ever. And they believe us.

And by and large, they are right to believe us, because our media platform, Norsk, bends over backwards to make the magic true. No matter how “real world” the source being sent to us is, everything downstream sees a “perfect” output. Missing audio and video is filled in transparently, and frame rate and resolution changes disappear. Video players see a swan serenely gliding down the river, when under the covers that swan is paddling like crazy to preserve the illusion of effortless progress.

So if this article has a message (beyond my long standing need to howl at the moon on this topic!), it is to suggest you either control your video player population very carefully or you live in a world of fairy godmothers and perfection. The sort of perfection you can get with Norsk.

It’s a streaming expert, so you don’t have to be.

Dr. Adrian Roe

Norsk CEO Dr. Adrian Roe has 25+ years experience running several fast-growth IT companies across streaming media, retail, financial and mobile sectors. A pioneer of “cloud first” solutions and a high availability specialist, Adrian has designed and delivered platforms delivering high volumes of live events to audiences across the globe - including low and ultra-low latency. Adrian is a major proponent of systems that are provably “correct by construction” and how ideas from academia can lead to practical every resilience and efficiency.