Designing for Failure with Norsk

A train coming down the tracks towards the viewers, with a police vehicle split in two, with one part on either side of the tracks.

At id3as we have always been in the business of designing for resilience. We aim to be good on a bad day, and our choice of technologies (Erlang, Rust, PureScript) reflects that.

So what does it mean when we give our users the power to execute their own code within that model, and they can suddenly use their own choice of language and their own choice of framework? How does that then impact the way in which those users design those applications?

In order to answer those questions, we must first look at what happens by default when something goes wrong in a Norsk client application.

What Failure Looks Like

We apply a two-step approach to avoiding failure in Norsk itself. The first is using a language (PureScript) that rigidly enforces some notion of correctness. The second is using a platform (Erlang) that allows us to build up a hierarchy of processes, along with a description of how those processes depend on each other and what to do when an individual process crashes.

In Norsk, the unit of failure that we are most interested in is that of a running media node. In the event of failure, we should expect the following behavior:

  • Norsk will not try to restart the stopped node.
  • A node halting should not result in *other* nodes having to stop.
  • Norsk will clear up all subscriptions to/from that media node.
  • The rest of the system will carry on running (therefore allowing output to keep on flowing uninterrupted).
  • The server will inform the client (if still possible).

But why would a media node have to stop? A non-exhaustive list of reasons could includes the following:

  • It encounters data that it cannot possibly handle.
  • It cannot perform its function because of bad configuration.
  • There is a (rare) bug in Norsk itself.
  • The client application that created the node disconnects.

Responding to Failure

The client can easily respond to the first three with an appropriate action. The client can either re-create the node with different configuration or switch the input of a node downstream to an alternative source. It is possible that the best course of action may even be to do nothing at all, because it isn’t necessary to start the node again. This level of control is one of the aspects of Norsk that makes it so compelling for the construction of media workflows in the first place.

Things do go wrong, and aggregating that failure into a single unit and a single action (whilst logging and potentially supplying the reasons) simplifies the handling of that failure at the client level.

What Happens When a Norsk Client Application Disconnects?

We represent each media node in Norsk (at the lowest level) with a two-way gRPC channel. We use this to send initial/follow-up configuration and exchange context changes/subcriptions. The media node will remain operational so long as this channel is open.

Using the TypeScript SDK, we represent this persistent channel with the return value of the function that creates the node.

let input = await norsk.input.rtmpServer(rtmpSettings);

We expect client applications to store this reference for the lifetime of the application. We use this reference to update subscriptions and configuration, respond to context changes, or manually stop the node. Closing the channel or stopping the client application in any way will result in the termination of the corresponding media node running inside Norsk.

On the face of it, this seems a bit extreme; it would be an obvious feature request that once a media node is set up and running that it simply stays running. On startup, a client application could simply request references to nodes that it knows are running within Norsk. As is common with technical decisions of this nature, this solution raises more problems than it solves:

  • What happens if we have multiple clients? Who wins?
  • In the absence of a client, how does Norsk make decisions about
    • Media context changes?
    • Incoming RTMP/SRT clients/streams?
    • Composition transitions?

The reality is that the client application connected to Norsk is the brains of the whole operation. We cannot expect Norsk to make any decisions at all if the client application is not present. If the client application terminates, then it is just as if the client application was never there. Norsk completely shuts down all of the resources that the client application created (as cleanly as it can) so that it can start anew when we restart the client application.

Some Design Implications

It is very tempting to write a NodeJS client application in TypeScript or JavaScript in the style of some of our examples. A single application can host a web server, access databases, call out to remote services, and communicate with Norsk. This is potentially problematic, because a single uncaught exception can bring the whole application down. A greater surface area of code increases the likelihood of those uncaught exceptions. Crashing the application will result in the termination of all the attached Norsk media nodes, potentially causing service interruption.

In general, the client application that controls Norsk should only be responsible for controlling Norsk. By its very nature, the client application that controls Norsk is going to end up with a fair bit of business logic in it for decision-making around how streams flow through the workflow. That is already quite enough responsibility for a single process.

Clearly the solution is to break up the client application, and there are a few patterns we can use to do this.

Push Work Out of the Client Application

This is the easiest starting point, and probably where we would suggest beginning the application development process. We write all the code we need as we go, keeping it all in a single application but thinking carefully about module design. It then will become obvious during development which aspects of the application are prone to failure. We can always then harden weak links, or remove them from the main process and place them in their own service.

If the Norsk client application is calling out to remote services and they return an error, the Norsk client application can carry on making decisions based on the current state. The logic around this behavior is obviously business-dependent. We can achieve the same with judicious exception handling, but that can be tricky to get right in a deep stack of asynchronous actions.

We must be careful during this process to avoid too much unrelated logic taking place within the main process, lest an unhandled exception bring the whole session down.

Treating the Client Application as a Service

The client application is a means to inject business logic and configuration into a running media workflow. This is contrary to the common approach of having to imagine every possible scenario and loading that into the configuration of some black box that then cannot then be changed. As such, the client application it doesn’t need to solve ‘every single scenario;’ it just needs to solve the scenarios that are relevant to the business problem that is being tackled.

We can write our Norsk client application as a service that takes business-level configuration on startup. The client application can use this to control some of the decisions it might then choose to make on behalf of Norsk. We can then expose business-level services (or another gRPC channel/WebSockets) for controlling updates to this configuration.

Of course, this presents the danger that we end up simply building a thin wrapper around Norsk itself. Duplicating the Norsk SDK is a sure sign that this is happening, and we should take care to avoid this. In general, business-level decisions are far coarser than the options provided by Norsk itself. We can always directly embed a business decision in the code to start with. We can then easily change this decision later, because it is our own code!

Treating the Norsk client application as a service

Building a Client Application Shim

The most extreme option would be to wrap up the Norsk interaction in a very thin worker process. This reduces the surface area for bad code causing failure. We can supply the ‘shim’ with ‘fallback behavior’ for when the main application isn’t present. It can spin up and hold the media node channels and use that fallback behavior when it needs to.

This isn’t a course that we would recommend easily. In a sufficiently complicated environment, we can envisage it being a practical way to have the behavior we alluded to earlier on in this post. “Can’t we just reconnect to the running session?” Yes, we can, but now the logic around “What do we do when a client is not there?” and “What do we do if more than one client tries to connect?” becomes a decision that our team can make. This is because it is our own code, our own application, and our own business logic.

Building a lightweight shim to minimise failure surface area in a norsk client application

Summary

Norsk provides the power to get inside running media workflows and take control of the decision-making process. We can react dynamically to changes in the input and create our own control systems around the active set of streams. To achieve this, we write a client application that sits closely with the media technologies driving that workflow. We have to make decisions around error handling in order to keep delivery going, even on those aforementioned bad days.

Whichever way we choose to tackle it is down to the team writing that application. The business requirements should be the biggest factor in that decision. Hopefully the suggestions above will be helpful in the decision-making process.

You can get started with Norsk straight away by downloading a trial license and visiting our developer page for docs and examples.

 

Rob Ashton

Rob has been writing code for id3as for nearly a decade, working on all aspects of various products from low-level codec integration up the implementation of actual customer integrations, and loves it all. He lives in Glasgow with his girlfriend and dog and can often be found at conferences with his laptop looking for the next interesting conversation.