AV1

AV1 Low Delay for RTC – Challenges and Suggestions

Thomas Davies, PhD

Distinguished Engineer, Visionular Inc.

Hi there, I’m Thomas Davies and I’m here to talk to you about ``AV1 at the Coalface – developing a practical RTC solution using AV1``.

I started working on AV1 by developing the standard, and since then I’ve been eating my own dog food and developing production encoders for AV1.

Now, when I started on video coding, RTC meant video calling, conferencing, that kind of thing. But since then, the scope of RTC has increased to include things like game sharing and live streaming, so it’s become increasingly challenging to develop an AV1 encoder for that range of applications. So, my agenda today will explore some of the issues that come up in that development.

First of all, I’m going to talk about what’s involved in rolling out a new codec and what requirements does that impose on an encoding solution? Then I’ll talk about some of the key challenges in meeting those requirements, especially with an advanced codec like AV1.

I’ll talk about where we’ve gotten to with Aurora1 RTC and then I’ll have some comments about what the future looks like, especially with ML tools, ML enhancement, and ML encoding. And then I’ll leave you with some three basic questions, takeaways.

This talk was presented by Thomas Davies at the popular RTC @Scale conference in 2024. Here is the video recording of his presentation. 

Codec Rollout Strategy

So, for a codec rollout, you need to answer three basic questions: what, why, and how.

AV1 Low Delay RTC Codec Rollout Strategy
Codec rollout strategy - What goes into it?

What?

Now, “what” means characterizing your application from the perspective of a video encoder. So that means what bitrates and resolutions are you supporting? What is your performance envelope, particularly in software? That means how much CPU you can use on each kind of device that you support. It also means what content type you’re targeting, whether that’s standard conferencing or live streaming or screen content, and what latencies you need to support. So that might be standard zero latency or you might be able to have slightly higher latencies, maybe a quarter of a second or so for some streaming applications.

Why?

Then you need to ask yourself “why” of course. You know, what is broken? What do you want to fix? What do you want to improve? So that might be: you need to provide better quality for screen content, or it could be that you maybe have some mobile clients in your application who are experiencing a high degree of loss or in some circumstances, getting no video at all because over-UDP transports, they’re losing too many frames because of variable bitrates and, being able to lower the bitrates that you’re using a lot may improve that resilience considerably.

How?

If you have this rationale for what you want to do, then you can plan how you’re going to do it. So that means thinking about where in that bitrate resolution ladder your new solution is going to fit in. Typically, a new codec like AV1 will give you bigger gains at lower bitrates and at higher resolutions, so you need to find a sweet spot. But you may also need to have adequate performance across a whole range of bitrates and resolutions. The more flexible your solution is, the more performance it has across that range, the easier that is.

The other thing is what CPU envelope can you actually support? Can you afford to spend a little bit more on your new codec or are you fully constrained by what you currently have?

And those things might tell you whether or not you have a dynamic deployment model where you are changing your codec on the fly or whether it’s more static and you’re moving the whole system onto the new performance codec.

Challenges Faced in Building Low-Delay Aurora1 (AV1) RTC Streaming

With that model in mind, that helps focus us on some of the challenges that we faced in developing an Aurora1 RTC and that anyone faces in developing an RTC encoding solution. I want to pick out three today: performance, rate control, and some of the adaptive behaviors that AV1 supports.

So, what do I mean by performance?

When people talk about video encoders, they often mean something like you can get x% bitrate saving over such and such an encoder, you know, 50% or 20% or whatever.

But in practice, when you’re especially talking about RTC, there’s what I would call a waterfall curve where you have a relationship between the quality that you can achieve and the complexity required to achieve it, as measured here by frames per second.

So, the more compute you can spend, the more quality you can get up to some kind of ceiling where you’re not trailing off and there are diminishing returns. So, you get this kind of curve, you know in the abstract for any codec standard, but also specifically for every codec implementation. But it’s also worth bearing in mind that a particular codec solution will have a sweet spot and will not necessarily cover the full range of this curve. And I think that’s particularly the case for RTC.

So, if you then compare with a legacy codec, the aim is that you are above where that legacy implementation sits across the whole range. You can see that if you spend lots and lots of CPU and you have very poor FPS, you can get big gains, but the challenge is can you get those gains across the whole range?

Ideally, your next-generation solution sits above the curve for your legacy solution. But given the specialist nature of some of the encoding solutions you have for RTC, this can be quite a challenge.

So, you’re aiming to do two things.

Firstly, you’re aiming to move your curve above and to the right to get more gain across this range, but you’re also trying to improve the total range that you can reasonably cover where you do get good performance. So, it’s a double challenge.

So, given that understanding of performance, what are the difficulties that we face?

Well, the major issue in RTC is deadlines. You must produce a frame generally every 33 milliseconds, and those frames can vary considerably in complexity. Now, why do they vary so much in complexity? Well, for modern codecs like AV1, they get their coding gains by having a lot of tools that you can use.

And therefore, there is potentially a long tail of tools that you could exploit to improve quality. If you don’t exploit those tools, then it’s easier to hit those deadlines, but you want to exploit those tools to get the gains.

So how can you solve this conundrum?

Ideas for Hitting RTC Deadlines

We identified three key areas where you can work on this.

The first is through classification divide and conquer. if you can characterize the type of content and what is likely to work on these difficult frames, then you can reduce the number of things that you have to test and the amount of work that you must do in the limited time available. Here, you can apply some real intelligence both in the sense of lightweight AI or ML, but also through more classical arguments of heuristics for analyzing that content.

Then the second thing is, of course, you want to optimize everything inside. Now that means not just using a lot of SIMD and writing a lot of assembly, but it also means reducing the amount of code size overall and the amount of memory footprint and having clever algorithms to truncate the amount of work that you have do to get a solution.

The final approach is parallelization, and this is where there’s another major difference between RTC and other applications because you must operate often with zero latency. You must do most of your parallelizing, if not all of it, within a frame. This can mean that you may have hundreds or even thousands of jobs to do if you want to parallelize to properly exploit the CPU footprint that you have, and you have to do this with the minimum possible overhead.

You need to manage thousands of tasks which in frame encoding come with a lot of dependencies without an excessive amount of overhead in dispatching those tasks to different cores. This can really help even if you are fast enough to encode in a single-threaded way because it gives you some safety margin to parallelize if you get something particularly complex coming out of the rate control.

Rate Control for RTC Applications for Video Codecs

The second area of challenge is around rate control, and this is also related to this very polarized nature of complexity that I alluded to earlier.

Now, one issue here is that, quite often when we think about rate control, we think very much in terms of a classical model of a leaky bucket buffer model where you define an average bitrate or a maximum bitrate and bits leak out at that rate and you add them to your bucket and so on.

But what we find particularly with customers is that they’re after something more.

It’s not sufficient just to stay within the parameters of some specific model that they specify. They are often looking for maximum smoothness in the in the size of your frames, the in the rate that you have.

Rate Control for RTC Applications
Rate Control for RTC Applications

Typically for exception purposes, customers will apply sliding windows of different time periods over your bitstreams to make sure that things are smooth and they’re doing this because what really causes problems in many RTC applications is sudden changes in bitrate overloading intermediate nodes and causing packet loss, and it’s avoiding packet loss that is the objective of a lot of rate control.

For Aurora1, what we developed was a method of lightweight complexity assessment in order to tackle these issues whilst providing smooth dynamics for the rate control.

This means managing your peak bitrate when something complex arrives and knowing that it’s complex when it does arrive and controlling the maximum frame size so you don’t produce a lot of bits onto the wire all at once. Now bear in mind again that we might be doing this with parallelization in place, so we are trying to control the bitrate whilst distributing a lot of work across several cores.

This enables us to control the resilience and prevent packet loss downstream of the encoder. It depends on identifying scene changes and making reasonable adjustments like changing resolution but in particular we’re doing this in the context of layer management.

Often when people are applying scalability, you are actually trying to hit two or three bitrate targets simultaneously with perhaps highly variable content and doing this all in adaptive parallel. This is where adaptivity comes in.

AV1 supports a number of adaptive features that are very useful in RTC, but pose a challenges especially in conjunction with rate control.

Adaptive Layering:

There’s adaptive layering, for example.

Adaptive Layering in the AV1 video codec
Adaptive Layering in the AV1 video codec

In a scalable solution, you might add or subtract layers to your encoding for different targets, and those have an effect on rate control because you need to control the bitrate for the layers independently but produce a common total bitrate as well.

Adaptive Resolution:

Then, there’s adaptive resolution which allows you to avoid dropping frames, for example, when the bitrate goes down or increase in quality when bitrate goes up without having a big keyframe that would cause you resilience problems because you can predict in the normal way. But there are some other cool applications of adaptive resolution.

Adaptive Resolution in the AV1 video codec
Adaptive Resolution in the AV1 video codec

You can, for example, insert keyframes only that are a smaller resolution and that is one way of avoiding problems when there are seam changes.

Now we have to do all of this with intolerances that customers might set for adapting that bitrate when these kinds of changes happen.

How did we perform?

Data from our experiments
Data from our experiments

Here are a couple of examples from live streaming and video calling. In the live streaming case, I’ve shown you a portion of a waterfall curve.

The reference here is VP8/VP9 medium which sits on the x264 curve, and we can actually get both speed gains and big quality gains:

  • 48% bitrate reduction in VMAF and 1.6 times the speed of x264.
  • We can actually get up to twice the speed of x264 if we are prepared to lose a few percent.

Then for video calling, we implemented our encoder on the iPhone on an iPhone 12 in fact, and we found that we could go 40 to 90% faster if we matched video quality, but we were several times faster for screen content while simultaneously giving large bitrate savings, 60% or more.

And then we were doing all of this whilst being more than twice as fast as we needed to be for 1080p30 and 720p30 single-threaded, although as I said you might want to use multiple threads just in case.

Future of video encoding for RTC

When people talk about the future of video encoding, they typically have this kind of hierarchy in mind with three layers.

Future of video encoding
Future of video encoding

So, there’s where we are now, where you can accelerate and improve the effects of video encoding for current standards by having AI-powered encoding and super-resolution pre-postprocessing, that kind of thing – enhancement technologies applied on top of conventional codecs.

Then looking forward, there will be new standards – there’s the emerging AV2 AVM, there will be the next-generation standard from MPEG, and these may or may not include an increasing number of ML tools within them with a conventional architecture. But what people are really excited about is the possibility of whether you could implement a full ML model of an autoencoder for video encoding.

I think one key point there that we really need to understand is where would such an encoder fit on the kind of waterfall curve that I’ve shown you?

If the complexity is so high that it will only fit in the bottom right-hand corner, then that will limit its applications. But if we could get a range of complexities out of these technologies and adjustable complexities out of these technologies, then we really have something really powerful. So, I think this is where research needs to go to understand how we can have a flexible solution.

The second thought I have is that the kind of turbocharged gains that you get from ML right now are very much related to resolution and really the difference between the intrinsic resolution of a video and how many pixels you actually encode. Now, we’ve had multi-resolution encoding for a long time, scalability and so on, but to really exploit it you need to have some understanding of how much intrinsic resolution there is and how much you can generate. And so this is a really exciting area where new tools could be involved frame by frame or block by block in emerging standards.

Summary

If you have a new video codec solution for RTC, it’s important that it should have range and flexibility. It needs to have the features that you want, but it also needs to cover a wide range of operating points so that you can cover as much of your current application as possible and get the gains that you want.

Now it’s difficult to get big gains, it involves a lot of engineering, but they are definitely there and you can get some really significant performance that can take you to the next level in RTC.

Finally, we’re seeing increasingly that AI is turbocharging encoders, but also within RTC applications these enhancement technologies are turbocharging the quality that you can get.

Visionular focuses on the use of AI in video compression (for H.264/AVC, HEVC, and AV1) that enables our video codecs to deliver the highest possible compression efficiency at the best encoding speeds!

Our tech is used by the largest content providers globally to deliver the best user experience and to reduce their OPEX by 25 – 30% at least!

To learn more,

Related Posts