An aside: adventures in desktop streaming

One of the things I want to capture, and which sort of inspired this whole line of research, is something that can run in fullscreen mode on Linux in general and Raspberry Pi in particular. Since I am already on a Raspberry Pi, can I efficiently stream an entire desktop? I did a bunch of experiments, mostly on a Raspberry Pi but a bit on a Linux x86 machine. This is a description of some things I found out along the way. You shouldn't think of this as a monograph from an expert in this area. Rather, I am trying to describe a few specific things with enough detail for context that it is useful (including being useful for future me). If you are and expert in any of these areas, you will probably be chuckling at how much I struggled to figure out what you think is obvious.

My use case requirements simplified the general question somewhat.

The desktop would be 1920x1080, aka "FHD".
The target framerate was 60 FPS, but I figured 30 FPS would also be acceptable.
I wasn't concerned too much about progressive versus interlaced.
The stream format would be MPEG-TS.
The stream would include audio produced by an application.
There would only be at most a single consumer of the stream, though that single client would not always be consuming (it would come and go).

There is a heck of a lot that I don't know about MPEG, MPEG-TS, and their friends. I could write a book. No, wait, I couldn't write a book; I mean that what I don't know could fill volumes. As it is, what I do and don't know fills untold quantities of software documentation, forum postings, and wikis. Quite often advice to "just try this" is offered by someone who doesn't really know anything about anything, except that that single "try this" worked for them under some unspecified circumstances. Consequently, there is paradoxically too much information available since a lot of it is wrong or obsolete with no way to tell the difference. It's the Fog of Wikis.

The FFmpeg approach

ffmpeg -h full

Although there are billions of tools for doing things with video streams, almost all of them are built on top of FFmpeg or its constellation of libraries and their close friends. I decided to start by going right to the BMOC and see if I could use it directly. For a long time, I've put off learning a lot of things about FFmpeg simply because it can do so much. On my Linux desktop, just printing the "full help" from the ffmpeg command gives over 15,000 lines of text. The tool is ridiculously capable (and I mean that in a complimentary way).

I soon discovered that there are a couple of ways to use the Linux display as the FFmpeg input. The simplest to understand is "x11grab", which in FFmpeg parlance is a format. The input device is something that identifies the X11 display (typically ":0.0"). The other way is "kmsgrab", which does something similar but at a different architectural layer (docs). My (unverified) understanding is that "kmsgrab" uses fewer resources but can sometimes be trickier to use. (Note: Before using ffmpeg with "kmsgrab", you need to set a capability on the ffmpeg binary: "sudo setcap cap_sys_admin+ep /usr/bin/ffmpeg".) This Wikipedia article gives a good overview of DRM/KMS. This article, though with some vendor specifics, is also a pretty good read: https://wiki.st.com/stm32mpu/wiki/DRM_KMS_overview Both "x11grab" and "kmsgrab" offer options for selecting portions of the screen, size of the screen capture, and so on.

Because FFmpeg is available for so many platforms, and because there are lots of compile-time choices, either or both of those two formats might be missing from an environment. Some years ago, "x11grab" may have been deprecated in favor of "xcbgrab", but most of my copies of ffmpeg still have "x11grab". I don't really know the difference between the two or what their twisted history is all about. My suggestion would be to see which one is present and use it. I think the options are compatible. Find which one you have by running this command:

ffmpeg -formats | grep grab

You might see some other things that happen to have the string "grab" in them, but it should be clear if you have the ones you want.

I came across another way to approach this, https://github.com/umlaeute/v4l2loopback, but I haven't done any experiments with it so far.

That takes care of the input.

You can ask FFmpeg to transcode the input into an MPEG-TS stream by specifying a format of "mpegts" for the output. That's pretty simple, though it ignores audio, which I'll come back to. Where should that output go? It's easy to capture it in a file, but for streaming purposes we need to get it onto a network somehow. In ancient times, the FFmpeg project had a tool called FFserver which was a streaming server, but it's been gone for a while. You can give an HTTP URL as the FFmpeg output, but only if something is listening at that location to receive the stream. If you want the output destination to listen for streaming requests, you have to give command line option "-listen 1" (single client) or "listen 2" (multi clients) before the output URL. I haven't tracked down the precise difference between single and multi clients, though I have an idea.

I only found out about the "listen" option when I saw it mentioned in a forum response. Though it is listed in the FFMpeg full help, I haven't been able to find any official documentation for it. It's probably in there somewhere. On at least one of the environments where I was experimenting with FFmpeg, the "listen" option wasn't recognized. I thought maybe it was deprecated, so I gave up on it as a possible part of a solution. I have just checked, and it's still part of the libavformat source code, so maybe it was some build-time decision for the environment where it didn't work for me. Or maybe I just made a mistake of some kind in that particular experiment.

Putting all of that together, an example command line for streaming the full desktop to a localhost port looks like this:

ffmpeg -video_size 1920x1080 -framerate 60 -f x11grab -i :0.0  -listen 2 -f mpegts http://127.0.0.1:33444

The accepted answer in this (very old) thread has some good examples of how to combine audio and video from separate sources: https://superuser.com/questions/277642/how-to-merge-audio-and-video-file-in-ffmpeg. This article is a good tutorial on some alternative scenarios: https://json2video.com/how-to/ffmpeg-course/ffmpeg-add-audio-to-video.html

The VLC approach

vlc -H

You are probably familiar with VLC as the all-singing, all-dancing media player that can play just about anything. You might not be familiar with a couple of its other features:

It can transcode its input in a variety of ways.
It can take multiple inputs and send those (possibly transcoded) inputs to multiple destinations.
One of the inputs is a pseudo-device that captures the screen.
One of the destinations for outputs is a built-in streaming server.
All of the things you can do from menus and wizards in the GUI can be done via the companion headless tool, cvlc. cvlc can also do some things that are probably not available via the GUI.

Until a few days ago, I either didn't know about or wasn't very familiar with most of those bullet points. For some things you do in the GUI, vlc shows you the command line equivalent, which you can modify on the fly. Theoretically, you can also copy those things and use them directly with cvlc. Also, if you turn on vlc debug logging, it will log those CLI pipelines. I say theoretically because I didn't have very good luck with those approaches. The CLI syntax is quite quirky, and I was seldom able to directly use what vlc told me in the GUI or in the logs. That might be due to my thick skull being unwilling to let the syntax waves penetrate. By trial and error, I eventually created a pipeline that worked.

The pseudo-device for screen capture is literally "screen://". There are also command line parameters for controlling some of the screen capture settings. For example, you can control the frame rate (of the capture, not the physical screen). As I mentioned earlier, capturing the screen only deals with the video images and does not include audio. On Linux, the audio is a completely separate subsystem, most commonly PulseAudio or PipeWire. A typical system will have multiple possible audio outputs (HDMI/DisplayPort, analog headset jack, and possibly others). To capture the audio along with the video, VLC refers to it as an input slave. You have to instruct VLC which audio device to use. It's easy to find out that you choose a PulseAudio device by using the literal string "pulse://", but it's a bit more mysterious how you indicate which PulseAudio device you want. In my case, I wanted to the audio being sent over the HDMI connector and I got lucky by guessing "pulse://hdmi". I have yet to find any documentation suggesting that or explaining why it should work.

For the transcoding part of the pipeline, I just used fragments of what the GUI told me it was doing. I then used the VLC embedded help (which produces a mere 6000 lines) to find additional parameters to tune that starting configuration.

For the output destination, I used the HTTP target, which acts as a built-in streaming server. You can configure a specific local IP address or name, or you can leave that out and have VLC listen on all of the interfaces. You specify a TCP port number and an optional URL path.

Putting all of the above together, here is an example of capturing the entire screen along with HDMI audio, transcoding it to MPEG-TS, and sending it to the built-in streaming server:

cvlc screen:// :input-slave=pulse://hdmi :screen-fps=60.0 :live-caching=300 --sout '#transcode{vcodec=x264,venc=x264{profile=baseline,keyint=30},acodec="mpga",ab="128",channels="2",samplerate="48000",scodec=none,threads=4}:http{dst=:33444/ts.ts}' :sout-all :sout-keep

In that example, the subtitle codec is specified as "none" because I know I don't have any in this scenario. Some of the other parameters in the pipeline might not be needed, and I am still tuning for performance. The client request for the stream would go to "http://some.server:33444/ts.ts".

This documentation section describes the overall streaming model: https://docs.videolan.me/vlc-user/desktop/3.0/en/advanced/stream_out_introduction.html This wiki page has several examples of cvlc streaming pipelines: https://wiki.videolan.org/Documentation:Streaming_HowTo/Command_Line_Examples/ Here is another useful VLC wiki page: https://wiki.videolan.org/Documentation:Streaming_HowTo_New/

The TSDuck approach

tsp --help

Along the way to looking at tooling for this, I came across the TSDuck toolkit. I originally got interested in it because it has a lot of facilities for analyzing and modifying MPEG-TS streams, and I wanted to see if I could figure out why channels DVR didn't like what my HDMI encoder box sometimes put out. I didn't actually solve that riddle, but I was able to use TSDuck to create a simple proxy that worked around whatever the issue is. I packaged it up as a Docker image. You can read about it here.

As far as I can tell, TSDuck doesn't have a direct way to captured the desktop video. It wants to start with an MPEG-TS stream or file. At the end of it's pipeline, TSDuck's tsp command can expose a rudimentary streaming server. Even though it is limited to a single client at a time and has other restrictions, it turned out to be exactly good enough for what I needed.

Audio stuff

I started using Linux back in the days getting audio output was a matter of having the write driver for one of a handful of sound cards and their clone friends. Things have gone through multiple generations of de facto standards since then. I hadn't really kept up with the details because things always just worked on my various desktop and laptops. I saw a lot of component names mentioned, but I let them fly by without impediment.

In a nutshell, here is what I think the situation is. The Linux device driver stuff eventually gave way to Open Sound System (OSS), a more modular way of providing pretty much the same thing. OSS is probably still a thing, but is generally supplanted by the Advanced Linux Sound Architecture (ALSA). Here, both OSS and ALSA refer to kernel-level interfaces for audio devices. There are also userland libraries for applications to use. For the purposes of this discussion, we can forget about OSS. The latest generation of audio architectures tends to independently running audio services. The two services that dominate these days are PulseAudio and PipeWire. PulseAudio is pretty well-established, but PipeWire seems to be picking up steam. (There is a third contender, JACK, but you are unlikely to encounter it unless you are doing high-end audio work.) If you have a reasonably modern Linux distribution, it will almost certainly have one or the other those already installed or readily available. Both use the ALSA kernel interface to talk to actual devices. By running as services, they are able to do sophisticated things like simultaneous access by multiple applications, flexible mixing, on-the-fly device switching and other things. Because of capitalism or something, all of these things have evolved emulation layers to cater to applications written to use one of the other things. That's handy for running applications but confusing for sorting out what is actually happening.

Rather than have my little gray cells overrun with details, I decided to focus on what I could do with PulseAudio. All of the Linux systems I have been working with either already had it installed or could easily have it.

For the problem at hand, we want to find out which real or virtual device is receiving audio from the application and tap it for multiplexing it into our resulting MPEG-TS stream. Unfortunately, different tools have different ways of describing things, both in notation and in terminology. The best starting point for enumerating devices is "aplayer", which is generally installed as part of some ALSA utilities package. It's a standalone CLI audio player, but it also has some useful discovery options. "aplay -l" (lowercase L) lists the hardware devices. "aplay -L" (uppercase L) lists virtual devices, each of which is some combination of encoding/conversion to a hardware device. Some of that "aplay -L" output will usually be described as "no conversion", and those listings should parallel the things listed in the "aplay -l" output.

Wait a minute.... the application is playing audio to an output device, but we want an input device to pick it up for our MPEG-TS stream. What's that about? The PulseAudio paradigm for doing this is called a loopback device or a monitor, and it's for exactly this kind of use. This article has a good general overview of how to identify PulseAudio loopback devices: https://wiki.debian.org/audio-loopback. These articles have some good hints about audio capture: https://trac.ffmpeg.org/wiki/Capture/ALSA and https://trac.ffmpeg.org/wiki/Capture/PulseAudio However, none of those helped me identify the PulseAudio loopback device for the HDMI interface. I simply guessed and got lucky for the VLC case.

The FFmpeg approach

The VLC approach

The TSDuck approach

Audio stuff

UVC and UAC

Some capture measurements [UPDATED]

Discussions

Become a Hackaday.io Member