[alsa-devel] [RFC] AVB - network-based soundcards in ALSA
Hi all!
This is an RFC for a new class of soundcards. I am not very familiar with how ALSA is tied together underneath the hood, so what you see here, is based on my naive understanding of ALSA. I wear asbestos underwear on a regular basis, so I prefer honesty over sugarcoating :)
I use "I" and "we" interchangeably. By 'we' I mean a small R&D group at Cisco Norway, by "I", I mean.. well, me. So, we plan for AVB, I do the kernel side work. We plan to upstream this, given that the community accepts it.
Also, I've used my private address as that is set up to track kernel-related lists, but added my Cisco-address so please keep that on the CC if you reply.
We have recently begun working on Audio Video Bridging (AVB, [1]) and is looking into how this can be added to the Linux Kernel via ALSA and video4linux.
But first; for those of you who are not familiar with AVB:
In short, AVB is just a set of open standards governing network and timing configuration so that you can stream audio and video reliably and with low latency. Note that this is not the kind of streaming services currently associated with streaming (a few companies distributing movies and TV-shows comes to mind; one rhyming with lightsticks). It is the kind of streaming you use when connecting a pair of speakers to your computer - via ethernet. Or a webcam via the wireless network. (I'm aware of the security implications here, but bear with me).
For the eager reader, AVB is being promoted by AVnu Alliance [2], they have a lot of information available. I also added a link to a very short intro to AVB that Hans held a few weeks back (focus on the network though) in [3]. Then the IEEE 802.1 working group [4] has a few standards, but these are probably not that relevant no this list, at least not right now.
For AVB to work, you need support in the networking infrastructure. This is not prevalent but it is coming. There are a few manufacturers that provide AVB ready equipment and some networking gear.
What you need of standards for AVB:
* gPTP support (IEEE 802.1AS), this is an IEEE 1588 (PTP) profile for AVB. This is needed for accurate timestamping of samples, and all nodes in an AVB domain must agree to the _same_ time (not that the _correct_ time is not that important in this setting). .1AS should give you a <1us error between the clocks for the systems involved.
* Stream Reservation (IEEE 802.1Qat, or 802.1Q:2011 Sec. #35) to make sure we have guaranteed bandwidth. This will avoid dropped etherframes due to congested network. It also caps the amount you can reserve to 75% of total BW, making sure AVB can coexist with normal traffic.
* Traffic Shaping and adminssion control (IEEE 802.1Qav, or 802.1Q:2011 Sec. #34) to improve utilization but also avoid/minimize jitter due to queues inside switches/routers/bridges.
* IEEE 802.1BA, default configuration for AVB devices and what the network looks like.
* IEEE 1722 (and 1733 for layer-3) Layer 2 Transport for audio/video. The packing is similar to what is done in Firewire. You have 8kHz frame intervals for class A, 4kHz for class B. This gives relatively few samples pr. frame. Currently we only look at Layer 2 as small peripherals (microphones, speakers) will only have to implement L2 instead of the entire IP-stack.
* IEEE 1722.1 Device discovery protocol (AVDECC) defines how Talkers and Listeners find each other and connect. Any talker will regularly announce its presence, and 1722.1 defines how to announce - and how to respond.
Of all these standards, the 802.1BA and 1722 are probably the most interesting ones. AVnu also has a 'best practice' [5] document that gives a outline that serves as a nice starting point.
Terminology (brief) - Bridge: Node in the network with more than 1 port (think switches) - End-station: Node in the network with 1 port. - Talkers: End-station that produce media (mic, camera) - Listeners: End-station that receives from Talkers - Streams & Channels: A talker creates a stream through the network to a Listener. Each stream is composed of 1..N channels where each sample is interleaved. - An end-station can act as both Talker and Listener. - gPTP domain: set of PTP-capable nodes connected (gPTP will not allow non-timeaware nodes in the domain). - SRP domain: nodes in a network that supports stream reservation. - AVB Domain: intersection of SRP domain and gPTP domain.
To put it to easier terms, AVB gives you a way to add 'stuff' to your computer and play music to them via the network.
Moving out into ALSA-land and introducing "The plan":
* A central driver, an "avb_core" if you like. Once loaded it will create a configfs directory and start looking at etherframes to see if anything of interest comes along. This will be present from the start and is required for all the rest to work.
* An "avb_media_driver" to split data going to ALSA and v4l as well as combining streams coming back. The easiest way is probably to combine snd_avb and the corresponding v4l driver into a single driver, but expose it as "snd_avb" to ALSA (and ditto for v4l2).
* A userspace tool for tapping into the AVDECC data (for autodiscovery of nodes). Let's call this avdecclib for now (there are a few userspace libraries available on github).
* ConfigFS [6] is then used by userspace to spawn an new avb_media_driver for each stream we want to connect to.
Tree-structure will look something like this mkdir /config/avb/node0; config/ └── avb └── node0 ├── channels_in ├── channels_out ├── enable └── mac
(the number of attributes will have to be adjusted as I figure out what makes sense to have in the configfs item.
Writing 1 to enable will then trigger the negotiating phase and wait for the driver to come online. A new ALSA soundcard will then pop into existence, which can then be used as any regular soundcard attached to the computer.
So, an attempt to bring this to life using state of the art ASCII skills
+----------------------------------------------------+ | | | media application | | | +-------+-----------------+--------------------+-----+ | | | | | | +-------+-----+ +------+------+ +-------+-----+ | | | | | | | alsalib | | v4l2lib | | avdecclib | | | | | | | userspace +-------+--- -+ +------+------+ +-------+-----+ ................ | | | kernelspace | | | +-------+-----+ | | | | | | | alsa core | | | | | | | +-------+-----+ | | | | | +-------+-----------------+------+ | | | +-------+-----+ | snd_avb v4l2_avb | | | | | | ConfigFS | | avb_media_driver | | | | | +-------+-----+ +-------------+------------+-----+ | | | +-------+-----+ +---------------+ +------+------+ | | | | +----------+ +-----------------| avb_config | | time | | avb_core | | | | | +-----+ | | | +-------------+ +-------+-------+ | +------+------+ | | | | | | | | | +-------+-------+ | | | | | | | | | media_clock | | +------+------+ | | | | | | | +---------------+ +----+ net +-----+ | | +-------------+
So, why in the kernel and not completely in userspace?
Primarily because we would like to make it as easy as possible to create a Talker or a Listener in an AVB domain. Sure, you would need some kind of tool to manage the ConfigFS interface and set up the detailed configuration, but once that is done, _any_ program on a standard GNU/Linux box can use AVB as if it was a regular soundcard. That is a real benefit, and what makes it really exciting.
It is also a bit difficult to associate a physical location to a MAC-address. A userspace tool can be configured to remember this, but this is not information that belongs in the kernel. This needs to be persistent anyway, so setting 00:00:A4.. to be "L&R Speaker in Henrik's Den" doesn't really make sense to compile into the kernel.
Then there is the notion of security. If the kernel triggers on every newly discovered device, it is pretty simple to write a metasploit plugin that will bring any AVB enabled Linux box to its knees by just flooding the network with Announce-messages. Also, I don't necessarily want the stream from my computer to my speakers to be accessed by someone (tm) on my network.
I'd greatly appreciate feedback and comments, especially with regards to the rough outline and the usage of ConfigFS and ioctls.
Stay tuned! Once we have something that doesn't crash and burn in the most horrible sense, I'll submit a few patches for people to look at. If the interest is high, I'll probably create a public repo that I'll update more frequently, but with more of the bleeding-part of the edge.
Thanks!
1) http://en.wikipedia.org/wiki/Audio_Video_Bridging 2) http://www.avnu.org/ 3) http://www.slideshare.net/henrikau/avb-v4l2summit 4) http://en.wikipedia.org/wiki/IEEE_802.1 5) http://www.avnu.org/knowledge_center 6) http://events.linuxfoundation.org/sites/events/files/slides/USB%20Gadget%20C...
26.05.2014 19:03, Henrik Austad wrote:
Hi all!
This is an RFC for a new class of soundcards. I am not very familiar with how ALSA is tied together underneath the hood, so what you see here, is based on my naive understanding of ALSA. I wear asbestos underwear on a regular basis, so I prefer honesty over sugarcoating :)
Hello. All of this looks very interesting, but a bit more information is needed in order to put this in context.
First: is this supposed to work with any ethernet card? Or is some special hardware needed on the PC side? If so, which hardware?
Second: I would like to know more about the buffering model. For simplicity, let's consider playback from a PC to a remote receiver. Obviously, as the intention is to create something that looks like a regular ALSA sound card, there should be a circular buffer that holds sound samples (just like the DMA buffer on regular sound cards). There also needs to be "something" that sends samples from this buffer into the network. Is my understanding correct?
- IEEE 1722 (and 1733 for layer-3) Layer 2 Transport for audio/video. The packing is similar to what is done in Firewire. You have 8kHz frame intervals for class A, 4kHz for class B. This gives relatively few samples pr. frame. Currently we only look at Layer 2 as small peripherals (microphones, speakers) will only have to implement L2 instead of the entire IP-stack.
So, are you proposing to create a real-time kernel thread that will wake up 4000 or 8000 times per second in order to turn a few samples from the circular buffer into an Ethernet packet and send it, also advancing the "hardware pointer" in the process? Or do you have an idea how to avoid that rate of wakeups?
On Mon, May 26, 2014 at 10:21:10PM +0600, Alexander E. Patrakov wrote:
26.05.2014 19:03, Henrik Austad wrote:
Hi all!
This is an RFC for a new class of soundcards. I am not very familiar with how ALSA is tied together underneath the hood, so what you see here, is based on my naive understanding of ALSA. I wear asbestos underwear on a regular basis, so I prefer honesty over sugarcoating :)
Hello. All of this looks very interesting, but a bit more information is needed in order to put this in context.
Hi Alexander, thank you for the feedback.
First a disclaimer, I am in no sense an expert in this area, so if something seems fishy, it might just be.
First: is this supposed to work with any ethernet card? Or is some special hardware needed on the PC side? If so, which hardware?
In theory, any NIC should work. That's theory. In practice, it will be a real benefit if the NIC can timestamp etherframes on ingress and egress, and that requires some extra silicon. I've only heard of the I210 (from Intel) that does this, but I may be wrong.
The benefit from doing this is faster clock convergence for the gPTP protocol.
AVB uses the term "clock syntonization" which means that the clocks of 2 entities running at the same frequency. Since this is next to impossible, having a PLL to slightly change and lock the frequency whenever the GrandMaster updates the PTP time is required.
If you do not have the capability of doing this, you can fall back to synchronization, which requires some extra care when you correlate the timestamp for a sample to the local media clock.
AVB does place some hard requirements on the network infrastructure though, you need switches capable of SRP,MSRP,gPTP and queueing enhancements.
Second: I would like to know more about the buffering model. For simplicity, let's consider playback from a PC to a remote receiver.
Sure, I think this is a pretty good scenario for how ALSA would use AVB.
Obviously, as the intention is to create something that looks like a regular ALSA sound card, there should be a circular buffer that holds sound samples (just like the DMA buffer on regular sound cards). There also needs to be "something" that sends samples from this buffer into the network. Is my understanding correct?
Yes, that is pretty much what I've planned. Since we cannot interrupt userspace to fill the buffer all the time, I was planning on adding a ~20ms buffer. If this is enough, I don't know yet.
As stated in the previous mail, I'm no alsa-expert, I expect to learn a lot as I dig into this :)
As to moving samples from the buffer onto the network, one approach would be to wrap a set of samples and place it into a ready frame with headers and bits set and leave it in a buffer for the network layer to pick up.
The exact method here is not clear to me yet, I need to experiment, and probably send something off to the networking guys. But before I do that, I'd like to have a reasonable sane idea of how ALSA should handle this.
I expect this to be rewritten a few times :)
- IEEE 1722 (and 1733 for layer-3) Layer 2 Transport for audio/video. The packing is similar to what is done in Firewire. You have 8kHz frame intervals for class A, 4kHz for class B. This gives relatively few samples pr. frame. Currently we only look at Layer 2 as small peripherals (microphones, speakers) will only have to implement L2 instead of the entire IP-stack.
So, are you proposing to create a real-time kernel thread that will wake up 4000 or 8000 times per second in order to turn a few samples from the circular buffer into an Ethernet packet and send it, also advancing the "hardware pointer" in the process? Or do you have an idea how to avoid that rate of wakeups?
I'm hoping to get some help from the NICs hardware and a DMA engine here as it would be pretty crazy to do a task wakeup 8k times/sec. Not only would the overhead be high, but if you have a 125us window for filling a buffer, you are going to fail miserably in a GPOS.
For instance, if you can prepare, say 5ms worth of samples at a go, that would mean you have to prepare 40 frames. If you then could get the NIC and network infrastructure take thos frames and even them out over the next 5 ms, all would be well.
The process of evening out the rate of samples is what traffic shaping and stream reservation will help you do (or enforce, ymmv), to some extent at least. The credit based shaper algorithm is designed to force bursty traffic into a steady stream. How much you can press the queues, I'm not sure. It may very well be that 40 frames is too much.
As you can see, a lot of uncertanties, and a long way to walk.
Thanks for pointing these things out though, it gives me some extra elements to pursue - thanks!
27.05.2014 15:02, Henrik Austad wrote:
On Mon, May 26, 2014 at 10:21:10PM +0600, Alexander E. Patrakov wrote:
26.05.2014 19:03, Henrik Austad wrote:
Hi all!
This is an RFC for a new class of soundcards. I am not very familiar with how ALSA is tied together underneath the hood, so what you see here, is based on my naive understanding of ALSA. I wear asbestos underwear on a regular basis, so I prefer honesty over sugarcoating :)
Hello. All of this looks very interesting, but a bit more information is needed in order to put this in context.
Hi Alexander, thank you for the feedback.
First a disclaimer, I am in no sense an expert in this area, so if something seems fishy, it might just be.
I am not an expert in the kernel part of ALSA, either.
Obviously, as the intention is to create something that looks like a regular ALSA sound card, there should be a circular buffer that holds sound samples (just like the DMA buffer on regular sound cards). There also needs to be "something" that sends samples from this buffer into the network. Is my understanding correct?
Yes, that is pretty much what I've planned. Since we cannot interrupt userspace to fill the buffer all the time, I was planning on adding a ~20ms buffer. If this is enough, I don't know yet.
Actually a sound card with only 20 ms of buffer would be a very strange beast. "Typical sound card" buffers have a 200-2000 ms range. When setting hardware parameters, an ALSA application specifies the desired buffer size (that is, how much they want to survive without getting scheduled) and the period size (i.e. how often they want to be notified that the sound card has played something - in order to supply additional samples). So that "20 ms" buffer size should be client-settable.
You also have, in the ideal world, to provide the following:
* An option to disable period wakeups for the application that relies on some other clock source and position queries. * A method to get the position of the sample currently being played, with good-enough (<= 0.25 ms) precision for the application-level synchronization with other sound cards not sharing the same clock source (via adaptive resampling). * A method to get the position of the first safe-to-rewrite sample (aka DMA position), for implementing dynamic-latency tricks at the application level (via snd_pcm_rewind).
As stated in the previous mail, I'm no alsa-expert, I expect to learn a lot as I dig into this :)
As to moving samples from the buffer onto the network, one approach would be to wrap a set of samples and place it into a ready frame with headers and bits set and leave it in a buffer for the network layer to pick up.
The exact method here is not clear to me yet, I need to experiment, and probably send something off to the networking guys. But before I do that, I'd like to have a reasonable sane idea of how ALSA should handle this.
I expect this to be rewritten a few times :)
I think that snd-pcsp should provide you some insight on this, possibly even yielding (as a quick hack) a very very suboptimal (8k interrupts per second) but somewhat-working version, assuming that the arguments for doing this in the kernel are valid. Which is not a given - please talk to BlueTooth guys about that, they opted for a special socket type + userspace solution in a similar situation.
- IEEE 1722 (and 1733 for layer-3) Layer 2 Transport for audio/video. The packing is similar to what is done in Firewire. You have 8kHz frame intervals for class A, 4kHz for class B. This gives relatively few samples pr. frame. Currently we only look at Layer 2 as small peripherals (microphones, speakers) will only have to implement L2 instead of the entire IP-stack.
So, are you proposing to create a real-time kernel thread that will wake up 4000 or 8000 times per second in order to turn a few samples from the circular buffer into an Ethernet packet and send it, also advancing the "hardware pointer" in the process? Or do you have an idea how to avoid that rate of wakeups?
I'm hoping to get some help from the NICs hardware and a DMA engine here as it would be pretty crazy to do a task wakeup 8k times/sec. Not only would the overhead be high, but if you have a 125us window for filling a buffer, you are going to fail miserably in a GPOS.
For instance, if you can prepare, say 5ms worth of samples at a go, that would mean you have to prepare 40 frames. If you then could get the NIC and network infrastructure take thos frames and even them out over the next 5 ms, all would be well.
Except that on cheap cards, all of this will be software timer-based anyway, and thus will not avoid the 8 kHz interrupt-rate requirement. So maybe we just have to accept this requirement for now at least as a fallback path (especially since even a DNS server at your ISP has more stringent requirements) and add optimizations later.
The process of evening out the rate of samples is what traffic shaping and stream reservation will help you do (or enforce, ymmv), to some extent at least. The credit based shaper algorithm is designed to force bursty traffic into a steady stream. How much you can press the queues, I'm not sure. It may very well be that 40 frames is too much.
Well, yes, because some software (e.g. PulseAudio) sometimes wants to rewind as close to the currently-playing sample as possible. Currently, PulseAudio allows for only 1.3 ms of the safety margin.
On Tue, May 27, 2014 at 06:10:40PM +0600, Alexander E. Patrakov wrote:
27.05.2014 15:02, Henrik Austad wrote:
On Mon, May 26, 2014 at 10:21:10PM +0600, Alexander E. Patrakov wrote:
26.05.2014 19:03, Henrik Austad wrote:
Hi all!
This is an RFC for a new class of soundcards. I am not very familiar with how ALSA is tied together underneath the hood, so what you see here, is based on my naive understanding of ALSA. I wear asbestos underwear on a regular basis, so I prefer honesty over sugarcoating :)
Hello. All of this looks very interesting, but a bit more information is needed in order to put this in context.
Hi Alexander, thank you for the feedback.
First a disclaimer, I am in no sense an expert in this area, so if something seems fishy, it might just be.
I am not an expert in the kernel part of ALSA, either.
Obviously, as the intention is to create something that looks like a regular ALSA sound card, there should be a circular buffer that holds sound samples (just like the DMA buffer on regular sound cards). There also needs to be "something" that sends samples from this buffer into the network. Is my understanding correct?
Yes, that is pretty much what I've planned. Since we cannot interrupt userspace to fill the buffer all the time, I was planning on adding a ~20ms buffer. If this is enough, I don't know yet.
Actually a sound card with only 20 ms of buffer would be a very strange beast. "Typical sound card" buffers have a 200-2000 ms range. When setting hardware parameters, an ALSA application specifies the desired buffer size (that is, how much they want to survive without getting scheduled) and the period size (i.e. how often they want to be notified that the sound card has played something - in order to supply additional samples). So that "20 ms" buffer size should be client-settable.
Ah, true. I just grabbed a size that would give pretty low latency, but yes, you're right, this should be configurable from userspace. Given the nature of AVB, the lower limit should probably be a lot lower than 200ms though. But at this stage, this is just details methinks.
You also have, in the ideal world, to provide the following:
- An option to disable period wakeups for the application that
relies on some other clock source and position queries.
Hmm, I see. This makes sense if AVB is not the primary driver of the application.
- A method to get the position of the sample currently being
played, with good-enough (<= 0.25 ms) precision for the application-level synchronization with other sound cards not sharing the same clock source (via adaptive resampling).
- A method to get the position of the first safe-to-rewrite sample
(aka DMA position), for implementing dynamic-latency tricks at the application level (via snd_pcm_rewind).
All of these are good points, but I'm not sure if this is what I'll start working on right now. I've added them to the list of "stuff to remember once we get going". I fear the size of that list... :)
As stated in the previous mail, I'm no alsa-expert, I expect to learn a lot as I dig into this :)
As to moving samples from the buffer onto the network, one approach would be to wrap a set of samples and place it into a ready frame with headers and bits set and leave it in a buffer for the network layer to pick up.
The exact method here is not clear to me yet, I need to experiment, and probably send something off to the networking guys. But before I do that, I'd like to have a reasonable sane idea of how ALSA should handle this.
I expect this to be rewritten a few times :)
I think that snd-pcsp should provide you some insight on this, possibly even yielding (as a quick hack) a very very suboptimal (8k interrupts per second) but somewhat-working version, assuming that the arguments for doing this in the kernel are valid. Which is not a given - please talk to BlueTooth guys about that, they opted for a special socket type + userspace solution in a similar situation.
Thanks! That is a nice place to start looking, I'll do that.
- IEEE 1722 (and 1733 for layer-3) Layer 2 Transport for audio/video. The packing is similar to what is done in Firewire. You have 8kHz frame intervals for class A, 4kHz for class B. This gives relatively few samples pr. frame. Currently we only look at Layer 2 as small peripherals (microphones, speakers) will only have to implement L2 instead of the entire IP-stack.
So, are you proposing to create a real-time kernel thread that will wake up 4000 or 8000 times per second in order to turn a few samples from the circular buffer into an Ethernet packet and send it, also advancing the "hardware pointer" in the process? Or do you have an idea how to avoid that rate of wakeups?
I'm hoping to get some help from the NICs hardware and a DMA engine here as it would be pretty crazy to do a task wakeup 8k times/sec. Not only would the overhead be high, but if you have a 125us window for filling a buffer, you are going to fail miserably in a GPOS.
For instance, if you can prepare, say 5ms worth of samples at a go, that would mean you have to prepare 40 frames. If you then could get the NIC and network infrastructure take thos frames and even them out over the next 5 ms, all would be well.
Except that on cheap cards, all of this will be software timer-based anyway, and thus will not avoid the 8 kHz interrupt-rate requirement. So maybe we just have to accept this requirement for now at least as a fallback path (especially since even a DNS server at your ISP has more stringent requirements) and add optimizations later.
Well, there's a world of difference between the cheapest, low-end soundcards and those intended for the professional market. Since AVB "moves" the soundcard out of the computer, placing at least -some- demand on the NIC does not seem that far fetched.
The process of evening out the rate of samples is what traffic shaping and stream reservation will help you do (or enforce, ymmv), to some extent at least. The credit based shaper algorithm is designed to force bursty traffic into a steady stream. How much you can press the queues, I'm not sure. It may very well be that 40 frames is too much.
Well, yes, because some software (e.g. PulseAudio) sometimes wants to rewind as close to the currently-playing sample as possible. Currently, PulseAudio allows for only 1.3 ms of the safety margin.
So PulseAudio requires some extra buffering so that they can alter the samples already given to ALSA? Or does it mean that you can only safely move the next 1.3ms of audio to the soundcard at any given time?
Henrik Austad wrote:
[...] As to moving samples from the buffer onto the network, one approach would be to wrap a set of samples and place it into a ready frame with headers and bits set and leave it in a buffer for the network layer to pick up.
The exact method here is not clear to me yet, I need to experiment, and probably send something off to the networking guys. But before I do that, I'd like to have a reasonable sane idea of how ALSA should handle this.
ALSA expects that the sound card hardware fetches samples whenever it needs them.
For USB and FireWire, there is a short queue of packets; the driver appends new packets whenever a bunch of older packets has been completed (as reported by an interrupt).
(This queue is separate from the ALSA ring buffer, which is then never accessed directly by hardware.)
The process of evening out the rate of samples is what traffic shaping and stream reservation will help you do (or enforce, ymmv), to some extent at least. The credit based shaper algorithm is designed to force bursty traffic into a steady stream.
In the case of USB and FireWire, the hardware already knows to send isochronous packets at a rate of 8 kHz.
A 'normal' NIC wouldn't be able to do this. Are there NICs that have a separate queue for isochronous packets? Or how else can this be handled?
Regards, Clemens
On Tue, May 27, 2014 at 03:47:40PM +0200, Clemens Ladisch wrote:
Henrik Austad wrote:
[...] As to moving samples from the buffer onto the network, one approach would be to wrap a set of samples and place it into a ready frame with headers and bits set and leave it in a buffer for the network layer to pick up.
The exact method here is not clear to me yet, I need to experiment, and probably send something off to the networking guys. But before I do that, I'd like to have a reasonable sane idea of how ALSA should handle this.
ALSA expects that the sound card hardware fetches samples whenever it needs them.
Right, that's what I thought. Is it correct to assume that _all_ soundcards do this? I.e. no polled memory ops here, only DMA?
For USB and FireWire, there is a short queue of packets; the driver appends new packets whenever a bunch of older packets has been completed (as reported by an interrupt).
Yes, that is what I thought was happening. I was then hoping to do something similar with AVB, just with the networking part instead. So the net subsystem would act as a hardware device to ALSA and provide a wakeup to the snd_media_driver once it is done.
On a regular PCI soundcard, I had the impression that it would also fetch the samples whenever it needed them (you only mention USB and Firewire). Is this correct, or is PCI a whole different ballpark?
(This queue is separate from the ALSA ring buffer, which is then never accessed directly by hardware.)
Ah, so userspace places samples in a buffer via alsalib, and snd_<whatever> then moves the samples from that buffer into another buffer which the hardware can access directly?
The process of evening out the rate of samples is what traffic shaping and stream reservation will help you do (or enforce, ymmv), to some extent at least. The credit based shaper algorithm is designed to force bursty traffic into a steady stream.
In the case of USB and FireWire, the hardware already knows to send isochronous packets at a rate of 8 kHz.
Yes, that is true.
A 'normal' NIC wouldn't be able to do this. Are there NICs that have a separate queue for isochronous packets? Or how else can this be handled?
As I said in another email, I've only found i210 with support for AVB at the moment (sorry for the rather intense Intel plugfest this turned into).
From the datasheet [1]:
""" The I210 implements 4 receive queues and 4 transmit queues, where up to two queues are dedicated for stream reservation or priority, and up to three queues for strict priority. In Qav mode, the MAC flow control is disabled. Note that Qav mode is supported only in 100 Mb/s and 1000 Mb/s. Furthermore, Qav is supported only in full-duplex mode with no option for Jumbo packets transmission. """
It goes on further down (In sec 7.2.7.5 if you're interested) to say: """ A queue is eligible for arbitrations only if it has descriptors pointing to at least a single packet in host memory. For SR queues with the time based element enabled a queue is only eligible for arbitration if the fetch time of the up coming packet has been reached. """
So, for this, I interpret it as saying that if you create a set of frames ready for transmission and assign them to a queue with a reserved stream and prioritized queue, the NIC itself will take care of fetching data.
If this is a correct interpretation, I don't know, but I think it is a fair assessment that you can get support from HW to do this. It also means that avb_media_driver needs to have some awareness over actual network hardware. This could get somewhat messy.
1) http://www.intel.com/content/dam/www/public/us/en/documents/datasheets/i210-...
Henrik Austad wrote:
On Tue, May 27, 2014 at 03:47:40PM +0200, Clemens Ladisch wrote:
Henrik Austad wrote:
[...] As to moving samples from the buffer onto the network, one approach would be to wrap a set of samples and place it into a ready frame with headers and bits set and leave it in a buffer for the network layer to pick up.
The exact method here is not clear to me yet, I need to experiment, and probably send something off to the networking guys. But before I do that, I'd like to have a reasonable sane idea of how ALSA should handle this.
ALSA expects that the sound card hardware fetches samples whenever it needs them.
Right, that's what I thought. Is it correct to assume that _all_ soundcards do this? I.e. no polled memory ops here, only DMA?
All _real_ sound cards use DMA. As for the rest, I don't want to talk about them. ;-)
For USB and FireWire, there is a short queue of packets; the driver appends new packets whenever a bunch of older packets has been completed (as reported by an interrupt).
Yes, that is what I thought was happening. I was then hoping to do something similar with AVB, just with the networking part instead. So the net subsystem would act as a hardware device to ALSA and provide a wakeup to the snd_media_driver once it is done.
On a regular PCI soundcard, I had the impression that it would also fetch the samples whenever it needed them (you only mention USB and Firewire). Is this correct, or is PCI a whole different ballpark?
I mentioned USB and FireWire because these buses require that samples are sent wrapped inside packets, which implies that the hardware cannot access the samples in ALSA's ring buffer directly. (Actually, this would be posibble with a flexible enough scatter/gather support, but this has not been implemented yet.)
Regular PCI sound cards typically get told the location of the ring buffer in memory, and then do everything by themselves. (The driver then does not need to do anything, except reporting the current position in the buffer to userspace. This is where disabling period wakeups would make sense.)
The process of evening out the rate of samples is what traffic shaping and stream reservation will help you do (or enforce, ymmv), to some extent at least. The credit based shaper algorithm is designed to force bursty traffic into a steady stream.
In the case of USB and FireWire, the hardware already knows to send isochronous packets at a rate of 8 kHz.
Yes, that is true.
A 'normal' NIC wouldn't be able to do this. Are there NICs that have a separate queue for isochronous packets? Or how else can this be handled?
As I said in another email, I've only found i210 with support for AVB at the moment.
Wikipedia also mentions XMOS and Marvell 88E8059.
""" For SR queues with the time based element enabled a queue is only eligible for arbitration if the fetch time of the up coming packet has been reached. """
This is exactly what I meant.
It also means that avb_media_driver needs to have some awareness over actual network hardware.
Implementing AVB (802.1Qav) is not possible without hardware support, so of course this needs some new inteface to the hardware driver(s).
Hmmm, what about https://github.com/AVnu/Open-AVB?
Regards, Clemens
At Mon, 26 May 2014 15:03:52 +0200, Henrik Austad wrote:
Hi all!
This is an RFC for a new class of soundcards. I am not very familiar with how ALSA is tied together underneath the hood, so what you see here, is based on my naive understanding of ALSA. I wear asbestos underwear on a regular basis, so I prefer honesty over sugarcoating :)
I use "I" and "we" interchangeably. By 'we' I mean a small R&D group at Cisco Norway, by "I", I mean.. well, me. So, we plan for AVB, I do the kernel side work. We plan to upstream this, given that the community accepts it.
Also, I've used my private address as that is set up to track kernel-related lists, but added my Cisco-address so please keep that on the CC if you reply.
We have recently begun working on Audio Video Bridging (AVB, [1]) and is looking into how this can be added to the Linux Kernel via ALSA and video4linux.
But first; for those of you who are not familiar with AVB:
In short, AVB is just a set of open standards governing network and timing configuration so that you can stream audio and video reliably and with low latency. Note that this is not the kind of streaming services currently associated with streaming (a few companies distributing movies and TV-shows comes to mind; one rhyming with lightsticks). It is the kind of streaming you use when connecting a pair of speakers to your computer - via ethernet. Or a webcam via the wireless network. (I'm aware of the security implications here, but bear with me).
For the eager reader, AVB is being promoted by AVnu Alliance [2], they have a lot of information available. I also added a link to a very short intro to AVB that Hans held a few weeks back (focus on the network though) in [3]. Then the IEEE 802.1 working group [4] has a few standards, but these are probably not that relevant no this list, at least not right now.
For AVB to work, you need support in the networking infrastructure. This is not prevalent but it is coming. There are a few manufacturers that provide AVB ready equipment and some networking gear.
What you need of standards for AVB:
gPTP support (IEEE 802.1AS), this is an IEEE 1588 (PTP) profile for AVB. This is needed for accurate timestamping of samples, and all nodes in an AVB domain must agree to the _same_ time (not that the _correct_ time is not that important in this setting). .1AS should give you a <1us error between the clocks for the systems involved.
Stream Reservation (IEEE 802.1Qat, or 802.1Q:2011 Sec. #35) to make sure we have guaranteed bandwidth. This will avoid dropped etherframes due to congested network. It also caps the amount you can reserve to 75% of total BW, making sure AVB can coexist with normal traffic.
Traffic Shaping and adminssion control (IEEE 802.1Qav, or 802.1Q:2011 Sec. #34) to improve utilization but also avoid/minimize jitter due to queues inside switches/routers/bridges.
IEEE 802.1BA, default configuration for AVB devices and what the network looks like.
IEEE 1722 (and 1733 for layer-3) Layer 2 Transport for audio/video. The packing is similar to what is done in Firewire. You have 8kHz frame intervals for class A, 4kHz for class B. This gives relatively few samples pr. frame. Currently we only look at Layer 2 as small peripherals (microphones, speakers) will only have to implement L2 instead of the entire IP-stack.
IEEE 1722.1 Device discovery protocol (AVDECC) defines how Talkers and Listeners find each other and connect. Any talker will regularly announce its presence, and 1722.1 defines how to announce - and how to respond.
Of all these standards, the 802.1BA and 1722 are probably the most interesting ones. AVnu also has a 'best practice' [5] document that gives a outline that serves as a nice starting point.
Terminology (brief)
- Bridge: Node in the network with more than 1 port (think switches)
- End-station: Node in the network with 1 port.
- Talkers: End-station that produce media (mic, camera)
- Listeners: End-station that receives from Talkers
- Streams & Channels: A talker creates a stream through the network to a Listener. Each stream is composed of 1..N channels where each sample is interleaved.
- An end-station can act as both Talker and Listener.
- gPTP domain: set of PTP-capable nodes connected (gPTP will not allow non-timeaware nodes in the domain).
- SRP domain: nodes in a network that supports stream reservation.
- AVB Domain: intersection of SRP domain and gPTP domain.
To put it to easier terms, AVB gives you a way to add 'stuff' to your computer and play music to them via the network.
Moving out into ALSA-land and introducing "The plan":
A central driver, an "avb_core" if you like. Once loaded it will create a configfs directory and start looking at etherframes to see if anything of interest comes along. This will be present from the start and is required for all the rest to work.
An "avb_media_driver" to split data going to ALSA and v4l as well as combining streams coming back. The easiest way is probably to combine snd_avb and the corresponding v4l driver into a single driver, but expose it as "snd_avb" to ALSA (and ditto for v4l2).
A userspace tool for tapping into the AVDECC data (for autodiscovery of nodes). Let's call this avdecclib for now (there are a few userspace libraries available on github).
ConfigFS [6] is then used by userspace to spawn an new avb_media_driver for each stream we want to connect to.
Tree-structure will look something like this mkdir /config/avb/node0; config/ └── avb └── node0 ├── channels_in ├── channels_out ├── enable └── mac
(the number of attributes will have to be adjusted as I figure out what makes sense to have in the configfs item.
Writing 1 to enable will then trigger the negotiating phase and wait for the driver to come online. A new ALSA soundcard will then pop into existence, which can then be used as any regular soundcard attached to the computer.
So, an attempt to bring this to life using state of the art ASCII skills
+----------------------------------------------------+ | | | media application | | | +-------+-----------------+--------------------+-----+ | | | | | | +-------+-----+ +------+------+ +-------+-----+ | | | | | | | alsalib | | v4l2lib | | avdecclib | | | | | | |
userspace +-------+--- -+ +------+------+ +-------+-----+ ................ | | | kernelspace | | | +-------+-----+ | | | | | | | alsa core | | | | | | | +-------+-----+ | | | | | +-------+-----------------+------+ | | | +-------+-----+ | snd_avb v4l2_avb | | | | | | ConfigFS | | avb_media_driver | | | | | +-------+-----+ +-------------+------------+-----+ | | | +-------+-----+ +---------------+ +------+------+ | | | | +----------+ +-----------------| avb_config | | time | | avb_core | | | | | +-----+ | | | +-------------+ +-------+-------+ | +------+------+ | | | | | | | | | +-------+-------+ | | | | | | | | | media_clock | | +------+------+ | | | | | | | +---------------+ +----+ net +-----+ | | +-------------+
So, why in the kernel and not completely in userspace?
Primarily because we would like to make it as easy as possible to create a Talker or a Listener in an AVB domain. Sure, you would need some kind of tool to manage the ConfigFS interface and set up the detailed configuration, but once that is done, _any_ program on a standard GNU/Linux box can use AVB as if it was a regular soundcard. That is a real benefit, and what makes it really exciting.
It is also a bit difficult to associate a physical location to a MAC-address. A userspace tool can be configured to remember this, but this is not information that belongs in the kernel. This needs to be persistent anyway, so setting 00:00:A4.. to be "L&R Speaker in Henrik's Den" doesn't really make sense to compile into the kernel.
Then there is the notion of security. If the kernel triggers on every newly discovered device, it is pretty simple to write a metasploit plugin that will bring any AVB enabled Linux box to its knees by just flooding the network with Announce-messages. Also, I don't necessarily want the stream from my computer to my speakers to be accessed by someone (tm) on my network.
I'd greatly appreciate feedback and comments, especially with regards to the rough outline and the usage of ConfigFS and ioctls.
Stay tuned! Once we have something that doesn't crash and burn in the most horrible sense, I'll submit a few patches for people to look at. If the interest is high, I'll probably create a public repo that I'll update more frequently, but with more of the bleeding-part of the edge.
Thanks!
This reminds me of the talk Pierre gave in LPC at San Diego a couple of years ago. Although his topic was more about the audio time accounting, the framework mentioned at that time would fit with this scenario?
Takashi
This reminds me of the talk Pierre gave in LPC at San Diego a couple of years ago. Although his topic was more about the audio time accounting, the framework mentioned at that time would fit with this scenario?
Yes it is related but the overall architecture on a first pass of reading seems different: the ideas we presented were more along the lines of letting every subsystem provide an accurate accounting of time and have some userspace parts see and compensate the difference between system, network, audio, video, clocks. very interesting topic and RFC, thanks for posting this. -Pierre
On Tue, May 27, 2014 at 11:55:26AM -0500, Pierre-Louis Bossart wrote:
This reminds me of the talk Pierre gave in LPC at San Diego a couple of years ago. Although his topic was more about the audio time accounting, the framework mentioned at that time would fit with this scenario?
Yes it is related but the overall architecture on a first pass of reading seems different: the ideas we presented were more along the lines of letting every subsystem provide an accurate accounting of time and have some userspace parts see and compensate the difference between system, network, audio, video, clocks.
Any documents/talks available for this? I found "Audio/system time alignment" from Plumbers 2012 (link at the bottom), is that the one?
I see that this also covers AVB, but places everything AVB-related in userspace and then lets the application tie everything together. What was the design-rationale for this?
To turn it around; our idea of placing this in the kernel was - easier to integrate with v4l - single interface (ALSA) for userspace to play audio - easy access to NIC internals to ship off frames and whatnot.
what we found to be a pretty neat solution
Did we miss something crucial?
very interesting topic and RFC, thanks for posting this.
Thanks for providing feedback! :)
http://www.linuxplumbersconf.org/2012/wp-content/uploads/2012/11/2012-lpc-au...
On 5/28/14, 4:43 AM, Henrik Austad wrote:
On Tue, May 27, 2014 at 11:55:26AM -0500, Pierre-Louis Bossart wrote:
This reminds me of the talk Pierre gave in LPC at San Diego a couple of years ago. Although his topic was more about the audio time accounting, the framework mentioned at that time would fit with this scenario?
Yes it is related but the overall architecture on a first pass of reading seems different: the ideas we presented were more along the lines of letting every subsystem provide an accurate accounting of time and have some userspace parts see and compensate the difference between system, network, audio, video, clocks.
Any documents/talks available for this? I found "Audio/system time alignment" from Plumbers 2012 (link at the bottom), is that the one?
I see that this also covers AVB, but places everything AVB-related in userspace and then lets the application tie everything together. What was the design-rationale for this?
To turn it around; our idea of placing this in the kernel was
- easier to integrate with v4l
- single interface (ALSA) for userspace to play audio
- easy access to NIC internals to ship off frames and whatnot.
what we found to be a pretty neat solution
There are cases where the clock estimation is done in userspace. I sort of recall that linuxptp does this for example. So if you want to any sort of alignment/correlation between network and local audio clock, it needs to be done where the information is available. And if you want to do any compensation on the audio data the processing also does below in userspace or DSP firmware, not in the kernel.
Did we miss something crucial?
very interesting topic and RFC, thanks for posting this.
Thanks for providing feedback! :)
http://www.linuxplumbersconf.org/2012/wp-content/uploads/2012/11/2012-lpc-au...
participants (5)
-
Alexander E. Patrakov
-
Clemens Ladisch
-
Henrik Austad
-
Pierre-Louis Bossart
-
Takashi Iwai