On Sat, Jun 18, 2016 at 02:22:13PM +0900, Takashi Sakamoto wrote:
Hi,
Hi Takashi,
You raise a lot of valid points and questions, I'll try to answer them.
edit: this turned out to be a somewhat lengthy answer. I have tried to shorten it down somewhere. it is getting late and I'm getting increasingly incoherent (Richard probably knows what I'm talking about ;) so I'll stop for now.
Plase post a follow-up with everything that's not clear! Thanks!
Sorry to be late. In this weekday, I have little time for this thread because working for alsa-lib[1]. Besides, I'm not full-time developer for this kind of work. In short, I use my limited private time for this discussion.
Thank you for taking the time to reply to this thread then, it is much appreciated
On Jun 15 2016 17:06, Richard Cochran wrote:
On Wed, Jun 15, 2016 at 12:15:24PM +0900, Takashi Sakamoto wrote:
On Mon, Jun 13, 2016 at 01:47:13PM +0200, Richard Cochran wrote:
I have seen audio PLL/multiplier chips that will take, for example, a 10 kHz input and produce your 48 kHz media clock. With the right HW design, you can tell your PTP Hardware Clock to produce a 10000 PPS, and you will have a synchronized AVB endpoint. The software is all there already. Somebody should tell the ALSA guys about it.
Just from my curiosity, could I ask you more explanation for it in ALSA side?
(Disclaimer: I really don't know too much about ALSA, expect that is fairly big and complex ;)
In this morning, I read IEEE 1722:2011 and realized that it quite roughly refers to IEC 61883-1/6 and includes much ambiguities to end applications.
As far as I know, 1722 aims to describe how the data is wrapped in AVTPDU (and likewise for control-data), not how the end-station should implement it.
If there are ambiguities, would you mind listing a few? It would serve as a useful guide as to look for other pitfalls as well (thanks!)
(In my opinion, the author just focuses on packet with timestamps, without enough considering about how to implement endpoint applications which perform semi-real sampling, fetching and queueing and so on, so as you. They're satisfied just by handling packet with timestamp, without enough consideration about actual hardware/software applications.)
You are correct, none of the standards explain exactly how it should be implemented, only what the end result should look like. One target of this collection of standards are embedded, dedicated AV equipment and the authors have no way of knowing (nor should they care I think) the underlying architecture of these.
Here is what I think ALSA should provide:
- The DA and AD clocks should appear as attributes of the HW device.
This would be very useful and helpful when determining if the clock of the HW time is falling behind or racing ahead of the gPTP time domain. It will also help finding the capture time or calculating when a sample in the buffer will be played back by the device.
- There should be a method for measuring the DA/AD clock rate with respect to both the system time and the PTP Hardware Clock (PHC) time.
as above.
- There should be a method for adjusting the DA/AD clock rate if possible. If not, then ALSA should fall back to sample rate conversion.
This is not a requirement from the standard, but will help avoid costly resampling. At least it should be possible to detect the *need* for resampling so that we can try to avoid underruns.
There should be a method to determine the time delay from the point when the audio data are enqueued into ALSA until they pass through the D/A converter. If this cannot be known precisely, then the library should provide an estimate with an error bound.
I think some AVB use cases will need to know the time delay from A/D until the data are available to the local application. (Distributed microphones? I'm not too sure about that.)
yes, if you have multiple microphones that you want to combine into a stream and do signal processing, some cases require sample-sync (so within 1 us accuracy for 48kHz).
If the DA/AD clocks are connected to other clock devices in HW, there should be a way to find this out in SW. For example, if SW can see the PTP-PHC-PLL-DA relationship from the above example, then it knows how to synchronize the DA clock using the network.
[ Implementing this point involves other subsystems beyond ALSA. It isn't really necessary for people designing AVB systems, since they know their designs, but it would be nice to have for writing generic applications that can deal with any kind of HW setup. ]
Depends on which subsystem decides "AVTP presentation time"[3].
Presentation time is either set by a) Local sound card performing capture (in which case it will be 'capture time') b) Local media application sending a stream accross the network (time when the sample should be played out remotely) c) Remote media application streaming data *to* host, in which case it will be local presentation time on local soundcard
This value is dominant to the number of events included in an IEC 61883-1 packet. If this TSN subsystem decides it, most of these items don't need to be in ALSA.
Not sure if I understand this correctly.
TSN should have a reference to the timing-domain of each *local* sound-device (for local capture or playback) as well as the shared time-reference provided by gPTP.
Unless an End-station acts as GrandMaster for the gPTP-domain, time set forth by gPTP is inmutable and cannot be adjusted. It follows that the sample-frequency of the local audio-devices must be adjusted, or the audio-streams to/from said devices must be resampled.
As long as I know, the number of AVTPDU per second seems not to be fixed. So each application is not allowed to calculate the timestamp by its own way unless TSN implementation gives the information to each applications.
Before initiating a stream, an application needs to reserve a path and bandwidth through the network. Every bridge (switch/router) must accept this for the stream-allocation to succeed. If a single bridge along the way declies, the entire stream is denied. The StreamID combined with traffic class and destination address is used to uniquely identify the stream.
Once ready, frames leaving the End-station with the same StreamID will be forwarded through the bridges to the End-station(s).
If you choose to transmit *less* than the bandwidth you reserved, that is fine, but you cannot transmit *more*.
As to timestamps. When a talker transmit a frame, the timestamp in the AVTPDU describes the presentation-time.
1) The Talker is a mic, and the timestamp will then be the capture-time of the sample. 2) For a Listener, the timestamp will be the presentation-time, the time when the *first* sample in the sample-set should be played (or aligned in an offline format with other samples).
The application should be part of the same gPTP-domain as all the other nodes in the domain, and all the nodes share a common sense of time. That means that time X will be the exact same time (or, within a sub-microsecond error) for all the nodes in the same domain.
For your information, in current ALSA implementation of IEC 61883-1/6 on IEEE 1394 bus, the presentation timestamp is decided in ALSA side. The number of isochronous packet transmitted per second is fixed by 8,000 in IEEE 1394, and the number of data blocks in an IEC 61883-1 packet is deterministic according to 'sampling transfer frequency' in IEC 61883-6 and isochronous cycle count passed from Linux FireWire subsystem.
For an audio-stream, it will be very similar. The difference is the split between class A and class B, the former is 8kHz frame-rate and a guaranteed 2ms latency accross the network (think required buffering at end-stations), class B is 4kHz and a 50ms max latency. Class B is used for links traversing 1 or 2 wireless links.
If you look at the avb-shim in the series, you see that for 48kHz, 2ch, S16_LE, every frame is of the same size, 6 samples per frame, total of 24 bytes / frame. For class B, size doubles to 48 bytes as it transmits frames 4000 times / sec.
The 44.1 part is a bit more painful/messy/horrible, but is doable because the stream-reservation only gives an *upper* bound of bandwidth.
In the TSN subsystem, like FireWire subsystem, callback for filling payload should have information of 'when the packet is scheduled to be transmitted'.
[ Given that you are part of a gPTP domain and that you share a common sense of what time it is *now* with all the other devices ]
A frame should be transmittet so that it will not arrive too late for it to be presented. A class A link guarantees that a frame will be delivered within 2ms. Then, by looking at the timestamp, you subtract the delivery-time and you get when the frame should be sent at the latest.
With the information, each application can calculate the number of event in the packet and presentation timestamp. Of cource, this timestamp should be handled as 'avtp_timestamp' in packet queueing.
Not sure if I understand what you are asking, but I think maybe I've answered this above (re. 48kHz, 44.1khz and upper bound of framesize?)
In ALSA, sampling rate conversion should be in userspace, not in kernel land. In alsa-lib, sampling rate conversion is implemented in shared object. When userspace applications start playbacking/capturing, depending on PCM node to access, these applications load the shared object and convert PCM frames from buffer in userspace to mmapped DMA-buffer, then commit them.
The AVB use case places an additional requirement on the rate conversion. You will need to adjust the frequency on the fly, as the stream is playing. I would guess that ALSA doesn't have that option?
In ALSA kernel/userspace interfaces , the specification cannot be supported, at all.
Please explain about this requirement, where it comes from, which specification and clause describe it (802.1AS or 802.1Q?). As long as I read IEEE 1722, I cannot find such a requirement.
1722 only describes how the L2 frames are constructed and transmittet. You are correct that it does not mention adjustable clocks there.
- 802.1BA gives an overview of AVB
- 802.1Q-2011 Sec 34 and 35 describes forwarding and queueing and Stream Reservation (basically what the network needs in order to correctly prioritize TSN streams)
- 802.1AS-2011 (gPTP) describes the timing in great detail (from a PTP point of vew) and describes in more detail how the clocks should be syntonized (802.1AS-2011, 7.3.3).
Since the clock that drives the sample-rate for the DA/AD must be controlled by the shared clock, the fact that gPTP can adjust the time means that the DA/AD circuit needs to be adjustable as well.
note that an adjustable sample-clock is not a *requirement* but in general you'd want to avoid resampling in software.
(When considering about actual hardware codecs, on-board serial bus such as Inter-IC Sound, corresponding controller, immediate change of sampling rate is something imaginary for semi-realtime applications. And the idea has no meaning for typical playback/capture softwares.)
Yes, and no. When you play back a stored file to your soundcard, data is pulled by the card from memory. So you only have a single timing-domain to worry about. So I'd say the idea has meaning in normal scenarios as well, you don't have to worry about it.
When you send a stream accross the network, you cannot let the Listener pull data from you, you have to have some common sense of time in order to send just enough data, and that is why the gPTP domain is so important.
802.1Q gives you low latency through the network, but more importantly, no dropped frames. gPTP gives you a central reference to time.
[1] [alsa-lib][PATCH 0/9 v3] ctl: add APIs for control element set http://mailman.alsa-project.org/pipermail/alsa-devel/2016-June/109274.html [2] IEEE 1722-2011 http://ieeexplore.ieee.org/servlet/opac?punumber=5764873 [3] 5.5 Timing and Synchronization op. cit. [4] 1394 Open Host Controller Interface Specification http://download.microsoft.com/download/1/6/1/161ba512-40e2-4cc9-843a-923143f...
I hope this cleared some of the questions