[alsa-devel] Questions about virtual ALSA driver (dummy), PortAudio and full-duplex drops (full-duplex: latency.c)
sd at imi.aau.dk
Fri Sep 13 08:23:15 CEST 2013
Since the PortAudio full-duplex problem I have, depends on both the capture and playback direction - I thought I'd also look into full-duplex, at least from the perspective of ALSA, and post a writeup on that. This is again done by comparing the `hda_intel` and `dummy` ALSA drivers, but ran under a different set of userspace programs. Even though I'll have specific questions, please let me know if anything pokes your eye as wrong in this email; and as always, apologies for the verbosity.
First of all, I wondered if there is a "plain ALSA" program, which would be an equivalent to PortAudio's `patest(_duplex)_wire.c`, and could be used to do a full-duplex test. I never would have guessed by the name alone, but it seems there is - in form of `alsa-lib`'s `test/latency.c`. I had quite a hard time figuring out exactly how this `latency.c` is supposed to behave - so after playing with it a bit, I took the liberty to post a page about it on the wiki:
... which also includes some valuable snippets from `alsa-devel` discussions - hope that's OK. While the snippets go a long way in explaining a lot (such as the latency being visible as the playback stream being a number of frames ahead of the capture one), what confuses me is this:
* I specify `-m 128 -M 128` as arguments, which forces latency=128 (frames) in the code
* The program confirms that, saying: "Trying latency 128 frames ..."
* Before the read/write loop, `writebuf(phandle, buffer, latency ...)` is called **twice**, and yet:
** the very first `appl_ptr` obtained from a playback .pointer is 128 (in most of my captures)
** At "Success" end, we may have Playback: *** frames = 44288 *** and Capture: *** frames = 44160 *** (like in the output under #Usage on the wiki page); and their difference is 44288-44160 = 128
So my question is: how is it possible to do a write of 128 frames *twice*; and yet still get both the first `appl_ptr`, and p/c stream difference to be 128? Shouldn't it be 2*128 = 256, if we started with writing 128 frames twice?
Then, I wondered if there is an ALSA equivalent to PortAudio's use of a single/"wire" callback to specify "proper"/"hard" full-duplex operation. To me, it seems that it is the command `snd_pcm_link`, for which the docs say: "The two PCMs will start/stop/prepare in sync". Is this what defines a proper full-duplex operation - that the playback and capture streams are running in sync? A confirmation for this can be seen to come from portaudio-v19/src/hostapi/alsa/pa_linux_alsa.c:
/* this will cause the two streams to automatically start/stop/prepare in sync.
* We only need to execute these operations on one of the pair.
* A: We don't want to do this on a blocking stream.
if( self->callbackMode && self->capture.pcm && self->playback.pcm )
int err = snd_pcm_link( self->capture.pcm, self->playback.pcm );
if( err == 0 )
self->pcmsSynced = 1;
PA_DEBUG(( "%s: Unable to sync pcms: %s\n", __FUNCTION__, snd_strerror( err ) ));
... where we have the only instance of `snd_pcm_link`, being used only when capture and playback streams both exist in the same context (besides the callbackMode) - which, I gather, is only possible when a single/"wire" callback is used; is this correct? But then, why wouldn't we want to do that on a blocking stream? Also, would that "blocking" refer to actual blocking file descriptors only - or can it also be understood to include polling waits as well?
Anyways, I have again posted the scripts for, and debug log acquisitions + visualisations of, the `latency` tests in:
There is a modified version of `latency.c` used there, called `latency-mod.c`; the drivers are supposed to have `trace_printk`s as per `latency-drv.patch`; the related scripts are `run-alsa-lattest.sh`, `traceFGLatLogfile2Csv.py`, `traceFGLatLogGraph.gp` and `lat-anim.pl`. The debug logs are in `captures-alsa-captlat.tar.xz`, and their PDF plots are in the `_cappics05/` subfolder. Please see the end of the updated `Readme`, in the same directory, for more information about these.
To begin with, I tried acquiring the operation of `hda-intel` under `latency-mod.c` with various parameters, to see how it would behave - these parameters are encoded in the respective file/directory names. First, here is `hda-intel` driver (and card), under `latency-mod.c` without poll (`np`), without Round Robin scheduling (`ns`) and without blocking (`nb`), with latency 128 frames, and duration of test 256 frames:
(( Note that all of the provided acquisitions are without Round Robin scheduling (`ns`), with latency 128 frames, and duration of test 256 frames. Also most of the acquisitions represent successfully completed `latency` runs - except for the last two, which are indicated with an `-f` in their names ))
The .gif, as usual, is not very legible - but you can view the PDFs of the individual frames given in the `_cappics05` subfolder. What is clear though, is that .pointer is being called very frequently - and the reason for this, are the very frequent calls to `snd_pcm_readi`; which are frequent because neither blocking, nor polling is used (and so `readbuf` in `latency-mod.c` spins in its loop as fast as it can).
Once we start using polling (`yp`) with `latency`, the situation settles down; note that "polling" in `latency.c` uses a `snd_pcm_wait` to wait a given time (in milliseconds) on the capture file handle. As I had some problems with poll at first, I thought it was due to the poll waiting time, which was fixed to 1000 ms in the original `latency.c` - so I added a `--polltime` option in `latency-mod.c`. After some reboots, I couldn't reproduce that problem any more, and it looks like the operation with polling is more-less the same with 1 ms wait time vs. 1000 ms wait time:
Finally, I tried using blocking (`yb`) with `latency` instead - and it turns out, it doesn't look all that much different from polling with `hda-intel`:
... and so I finally settled on using these `latency` settings (128-256-np-0-ns-yb) for later acquisition of `dummy` operation as well. Anyways, in terms of `hda-intel`, I gather the following can be said generally for the operation under `latency(-mod).c`:
* The operation starts with two or three violet "CardIRQ?"s, spanning approx a quarter of the buffer period (the time corresponding to period_size)
** (violet "CardIRQ?" means that an audio driver interrupt handler function, here `azx_interrupt`, has been detected - but no .pointer was called in its context)
* Approx a whole buffer period later from the very first violet "CardIRQ?", the first "proper" "CardIRQ?" is fired, and it is capture (blue)
* Approx a quarter of the buffer period later from the first capture "CardIRQ?", the first playback (red) "CardIRQ?" fires
* The capture/playback "CardIRQ?" then repeat after a buffer period respectively (keeping the quarter buffer period phase between them) until the end of the program
This seems to match the behavior of the separate capture and playback "montage"s for `hda-intel`, discussed earlier in the thread. The interesting thing for me here is: even if the `latency` program starts by doing two `snd_pcm_writei`s - the driver operation still starts with a capture IRQ, and the playback series is delayed in respect to capture for a quarter buffer period; is this correctly surmised?
For a more detailed analysis, the `lat-anim.pl` script can be used to render a single debug acquisition as an animation; here is one such example for `hda-intel` under `latency`:
... try it with:
vlc --repeat --rate 3.0 captlat-2013-09-02-23-30-38-hda-128-256-np-0-ns-yb.mpg
Again, the supplied `.mpg`s are not very legible - however, you can use `lat-anim.pl` and the provided debug logs to generate the high-res animation frames (and videos) yourself; although note that rendering frames from a debug acquisition like the above takes about an hour on my machine (see the `Readme` for more). By default, an animation frame is 800x600, and in that format looks like this:
On the left hand side, there is visualization of playback (left block) and capture (right block) pointers - from top to bottom:
* For playback, `appl_ptr` changes first - `_pointer` and then `hw_ptr` try to follow it
* For capture, `_pointer` changes first - `hw_ptr` and then `appl_ptr` try to follow it
On the right hand side, there is the "Card Time" lane overlay, showing the a repeat of the `_pointer` values (violet); and interpolated playback (red) and capture (blue) position (again, see `Readme` for more). The interpolated positions start with the first proper capture or playback "cardIRQ?", and then grow as per sampling rate period (here 1/44100), so they simulate the card sampling clock. The animation in this lane shows that:
* The capture and playback streams indeed seem to be close to synchronized - as expected from `snd_pcm_link` - by looking at their interpolated positions (the small delta between those values is likely due to calculation errors)
* The respective `_pointers` do seem to track closely the interpolated positions - and they typically cross about half a buffer (in this case, a period_size), when they update (close to their respective "cardIRQ?"s)
The interesting thing here for me, is the fact that the _pointer/interpolated positions indeed seem to be synchronized - even if the capture and playback IRQ sequences are quarter buffer period apart! At first, I expected that since the c/p IRQ sequences are quarter buffer period apart - then also the respective interpolated/_pointer values would also have to be quarter buffer_size (half period_size) apart!? But seen like this, it seems that at the very first (violet) cardIRQ, the sampling clock is started synchronously for both playback and capture on the card - and so, _even_ if the playback IRQs hit a little bit later, the PC still is informed of relatively accurate card values (esp. given that, as surmised earlier in the thread, the `_pointer` values seem to propagate from the card to the PC via DMA, and thus do not rely on interrupts "reporting" them). Is this somewhat correctly understood?
[[ This confused me greatly, especially since one of the very first captures (which I also used for developing the scripts) was this:
There the very first playback _pointer has a value of 48 - barely 200 μs (duration of some 8 frames) after the `snd_pcm_start`! However, it is about 48/128 = (1/2.66667)th of the buffer_size, whereas a quarter buffer_size is 32 frames - and since I got this in no other debug log, I guess this was an anomaly/error of sorts (even if `latency` reported successful completion). ]]
Further on to the `dummy` driver, the original version (with the extra `trace_printk`), behaves like this for a blocking `latency` test:
Here, even with blocking, .pointer is quite frequently called - however, as noted earlier in this thread, that is because this driver version re-calculates the _pointer value in the .pointer function; and so the value of _pointer changes by the time `hw_ptr` is set to _pointer's last value. And since in the capture direction, ALSA insists having `hw_ptr` set to _pointer, it repeats this process.
The behavior of the modified `dummy` driver (`dummy-2.6.32-patest.c` with a respective `trace_printk`) looks like this:
Since `dummy-2.6.32-patest` recalculates _pointer values in the hrtimer callbacks (and the .pointer function simply returns it) - the ALSA engine can quickly update `hw_ptr` to the _pointer value without many repetitions; and so it resembles the behavior `hda-intel` a bit more. Apparently, due to the `snd_pcm_link`, the hrtimer callbacks for playback and capture schedule very close to one another - causing the future callbacks to be answered in the context of a single IRQ; however, also here, the capture callback seems to be answered first. Looking at one of its debug logs as an animation:
... also the update behavior of the pointers and the interpolated positions seems close to the behavior of `hda-intel`.
While also `hda-intel`can fail (with an XRUN) in a `latency` test like above, I was more interested in a failure of the modified `dummy-2.6.32-patest` - so here is one such debug plot:
... and as an animation:
At end of this run, `latency-mod.c` would have reported playback/capture frames at end, and an array logging the status of each `readbuf`/`writebuf` call in `latency-mod.c`:
* 192/128 xrun: [rd wr] 0: [64 64] 1: [64 -32] 2: [0 0] 3: [0 0] ...
... which would mean that the second `writebuf` failed. However, in the PDF plot, we can see the first `_readi` called right at start; after the first "cardIRQ?", first `_writei` and second `_readi` are called. In the successful acquisitions, this pattern repeats to the end - but in this failing one, the failing `_writei` is called after an `xrun()` is decided in context of the third playback "cardIRQ?" (this is not shown on the PDF - you'll have to look at the `.csv` file to see the `xrun()`). This means that - for some reason - both `_writei and `_readi` are *skipped* after the second "cardIRQ?"; and I cannot really tell what would be the reason for it. I may try to look into this further - but as I remember, the PortAudio full-duplex problem was not a typical XRUN, so maybe I'll go back directly to PortAudio from this point.
Thanks for bearing with me so far - and I'm looking forward to any answers/corrections,
More information about the Alsa-devel