[alsa-devel] Questions about virtual ALSA driver (dummy), PortAudio and full-duplex drops (playback)

Wed Aug 14 16:30:31 CEST 2013

Hi all,

Since the PortAudio full-duplex problem I have, depends on both the capture and playback direction - I thought I'd also look into playback, again by comparing the `hda_intel` and `dummy` ALSA drivers, and post a writeup on it (given that only the capture direction was discussed so far). It would definitely be nice to get some feedback on it, so I know where am I going wrong with this - and apologies again for the verbosity.

This was greatly motivated by the following essential comment by Clemens, earlier in the thread:

>>> On 2013-07-25 10:37, Clemens Ladisch wrote:
>>>> Your driver's .pointer callback must report the *actual* position at
>>>> which the hardware has finished reading from the buffer
>> ... for a playback stream, or finished reading, for a capture stream.

I would restate that comment in a slightly stronger language: the playback and capture operations, although similar in many respects, are **fundamentally** different. I, unfortunately, couldn't appreciate the comment in its fullness, until I had some code for tests, and resulting pictures to look at :) These test scripts and images, which will be referred to below, are again posted at the link below (see the slightly updated Readme there):

    http://sdaaubckp.sourceforge.net/post/alsa-capttest/

One of the misconceptions that I had earlier, was that in ALSA, the proper streaming operation essentially boils down to _only_ what the .pointer function returns for each streaming direction. That is probably still correct in a sense - but I guess, it is more correct to say, that: the proper streaming operation in ALSA is essentially determined by three variables/parameters/properties:

* the value returned by the .pointer function (the .pointer position);
* the hw_ptr; and
* the appl_ptr

... of each stream; all of them expressed in units of frames. I think I implicitly took them to mean the same, because they are named the same - however, since the playback and capture direction are *fundamentally* different, so is the meaning of these variables, depending on which stream direction they are attributed to:

* capture:
** .pointer - actual position at which the hardware has completed capturing (finished reading)
** hw_ptr   - follows .pointer - but late: only after a call to `snd_pcm_update_hw_ptr0`
** appl_ptr - follows hw_ptr - but later: only after `snd_pcm_readi` (or its ioctl) has returned the given amount of bytes to userspace
* playback:
** appl_ptr - how many frames has application already written to ALSA (after `snd_pcm_writei` returns)
** .pointer - actual position at which the hardware has finished reading from ALSA playback buffer (a.k.a. number of samples already played back)
** hw_ptr   - follows .pointer - but late: only after a call to `snd_pcm_update_hw_ptr0`

Also, .pointer is a variable that wraps at PCM buffer_size (in frames); hw_ptr and appl_ptr are cumulative (although they'd of course wrap too, if they hit the size of the unsigned long integer they are stored in). Additionally, for either direction, these kernel/driver variables are exposed to userspace via `snd_pcm_delay` and `snd_pcm_avail` functions/variables (although it seems possible to retrieve hw_ptr and appl_ptr directly by calling the `snd_pcm_status()` function, and retrieving `snd_pcm_status_t`/`snd_pcm_status` structure).

It should be mentioned here, that using `snd_pcm_readi`/`snd_pcm_writei` is just one way/method of interacting with ALSA from userspace; there are, as far as I can see, at least five. I first became aware of this stumbling upon http://alumnos.elo.utfsm.cl/~yanez/alsa-sample-programs/ , which mentions METHOD_DIRECT_RW, METHOD_DIRECT_MMAP, METHOD_ASYNC_RW, METHOD_ASYNC_MMAP, METHOD_RW_AND_POLL. But then, I realized there is a userspace program in `alsa-lib/test/pcm.c`, which refers to methods: "write", "write_and_poll", "async", "async_direct", "direct_interleaved", "direct_noninterleaved", and "direct_write"; where all but the first three are mmap-based. I have chosen `snd_pcm_readi`/`snd_pcm_writei` for the tests, in order to keep the kernel debug log acquisitions as short and simple as possible (so as to make their filtering for plotting easier) - however, it's notable that PortAudio uses a poll-based approach instead.

Ignoring these other approaches for now - when using `snd_pcm_readi`/`snd_pcm_writei`, I guess the only hint that the application has of .pointer/hw_ptr/appl_ptr as a whole, is through the number of frames returned for that request. So, say we have this for capture:

    ret_frames = snd_pcm_readi(capture_pcm_handle, audiobuf, 32);

With this, userspace has requested 32 capture/record frames from ALSA. When the function returns, the returned `ret_frames` can be:

* ret_frames = 32       - exactly the amount which was requested; which means no problem
* 0 <= ret_frames < 32  - less than requested 32; which means that an input underflow, or capture underrun, has happened - which may be possible to correct later
* ret_frames < 0        - a negative number; which means an outright error has happened, which is unrecoverable

When we look at playback:

    ret_frames = snd_pcm_writei(playback_pcm_handle, audiobuf, 32);

... the meaning of the returned `ret_frames` is nearly the same as in the capture case - except here, for playback, 0 <= ret_frames < 32 means that:

* the userspace app requests to write 32 frames to ALSA's playback buffer;
* ALSA managed to write, say, 16 frames of the 32, and the buffer ended up full - so ALSA returns 16 as `ret_frames`;
* since in this case, userspace wrote _more_ frames than ALSA could handle, this is an output overflow, or playback overrun, condition.

For completion, 0 <= ret_frames < 32 for the capture case means that:

* the userspace app requests to read 32 frames from ALSA's capture buffer;
* ALSA managed to read, say, only 16 frames, and it hit the limits of the buffer, beyond which there is no more data to read - so ALSA returns 16 as `ret_frames`;
* since in this case, userspace got _less_ frames from ALSA than it requested, this is an input underflow, or capture underrun, condition.

Note that here, it is likely that ALSA decides what to return as `ret_frames`, based on previous appl_ptr and .pointer/hw_ptr; however, the value that is returned as `ret_frames`, soon enough also becomes the new value of appl_ptr, for a given stream direction (playback or capture).

With this in mind, here is my experience trying to profile the playback operation. I was using the `playmini.c` file in the `alsa-capttest` folder, again ran by `run-alsa-capttest.sh` to obtain kernel debug logs and plots. I first started the same way as I did in `captmini.c` - that is (essentially):

  ... // enable ftrace logging
  ret1 = snd_pcm_writei(playback_pcm_handle, audiobuf, period_chunksize_frames);
  ret2 = snd_pcm_writei(playback_pcm_handle, audiobuf, period_chunksize_frames);
  ... // disable ftrace logging

But, that resulted with acquisitions which are rather short - and only a single .pointer firing would be captured, e.g. as on

    http://sdaaubckp.sf.net/post/alsa-capttest/_cappics04/captures-2013-08-10-02-25-46-shp-trace-both.pdf

The blue lines, showing the calls from userspace, indicate that `snd_pcm_writei` could return in 150 μs (for `dummy-patest`, my version of `dummy`) to 290 μs (for `hda-intel`); this is far shorter than the expected period of 32/44100 ~= 726 μs - which otherwise seems to have been the (average approximate) time taken by `snd_pcm_readi` to return (in the capture direction case)! Now, what is confusing to me here, is the issue of blocking; note the docs say:

    ALSA project - the C library reference: PCM (digital audio) interface
    http://www.alsa-project.org/alsa-doc/alsa-lib/pcm.html

    > In blocked behaviour, these I/O functions stop and wait until there is
    > a room in the ring buffer (playback) or until there are a new samples
    > (capture).
    >
    > The ALSA PCM API uses a different behaviour when the device is opened
    > with blocked or non-blocked mode. The mode can be specified with mode
    > argument in snd_pcm_open() function. The blocked mode is the default
    > (without SND_PCM_NONBLOCK mode).

At first read, to me this would mean, that both `snd_pcm_writei` and `snd_pcm_readi` would block, until their request for N (say, period_size, here 32) frames (either for writing or for reading) has been honored; or in other words, they should both block for approximately the period_size time (726 μs for 32 frames @ CD-quality). However, the debug logs show that `snd_pcm_writei` can return in far less time - in nearly a fifth part of the period; so quite obviously, `snd_pcm_writei` doesn't block for the entire period_size time.

At second glance, it does say "until there is room in the ring buffer (playback)". In other words, this "ring buffer" probably refers to the `dma_area` of the playback stream. In effect, what we're talking about here is blocking until data from userspace is transferred to the ALSA driver's `dma_area`. And this `dma_area` is ultimately kernel space of the same PC, and so a copy from userspace to kernelspace, is indeed likely to complete relatively fast - since for this direction (playback), it doesn't need to refer to the card (external hardware) at all! In other words:

* Userspace starts with `snd_pcm_writei`
* `snd_pcm_writei` starts blocking
** Kernel receives the `snd_pcm_writei` / playback ioctl, checks and sees that (say) the playback `dma_area` is empty,
** thus kernel accepts the bytes from userspace, writing them into `dma_area`
** Given that `dma_area` is part of the same kernel, this copy completes relatively fast
* `snd_pcm_writei` stops blocking, and returns the copied amount of frames

In contrast, in the capture direction, `snd_pcm_readi` blocks until N (say 32) frames are available - but whether those frames are available, depends ultimately on the card hardware; and since we have a sampling rate specification, those frames cannot be delivered by the card any faster than period_time=period_size/rate; thus blocking for at least period_time in the capture direction is implied by default.

So, assuming the above is correct, I first tried a bit with `snd_pcm_wait`, for which the docs say:

    http://www.alsa-project.org/alsa-doc/alsa-lib/group___p_c_m.html
    Wait for a PCM to become ready.

But, apparently, the "PCM to become ready" for the playback direction, again is in reference to whether there is space in `dma_area`. So if you start with buffer_size 64 frames; and you do a first `snd_pcm_writei` successfully with 32 frames; you're still left with space of 32 frames to write into in the playback buffer. So if you run `snd_pcm_wait` here, it returns fast - because there is space in the buffer; it doesn't wait for the buffer to finish playing! So this still didn't help me get a better debug acquisition - this is (somewhat) documented in the `doPlayback_v01()` function in `playmini.c`, which is the source for this .gif:

    http://sdaaubckp.sourceforge.net/post/alsa-capttest/capttest_04_shp.gif

So, the typical response I'd get from the `playmini.c` program, for two consecutive `snd_pcm_writei`, would be that the first 32 frames of `writei` were OK, the second were "Broken pipe" (-32); this being for the `hda_intel` driver (note that on the gifs/pdfs, the `dummy` diagram is on top, `hda_intel` on bottom - but on the text overlay, its the opposite: the first two lines are for `hda_intel`, and the second lines for `dummy`). And during these tests, the `dummy_patest` driver would return 32/OK for both `_writei` calls - but even this caused captures so short, that I cannot see the timer function run even once (which made things really confusing for me, in the sense that: if both `writei`s returned successfully, before the timer function even had a chance to increase .pointer - what made them complete with success then?). Note that more complex userspace ALSA code (like the one in PortAudio) usually performs a write, followed by a poll of the file descriptor - never two w
rites one immediately after another, like attempted here.

It's also notable that the playback stream gets started "for real" only upon the first `writei` command - only then does the _kernel_ `snd_pcm_start()` function get called. If we try to call `snd_pcm_start(playbck_pcm_handle)` from userspace before the first `writei`, that one will simply call something like `snd_pcm_pre_start` in the kernel, (see the .csv source for e.g.:

    http://sdaaubckp.sf.net/post/alsa-capttest/_cappics04/captures-2013-08-10-09-56-34-shp-adsws-trace-both.pdf

) - and then wait until the first `writei`, to call the kernel `snd_pcm_start()`.

By trial and error, I eventually realized that a specific delay (first introduced by multiple `fprintf's`) between the two `snd_pcm_writei`s, makes the `playmini.c` much more likely to complete both writes with success. So I thought I'd look into that - and while at first I thought I'd have to somehow "trigger" .pointer updates from userspace (say by calling `snd_pcm_avail_delay`), it turns out that just a delay - here done via `nanosleep` - is enough. So I used the function `doPlayback_v02()` in `playmini.c`, and additionally used the script `playdelay.sh` to re-run 100 tests of playmini.c and log how many of them fail - for each delay step of 10 μs up to 1 ms; and the corresponding image looks like this:

    http://sdaaubckp.sf.net/post/alsa-capttest/playdelay-hda-intel_v02.png

This image also shows the mean and median of playbacks' `avail` frames; the median pb_avail frames values are: 37, 33, 41, 49, 57, [0] - it's visible that the mean attempts to track this sequence too. A "sweet spot" for the `nanosleep` between the two `snd_pcm_writei` commands, where the number of errors is minimal (here 1 error in 100 runs), is visible - and it "moves" slightly depending on whether `snd_pcm_avail_delay` was used (280 μs - as above) or not (310 μs - notably close to 363 μs, which is half the period time for this case, 726 μs):

    http://sdaaubckp.sf.net/post/alsa-capttest/playdelay-hda-intel_v03.png

... which, in turn, implies that `snd_pcm_avail_delay` costs about 30 μs on this platform. I also noticed here, that every time something like `snd_pcm_avail_delay` or `snd_pcm_status` is used either before or after the "sweet spot" zone, it will trigger an XRUN, which afterwards propagates to all subsequent ALSA calls; the only command that doesn't do this is `snd_pcm_avail_update`, which is described in the docs as "light": "The position is not synced with hardware (driver) position in the sound ring buffer in this function. This function is a light version of snd_pcm_avail()." In relation to this, I also found this (massive) alsa-devel thread from 2008:

    "Re: What does snd_pcm_delay() actually return?"
    http://thread.gmane.org/gmane.linux.alsa.devel/53841/focus=54050

    > The period-based refresh makes it hard to use the fifo effectively.  If
    > the card fifo is allowed to 'suck' all the data from the ringbuffer then
    > it makes it look like an underrun. Also it makes time appear to run fast
    > until the fifo is filled up.
    >
    > The 'fast time' creates problems for ALSA on playback start, because
    > alsa assumes that it will take a whole period for a period of data to be
    > consumed, while the driver is capable of consuming multiple periods
    > almost instantly.  In my driver I have to throttle the rate that data is
    > transferred to the card fifos.

    http://thread.gmane.org/gmane.linux.alsa.devel/53841/focus=53984

    > Yes, this is exactly what I am experiencing. At stream start my
    > estimations (based on update_avail) are way off. Afterwards everything
    > is fine. As a dirty workaround to fix this I halve the initial sleep
    > time always so that I can make sure I don't sleep for too long and get
    > an xrun. But that's really ugly, because halving it is just a wild
    > guess and it isn't even necessary on PCI hardware.

It seems these quotes refer to something similar that I'm seeing with two `writei`'s in a row (need for a specific sleep, possibly half a period, to get to the "sweet spot"); but I cannot tell for sure right now.

Anyways, knowing this delay, I finally came to this piece of (here, pseudo) code:

  ... // enable ftrace logging
  ret1 = snd_pcm_writei(playback_pcm_handle, audiobuf, period_chunksize_frames=32);
  nanosleep(310 μs , ...);
  ret2 = snd_pcm_writei(playback_pcm_handle, audiobuf, period_chunksize_frames=32);
  snd_pcm_drain(playbck_pcm_handle);
  ... // disable ftrace logging

... which represents the essence of the `doPlayback_v03()` function in `playmini.c`, used to obtain most of the other playback related acquisitions/plots. To begin with, here is an animation of all those `playmini` test runs, which completed successfully for both `dummy-patest` and `hda-intel` drivers:

    http://sdaaubckp.sf.net/post/alsa-capttest/capttest_04.gif

Basically, in the above code, we know that the first `writei` call will succeed for sure - because it's the first call, and at that time, the playback `dma_area` is empty - and will return quickly. With the `nanosleep` we ensure that the second `writei` call will be in the "sweet spot" - and hopefully, also succeed, and thus return quickly. Since we have no intention of writing any further data, we let ALSA know that by calling `snd_pcm_drain`, for which the docs note: "For playback wait for all pending frames to be played and then stop the PCM.". It was also called in the code before - but here it is specifically added, so it blocks until card has finished playback before stopping ftrace logging - so we can obtain a complete debug log acquisition of the playback process. And, it does work indeed - because, at least, we start getting timers/interrupts firing in the debug logs, as shown in the `capttest_04.gif`.

I found the behavior of playback for `hda-intel` somewhat surprising at first, because it is quite different from the capture case; compare the cases of:

    http://sdaaubckp.sf.net/post/alsa-capttest/capttest.gif     (capture  - `hda-intel`: bottom plot)
    http://sdaaubckp.sf.net/post/alsa-capttest/capttest_04.gif  (playback - `hda-intel`: bottom plot)

In the capture case, there are three IRQs fired; starting almost immediately after `readi`, they basically delineate the time taken by two periods. In the playback case, we have four IRQs fired: again the first one fires soon after the first `writei` - but the second fires after only _half_ a period, not after a period like in the first case! From this point on, however, the 3rd and 4th IRQ _do_ fire after a period!

So, one of these playback test run acquisitions/logs (captures-2013-08-11-05-15-21) was taken to be the source of yet another annotated montage; first, for the `hda-intel` driver it is:

    http://sdaaubckp.sf.net/post/alsa-capttest/montage-hda-intel-p.png  (also .pdf)

While I have learned that modern cards do not have on-board buffers, I have still drawn an "intern playback"  buffer for the "Card Time" axis, because I think it could be a useful tool in understanding what should happen. Here's my speculative breakdown, on what (I think) happens here:

* The first `snd_pcm_writei` fires; right before it, the playback `dma_area` is "empty"
* ALSA then starts the process via kernel `snd_pcm_start` soon after
* Approx 50 μs after that (or about 100 μs after `writei` first fired), card responds with an IRQ
** Strangely enough, this first IRQ does *not* trigger a .pointer !
* At about the same time, `snd_pcm_writei` probably returned with 32 frames written;
** so already at about this time, we can count on appl_ptr being set to 32 (or `dma_area` is "half full"; hence another drawing of the "buffer")
** Also about this time, the `nanosleep` (not drawn) in userspace should start
* Some 332 μs (approx half a period time, which is 726/2 = 363 μs) after the first card IRQ, the second card IRQ fires
** this one apparently informs ALSA that playback has started (so .pointer would be at 0 here) - because also here, .pointer is _not_ fired in the context of the IRQ handler
* Time goes by, `nanosleep` has expired, and the second `snd_pcm_writei` is fired in userspace
* Soon after that, .pointer is called for the first time, in the context of the playback ioctl handler
** The values seen by the first pointer are (in frames): hw_ptr = 0, .pointer = 17, appl_ptr = 32; engine sees hw_ptr < .pointer < appl_ptr = 32
** hw_ptr would become = .pointer (=17) very soon after
** so at this point, engine sees 0 < hw_ptr=17 < appl_ptr=32 - which is probably seen as a good sign: hw_ptr is where it's supposed to be after approx half a period; it still hasn't gone over appl_ptr yet, so playback is still active
** and since there is still space in the playback `dma_area` buffer, the ioctl allows `_writei` to complete successfully in userspace
* `writei` completes in userspace, returning 32 more frames; now appl_ptr should be at 64 (the `dma_area` buffer is currently full - meaning if there was a next write, it would wrap)
* `snd_pcm_drain` fires afterward in userspace - triggerring again the playback ioctl
* `snd_pcm_drain` is fired in kernel space soon after, apparently waiting for the playback to complete
* Some 738 μs (approx the period time of 726 μs) after the second IRQ, the third card IRQ fires
* Soon after, .pointer is called for the second time, in context of this third card IRQ
** The values seen by the second pointer are (in frames): hw_ptr = 17, .pointer = 33, appl_ptr = 64
** Engine again sees hw_ptr < .pointer < appl_ptr - and ultimately, 0 < hw_ptr = 33 < appl_ptr = 64 - again probably seen as a good sign
* Time goes by - Some 763 μs (again approx the period time of 726 μs) after the third IRQ, the fourth card IRQ fires
* Soon after, .pointer is called for the third time, in context of this fourth card IRQ
** The values seen by the second pointer are (in frames): hw_ptr = 33, .pointer = 1, appl_ptr = 64
** this means hw_ptr has wrapped - so all 64 requested (appl_ptr) frames have finished playing
** engine thus determines `snd_pcm_drain_done()` in kernelspace
* Soon after, `snd_pcm_drain()` exits in userspace - and the debug acquisition completes
* ((there is a "ghost buffer" at the end of capture in "Card Time", to indicate where .pointer would have to be - at approx quarter buffer size - _had_ the playback continued; which it doesn't in this case))

So, this tells me there is probably something like a condition, of either (hw_ptr < appl_ptr) after a .pointer call - or (hw_ptr < .pointer < appl_ptr) right before/during a .pointer call - (with wrapping handled in both cases), which needs to be satisfied, so that the ALSA engine determines that a playback stream is proceeding as expected. I'm not really sure, which of these would be the stronger condition. We can also look at some acquisitions where `hda_intel` fails (vs. `dummy-patest`, which doesn't):

    http://sdaaubckp.sf.net/post/alsa-capttest/capttest_04_bhda.gif (playback - `hda-intel`: bottom plot)

In most of these, .pointer fails to be fired after the second card IRQ (although, one of these didn't get to acquire any card IRQs at all). When .pointer is fired between second and third card IRQ, e.g as in:

    http://sdaaubckp.sf.net/post/alsa-capttest/_cappics04/captures-2013-08-11-11-56-38-bhda-trace-both.pdf

... at that point .pointer reads: hw_ptr = 0, .pointer = 1, appl_ptr = 32; .pointer here should be 17. The second .pointer we have: hw_ptr = 1, .pointer = 33, appl_ptr = 32 (vs. hw_ptr = 17, .pointer = 33, appl_ptr = 64); this is apparently a cause for a call to `azx_pcm_trigger` to stop stream (which otherwise happens first at `snd_pcm_drain_done()`); and after that, the `snd_pcm_drain()` call exits quickly (given debug acquisition finishes soon after, and no further events are reported on that plot). Now, the second .pointer certainly doesn't satisfy (hw_ptr < .pointer < appl_ptr), nor are the values of .pointer there where they should be according to time expired since start of playback - but I still cannot tell for sure, if this is the exact condition that causes the failure of the second userspace `writei` call.

Anyways, we can now take a look at the `dummy-patest` driver in the playback direction, whose montage of successful run is at:

    http://sdaaubckp.sf.net/post/alsa-capttest/montage-dummy-p.png  (also .pdf)

This image contains sometimes a "ghost copy" of the buffer in the CPU1 lane, because there is a bit of space there I could use; it is simply meant as a visual tool, to see what ALSA would "think" about the "card playback buffer" position (which in this case, for a virtual driver with no hardware, is simulated by the values returned by .pointer, calculated based on time delta in the timer tasklet). Anyways, a brief speculative breakdown would be:

* The first `snd_pcm_writei` fires; right before it, the playback `dma_area` is "empty"
* ALSA then starts the process via `snd_pcm_start` soon after
** Within this, the timer function is scheduled to fire after a period_size time - but there is no firing of "first" timer like in the `hda_intel` case
* Soon after, `snd_pcm_writei` probably returned with 32 frames written;
** so already at about this time, we can count on appl_ptr being set to 32 (or `dma_area` is "half full"; hence another drawing of the "buffer")
** Also about this time, the `nanosleep` (not drawn) in userspace should start
* Some time goes by - and the second `snd_pcm_writei` manages to fire in userspace _before_ the timer function even fires
* but then, the timer functions interrupts on CPU0, right before...
* ... the playback_ioctl handler is raised on CPU1!
** The first .pointer is called in context of the playback_ioctl;
** The values seen by the first pointer are (in frames): hw_ptr = 0, .pointer = 0, appl_ptr = 32; this is apparently seen as good sign by the engine, as `_writei` is allowed to complete successfully..
* `writei` completes in userspace, returning 32 more frames; now appl_ptr should be at 64 (the `dma_area` buffer is currently full - meaning if there was a next write, it would wrap)
* `snd_pcm_drain` fires afterward in userspace - triggering again the playback ioctl
* `snd_pcm_drain` is fired in kernel space soon after, apparently waiting for the playback to complete
* Soon after, .pointer is called for the second time, in context of the _drain playback_ioctl handler
** The values seen by the second pointer are (in frames): hw_ptr = 0, .pointer = 38, appl_ptr = 64; this is apparently still good
* Soon after, the second timer function is called
* Soon after, .pointer is called for the third time, in the context of the second timer function
** The values seen by the second pointer are (in frames): hw_ptr = 38, .pointer = 10, appl_ptr = 64; this is apparently good - indicating .pointer has wrapped... but then, it wrapped at 10 frames over, meaning "card" played _more_ samples than requested; but that seems not to be a cause of concern
** engine thus determines `snd_pcm_drain_done()` in kernelspace
** `dummy_pcm_trigger` is called soon after to stop the stream;
* Soon after, `snd_pcm_drain()` exits in userspace - and the debug acquisition completes

We can also look at some acquisitions where `dummy-patest` fails (vs. `hda_intel`, which doesn't):

    http://sdaaubckp.sf.net/post/alsa-capttest/capttest_04_bdum.gif

A quick scan of the top of that animated plot, tells us that in those `dummy-patest` acquisitions, the second timer function doesn't even fire; implying that the stream was stopped already at the first firing of pointer (or the second `writei`). One of those acquisitions is:

    http://sdaaubckp.sf.net/post/alsa-capttest/_cappics04/captures-2013-08-11-11-40-44-bdum-trace-both.pdf

Here we can see that also .pointer fires only once, and it sees values hw_ptr = 0, .pointer = 37, appl_ptr = 32; and as we cannot have played more frames than requested after a single `_writei`, the engine rightly decides something is wrong here - and rightly issues a `_trigger` to stop immediately afterwards.

Also, we can have a brief look at the original dummy driver. First, recall that when we compare the capture operation in the original `dummy` vs. `dummy-patest`:

    http://sdaaubckp.sf.net/post/alsa-capttest/capttest.gif     (capture  - `dummy-patest`: top plot)
    http://sdaaubckp.sf.net/post/alsa-capttest/capttest_03.gif  (capture  -  orig `dummy` : top plot)

... the original `dummy`, being able to provide a .pointer that increases each frame, can trigger `snd_pcm_update_hw_ptr0` (and thus the .pointer function) to repeatedly update multiple times; `dummy-patest`, which calculates .pointer position only once in the timer tasklet, doesn't trigger a `snd_pcm_update_hw_ptr0` (and the corresponding .pointer) update more than twice in a row.

The interesting thing is, that in the playback direction, there is no such distinction:

    http://sdaaubckp.sf.net/post/alsa-capttest/capttest_04.gif     (playback - `dummy-patest`: top plot)
    http://sdaaubckp.sf.net/post/alsa-capttest/capttest_04_or.gif  (playback -  orig `dummy` : top plot)

In both cases, the .pointer in context of `snd_pcm_update_hw_ptr0` is called at pretty much the same times. I would guess, that this is because of the fundamental difference between the capture and playback direction - in the capture direction, the card is the initiator of delivering frames to the PC, and .pointer indicates the position that the card has reached in capturing - and it's in the best interest of ALSA to have the latest .pointer position stored in hw_ptr; thus if ALSA keeps on getting new values in .pointer, it will repeatedly try to update to them. But, in the playback direction, userspace is the initiator of delivering frames to the card, and as such ALSA doesn't need to continuously update to have the latest .pointer should it change - it can make do, apparently simply by checking .pointer "once in a while", and making sure the card keeps track with playback as demanded by userspace.

Before I wrap up, here is a small (and crude) ASCII table, summarizing the difference in behavior between the `hda-intel` and `dummy` drivers (here `dummy` refers both to the original and `dummy-patest`, since they both schedule their timer functions the same way) in the context of `captmini`/`playmini` tests, as I see it so far:

           hda-intel               dummy
    0   readi     writei   |   readi     writei
    1   IRQ.p/0   IRQ      |
    16            IRQ/0    |
    32  IRQ.p/32           |   Tmr.p/32  Tmr.p/32
    48            IRQ.p/32 |
    64  IRQ.p              |   Tmr.p     Tmr.p
    16            IRQ.p    |
    ...

Here time is shown through frames, assuming period_size is 32 (so half a period is 16, two periods is 64 - which is also buffer_size). The table shows a comparison of the firing of "period pulses": in case of `hda-intel` provided by a card IRQ; in case of the virtual `dummy` driver provided by timer functions. The `.p`, where present, means that .pointer is expected to be called in context of that callback. The presence of IRQ at "1" for `hda-intel` means "acknowledgment" interrupts are fired immediately after the first command is issued - which doesn't happen in `dummy`. The slash with number (/0, /32), where present, refers to what .pointer position is expected to be reported at that time. This should help make visible that the playback stream for `hda-intel` is "offset" for half a period in respect to the capture one - which again doesn't happen for `dummy`. I think it will be possible to add some variables to `dummy`, and force it to fire its timer functions with the same
 asymmetric capture/playback pattern as `hda-intel` - whether this will fix the PortAudio full-duplex drop, remains to be seen...

Well, that is as much I can fit into an email this time :)
Many thanks for any comments - especially if anyone sees anything wrong in this analysis,
Cheers!