[alsa-devel] Questions about virtual ALSA driver (dummy), PortAudio and full-duplex drops

Smilen Dimitrov sd at imi.aau.dk
Sun Aug 4 02:05:10 CEST 2013


Hi Clemens, 

Many thanks for your reply - and apologies it took me a while to write back (and for a longish email again). Since reading your reply, I've spent most of my time coding a new test case for discussion, now posted here (see Readme for more):

    http://sdaaubckp.sourceforge.net/post/alsa-capttest/
    http://sdaaubckp.sourceforge.net/post/alsa-capttest/Readme

I took the previous advice; and this is an ALSA-only (and capture only, to keep in simple) test, trying to explore what happens when two `snd_pcm_readi` commands:

    ret1 = snd_pcm_readi(capture_pcm_handle, audiobuf, period_32_frames);
    ret2 = snd_pcm_readi(capture_pcm_handle, audiobuf, period_32_frames);

... are ran in succession, for 44100Hz/16b/stereo (with period_size=32 and buffer_size=64 frames, resulting with period time of 725.6 microseconds) in two contexts: 1) with my onboard PCI 'snd_hda_intel' card; and 2) with the virtual 'snd_dummy' driver. Hopefully it will help me get some questions I have clarified - and eventually result with a virtual ALSA driver, that does not trigger the full-duplex drop in PortAudio.

On 2013-07-25 10:37, Clemens Ladisch wrote:
>>> [...] and you are not correctly
>>> reporting the number of samples transferred to the hardware?
> >
>> I agree that it must be the core of the problem - but I have problem
>> understanding why, given I currently perceive that I'm doing
>> everything right: I know I have a rate of 44100 frames per second; I
>> choose either a period for timer functions, and calculate bytes per
>> period to match the rate, or vice versa; and in each period, I
>> increase stream buffer positions for that bytes per period amount
>> (taking care of buffer wrapping).
>
> Your driver's .pointer callback must report the *actual* position at
> which the hardware has finished reading from the buffer.  You *must*
> read some hardware register of your DMA controller for this.


I understand this - and agree with it, if I had such a case case, where my driver would talk to an actual hardware card. However, since here I'm interested in the operation of a virtual (platform) driver, which talks to _no_ soundcard hardware - how could I possibly read a hardware register, related to a card that doesn't exist? (Maybe it's the "...transferred to the hardware..." mention at the start of the quote, that gave the wrong impression of my focus in this case? My statement in the quote, refers to what I'm trying to do with the _virtual_ "dummy" driver.)

Anyways, now that it's mentioned, I wanted to make sure I've understood the reporting of the actual position at "which the hardware has finished reading from the buffer" conceptually, in context of actual soundcard hardware - so here is a diagram based on my onboard PCI 'hda-intel' card (replace ".png" with ".pdf" in link, to get a text-searchable PDF version):

    http://sdaaubckp.sourceforge.net/post/alsa-capttest/montage-hda-intel.png

Here's what I'm trying to show on it: I'm assuming that the card has it's own intern capture buffer memory on board; and ALSA (the hda_intel driver, actually) manages the equivalent of `substream->runtime->dma_area` (the hda_intel driver actually manages it's own `area` pointer, as part of a `chip` structure) as capture buffer memory in RAM. The main purpose of the ALSA driver, then, is to manage the copying the data from the intern card capture buffer, to the `dma_area` capture buffer in RAM of the PC; once the data is in the `dma_area`, the rest of the ALSA engine will make sure that data ends up in `audiobuf` in user-space, upon a call to `snd_pcm_readi` as given above.

What is intended to be shown, is: the card starts filling its intern capture buffer, soon after `snd_pcm_start`; since the period is set to 32 frames, when the card reaches this boundary in its buffer, it generates an interrupt ("CardIRQ?"); the kernel reacts to this by handling this hardware interrupt, by eventually calling the `azx_interrupt()` handler of the hda-intel driver. [[[~ Obviously, I cannot measure the actual interrupts generated by the card, so the "CardIRQ?" positions are interpolated - based on reported kernel interrupt entries, but only where the `azx_interrupt` handler has been called (since it's also possible to capture interrupt entries for power, for instance); the shown filling of the buffer is then interpolated based on this. ~]]]

On the PC side, the driver's .pointer callback can be triggered both by userspace call to `snd_pcm_readi`, and (apparently) independently of it - but (surprising for me) it is not necessarily periodic! The `dma_area` filling shown is based on actual position returned by the .pointer callback (as much as space allows). Looking from a distance, it looks like the .pointer position returned, seems to track well the (idealized) filling of the intern capture buffer. [[[~ however, this may also be due to the .pointer callback (`azx_pcm_pointer()`) being occasionally called in quick succession (apparently in context of `snd_pcm_capture_ioctl()` function). ~]]]

Is the above understood correctly? And does the observation, that (apparently):
* the "filling" of `dma_area` buffer on the PC side "tracks well" the "filling" of the intern capture buffer on the card side;
... illustrate the nature of .pointer "reporting of the actual position at which the hardware has finished reading from the buffer" correctly?


> It is not
> possible to deduce this from the current time because the clocks do not
> run at the same speed, and any kind of buffering will introduce more
> errors.

Thanks for mentioning this - I had otherwise completely forgotten about clock domains; so this was the comment that got me working on this test round! First, I'd like to make sure I understand "the clocks do not run at the same speed" properly:

In the `montage-hda-intel.png` diagram above, there are three time axes, shown vertically. It is assumed that the card hardware has its own separate crystal oscillator (XOcard), in addition to the PC having its own crystal (XOpc) - as clock sources; consequently, they have their separate time axes "(Card) Time" and "(PC) Time", given there will always be some mismatch between the frequencies they generate. The leftmost axis is what I've called "(Real) time", and is used for no other reason, than being a reminder; I guess it would represent the clock of an "independent observer", or the "developer clock" - or the "global date & time clock" (such as retrieved by `ntpdate`). By the way, when I see "wall clock" referred to in code, does that refer to this, which I've called "Real Time"?

The diagram takes the "(Real) Time" axis to have a time unit == 1; the "(Card) Time" is set to have a 1.5% smaller (faster) unit than 1; and the "(PC) Time" is set to have a 1.5% larger (slower) unit than 1. The value of 1.5% is chosen arbitrarily, so the mismatch is more obviously visible through the diagram axes' ticks. Does this illustrate the nature of "the clocks do not run at the same speed" properly (in general)?

Anyways - I do understand the impossibility of deduction of the (XOcard) timing, based solely on algorithms running on (XOpc) timing. And I agree it would have been a problem, if I was in a context of working with actual soundcard hardware.

However, since I'm inquiring about a virtual driver - there is no actual card hardware, and consequently no actual (Xocard) crystal oscillator/clock; as you've noted:


>
> The dummy driver uses a timer because there is no actual hardware.
>

Right - I have tried to visualise this on the diagram below (again, replace `.png` with `.pdf` for a PDF version):

    http://sdaaubckp.sourceforge.net/post/alsa-capttest/montage-dummy.png

That diagram should clearly show, that in a context of a virtual (platform) "dummy" driver, there is no actual card hardware targetted - nor a corresponding oscillator (with a corresponding independent clock domain). The large red arrows, that used to indicate the hardware IRQ in the `hda-intel` diagram, now indicate the timer functions softirq entry - and since they originate from the CPU itself, they now point from the other side (note also I'm reusing the "CardIRQ?" engine, to also render these interrupt entries on the card axis, under the cross-out). There is only one buffer (`dma_area`'s) filling shown - and as there is no intern card buffer now, there is nothing to compare this `dma_area`'s filling process to; the filling shown is based solely on what the .pointer callback returns.

And herein lies the crux of my inquiery: given there is no hardware targetted with a virtual driver, I can in principle return whatever I want as a .pointer position. Logic would say that the value returned from pointer, should increase (in this case) by 32 (frames) each PCM period (726 microseconds) - and it is with this in mind, that the timer functions are ran in the dummy driver. Userspace, in principle, deals only with this layer of information - so I should be able to simulate a proper operation to userspace, just by increasing this .pointer value properly. However, even if I do that, I still manage to somehow trigger a full-duplex drop in the PortAudio userspace layer - and that means, I'm still going wrong somewhere with the .pointer position calculation, even if I believe I'm doing it right.

The full-duplex part, of course, is not handled in this test, being capture only; however, I can notice some things, that may eventually have an influence:

* The .pointer callback can be called in quick succession, in context of `snd_pcm_capture_ioctl`, with both `dummy` and `hda-intel` drivers.
* In both cases, right after `snd_pcm_start` is called, .pointer is called, returning zero, BUT:
** with `hda-intel` driver, the card immediately raises an interrupt here - making for a total of 3 interrupts in a 2ms capture;
** with `dummy` driver, the timer function is just scheduled at the start - but does not fire at the start; it fires first after a period has expired - making for a total of 2 interrupts in a 2 ms capture.


Also, just by looking at `montage-hda-intel.png` vs. `montage-dummy.png`, one would gather that the hardware IRQ runs slightly faster than the expected 726 μs period; while the timer function softirq runs somewhat slower than that. However, that is inaccurate - both drivers' timing IRQ's can jitter in either direction; this is especially obvious if several captures are made, and then the corresponding plots animated, as shown on the animated .gif here:

    http://sdaaubckp.sourceforge.net/post/alsa-capttest/capttest.gif

In that gif, the plots which were vertical in the montages, are now shown horizontally; 2ms capture of `dummy` is shown on top, while the capture of `hda_intel` is shown below it. It is noticeable that, first, focusing on the "(Card) Time" track, `hda-inter` fires an additional "start" interrupt, which has no counterpart in `dummy`. However, it is also noticeable that:

* Focusing on the "(Card) Time" track, the `hda-intel` hardware interrupts are much less jittery (and follow the expected 726 us period more closely) than the timer interrupts of `dummy`;

I'm not sure if this is the expected behaviour. As far as I know, the order priority of interrupts in Linux is (crudely):

    hardware IRQ > softIRQ (sofftware IRQ) > task switching/scheduling > everything else

The card then, uses it's own clock, which is not burdened with anything else but filling buffers, meaning we can expect tight timing here; and when it generates an IRQ, it is handled by kernel with highest priority - ergo, not so much jitter. Timer functions, on the other hand, run in softIRQ context - meaning they (and their scheduling) could be preempted by the hardware IRQ of any other device on the system; ergo, more jitter. Is this reasonable to assume?

* When .pointer runs in quick succession, that usually results with a "correction" for `hda-intel` - but `dummy`'s position remains the same in the same situation

I just noticed this while writing this mail, otherwise I didn't pay much attention to it. But, I just remembered that the original `dummy` driver, calculates delta and returns pointer position based on it in the .pointer callback itself:

    dummy_hrtimer_pointer(struct snd_pcm_substream *substream)
    {
      ...
      delta = ktime_us_delta(hrtimer_cb_get_time(&dpcm->timer), dpcm->base_time);
      ...
      div_u64_rem(delta, runtime->buffer_size, &pos); // this sets pos
      return pos;
    }

... and in that, could simulate the "correction" that `hda-intel` also does, when called in quick succession.

However, in my version of the `dummy` driver, I'm also trying to write a few pulses to the `dma_area` - therefore I actually manage the position (returned by .pointer) in the hrtimer tasklet in a variable `pcm_buf_pos`:

    static void dummy_hrtimer_pcm_elapsed(unsigned long priv) // this is the tasklet
    {
      ...
      delta = ktime_us_delta(hrtimer_cb_get_time(&dpcm->timer), dpcm->base_time);
      ...
      div_u64_rem(delta, runtime->buffer_size, &pos); // this sets pos
      ...
      dpcm->pcm_buf_pos = frames_to_bytes(runtime, pos);
      ...
    }

... and in the .pointer, I simply return this number:

    dummy_hrtimer_pointer(struct snd_pcm_substream *substream)
    {
      ...
      pos = bytes_to_frames(runtime, dpcm->pcm_buf_pos);
      return pos;
    }

Now, note that the `capttest.gif` animation shows the jitter of the timer *(soft)IRQ entry*; however, the timer function in itself just schedules the tasklet to run even later - and this is also visible in `montage-dummy.png`, where it can be seen that the tasklet `dummy_hrtimer_pcm_elapsed` usually occurs up to some 100 us μs *after* the timer IRQ entry! This probably has an influence on the .pointer position calculation - but can it be to such a degree, to cause a PortAudio drop in full-duplex mode?


Since I've mentioned writing in the `dma_area`: `hda-intel` probably schedules the DMA controller, to transfer the data from the intern capture memory, to RAM of the PC - and as such, the transfer/copy uses no CPU cycles (CPU time). While in my `dummy`, just by trying to `memset` (not even copy) a few bytes, I'm using extra CPU cycles - could this also have an influence on increased jitter?

And now that DMA is mentioned, I might as well ask again:

* What is the meaning of MMAP in context of SNDRV_PCM_INFO_MMAP? Is it:
** A memory map from the card's internal buffer, via DMA, to the `dma_area` in PC RAM; or
** A memory map from `dma_area` in kernel space, to whatever buffer is referred to in user space?


Well, I hope someone will be able to confirm, if I am right in my understanding so far - and point out where am I otherwise wrong...

Thanks in advance for any feedback,
Cheers!


More information about the Alsa-devel mailing list